Fork me on GitHub

Loch Prospector: MetaData Visualization for Lakes of Open Data

Neha Makhija, Mansi Jain, Nikolaos Tziavelis, Laura Di Rocco, Sara Di Bartolomeo, Cody Dunne

Motivation

Research on managing open data requires an appropriate collection of such datasets in order to test novel algorithms and techniques. Gaining insights about their properties can also inform the process of designing algorithms or benchmarks. Currently, the interface of most open data portals is limited in that regard.

Visualization

Here we provide a visualization that uses Multidimensional Scaling (MDS) to depict datasets from https://www.data.gov/. We focus on four structural (metadata) attributes of the datasets:

We also show (on the right) the distribution of these 4 attributes for each selection of datasets.

How MDS works

Multidimensional Scaling can help us encode many attributes in the same visualization. For each pair of datasets d1,d2 we calculate their (weighted) Euclidean distance:

\( \begin{align} dist(d_1, d_2) = & [w_r(d_1.rows-d_2.rows)^2 + w_c(d_1.cols-d_2.cols)^2 + \\& w_n(d_1.nulls-d_2.nulls)^2+ w_u(d_1.unqs-d_2.unqs)^2]^{1/2} \end{align}\)

An MDS algorithm then places the points (i.e., datasets) in a 2-dimensional space such that these distances are preserved as much as possible.

Select the data type that you are interested in:

Adjust the weights according to when two datasets are similar
(higher weight places more importance to an attribute)





Filters

Number of Rows

Number of Categorical Columns

Number of Numerical Columns

Percentage of Unique Values (Categorical)

Percentage of Null Values (Categorical)

Percentage of Unique Values (Numerical)

Percentage of Null Values (Numerical)

Distribution Summary

MDS Plot: Datasets are embedded into 2D-space such that datasets that are closer together are more similar.

Acknowledgments

IEEE Vis 2020

Paper and Supplementary Materials can be found at: https://osf.io/zkxv9/

Watch a short 30s video teaser:

or a 7 minute demo video: