Research on managing open data requires an appropriate collection of such datasets in order to test novel algorithms and techniques. Gaining insights about their properties can also inform the process of designing algorithms or benchmarks. Currently, the interface of most open data portals is limited in that regard.
Here we provide a visualization that uses Multidimensional Scaling (MDS) to depict datasets from https://www.data.gov/. We focus on four structural (metadata) attributes of the datasets:
We also show (on the right) the distribution of these 4 attributes for each selection of datasets.
Multidimensional Scaling can help us encode many attributes in the same visualization. For each pair of datasets d1,d2 we calculate their (weighted) Euclidean distance:
\( \begin{align} dist(d_1, d_2) = & [w_r(d_1.rows-d_2.rows)^2 + w_c(d_1.cols-d_2.cols)^2 + \\& w_n(d_1.nulls-d_2.nulls)^2+ w_u(d_1.unqs-d_2.unqs)^2]^{1/2} \end{align}\)
An MDS algorithm then places the points (i.e., datasets) in a 2-dimensional space such that these distances are preserved as much as possible.
Number of Rows
Number of Categorical Columns
Number of Numerical Columns
Percentage of Unique Values (Categorical)
Percentage of Null Values (Categorical)
Percentage of Unique Values (Numerical)
Percentage of Null Values (Numerical)
MDS Plot: Datasets are embedded into 2D-space such that datasets that are closer together are more similar.
Paper and Supplementary Materials can be found at: https://osf.io/zkxv9/
Watch a short 30s video teaser:
or a 7 minute demo video: