Why and how you should build a data dictionary for big data sets

One of the most valuable tools that I've used, when performing exploratory analysis, is building a data dictionary. It offers the following advantages:

  • Identify areas of sparsity and areas of concentration in high-dimensional data sets
  • Identify outliers and data glitches
  • Get a good sense of what the data contains, and where to spend time (or not) in further data mining

What is a data dictionary

A data dictionary is a table with 3 or 4 columns. The first column represents a label: that is, the name of a variable, or a combination of multiple (up to 3) variables. The second column is the value attached to the label: the first and second columns actually constitute a name-value pair. The third column is a frequency count: it measures how many times the value (attached to the label in question) is found in the data set. You can add a 4-th column, that tells the dimension of the label (1 if it represents one variable, 2 if it represents a pair of two variables etc.)

Typically, you include all labels of dimension 1 and 2 with count > threshold (e.g. threshold = 5), but no or only very few values (the ones with high count) for labels of dimension 3. Labels of dimension 3 should be explored after having built the dictionary for dim 1 and 2, by drilling down on label/value of dim 2, that have a high count.

Example of dictionary entry

category~keyword travel~Tokyo 756 2

In this example, the entry corresponds to a label of dimension 2 (as indicated in column 4), and the simultaneous combination of the two values (travel, Tokyo) is found 756 times in the data set.

The first thing you want to do with a dictionary is to sort it using the following 3-dim index: column 4, then column 1, then column 3. Then look at the data and find patterns.

How do you build a dictionary

Browse your data set sequentially. For each observation, store all label/value of dim 1 and dim 2 as hash table keys, and increment count by 1 for each of these label/value. In Perl, it can be performed with code such as $hash{"$label\t$value"}++. 

If the hash table grows very large, stop, save the hash table on file then delete it in memory, and resume where you paused, with a new hash table. At the end, merge hash tables after ignoring hash entries where count is too small.

Views: 2962


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service