How to detect three types of hidden data, to eliminate opportunity costs

Your data is like Gruyere. It has holes. Big holes, sometimes the empty space occupies a bigger volume than the data itself - just like dark matter is more abundant than visible matter in the universe. This article is not about shallow or sparse data, but instead about data that you do not see, that you do not know even exist, and yet, data that contains better actionable nuggets than anything in your datawarehouse.

I will provide three cases of "Gruyere data", with the remedy for each case.

1. Missing or incomplete data

That's the easiest problem to fix. Any talented data scientist can work around this issue and use modern, unbiased imputation techniques. Most analytic software also include mechanisms to handle missing data.

2. Censored data

By censored, I mean censored from a statistical point of view. An example that comes to my mind: you want to estimate the proportion of all guns involved in a crime, at least once during their lifetime. The data set that you will use (gun or crime statistics) is censored, in the sense that a brand new gun has not killed someone today, but might be used in a shooting next week. Also, some criminals get rid of their gun, and the gun might not be traceable after the crime.

How do you deal with this issue? Again, any talented data scientist will easily handle this problem, using a statistical distribution (typically exponential) to measure time-to-crime, and estimate its mean based on censored data, using correct statistical techniques. Problem solved.

3. Hidden data

That one is a far bigger issue. First, you don't even know it exists because it is invisible, at least from your vantage point. The data might indeed not exist at all, and might have to be assembled first.


Target is trying to optimize revenue numbers. They analyze their data to see when garden items sell best. They have no data about selling garden stuff in February, the company headquarters are in Minneapolis, and anyone suggesting such an idea might be fired on the spot, or suspected of being on drugs. Yet in California, Target's competitors are all selling garden stuff in February, leaving next to nothing to Target, in terms of sales, when comes June. Target, unaware of the cause, thinks there's not much money to be made with garden items in California - case closed.

How to address this issue? Carefully looking at competitor data (for instance, scanning and analyzing the millions of pieces of junk mail they send to everyone every day) is a first step in the right direction. But the real solution is to hire a visionary data scientist:

Talented data scientists leverage data that everybody see, visionary data scientists leverage data that nobody see.

Related articles

Views: 2045


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by william e winkler on June 11, 2013 at 1:24pm

Re missing or incomplete data.  It is straightforward to fill in missing univariate data.  To fill in joint distributions in a principled manner, you need to extend methods such as in Little and Rubin (2002, Chapter 13).  If you further need to fill-in data so that edit (business rule) restraints are also satisfied, then it is even more complicated.  If your models for filling in data are quite large or you are working with millions of records, then you need sufficiently fast algorithms

Winkler, W. E. (2011), “Cleaning and using administrative lists: Methods and fast computational algorithms for record linkage and modeling/editing/imputation,” Proceedings of the ESSnet Conference on Data Integration, Madrid, Spain, November 2011 (http://www.ine.es/e/essnetdi_ws2011/ppts/Winkler.pdf).

After you have built a very large contingency table (billion-plus cells) that assures that the joint distributions are preserved, how do you find, say, 100,000 cells that match a record on the non-missing values and then draw one probability-proportional-to-size in a few milliseconds.  A somewhat junior person writing SAS code might have an algorithm that needs a minute per record to do the imputation (fill in missing values). 

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service