Your data is like Gruyere. It has holes. Big holes, sometimes the empty space occupies a bigger volume than the data itself - just like dark matter is more abundant than visible matter in the universe. This article is not about shallow or sparse data, but instead about data that you do not see, that you do not know even exist, and yet, data that contains better actionable nuggets than anything in your datawarehouse.
I will provide three cases of "Gruyere data", with the remedy for each case.
1. Missing or incomplete data
That's the easiest problem to fix. Any talented data scientist can work around this issue and use modern, unbiased imputation techniques. Most analytic software also include mechanisms to handle missing data.
2. Censored data
By censored, I mean censored from a statistical point of view. An example that comes to my mind: you want to estimate the proportion of all guns involved in a crime, at least once during their lifetime. The data set that you will use (gun or crime statistics) is censored, in the sense that a brand new gun has not killed someone today, but might be used in a shooting next week. Also, some criminals get rid of their gun, and the gun might not be traceable after the crime.
How do you deal with this issue? Again, any talented data scientist will easily handle this problem, using a statistical distribution (typically exponential) to measure time-to-crime, and estimate its mean based on censored data, using correct statistical techniques. Problem solved.
3. Hidden data
That one is a far bigger issue. First, you don't even know it exists because it is invisible, at least from your vantage point. The data might indeed not exist at all, and might have to be assembled first.
Target is trying to optimize revenue numbers. They analyze their data to see when garden items sell best. They have no data about selling garden stuff in February, the company headquarters are in Minneapolis, and anyone suggesting such an idea might be fired on the spot, or suspected of being on drugs. Yet in California, Target's competitors are all selling garden stuff in February, leaving next to nothing to Target, in terms of sales, when comes June. Target, unaware of the cause, thinks there's not much money to be made with garden items in California - case closed.
How to address this issue? Carefully looking at competitor data (for instance, scanning and analyzing the millions of pieces of junk mail they send to everyone every day) is a first step in the right direction. But the real solution is to hire a visionary data scientist:
Talented data scientists leverage data that everybody see, visionary data scientists leverage data that nobody see.