So the question is…when do you sample and when do you not? And does it even matter anymore in the world of big data?
As I’ll lay out here, in most cases today there is no point in wasting energy worrying about it. As long as a few basic criteria are met, do whatever you prefer.
First, let’s take care of the cases where sampling just won’t work. If you need to find the top 100 spending customers, you can’t do that with a sample. You’ll have to look at every single customer to accurately identify the top 100. However, such scenarios, while common, aren’t the most prevalent type of analytic requirement. They do represent an easy victory for the “no sampling” crowd, however. Similarly, even a model built on a sample will need to be applied to the universe to use it appropriately. So, when it comes time to deploy, sampling isn’t an option.
Read more at: http://iianalytics.com/2012/04/to-sample-or-not-to-sample-does-it-e...
Comment
Here are my thoughts:
In combinatorial problems, sampling is necessary. If you try to find the optimum vector of attributes (e.g the one with best fraud discriminative power) in a data set that has 40 attributes, you must sample: to compute the discriminative power of an attribute, you need to process (say) 50 million observations (your data set). And the total number of potential vectors is 2 at power 40. In short, you need to process 2^40 * 50MM = 5 * 10^18 data points, ideally in a few hours. Of course there are algorithms to significantly reduce the amount of computations by testing multiple vectors at once, that's what I designed when I was working with Visa to detect credit card fraud. Yet sampling the vector space is necessary.
For more general types of problem (computing averages, maxima, parameters etc.), sampling is always great as long as done correctly with correct cross-validation. More on this later.
Now if you want to compute highly granular data (e.g. value of each single home in US), you might keep all your data. Still you will need to perform some sound statistical inference for homes with little historical data. More on this later.
In short, you need a statistician involved in almost all these situations, and not just computer scientists. Or you will get poor predictive power.
You need to be a member of Data Science Central to add comments!
Join Data Science Central