So the question is…when do you sample and when do you not? And does it even matter anymore in the world of big data?
As I’ll lay out here, in most cases today there is no point in wasting energy worrying about it. As long as a few basic criteria are met, do whatever you prefer.
First, let’s take care of the cases where sampling just won’t work. If you need to find the top 100 spending customers, you can’t do that with a sample. You’ll have to look at every single customer to accurately identify the top 100. However, such scenarios, while common, aren’t the most prevalent type of analytic requirement. They do represent an easy victory for the “no sampling” crowd, however. Similarly, even a model built on a sample will need to be applied to the universe to use it appropriately. So, when it comes time to deploy, sampling isn’t an option.