Comments - To Sample Or Not To Sample… Does It Even Matter? - Data Science Central2020-07-11T22:04:11Zhttps://www.datasciencecentral.com/profiles/comment/feed?attachedTo=6448529%3ABlogPost%3A12257&xn_auth=noHere are my thoughts:
In comb…tag:www.datasciencecentral.com,2012-04-09:6448529:Comment:122582012-04-09T17:57:45.040ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p>Here are my thoughts:</p>
<p>In combinatorial problems, sampling is necessary. If you try to find the optimum vector of attributes (e.g the one with best fraud discriminative power) in a data set that has 40 attributes, you must sample: to compute the discriminative power of an attribute, you need to process (say) 50 million observations (your data set). And the total number of potential vectors is 2 at power 40. In short, you need to process 2^40 * 50MM = 5 * 10^18 data points, ideally in a…</p>
<p>Here are my thoughts:</p>
<p>In combinatorial problems, sampling is necessary. If you try to find the optimum vector of attributes (e.g the one with best fraud discriminative power) in a data set that has 40 attributes, you must sample: to compute the discriminative power of an attribute, you need to process (say) 50 million observations (your data set). And the total number of potential vectors is 2 at power 40. In short, you need to process 2^40 * 50MM = 5 * 10^18 data points, ideally in a few hours. Of course there are algorithms to significantly reduce the amount of computations by testing multiple vectors at once, that's what I designed when I was working with Visa to detect credit card fraud. Yet sampling the vector space is necessary.</p>
<p>For more general types of problem (computing averages, maxima, parameters etc.), sampling is always great as long as done correctly with correct cross-validation. More on this later.</p>
<p>Now if you want to compute highly granular data (e.g. value of each single home in US), you might keep all your data. Still you will need to perform some sound statistical inference for homes with little historical data. More on this later.</p>
<p>In short, you need a statistician involved in almost all these situations, and not just computer scientists. Or you will get poor predictive power. </p>