Featured Discussions - Data Science Central2020-03-30T08:43:49Zhttps://www.datasciencecentral.com/group/research/forum/topic/list?feed=yes&xn_auth=no&featured=1Black-box Confidence Intervals: Excel and Perl Implementationtag:www.datasciencecentral.com,2014-08-10:6448529:Topic:1928882014-08-10T22:15:54.401ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted <a href="http://www.datasciencecentral.com/profiles/blogs/black-box-confidence-intervals-excel-and-perl-implementations-det" target="_blank">here</a>. Check original article for most recent updates.</em></p>
<p>Confidence interval is abbreviated as CI. In this new article (part of our series on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discussion about our popular model-free confidence interval technique…</p>
<p><em>Originally posted <a href="http://www.datasciencecentral.com/profiles/blogs/black-box-confidence-intervals-excel-and-perl-implementations-det" target="_blank">here</a>. Check original article for most recent updates.</em></p>
<p>Confidence interval is abbreviated as CI. In this new article (part of our series on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discussion about our popular model-free confidence interval technique introduced <a href="http://www.analyticbridge.com/profiles/blogs/how-to-build-simple-accurate-data-driven-model-free-confidence-in" target="_blank">in our original Analyticbridge article</a>. This technique has the following advantages:</p>
<ul>
<li>Very easy to understand by non-statisticians (business analysts, software engineers, programmers, data architects)</li>
<li>Simple (if not basic) to code; no need to use tables of Gaussian, Student or other statistical distributions </li>
<li>Robust, not sensitive to outliers</li>
<li>Model-independent, data-driven: no assumptions required about the data set; it works with non-normal data, and produces asymmetrical confidence intervals</li>
<li>Therefore, suitable for black-box implementation or automated data science</li>
</ul>
<p>This is part of our series on data science techniques suitable for automation, usebla by non-experts. The next one to be detailed (with source code) will be our <a href="http://www.datasciencecentral.com/profiles/blogs/hidden-decision-trees-revisited" target="_blank">Hidden Decision Trees</a>.</p>
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2656749873?profile=original" target="_self"><img width="607" class="align-center" src="http://storage.ning.com/topology/rest/1.0/file/get/2656749873?profile=original"/></a></p>
<p><em><strong>Figure 1</strong>: Confidence bands based on our CI (bold red and blue curves) - </em><em>Comparison with traditional normal model (light red anf blue curves)</em></p>
<p>Figure 1 is based on simulated data that does not follow a normal distribution : see section 2 and Figure 2 in this article. It shows how sensitive CI's are to model assumptions, when using the traditional approach, leading to very conservative (and plain wrong) CI's. Classical CI's are just based on 2 parameters: mean and variance. With the classical model, all data sets with same mean and same variance have same CI's. To the contrary, our CI's are based on k parameters - average values computed on k different bins - see next section for details. In short, they are much better predictive indicators when your data is not normal. Yet they are so easy to understand and compute, you don't even need to understand probability 101 to get started. The attached spreadsheet and Perl scripts have all computations done for you.</p>
<p><span class="font-size-4"><strong>1. General Framework</strong></span></p>
<p>We assume that we have n observations from a continuous or discrete variable. We randomly assign a bin number to each observation: we create k bins (1 ≤ k ≤ n) that have similar or identical sizes. We compute the average value in each bin, then we sort these averages. Let p(m) be the m-th lowest average (1 ≤ m ≤ k/2, with p(1) being the minimum average). Then our CI is defined as follows:</p>
<ul>
<li>Lower bound: p(m)</li>
<li>Upper bound: p(k-m+1)</li>
<li>Confidence level, also called level or CI level: equal to 1 - 2m/(k+1)</li>
</ul>
<p>The confidence level represents the probability that a new observation (from the same data set) will be between the lower and upper bounds of the CI. Note that this method produces asymetrical CI's. It is equivalent to designing percentile-based confidence intervals on aggregated data. In practice, k is chosen much smaller than n, say k = SQRT(n). Also m is chosen to that 1 - 2m/(k+1) is as close as possible to a pre-specified confidence level, for instance 0.95. Note that the higher m, the more robust (outlier-nonsensitive) your CI.</p>
<p>If you can't find m and k to satisfy level = 0.95 (say), then compute a few CI's (with different values of m), with confidence level close to 0.95. Then <span>inperpolate</span> or <span>extrapolate</span> the lower and upper bounds to get a CI with 0.95 confidence level. The concept is easy to visualize if you look at Figure 1. Also, do proper cross-validation: slpit your data in two; compute CI's using the first half, and test them on the other half, to see if they still continue to have sense (same confidence level, etc.) </p>
<p>CI's are extensively used in quality control, to check if a batch of new products (say, batteries) have failure rates, lifetime or other performance metrics that are within reason, that are acceptable. Or if wine advertised with 12.5% alcohol content has an actual alcohol content reasonably close to 12.5% in each batch, year after year. By "acceptable" or "reasonable", we mean between the upper and lower bound of a CI with pre-specified confidence level. CI are also used in scoring algorithms, to provide CI to each score.The CI provides an indication about how accurate the score is. Very small confidence levels (that is, narrow CI's) corresponds to data well understood, with all sources of variances perfectly explained. Converserly, large CI's mean lot's of noise and high individual variance in the data. Finally, if your data is stratified in multiple heterogeneous segments, compute separate CI's for each strata.</p>
<p>That's it, no need to know even rudimentary statistical science to understand this CI concept, as well as the concept of hypothesis testing (derived from CI) explained below in section 3. </p>
<p><strong>When Big Data is Useful</strong></p>
<p>If you look closely at Figure 1, it's clear that you can't compute accurate CI's with a high (above 0.99) level, with just a small sample and (say) k=100 bins. The higher the level, the more volatile the CI. Typically, an 0.999 level requires 10,000 or more observations to get something stable. These high-level CI's are needed especially in the context of assessing failure rates, food quality, fraud detection or sound statistical litigation. There are ways to work with much smaller samples by combining 2 tests, see section 3.</p>
<p>An advantage of big data is that you can create many different combinations of k bins (that is, test many values of m and k) to look at how the confidence bands in Figure 1 change depending on the bin selection - even allowing you to create CI's for these confidence bands, just like you could do with Bayesian models.</p>
<p><span class="font-size-4"><strong>2. Computations: Excel, Source Code</strong></span></p>
<p>The first step is to <a href="http://www.analyticbridge.com/profiles/blogs/interesting-computational-complexity-question" target="_blank">re-shuffle your data</a> to make sure that your observations are in perfect random order: read <em>A New Big Data Theorem</em> section <a href="http://www.hadoop360.com/blog/a-synthetic-variance-designed-for-hadoop-and-big-data" target="_blank">in this article</a> for an explanation why reshuffling is necessary (look at the second theorem). In short, you want to create bins that have the same mix of values: if the first half of your data set consisted of negative values, and the second half of positive values, you might end up with bins either filled with positive or negative values. You don't want that; you want each bin to be well balanced.</p>
<p><strong>Reshuffling Step</strong></p>
<p>Unless you know that your data is in an arbitrary order (this is the case most frequently), reshuffling is recommended. Reshuffling can easily be performed as follows:</p>
<ul>
<li>Add a column or variable called RAN, made up of simulated random numbers, using a function such as 100,000 + INT(10,000*RAND()) where RAND() returns a random number between 0 and 1.</li>
<li>Sort your data by column RAN</li>
<li>Delete column RAN</li>
</ul>
<p>Note that we use 100,000 + INT(10,000*RAND()) rather than just simply RAND() to make sure that all random numbers are integers with the same number of digits. This way, whether you sort alphabetically or numerically, the result will be identical, and correct. Sorting numbers of variable length alphabetically (without knowing it) is a source of many bugs in software engineering. This little trick helps you avoid this problem.</p>
<p>If the order in your data set is very important, just add a column that has the original rank attached to each observation (in your initial data set), and keep it through the res-shuffling process (after each observation has been assigned to a bin), so that you can always recover the original order if necessary, by sorting back according to this extra column.</p>
<p><strong>The Spreadsheet</strong></p>
<p>Download the <a href="http://storage.ning.com/topology/rest/1.0/file/get/2656750270?profile=original" target="_self">Excel spreadsheet</a>. Figures 1 and 2 are in the spreadsheet, as well as all CI computations, and more. The spreadsheet illustrates many not so well known but useful analytic Excel functions, such as: FREQUENCY, PERCENTILE, CONFIDENCE.NORM, RAND, AVERAGEIF, MOD (for bin creations) and RANK. The CI computations are in cells O2:Q27 in the Confidence Intervals tab. You can modify the data in column B, and all CI's will automatically be re-computed. Beware if you change the number of bins (cell F2): this can screw up the RANK function in column J (some ranks will be missing) and then screw up the CI's. </p>
<p>For other examples of great spreadsheet (from a tutorial point of view), check the Excel section in our <a href="http://www.datasciencecentral.com/profiles/blogs/data-science-cheat-sheet" target="_blank">data science cheat sheet</a>. </p>
<p><strong>Simulated Data</strong></p>
<p>The simulated data in our <a href="http://storage.ning.com/topology/rest/1.0/file/get/2656750270?profile=original" target="_self">Excel spreadsheet</a> (see the <em>data simulation</em> tab), represents a mixture of two uniform distributions, driven by the parameters in the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure 2) were stored in column D, and were subsequently hard-copied into column B in the <em>Confidence Interval</em> (results) tab (they still reside there), because otherwise, each time you modify the spreadsheet, new deviates produced by the RAND Excel function are automatically updated, changing everything and making our experiment non-reproducible. This is a drawback of Excel, thought I've heard that it is possible to freeze numbers produced by the function RAND. The simulated data is remarkably non-Gaussian, see Figure 2. It provides a great example of data that causes big problems with traditional statistical science, as described in our following subsection.</p>
<p>In any case, this is an interesting tutorial on how to generate simulated data in Excel. Other examples can be found in our <a href="http://www.datasciencecentral.com/profiles/blogs/data-science-cheat-sheet" target="_blank">Data Science Cheat Sheet</a> (see Excel section).</p>
<p><strong>Comparison with Traditional Confidence Intervals</strong></p>
<p>We provide a comparison with standard CI's (available in all statistical packages) in Figure 1, and in our spreadsheet. There are a few ways to compute traditional CI's:</p>
<ul>
<li>Simulate Gaussian deviates with pre-specified variance matching your data variance, by(1) generating (say) 10 million uniform deviates on [-1, +1] using a <a href="http://www.analyticbridge.com/forum/topics/challenge-of-the-week-random-numbers" target="_blank">great random generator</a>, (2) randomly grouping these generated values in 10,000 buckets each having 1,000 deviates, and (3) compute averages in each bucket. These 10,000 averages will approximate very well a Gaussian distribution, all you need to do is to scale them so that the variance matches the variance in your data sets. And then compute intervals that contain 99%, 95%, or 90% off all the scaled averages: these are your standard Gaussian CI's.</li>
<li>Use libraries to simulate Gaussian deviates, rather than the cave-man appoach mentioned above. Source code and simulators can be found in books such as <a href="http://www.nr.com/" target="_blank">Numerical Recipes</a>.</li>
<li>In our Excel spreadsheet, we used the Confidence.norm function.</li>
</ul>
<p>As you can see in Figure 1, traditional CI's fail miserably if your data has either a short or long tail, compared with data originating from a Gaussian process.</p>
<p><strong>Perl Code</strong></p>
<p>Here's some simple source code to compute CI for given m and k:</p>
<p><span class="font-size-2">$k=50; # number of bins</span><br/><span class="font-size-2">$m=5;</span></p>
<p><span class="font-size-2">open(IN,"< data.txt");</span><br/><span class="font-size-2">$binNumber=0;</span><br/><span class="font-size-2">while ($value=<IN>) {</span><br/><span class="font-size-2"> $value=~s/\n//g;</span><br/><span class="font-size-2"> $binNumber = $n % $k;</span> <br/><span class="font-size-2"> $binSum[$binNumber] += $value;</span><br/><span class="font-size-2"> $binCount[$binNumber] ++;</span><br/><span class="font-size-2"> $n++;</span><br/><span class="font-size-2">}</span></p>
<p><span class="font-size-2">if ($n < $k) {</span> <br/><span class="font-size-2"> print "Error: Too few observations: n < k (choose a smaller k)\n";</span> <br/><span class="font-size-2"> exit();</span><br/><span class="font-size-2">}</span></p>
<p><span class="font-size-2">if ($m> $k/2) {</span><br/><span class="font-size-2"> print "Error: reduce m (must be <= k/2)\n";</span><br/><span class="font-size-2"> exit();</span><br/><span class="font-size-2">}</span></p>
<p><span class="font-size-2">for ($binNumber=0; $binNumber<$k; $binNumber++) {</span><br/><span class="font-size-2"> $binAVG[$binNumber] = $binSum[$binNumber]/$binCount[$binNumber];</span><br/><span class="font-size-2">}</span></p>
<p><br/><span class="font-size-2">$binNumber=0;</span><br/><span class="font-size-2">foreach $avg (sort { $a <=> $b } @binAVG) { # sorting bins numerically</span><br/><span class="font-size-2"> $sortedBinAVG[$binNumber] = $avg;</span><br/><span class="font-size-2"> $binNumber++;</span><br/><span class="font-size-2">}</span></p>
<p><span class="font-size-2">$CI_LowerBound= $sortedBinAVG[$m];</span><br/><span class="font-size-2">$CI_UpperBound= $sortedBinAVG[$k-$m+1];</span><br/><span class="font-size-2">$CI_level=1-2*$m/($k+1);</span></p>
<p><span class="font-size-2">print "CI = [$CI_LowerBound,$CI_UpperBound] (level = $CI_level)\n";</span></p>
<p><strong>Exercise</strong>: write the code in R or Python.</p>
<p><span class="font-size-4"><strong>3. Application to Statistical Testing</strong></span></p>
<p>Rather than using <em>p</em>-values and other <a href="http://www.datasciencecentral.com/profiles/blogs/p-values-the-gold-standard-of-statistical-validity-are-not-as" target="_blank">dangerous concepts</a> (about to become extinct) that nobody but statisticians understand, here is an easy way to perform statistical tests. The method below is part of what we call <a href="http://www.datasciencecentral.com/profiles/blogs/data-science-has-been-using-rebel-statistics-for-a-long-time" target="_blank">rebel statistical science</a>.</p>
<p>Let's say that you want to test, with 99.5% confidence (level = 0.995), whether or not a wine manufacturer consistently produces a specific wine that has a 12.5% alcohol content. Maybe you are a lawyer, and the wine manufacturer is accused of lying on the bottle labels (claiming that alcohol content is 12.5% when indeed it is 13%), maybe to save some money. The test to perform is as follows: check out 100 bottles from various batches, and compute an 0.995-level CI for alcohol content. Is 12.5% between the upper and lower bounds? Note that you might not be able to get an exact 0.995-level CI if your sample size n is too small (say n=100), you will have to <span>extrapolate</span> from lower level CI's, but the reason here to use a high confidence level is to give the defendant the <em>benefit of the doubt</em> rather than wrongly accusing him based on a too small<em>confidence level</em>. If 12.5% is found inside even a small 0.50-level CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it will be inside an 0.995-level CI, because these CI's are <span>nested</span> (see Figure 1 to understand these ideas). Likewise, if the wine truly has a 13% alcohol content, a tiny 0.03-level CI containing the value 13% will be enough to prove it. </p>
<p>One way to better answer these statistical tests (when your high-level CI's don't provide an answer) is to produce 2 or 3 tests (but no more, <a href="http://www.datasciencecentral.com/profiles/blogs/tutorial-how-to-detect-spurious-correlations-and-how-to-find-the-" target="_blank">otherwise your results will be biased</a>). Test whether the alcohol rate is</p>
<ul>
<li>As declared by the defendant (test #1)</li>
<li>Equal to a pre-specified value (the median computed on a decent sample, test #2)</li>
</ul>
<p><span class="font-size-4"><strong>4. Miscellaneous</strong></span></p>
<p>We include two figures in this section. The first one is about the data used in our test and Excel spreadsheet, to produce our confidence intervals. And the other figure shows the theorem that justifies the construction of our confidence intervals.</p>
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2656750918?profile=original" target="_self"><img width="538" class="align-center" src="http://storage.ning.com/topology/rest/1.0/file/get/2656750918?profile=original"/></a></p>
<p><em><strong>Figure 2</strong>: Simulated data used to compute CI's: asymmetric mixture of non-normal distrubutions</em></p>
<p><img class="align-center" src="http://storage.ning.com/topology/rest/1.0/file/get/2220279788?profile=original"/></p>
<p><em><strong>Figure 3</strong>: Theorem used to justify our confidence intervals</em></p> How to detect spurious correlations, and how to find the real onestag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1724512014-05-26T23:43:13.594ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on DataSciebceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/tutorial-how-to-detect-spurious-correlations-and-how-to-find-the-" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>Specifically designed in the context of big data in our research lab, the new and simple <em>strong…</em></p>
<p><em>Originally posted on DataSciebceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/tutorial-how-to-detect-spurious-correlations-and-how-to-find-the-" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>Specifically designed in the context of big data in our research lab, the new and simple <em>strong correlation</em> synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in large-scale automated data science or machine learning projects. Use this new metric now, to avoid being accused of <a href="http://www.datasciencecentral.com/forum/topics/five-articles-worth-reading" target="_blank">reckless data science</a> and even<a href="http://www.analyticbridge.com/profiles/blogs/scientists-tried-on-manslaughter-charges-for-providing-incorrect" target="_blank">being sued for wrongful analytic practice</a>.</p>
<p>In this paper, the traditional correlation is referred to as the <em>weak correlation</em>, as it captures only a small part of the association between two variables: <em>weak correlation</em> results in capturing spurious correlations and predictive modeling deficiencies, even with as few as 100 variables. In short, our <em>strong correlation</em> (with a value between 0 and 1) is high (say above 0.80) if not only the <em>weak correlation</em> is also high (in absolute value), but when the internal structures (auto-dependencies) of both variables X and Y that you want to compare, exhibit a similar pattern or correlogram. Yet this new metric is simple and involves just one parameter <strong>a</strong> (with <strong>a</strong> = 0 corresponding to <em>weak correlation</em>, and <strong>a</strong> =1 being the recommended value for<em>strong correlation</em>). This setting is designed to avoid over-fitting. </p>
<p>Our <em>strong correlation</em> blends together the concept of ordinary or <em>weak regression</em> - indeed, an improved, robust, <a href="http://www.analyticbridge.com/profiles/blogs/correlation-and-r-squared-for-big-data" target="_blank">outlier-resistant version of ordinary regression</a> (or see <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-book" target="_blank">my book</a> pages 130-140) - together with the concept of X and Y sharing similar <a href="http://www.analyticbridge.com/profiles/blogs/three-classes-of-metrics-centrality-volatility-and-bumpiness" target="_blank">bumpiness</a> (or see <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-book" target="_blank">my book</a> pages 125-128).</p>
<p>In short, even nowadays, what makes two variables X and Y <em>seem</em> related in most scientific articles and pretty much all articles written by journalists, is based on ordinary (weak) regression. But there are plenty of other metrics that you can use to compare two variables. Including bumpiness in the mix (together with weak regression in just one single blended metric called <em>strong correlation</em> to boost accuracy) guarantees that high <em>strong</em> correlation means that the two variables are really associated, not just based on flawy, old-fashioned <em>weak</em> correlations, but also associated based on sharing similar internal auto-dependencies and structure. To put it differently, two variables can be highly <em>weakly</em> correlated yet have very different bumpiness coefficients, as shown in <a href="http://www.analyticbridge.com/profiles/blogs/correlation-and-r-squared-for-big-data" target="_blank">my original article</a> - meaning that there might be <a href="http://www.datasciencecentral.com/forum/topics/correlation-vs-causation" target="_blank">no causal relationship</a> (or see <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-book" target="_blank">my book</a> pages 165-168) or hidden factors explaining the link. An artificial example is provided below in figure 3.</p>
<p>Using <span><em>strong</em></span>, rather than <em><span>weak</span></em> correlation, eliminates the majority of these spurious correlations, as we shall see in the examples below. This <em>strong correlation</em> metric is designed to be integrated in automated data science algorithms.</p>
<p><strong>1. Formal definition of <em>strong</em> correlation</strong></p>
<p>Let's define</p>
<ul>
<li><strong><em>Weak correlation</em> </strong>c(X, Y) as the absolute value of the ordinary correlation, with value between 0 and 1. This number is high (close to 1) if X and Y are highly correlated. I recommend using my <a href="http://www.analyticbridge.com/profiles/blogs/correlation-and-r-squared-for-big-data" target="_blank">rank-based, L-1 correlation</a> (or see <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-book" target="_blank">my book</a> pages 130-140) to eliminate problems caused by outliers.</li>
<li>c1(X) as the lag-1 auto-correlation for X, that is, if we have n observations X_1 ... X_n, then c1(X) = c(X_1 ... X_{n-1}, X_2 ... X_n)</li>
<li>c1(Y) as the lag-1 auto-correlation for Y</li>
<li><strong><em>d-correlation</em> </strong>d(X, Y) = exp{ -<strong>a</strong> * | ln( c1(X) / c1(Y) ) | }, with possible adjustment if numerator or denominator is zero, and parameter <strong>a</strong> must be positive or zero. This number, with value between 0 and 1, is high (close to 1) if X and Y have similar lag-1 auto-correlations.</li>
<li><strong><em>Strong correlation</em> </strong>r(X, Y) = min{ c(X, Y), d(X, Y) }</li>
</ul>
<p>Note that c1(X), and c1(Y) are the <a href="http://www.analyticbridge.com/profiles/blogs/three-classes-of-metrics-centrality-volatility-and-bumpiness" target="_blank">bumpiness coefficients</a> (or see <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-book" target="_blank">my book</a> pages 125-128) for X and Y. Also, d(X, Y) and thus r(X, Y) are between 0 and 1, with 1 meaning strong similarity between X and Y, and 0 meaning <span>either</span> dissimilar lag-1 auto-correlations for X and Y, <span>or</span> lack of old-fashioned correlation.</p>
<p>The <em>strong correlation</em> between X and Y is, by definition, r(X, Y). This is an approximation to having both spectra identical, a solution mentioned in my article <a href="http://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data" target="_blank">The curse of Big Data</a> (see also <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-book" target="_blank">my book</a> pages 41-45).</p>
<p>This definition of strong correlation was initially suggested in one of our <a href="http://www.analyticbridge.com/forum/topics/challenge-of-the-week-2" target="_blank">weekly challenges</a>.</p>
<p><strong>2. Comparison with traditional (<em>weak</em>) correlation</strong></p>
<p>When <strong>a</strong> = 0, weak and strong correlations are identical. Note that the <em>strong correlation</em> r(X, Y) still shares the same properties as the <em>weak correlation</em> c(X, Y): it is symmetric and invariant under linear transformations (such as re-scaling) of variables X or Y, regardless of <strong>a</strong>. <span> </span></p>
<p><span><a href="http://www.datasciencecentral.com/profiles/blogs/tutorial-how-to-detect-spurious-correlations-and-how-to-find-the-" target="_blank">Read full article</a>.</span></p>
<p></p>
<p></p> Practical illustration of Map-Reduce (Hadoop-style), on real datatag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1723682014-05-26T23:41:32.861ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/practical-illustration-of-map-reduce-hadoop-style-on-real-data?xg_source=activity" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">Here I will discuss a general framework to process web traffic data. The concept…</span></p>
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/practical-illustration-of-map-reduce-hadoop-style-on-real-data?xg_source=activity" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let's say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or un-billable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some conversions are soft, just a click-out, and conversion rate is above 10%; some conversions are hard, for instance a credit card purchase, and conversion rate is below 1%). Here, for now, we just ignore the conversion data and focus on the low hanging fruits: click data. Other valuable data is impression data (for instance a click not associated with an impression is very suspicious). But impression data is huge, 20 times bigger than click data. We ignore impression data here.</span></p>
<p></p>
<p><span class="font-size-2">Here, we work with complete click data collected over a 7-day time period. Let's assume that we have 50 million clicks in the data set. Working with a sample is risky, because much of the fraud is spread across a large number of affiliates, and involve clusters (small and large) of affiliates, and tons of IP addresses but few clicks per IP per day (low frequency).</span></p>
<p><span class="font-size-2">The data set (ideally, a tab-separated text file, as CSV files can cause field misalignment here due to text values containing field separators) contains 60 fields: keyword (user query or advertiser keyword blended together, argh...), referral (actual referral domain or ad exchange domain, blended together, argh...), user agent (UA, a long string; UA is also known as browser, but it can be a bot), affiliate ID, partner ID (a partner has multiple affiliates), IP address, time, city and a bunch of other parameters.</span></p>
<p><span class="font-size-2">The first step is to extract the relevant fields for this quick analysis (a few days of work). Based on domain expertise, we retained the following fields:</span></p>
<ul>
<li><span class="font-size-2">IP address</span></li>
<li><span class="font-size-2">Day</span></li>
<li><span class="font-size-2">UA (user agent) ID - so we created a look-up table for UA's</span></li>
<li><span class="font-size-2">Partner ID</span></li>
<li><span class="font-size-2">Affiliate ID</span></li>
</ul>
<p><span class="font-size-2">These 5 metrics are the base metrics to create the following summary table. Each (IP, Day, UA ID, Partner ID, Affiliate ID) represents our atomic (most granular) data bucket.</span></p>
<p><span class="font-size-2"><strong>Building a summary table: the Map step</strong></span></p>
<p><span class="font-size-2">The summary table will be built as a text file (just like in Hadoop), the data key (for joins or groupings) being (IP, Day, UA ID, Partner ID, Affiliate ID). For each atomic bucket (IP, Day, UA ID, Partner ID, Affiliate ID) we also compute:</span></p>
<ul>
<li><span class="font-size-2">number of clicks</span></li>
<li><span class="font-size-2">number of unique UA's</span></li>
<li><span class="font-size-2">list of UA</span></li>
</ul>
<p><span class="font-size-2">The list of UA's, for a specific bucket, looks like ~6723|9~45|1~784|2, meaning that in the bucket in question, there are three browsers (with ID 6723, 45 and 784), 12 clicks (9 + 1 + 2), and that (for instance) browser 6723 generated 9 clicks.</span></p>
<p><span class="font-size-2">In Perl, these computations are easily performed, as you sequentially browse the data. The following updates the click count:</span></p>
<p><span class="font-size-2">$hash_clicks{"IP\tDay\tUA_ID\tPartner_ID\tAffiliate_ID"};</span></p>
<p><span class="font-size-2">Updating the list of UA's associated with a bucket is a bit less easy, but still almost trivial.</span></p>
<p><span class="font-size-2">The problem is that at some point, the hash table becomes too big and will slow down your Perl script to a crawl. The solution is to split the big data in smaller data sets (called subsets), and perform this operation separately on each subset. This is called the <em>Map</em> step, in Map-Reduce. You need to decide which fields to use for the mapping. Here, IP address is a good choice because it is very granular (good for load balance), and the most important metric. We can split the IP address field in 20 ranges based on the first byte of the IP address. This will result in 20 subsets. The splitting in 20 subsets is easily done by browsing sequentially the big data set with a Perl script, looking at the IP field, and throwing each observation in the right subset based on the IP address.</span></p>
<p><span class="font-size-2"><strong>Building a summary table: the Reduce step</strong></span></p>
<p><span class="font-size-2">Now, after producing the 20 summary tables (one for each subset), we need to merge them together. We can't simply use hash table here, because they will grow too large and it won't work - the reason why we used the <em>Map</em> step in the first place.</span></p>
<p><span class="font-size-2">Here's the work around: ...</span></p>
<p><span class="font-size-2"><a href="http://www.datasciencecentral.com/profiles/blogs/practical-illustration-of-map-reduce-hadoop-style-on-real-data?xg_source=activity" target="_blank">Read full article</a>.</span></p>
<p></p> Jackknife logistic and linear regression for clustering and predictionstag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1723662014-05-26T23:39:46.654ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on DataSciebceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/jackknife-logistic-and-linear-regression" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>This article discusses a far more general version of the technique described in our article …</p>
<p><em>Originally posted on DataSciebceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/jackknife-logistic-and-linear-regression" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>This article discusses a far more general version of the technique described in our article <a href="http://www.datasciencecentral.com/profiles/blogs/the-best-kept-secret-about-linear-and-logistic-regression" target="_blank">The best kept secret about regression</a>. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.</p>
<p></p>
<p>Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include</p>
<ul>
<li><a href="http://www.analyticbridge.com/profiles/blogs/how-to-build-simple-accurate-data-driven-model-free-confidence-in" target="_blank">Data driven confidence intervals</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hidden-decision-trees-revisited" target="_blank">Hidden decision trees</a> (HDT)</li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/feature-selection-based-on-predictive-power" target="_blank">Fast Combinatorial Feature Selection</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/practical-illustration-of-map-reduce-hadoop-style-on-real-data" target="_blank">Map-Reduce applied to log file processing</a></li>
<li><a href="http://www.bigdatanews.com/profiles/blogs/fast-clustering-algorithms-for-massive-datasets" target="_blank">Building giant taxonomies</a></li>
</ul>
<p>Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.</p>
<p><strong>1. Introduction</strong></p>
<p>As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or non-centered variables is straightforward.</p>
<p>Thus we are still dealing with the following regression framework:</p>
<p>Y = a_1 * X_1 + ... + a_n * X_n + noise</p>
<p>Remember that the solution proposed in our previous paper was</p>
<ul>
<li>b_i = cov(Y, X_i) / var(X_i), i = 1, ..., n</li>
<li>a_i = M * b_i, i = 1, ..., n</li>
<li>M (a real number, not a matrix) is chosen to minimize var(Z), with Z = Y - a_1 * X_1 + ... + a_n * X_n</li>
</ul>
<p>When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.</p>
<p>Terminology: Z is the noise, Y is the (observed) response, the a_i's are the regression coefficients, and and S = a_1 * X_1 + ... + a_n * X_n is the estimated or predicted response. The X_i's are the independent variables or features.</p>
<p><strong>2. Re-visiting our previous data set</strong></p>
<p>I have added more cross-correlations to the <a href="http://www.datasciencecentral.com/profiles/blogs/the-best-kept-secret-about-linear-and-logistic-regression" target="_blank">previous simulated dataset</a> consisting of 4 independent variables, still denoted as x, y, z, u in the new, <a href="http://storage.ning.com/topology/rest/1.0/file/get/2656751664?profile=original" target="_self">updated attached spreadsheet</a>. Now corr(x, y) = 0.99.</p>
<p><a href="http://www.datasciencecentral.com/profiles/blogs/jackknife-logistic-and-linear-regression" target="_blank">Read full article</a>.</p> A synthetic variance designed for Hadoop and big datatag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1722992014-05-26T23:37:25.460ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on Hadoop36o, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.hadoop360.com/blog/a-synthetic-variance-designed-for-hadoop-and-big-data" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a…</p>
<p><em>Originally posted on Hadoop36o, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.hadoop360.com/blog/a-synthetic-variance-designed-for-hadoop-and-big-data" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula.</p>
<p><strong>Synthetic Metrics</strong></p>
<p>This new metric is <em>synthetic</em>: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). By<em>synthetic</em>, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a top-down approach to metric design - from data to theory, rather than the bottom-up traditional approach - from theory to data.</p>
<p>Other synthetic metrics designed in our research laboratory include:</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/feature-selection-based-on-predictive-power" target="_blank">Predictive power metric</a>, related to entropy (that is, information quantification), used in big data frameworks, for instance to identify optimum feature combinations for scoring algorithms.</li>
<li><a href="http://www.analyticbridge.com/profiles/blogs/correlation-and-r-squared-for-big-data" target="_blank">Correlation for big data</a>, defined by an algorithm and closely related to the optimum variance metric discussed here.</li>
<li><a href="http://www.analyticbridge.com/profiles/blogs/structuredness-coefficient-to-find-patterns-and-associations" target="_blank">Structuredness coefficient</a></li>
<li><a href="http://www.analyticbridge.com/profiles/blogs/three-classes-of-metrics-centrality-volatility-and-bumpiness" target="_blank">Bumpiness coefficient</a></li>
</ul>
<p><strong>Hadoop, numerical and statistical stability</strong></p>
<p>There are two issues with the formula used for computing Variance in Hadoop. First, the formula used, namely Var(x1, ... , xn) = {SUM(xi^2)/n} - {SUM(xi)/n}^2, is notoriously unstable. For large n, while both terms cancel out somewhat, each one taken separately can take a huge value, because of the squares aggregated over billions of observations. It results in numerical inaccuracies, with people having reported negative variances. Read the comments attached to my article <a href="http://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data" target="_blank">The curse of Big Data</a> for details. Besides, there are variance formula that do not require two passes of the entire data sets, and that are numerically stable.</p>
<p><a href="http://www.hadoop360.com/blog/a-synthetic-variance-designed-for-hadoop-and-big-data" target="_blank">Read full article</a>.</p>
<p></p> Fast Combinatorial Feature Selection with New Definition of Predictive Powertag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1725412014-05-26T23:34:48.980ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/feature-selection-based-on-predictive-power" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>In this article, I proposes a simple metric to measure <strong>predictive power</strong>. It is used for combinatorial feature selection, where a…</p>
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/feature-selection-based-on-predictive-power" target="_blank">Click here</a> to read original article and comments.</em></p>
<p>In this article, I proposes a simple metric to measure <strong>predictive power</strong>. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce framework. It was developed by the author in the context of credit card fraud detection, and click/keyword scoring. This material will be part of our <a href="http://www.datasciencecentral.com/group/data-science-apprenticeship" target="_blank">data science apprenticeship</a>, and included in our Wiley book.</p>
<p class="Para"><span class="font-size-2"><i>Feature selection</i> is a methodology used to detect the best subset of features, out of dozens or hundreds of features (also called variables or rules). By “best”, we mean with highest <i>predictive power</i>, a concept defined in the following subsection. In short, we want to remove duplicate features, simplify a bit the correlation structure (among features) and remove features that bring no value, such as a features taking on random values, thus lacking predictive power, or features (rules) that are almost never triggered (except if they are perfect fraud indicators when triggered).</span></p>
<p class="Para"><span class="font-size-2">The problem is combinatorial in nature. You want a manageable, small set of features (say 20 features) selected from (say) a set of 500 features, to run our <i>hidden decision trees</i> (or some other classification / scoring technique) in a way that is statistically robust. But there are 2.7 * 10<span>35</span> combinations of 20 features out of 500, and you need to compute all of them to find the one with maximum predictive power. This problem is computationally intractable, and you need to find an alternate solution. The good thing is that you don’t need to find the absolute maximum; you just need to find a subset of 20 features that is good enough.</span></p>
<p class="Para"><span class="font-size-2">One way to proceed is to compute the predictive power of each feature. Then, add one feature at a time to the subset (starting with 0 feature) until you reach either</span></p>
<ul>
<li><span class="font-size-2">20 features (your limit)</span></li>
<li><span class="font-size-2">Adding a new feature does not significantly improve the overall predictive power of the subset (in short, convergence has been attained)</span></li>
</ul>
<p class="Para"><span class="font-size-2">At each iteration, choose the feature to be added among the two remaining features with the highest predictive power: you will choose (among these two features) the one that increases the overall predictive power (of the subset under construction) most. Now you have reduced your computations from 2.7 * 10<span>35</span> to 40 = 2 * 20.</span></p>
<p class="Para"><span class="font-size-2"><a href="http://www.datasciencecentral.com/profiles/blogs/feature-selection-based-on-predictive-power" target="_blank">Read full article</a>.</span></p>
<p></p> Internet topology mappingtag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1723642014-05-26T23:32:25.888ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/a-little-known-component-that-should-be-part-of-most-data-science" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">This is a component often missing, yet valuable for most systems, algorithms and architectures…</span></p>
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.datasciencecentral.com/profiles/blogs/a-little-known-component-that-should-be-part-of-most-data-science" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online search, plagiarism and spam detection, etc.</span></p>
<p><span class="font-size-2">I will call it an <strong>Internet Topology Mapping</strong>. It might not be stored as a traditional database (it could be a graph database, a file system, or a set of look-up tables). It must be pre-built (e.g. as look-up tables, with regular updates) to be efficiently used.</span></p>
<p><span class="font-size-2"><strong>So what is the Internet Topology Mapping?</strong></span></p>
<p><span class="font-size-2">Essentially, it is a system that matches an IP address (Internet or mobile) with a domain name (ISP). When you process a transaction in real time in production mode (e.g. an online credit card transaction, to decide whether to accept or decline it), your system only has a few milliseconds to score the transaction to make the decision. In short, you only have a few milliseconds to call and run an algorithm (sub-process), on the fly, separately for each credit card transaction, to decide on accepting/rejecting. If the algorithm involves matching the IP address with an ISP domain name (this operation is called<em>nslookup</em>), it won't work: direct nslookups take between a few hundreds to a few thousands milliseconds, and they will slow the system to a grind.</span></p>
<p><span class="font-size-2">Because of that, Internet Topology Mappings are missing in most systems. Yet there is a very simple workaround to build it:</span></p>
<ol>
<li><span class="font-size-2">Look at all the IP addresses in your database. Chances are, even if you are Google, 99.9% of your traffic is generated by fewer than 100,000,000 IP addresses. Indeed, the total number of IP addresses (the whole universe) consists of less than 256^4 = 4,294,967,296 IP addresses. That's about 4 billion, not that big of a number in the real scheme of big data. Also, many IP addresses are clustered: 120.176.231.xxx are likely to be part of the same domain, for xxx in the range (say) 124-180. In short, you need to store a lookup table possibly as small as 20,000,000 records (IP ranges / domain mapping) to solve the nslookup issue for 99.9% of your transactions. For the remaining 0.1%, you can either assign 'Unknown Domain' (not recommended, since quite a few IP addresses actually have unknown domain), or 'missing' (better) or perform the cave-man, very slow nslookup on the fly.</span></li>
<li><span class="font-size-2">Create the look-up table that maps IP ranges to domain names, for 99.9% of your traffic.</span></li>
</ol>
<p><span class="font-size-2">When processing a transaction, access this look-up table (stored in memory, or least with some caching available in memory) to detect the domain name. Now you can use a rule system that does incorporate domain names.</span></p>
<p><span class="font-size-2"><strong>Example of rules and metrics based on domain names are</strong>:</span></p>
<ul>
<li><span class="font-size-2">domain extension (.com, .cc etc.)</span></li>
<li><span class="font-size-2">length of domain name</span></li>
</ul>
<p><span class="font-size-2"><a href="http://www.datasciencecentral.com/profiles/blogs/a-little-known-component-that-should-be-part-of-most-data-science" target="_blank">Read full article</a>.</span></p>
<p></p>
<p></p> Hidden decision trees revisitedtag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1723622014-05-26T23:29:55.428ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do" target="_blank">Click here</a> to read original article and comments.</em></p>
<div><p class="H1"><span class="font-size-2"><em>This is a revised version of an earlier article …</em></span></p>
</div>
<p><em>Originally posted on DataScienceCentral, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do" target="_blank">Click here</a> to read original article and comments.</em></p>
<div><p class="H1"><span class="font-size-2"><em>This is a revised version of an earlier article <a href="http://www.analyticbridge.com/forum/topics/hidden-decision-trees-vs" target="_blank">posted on AnalyticBridge</a>.</em></span></p>
</div>
<p class="Para"><span class="font-size-2">Hidden decision trees (HDT) is a technique patented by Dr. Granville, to <i>score</i> large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant advantages over both logistic regression and decision trees: robustness, ease of interpretation, and no tree pruning, no node splitting criteria. It makes this methodology powerful and easy to implement even for someone with no statistical background.</span></p>
<p class="Para"><span class="font-size-2"><i>Hidden Decision Trees</i> is a statistical and data mining methodology (just like logistic regression, SVM, neural networks or decision trees) to handle problems with large amounts of data, non-linearity and strongly correlated independent variables.</span></p>
<p class="Para"><span class="font-size-2">The technique is easy to implement in any programming language. It is more robust than decision trees or logistic regression, and helps detect natural final nodes. Implementations typically rely heavily on large, granular hash tables.</span></p>
<p class="Para"><span class="font-size-2">No decision tree is actually built (thus the name hidden decision trees), but the final output of a hidden decision tree procedure consists of a few hundred nodes from multiple non-overlapping small decision trees. Each of these parent (invisible) decision trees corresponds e.g. to a particular type of fraud, in fraud detection models. Interpretation is straightforward, in contrast with traditional decision trees.</span></p>
<p class="Para"><span class="font-size-2">The methodology was first invented in the context of credit card fraud detection, back in 2003. It is not implemented in any statistical package at this time. Frequently, hidden decision trees are combined with logistic regression in an hybrid scoring algorithm, where 80% of the transactions are scored via hidden decision trees, while the remaining 20% are scored using a compatible logistic regression type of scoring.</span></p>
<p class="Para"><span class="font-size-2">Hidden decision trees take advantage of the structure of large multivariate features typically observed when scoring a large number of transactions, e.g. for fraud detection. The technique is not connected with hidden Markov fields.</span></p>
<p class="Para"><span class="font-size-3"><b>Potential Applications</b></span></p>
<ul>
<li><span class="font-size-2">Fraud detection, spam detection</span></li>
<li><span class="font-size-2">Web analytics</span><ul>
<li><span class="font-size-2">Keyword scoring/bidding (ad networks, paid search)</span></li>
<li><span class="font-size-2">Transaction scoring (click, impression, conversion, action)</span></li>
<li><span class="font-size-2">Click fraud detection</span></li>
<li><span class="font-size-2">Web site scoring, ad scoring, landing page / advertiser scoring</span></li>
<li><span class="font-size-2">Collective filtering (social network analytics)</span></li>
<li><span class="font-size-2">Relevancy algorithms</span></li>
</ul>
</li>
<li><span class="font-size-2">Text mining</span><ul>
<li><span class="font-size-2">Scoring and ranking algorithms</span></li>
<li><span class="font-size-2">Infringement detection</span></li>
<li><span class="font-size-2">User feedback: automated clustering</span></li>
</ul>
</li>
</ul>
<p class="Para"><span class="font-size-3"><b>Implementation</b></span></p>
<p class="Para"><span class="font-size-2">The model presented here is used in the context of click scoring. The purpose is to create predictive scores, where <i>score</i> = f(<i>response</i>), that is, score is a function of the response. The response is sometimes referred to as the <i>dependent variable</i> in statistical and predictive models.</span></p>
<ul>
<li><span class="font-size-2">Examples of Response:</span><ul>
<li><span class="font-size-2">Odds of converting (Internet traffic data – hard/soft conversions)</span></li>
<li><span class="font-size-2">CR (conversion rate)</span></li>
<li><span class="font-size-2">Probability that transaction is fraudulent</span></li>
</ul>
</li>
<li><span class="font-size-2">Independent variables: Called <i>features</i> or rules. They are highly correlated</span></li>
</ul>
<p><span class="font-size-2"><a href="http://www.datasciencecentral.com/profiles/blogs/hidden-decision-trees-revisited" target="_blank">Read full article</a>. </span></p> Correlation and R-Squared for Big Datatag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1722962014-05-26T23:28:12.701ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on Analyticbridge, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data…</span></p>
<p><em>Originally posted on Analyticbridge, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.</span></p>
<p><span class="font-size-2">In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following</span></p>
<p><span class="font-size-2"><strong>Five conditions:</strong></span></p>
<ol>
<li><span class="font-size-2">Independent of sample size to allow comparisons across buckets of different sizes: a correlation of (say) 0.6 must always correspond to (say) the 74-th percentile among all potential paired series (X, Y) of size n, regardless of n</span></li>
<li><span class="font-size-2">Same bounds as old-fashioned correlation for back-compatibility : it must be between -1 and +1, with -1 and +1 attained by extreme, singular data sets, and 0 meaning no correlation</span></li>
<li><span class="font-size-2">More general than traditional correlation: it measures the degree of monotonicity between two variables X and Y (does X grow when Y grows?) rather than linearity (Y = a + b*X + noise, with a, b chosen to minimize noise). Yet not as general as the <a href="http://en.wikipedia.org/wiki/Brownian_covariance" target="_blank">distance correlation</a> (equal to zero if and only if X and Y are independent) or my <a href="http://www.analyticbridge.com/profiles/blogs/structuredness-coefficient-to-find-patterns-and-associations" target="_blank">structuredness coefficient</a>.</span></li>
<li><span class="font-size-2">Not sensitive to outliers, robust</span></li>
<li><span class="font-size-2">More intuitive, more compatible with the way the human brain perceives correlations</span></li>
</ol>
<p><span class="font-size-2">Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.</span></p>
<p><span class="font-size-2">To illustrate the first condition (dependence on n), let's consider the following made-up data set with two paired variables or time series X, Y: ...</span></p>
<p><span class="font-size-2"><a href="http://www.analyticbridge.com/profiles/blogs/correlation-and-r-squared-for-big-data" target="_blank">Read full article</a>.</span></p>
<p></p> What MapReduce can't dotag:www.datasciencecentral.com,2014-05-26:6448529:Topic:1722942014-05-26T23:26:22.281ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><em>Originally posted on Analyticbridge, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather…</span></p>
<p><em>Originally posted on Analyticbridge, by <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey" target="_blank">Dr. Granville</a>. <a href="http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do" target="_blank">Click here</a> to read original article and comments.</em></p>
<p><span class="font-size-2">We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.</span></p>
<p><span class="font-size-2">MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.</span></p>
<p><span class="font-size-2">MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.</span></p>
<p><span class="font-size-2">However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.</span></p>
<p><span class="font-size-2"><strong>The Problem</strong></span></p>
<p><span class="font-size-2">Let's say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indices (say k=10,000) and the n observations represent stock price signals (up / down) measured at n different times. You want to find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is up tomorrow.</span></p>
<p><span class="font-size-2">You have to compute k * (k-1) /2 correlations to solve this problem, despite the fact that you only have k=10,000 stock symbols. You can not spit your 10,000 stock symbols in 1,000 clusters, each containing 10 stock symbols, then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster, and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-clusters computations makes MapReduce useless in this case. The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can't help. I'll discuss another example (keyword taxonomy) later in this article.</span></p>
<p><span class="font-size-2"><strong>Three Solutions</strong></span></p>
<p><span class="font-size-2">Here I propose three solutions:</span></p>
<p><span class="font-size-2"><strong>1. Sampling</strong></span></p>
<p><span class="font-size-2">Instead of computing all cross-correlations, just compute a fraction of them: select m random pairs of variables, say m = 0.001 * k * (k-1) / 2, and compute correlations for these m pairs only. A smart strategy consists of starting with a very small fraction of all possible pairs, and increase the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide with variables to keep, which ones to add, to form new pairs, after computing correlations on (say) 1,000 randomly selected seed pairs (of variables).</span></p>
<p><span class="font-size-2">I'll soon publish an article that shows how approximate solutions<span> (a local optimum) </span>to a problem, requiring a million time less computer resources than finding the global optimum, yield very good approximations with an error often smaller than the background noise found in any data set. In another paper, I will describe a semi-combinatorial strategy to handle not only 2x2 combinations (as in this correlation issue), but 3x3, 4x4 etc. to find very high quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.</span></p>
<p><span class="font-size-2"><strong>2. Binning</strong></span></p>
<p><span class="font-size-2">If you can bin your variables in a way that makes sense, and if n is small (say=5), then you can pre-compute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential pairs of value. An example of such a pair is {(up, up, down, up, down), (up, up, up,down, down)} where the first 5 values correspond to a particular stock, and the last 5 values to another stock. It is thus easy to pre-compute all 512 correlations. You will still have to browse all k * (l-1) / 2 pairs of stocks to solve you problem, but now it's much faster: for each pair you get the correlation from the lookup table - no computation required, only accessing a value in a hash table or an array with 512 cells.</span></p>
<p><span class="font-size-2">Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs migh be faster than using lookup tables to access 512 pre-computed correlations. However, the principle works regardless as to whether you compute a correlation, or much more complicated function f.</span></p>
<p><span class="font-size-2"><a href="http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do" target="_blank">Read full article</a>.</span></p>