*Originally posted on Analyticbridge, by Dr. Granville. Click here to read original article and comments.*

We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.

MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.

MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.

However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

**The Problem**

Let's say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indices (say k=10,000) and the n observations represent stock price signals (up / down) measured at n different times. You want to find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is up tomorrow.

You have to compute k * (k-1) /2 correlations to solve this problem, despite the fact that you only have k=10,000 stock symbols. You can not spit your 10,000 stock symbols in 1,000 clusters, each containing 10 stock symbols, then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster, and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-clusters computations makes MapReduce useless in this case. The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can't help. I'll discuss another example (keyword taxonomy) later in this article.

**Three Solutions**

Here I propose three solutions:

**1. Sampling**

Instead of computing all cross-correlations, just compute a fraction of them: select m random pairs of variables, say m = 0.001 * k * (k-1) / 2, and compute correlations for these m pairs only. A smart strategy consists of starting with a very small fraction of all possible pairs, and increase the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide with variables to keep, which ones to add, to form new pairs, after computing correlations on (say) 1,000 randomly selected seed pairs (of variables).

I'll soon publish an article that shows how approximate solutions (a local optimum) to a problem, requiring a million time less computer resources than finding the global optimum, yield very good approximations with an error often smaller than the background noise found in any data set. In another paper, I will describe a semi-combinatorial strategy to handle not only 2x2 combinations (as in this correlation issue), but 3x3, 4x4 etc. to find very high quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.

**2. Binning**

If you can bin your variables in a way that makes sense, and if n is small (say=5), then you can pre-compute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential pairs of value. An example of such a pair is {(up, up, down, up, down), (up, up, up,down, down)} where the first 5 values correspond to a particular stock, and the last 5 values to another stock. It is thus easy to pre-compute all 512 correlations. You will still have to browse all k * (l-1) / 2 pairs of stocks to solve you problem, but now it's much faster: for each pair you get the correlation from the lookup table - no computation required, only accessing a value in a hash table or an array with 512 cells.

Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs migh be faster than using lookup tables to access 512 pre-computed correlations. However, the principle works regardless as to whether you compute a correlation, or much more complicated function f.

Tags:

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service