.

# The curse of big data

This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data. We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed.

In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be

• entirely caused by chance (just like someone who wins at the lottery wins purely by chance) and
• not replicable,
• having no predictive power,
• but obscuring weaker patterns that are ignored yet have a strong predictive power.

The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?

Let's focus on one example: identifying strong correlations or relationships between time series. If you have 1,000 metrics (time series), you can compute 499,500 = 1,000*999/2 correlations. If you include cross-correlations with time lags (e.g. stock prices for IBM today with stock prices for Google 2 days ago), then we are dealing with many, many millions of correlations. Out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will loose. Keep in mind that analyzing cross-correlations on all metrics is one of the very first step statisticians do at a beginning of any project - it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provide a much more robust mechanism to identify true relationships.

To illustrate the issue, let's say that you have k time series, each with n observations, for instance, price deltas (price increases or decreases) computed for k different stock symbols with various time lags over a same time period consisting of n days. For instance, you want to detect patterns such as "When Google stock price goes up, Facebook goes down one day later". In order to detect such profitable patterns, you must compute cross-correlations over thousands of stocks, with various time lags: one day, two days, or maybe one second, two seconds depending on whether you do daily trading or extremely fast intraday, high frequency trading. Typically, you keep a small number of observations - e.g. n=10 days or n=10 milliseconds - as these patterns evaporate very fast (when your competitors detect the patterns in question, it stops becoming profitable). In other words, you can assume that n=10, or maybe n=20. In other cases based on monthly data (environmental statistics, emergence of a new disease), maybe n=48 (monthly data collected over a 2-year time period). In some cases n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is very acute when n is smaller than 200 and k moderately large, say k=500. However, instances where both n is large (> 1,000) and k is large (> 5,000) are rather rare.

Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a very high correlation. We could have done Monte Carlo simulations to compute the chance in question, but here we use plain old-fashioned statistical modeling.