With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.
In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following
Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.
To illustrate the first condition (dependence on n), let's consider the following made-up data set with two paired variables or time series X, Y: ...