Originally posted on Analyticbridge, by Dr. Granville. Click here to read original article and comments.
With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.
In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following
Five conditions:
Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.
To illustrate the first condition (dependence on n), let's consider the following made-up data set with two paired variables or time series X, Y: ...
Tags:
Posted 1 March 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles