*Originally posted on Hadoop36o, by Dr. Granville. Click here to read original article and comments.*

The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula.

**Synthetic Metrics**

This new metric is *synthetic*: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). By*synthetic*, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a top-down approach to metric design - from data to theory, rather than the bottom-up traditional approach - from theory to data.

Other synthetic metrics designed in our research laboratory include:

- Predictive power metric, related to entropy (that is, information quantification), used in big data frameworks, for instance to identify optimum feature combinations for scoring algorithms.
- Correlation for big data, defined by an algorithm and closely related to the optimum variance metric discussed here.
- Structuredness coefficient
- Bumpiness coefficient

**Hadoop, numerical and statistical stability**

There are two issues with the formula used for computing Variance in Hadoop. First, the formula used, namely Var(x1, ... , xn) = {SUM(xi^2)/n} - {SUM(xi)/n}^2, is notoriously unstable. For large n, while both terms cancel out somewhat, each one taken separately can take a huge value, because of the squares aggregated over billions of observations. It results in numerical inaccuracies, with people having reported negative variances. Read the comments attached to my article The curse of Big Data for details. Besides, there are variance formula that do not require two passes of the entire data sets, and that are numerically stable.

Tags:

© 2017 Data Science Central Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service