Subscribe to DSC Newsletter

IBM Distinguished Engineer solves Big Data Conjecture

A mathematical problem related to big data was solved by Jean-Francois Puget, engineer in the Solutions Analytics and Optimization group at IBM France. The problem was first mentioned on Data Science Central, and an award was offered to the first data scientist to solve it.

Bryan Gorman, Principal Physicist, Chief Scientist at Johns Hopkins University Applied Physics Laboratory, made a significant breakthrough in July, and won $500. Jean-Francois Puget completely solved the problem, independently from Bryan, and won a $1,000 award.

Example of rare, special permutation investigated to prove the theorem

The competition has been organized and financed by Data Science central. Participants from around the world submitted a number of interesting approaches. The mathematical question was asked by Vincent Granville, a leading data scientist and co-founder at Data Science Central. Granville initially proposed a solution after performing large-scale Monte Carlo simulations, but his solution turned out to be wrong.

The problem consisted in finding an exact formula for a new type of correlation and goodness-of-fit metrics, designed specifically for big data, generalizing the Spearman's rank coefficient, and being especially robust for non-bounded, ordinal data found in large data sets. From a mathematical point of view, the new metric is based on L-1 rather than L-2 theory: In other words, it relies on absolute rather than squared differences. Using squares (or higher powers) is what makes traditional metrics such as R squared notoriously sensitive to outliers, and avoided by savvy statistical modelers. In big data, outliers are plentiful and even extreme outliers are not rare. It can render conclusions from a statistical analysis invalid, so this is a critical issue. This outlier issue is sometimes referred to as the curse of big data.

Interestingly, the initial approach used by Granville, to compute the formula for the new metrics, consisted of sampling a very large number of permutations of n elements (hundreds of billions of them) to compute a maximum over all permutations of n elements, with n as large as 30. The maximum is achieved by an incredibly small number of very rare, special permutations, making sampling difficult, and forcing you to do smart sampling. Granville's formula was correct up to n=21. Note that there are about 0.5 million of trillion of trillion of permutations of order 21, so even the biggest cloud architecture might require millions of years to make all the computations for n as small as 21, thus the idea of sampling, and indeed, smart sampling. Also, we need a solution that works for any value of n, as small as 21 or as large as 5,000,000. Smart sampling consists of sampling in an extremely tiny portion of the permutation space, in which the few permutations achieving the maximum are expected to be found.

Jean-Francois and Brian both came with a new approach: Instead of running heavy computations, they used mathematical thinking and leveraged their expertise in mathematical optimization as well as in permutation theory and combinatorics. And they succeeded. This proves that sometimes, mathematical modeling can beat even the most powerful system of clustered computers. Though usually, both work hand in hand.

The new metrics designed by Granville for big data are indeed very old. They precede the R squared by at least 50 years, and were first investigated by statisticians in the 18-th century. But the L-1 version (the oldest framework) was quickly abandoned because it was thought to be mathematically intractable, at least for 18-th century mathematicians. The L-2 version, which leads to elegant and simple mathematical formulas, and to what is known as the general linear model in statistical circles, was thus favored, even to this day, despite its poor robustness properties and strong dependency on the normal distribution assumption. The conjecture proved by Puget shows that the L-1 model leads to exact, tractable mathematical formulas. It reverses a long established trend among statisticians. Certainly, Granville contributed to help revisit this old problem, by using big computers and big simulations to discover, that, maybe, the L-1 model is not as intractable as originally thought. While Granville's results were wrong (it was a rather good approximation, but not an exact solution), it led to the Data Science Central competition and a final resolution.

Granville is still interested in exploring the asymptotic behavior of these metrics, and this will probably result in a new competition in which data scientists, statisticians and mathematicians - actually anyone good with analytics - are invited to participate.

You can find details about the competition, the result proved by Jean-Francois (including the proof), the new metrics used for big data, the advantages of these metrics, and the curse of big data at http://bit.ly/133S6ns.

Views: 11935

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dr. Z on October 28, 2013 at 12:33pm

@ Vincent. That's great. I find this kind of research both intriguing and extremely practical. I hope there will be an article about this metric once it is complete. Thanks.

Comment by Vincent Granville on October 25, 2013 at 6:17am

@Dr. Z: my article entitled Structuredness Coefficient addresses this issue.

Comment by Dr. Z on October 25, 2013 at 6:02am

What about non-parametric correlation metrics that are insensitive to outliers and provide an accurate proxy for the relationship between two variables? The distance power is merely one aspect of the correlation metric, not the only one. The process of estimating the variables' centroids is equally important.

Comment by Jeff Daniels on October 24, 2013 at 7:46am

Kudos, Vincent.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service