Subscribe to DSC Newsletter

Update about our data science competition

What seemed to be an untractable problem involving trillions of quadrillions of compu... - far more than required to process all the data produced or collected on Earth since the beginning of times - has been reduced to something computationally feasible and even possibly quite simple. One applicant has used extremely smart ways to sample the universe of all permutations, reducing computational complexity from O(n!) to O(n). His contribution will be featured when we announce the winner. No proof or formula is yet available, but my two weaker results (in particular, the fact that q(n) is asymptotically equivalent to n^2/3) are now considered erroneous. In particular, the factor 1/3 in n^2/3 is believed to be erroneous. The correct value seems to be 3/8: this has been suggested by two candidates, one with strong arguments in favor of 3/8 versus 1/3, though we don't have a formal proof yet. I will not share answers submitted by candidates, for obvious reasons, until winners are identified.

Maxim's Permutation - The least "linear" permutation of (0, ... , 20)

Standard correlation between X and Y is 0.0458

Another great news is that we decided to offer two, rather than one award, both with the same $500 price (multiplied by 2 if you have a reverse sponsor). The reason is because I now believe that there is no finite formula to compute q(n) though I'd love to be proven wrong. If you come up with an infinite, simple series or integral or recurrence relationship for q(n), and that's the best that we can get, you'll get the first award!

The second award will be offered to someone who finds the limiting normalized distribution for either fundamental quantity T or t when c=1, defined in my paper about the new correlation. I believe t is multimodal and might be very complex, possibly with an infinite number of local maxima as n tends to infinity. By normalized distribution, I mean after an appropriate transformation so that it is not a singular distribution, just like you normalize the sample mean by removing expectation and dividing by SQRT(n) to obtain the limiting Gaussian distribution (bell curve). Is the limiting normalized distribution also a bell curve in this case? I don't think so, but my guess is based on nothing - just gut feelings. In any case, finding the limiting normalized distribution is very important for business applications, and you can easily get a good approximation using Monte-Carlo simulations. But we are interested in the theoretical limit distribution here, not approximations.

We will also feature insights about the geometrical shape of these extreme, weird permutations (for large n) that maximize min(u, v). One example is posted in my comment dated July 14, at 11:28pm. I'm wondering if these "least linear" permutations look like concentric circles, when n becomes large. They definitely have a pattern that make them far different from any other permutations.

Related article

Views: 5137

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on October 23, 2013 at 10:26am

Jean-Francois Puget, PhD, Distinguished Engineer, Industry Solutions Analytics and Optimization at IBM found the exact formula and came with a proof. He was awarded a $1,000 prize for his solution. Here’s the result:

Theorem

Let m be the quotient and let r be the remainder of the Euclidean division of n by 4: n = 4m + r, 0 <= r < 4.

Let p(n) = 6m2 + 3mr + r(r-1)/2.

Then:

  • q(n) = p(n), if p(n) is even
  • q(n) = p(n) - 1, if p(n) is odd

The rather lengthy and complicated proof can be found at datashaping.com/Puget-Proof.pdf.

The two winners of this competition will be announced this month, on Data Science Central.

Note

Finding an explicit formula for q(n)  can be done using algorithms (simulations, smart permutations sampling) that involve processing huge amount of data. Jean-Francois found an exact mathematical solution.This proves that sometimes, mathematical modeling can beat even the most powerful system of clustered computers to find a solution. Though usually, both work hand in hand. 

This function q(n) is at the core of a new type of statistical metrics developed for big data: a non-parametric, robust metric to measure a (new type of ) correlation or goodness of fit. This metric generalizes traditional metrics that have been used for centuries, and it is most useful when working with large ordinal data series, such as rank data. While based on rank statistics, it is much less sensible to outliers than current metrics based on rank statistics (Spearman’s rank correlation) which was designed for rather small n, where it is indeed very robust.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service