This data puzzle was originally posted on my blog, The WellTempered Spreadsheet. In a Data Science Central article, Kirk Borne cited the puzzle as an example of a data relationship that eludes many data discovery tools.
The Island of Games
Life is fun on the Island of Games. The thousand inhabitants enjoy competing at chess, checkers, and contests to solve the Rubik’s Cube puzzle as fast as possible. The islanders are rated for their skill at each of the three games. The ratings fall between 0 and 1.
The ratings for each category seem to follow a uniform distribution. For example, here is a histogram for the Chess ratings:
There isn't much correlation between the skills for the three games:
Correlations between Skills
Chess 
Checkers 
Rubik's Cube 

Chess  1.0000 

Checkers  0.0530 
1.0000 

Rubik's Cube  0.0452 
0.0049 
1.0000 
The next three charts confirm this lack of correlation. The first chart compares the ratings for Chess and Checkers, and includes a linear regression:
Here is Chess vs Rubik’s Cube:
The raw data and some statistical analysis can be found here. All of this looks like pure noise. But there is a hidden structure. Can you find it?
Copyright 2013. All Rights Reserved.
Comment
I figured that if I can't see twopoint correlations there may still be threepoint correlations.
The first thing I tried was plotting checkers+cube vs. chess, which turns out to be a good starting point.
Thank you again.
You made an excellent comment about the art of data analysis. It highlights the importance of thorough visual exploration combined with deep subject matter understanding.
I wonder how many important relationships remain undetected because they are hard to visualize.
I would probably have run some kPCA analysis to denoise the data.
Anyhow, I doubt that with higher dimensions it'd be easy to spot this kind of relationship. One of the main problems I see is that distances between points will be very similar, so discretization might become helpful as well.
That's the "art" part of data science! You cannot go blindly in with your favourite algorithm. You first need to try and understand where the data comes from and ways to manipulate/visualize it that make sense, and that still no computer can do automatically.
Thanks Gustavo! Very interesting.
Let me ask you another question, if you don't mind. Let's say that it is possible to visualize up to n dimensions (using color, market size, animated time, etc.). I could construct an n+1 dimensional checkerboard pattern that would be impossible to visualize.
When you solved my puzzle, it seems you used a combination of visualization and LSSVM. If visualization were not helpful for a higher dimensional puzzle, do you think you still would have solved it? Thanks!
I looked at the data in 3D, it looked "blocky". I then run an LSSVM regression of Rubik ~ Chess,Checkers. The result was the following pictureI then transformed my Rubik data into a logical (Rubik>0.5).
I also transformed the Chess and Checkers into belonging to one or the other checkerboard squares.
The confusion matrix was a perfect match.
Another way I thought about it was to make a translation of all the data by 0.5. In that case you would have a perfect match of the sign of (Chess*Checkers) and the sign of Rubik.
Gustavo,
You are correct!
Most of the people who solved the puzzle did so by inspecting the data. What did you do?
It's a checkerboard!
If 0<Chess<.5 and 0<Checkers<0.5 then Rubik>.5
if .5<Chess<1 and .5<Checkers<1 then Rubik>.5
Otherwise Rubik<.5
© 2019 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central