This is an interesting data science conjecture, inspired by the well known six degrees of separation problem, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea.
Here the link is between any two univariate data sets of the same size, say Data A and Data B. The claim is that there is a chain involving no more than 6 intermediary data sets, each highly correlated to the previous one (with a correlation above 0.8), between Data A and Data B. The concept is illustrated in the example below, where only 4 intermediary data sets (labeled Degree 1, Degree 2, Degree 3, and Degree 4) are actually needed. The numbers highlighted in red show how this chain of data sets is built.
We have the following correlations:
Note that the data sets A and B were randomly generated, using the RAND() function in Excel. The full correlation table is as follows (the spreadsheet with data and computations is available here):
Cross-correlations between the 6 data sets
Conclusions
This is just a conjecture, and maybe the number of intermediary data sets or the 0.8 correlation must be fine-tuned and could depend on the size of the data set. But it makes for an interesting theoretical data science research project, for people with too much free time on their hands.
In some way, one could say that anything is related to everything, by a short path. Or that anything is caused by everything. This has of course been exploited in many news outlets to convey a political message, or to cause you to click on some random, worthless article, by using subject lines that seem implausible to attract your attention. .
Related articles
Comment
Completely random, but this morning I was thinking about Prime Numbers, their properties. I was wondering if data structures (maybe scoped to a firm) could be viewed at discrete items like integers, and then their correlation examined by the ability to explain them in terms of other data sets. So let us say a DataSet identified as "20", be explained by 2 datasets "10", or 4 DataSets "5".....
A possible application in data management could be allowing a firm to make a decision about what to maintain in persistence and what not to, what to use for computational queries (smaller the cheaper) and such. If you get 96% of the desired outcome using your "10" dataset in two different ways, compared to maintaining the larger "20" dataset....it may make sense to leverage the smaller dataset as viewed from the holistic organization.
More conjecture.
Isn't that somehow related to deep learning and the multiple hidden layers in a DNN?
Vincent, this is a great little experiment! Something interesting is that the pairwise correlations (as one would expect for synthetic correlations of 0.8) are dropping by roughly 0.2 at each step. With greater numbers of steps, the relationship would probably become more geometrical than linear (as corr(step X,step X + 2) should approximate square(corr(step X,step X + 1))). You can fine-tune the number of "degrees of separation" desired by sliding up and down that minimum transition correlation.
I like this as a tool to explain chaos, as it can easily be used to demonstrate that a single state can be the "result" of any of literally an infinite number of prior state arrived at by chaining strong correlations -- even states that are perfect opposites can lead to the same resultant state at d=6, r=0.8.
This conjecture teaches a great lesson to me. Data has to make sense according to its domain/environment. We cannot just be plugging more irrelevant attributes for the sake of finding correlation to far-fetched data sets. Or else, said news outlet will appear legitimate in making outrageous claims, which appear true, but are apparently false under the hood.
6 Degrees of Separation is great food for thought in the world of DS.
Hi Purshottam -- these data sets were not hand-picked for this property. I created Data A and Data B in Excel using the RAND() function, and I used the first pair (Data A, Data B) that it produced.
How do we deal with such dataset
© 2020 TechTarget, Inc. Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central