# Six Degrees of Separation Between Any Two Data Sets

This is an interesting data science conjecture, inspired by the well known six degrees of separation problem, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea.

Here the link is between any two univariate data sets of the same size, say Data A and Data B. The claim is that there is a chain involving no more than 6 intermediary data sets, each highly correlated to the previous one (with a correlation above 0.8), between Data A and Data B. The concept is illustrated in the example below, where only 4 intermediary data sets (labeled Degree 1, Degree 2, Degree 3, and Degree 4) are actually needed. The numbers highlighted in red show how this chain of data sets is built.

We have the following correlations:

• Between Data A and Data B: -0.0044
• Between Degree 1 and Data A: 0.8232
• Between Degree 2 and Degree 1: 0.8293
• Between Degree 3 and Degree 2: 0.8056
• Between Degree 4 and Degree 3: 0.8460
• Between Data B and Degree 4: 0.8069

Note that the data sets A and B were randomly generated, using the RAND() function in Excel. The full correlation table is as follows (the spreadsheet with data and computations is available here):

Cross-correlations between the 6 data sets

Conclusions

This is just a conjecture, and maybe the number of intermediary data sets or the 0.8 correlation must be fine-tuned and could depend on the size of the data set. But it makes for an interesting theoretical data science research project, for people with too much free time on their hands.

In some way, one could say that anything is related to everything, by a short path. Or that anything is caused by everything. This has of course been exploited in many news outlets to convey a political message, or to cause you to click on some random, worthless article, by using subject lines that seem implausible to attract your attention. .

Related articles

Views: 5549

Comment

Join Data Science Central

Comment by Raymond K Roberts on January 26, 2020 at 6:10am

Completely random, but this morning I was thinking about Prime Numbers, their properties. I was wondering if data structures (maybe scoped to a firm) could be viewed at discrete items like integers, and then their correlation examined by the ability to explain them in terms of other data sets. So let us say a DataSet identified as "20", be explained by 2 datasets "10", or 4 DataSets "5".....

A possible application in data management could be allowing a firm to make a decision about what to maintain in persistence and what not to, what to use for computational queries (smaller the cheaper) and such. If you get 96% of the desired outcome using your "10" dataset in two different ways, compared to maintaining the larger "20" dataset....it may make sense to leverage the smaller dataset as viewed from the holistic organization.

More conjecture.

Comment by Maxime Prat on October 1, 2019 at 8:15am

Isn't that somehow related to deep learning and the multiple hidden layers in a DNN?

Comment by Nate Whitten on September 9, 2019 at 11:46am

Vincent, this is a great little experiment! Something interesting is that the pairwise correlations (as one would expect for synthetic correlations of 0.8) are dropping by roughly 0.2 at each step. With greater numbers of steps, the relationship would probably become more geometrical than linear (as corr(step X,step X + 2) should approximate square(corr(step X,step X + 1))). You can fine-tune the number of "degrees of separation" desired by sliding up and down that minimum transition correlation.

I like this as a tool to explain chaos, as it can easily be used to demonstrate that a single state can be the "result" of any of literally an infinite number of prior state arrived at by chaining strong correlations -- even states that are perfect opposites can lead to the same resultant state at d=6, r=0.8.

Comment by Jon-David Woods on September 9, 2019 at 7:29am

This conjecture teaches a great lesson to me. Data has to make sense according to its domain/environment. We cannot just be plugging more irrelevant attributes for the sake of finding correlation to far-fetched data sets. Or else, said news outlet will appear legitimate in making outrageous claims, which appear true, but are apparently false under the hood.

6 Degrees of Separation is great food for thought in the world of DS.

Comment by Vincent Granville on September 9, 2019 at 3:47am

Hi Purshottam -- these data sets were not hand-picked for this property. I created Data A and Data B in Excel using the RAND() function, and I used the first pair (Data A, Data B) that it produced.

Comment by Purshottam Hoovayya on September 8, 2019 at 11:18pm

How do we deal with such dataset