# Data analysis challenge + how can I detect the advanced correlation between variables?

Hello data scientists :)

The idea is to discover the story behind some data that you ignore the rules of.
I am looking for a tool or Python library, to analyse data sets and automatically find correlations and relations between variables.

These are the kind of things I would like to get:

1. The correlation between each of the variables (1vs1)
2. Details about this correlation (e.g. A and B have a positive correlation (or a non linear...) especially when 50>A>70 with R reaching 0.8)
3. The correlation between several variables, including non numerical ones (e.g. A and B have a positive correlation, only when C>70 and D='USA')

As you can see there is nothing very complicated on the math/stat standpoint.
The keyword is here automatically to avoid spending hours finding manually the sub-rules on sub-data sets and exceptions.

Here is a little data analysis challenge !
The attached excel file contains two tabs.
The first one is a small data set to analyse and some rules of the game (what and how to find).
The second one contains some results found by manual research and graphical representations... but that should be found by other intelligent means.

Tags: correlation, python, tool

Views: 354

Attachments:

### Replies to This Discussion

Are you running your computations in batch mode? Some libraries in Python and R will solve your problem, see for instance here. In Excel, it is easy to compute the correlation matrix (all the correlations at once) if you install the Analysis Toolpack, see here

The main concern, if you have many variables, is that some correlations will be significant just by chance, even with totally random data. See the section on p-values in my book (here) page 236.

Hello Vincent,

Not in batch mode, I am more planning to load manually a CSV file in a Python/R/SAS program.

Excel could run correlation analysis between numerical variables, but des not handle:

- correlation analysis on sub-scopes (e.g two variables X and Y have a low correlation, except when X >50)
- correlation involving more than two variables (e.g two variables X and Y have a low correlation, except when Z is >0)
- correlation with non numerical variables like days, cities, etc.

I will check Python Numpy, but I haven't seen that it handled these cases.

Thanks for the reference to your book, very interesting!