# Correlation vs. causation

What are the techniques to discriminate between coincidences, correlations, and real causation? Also, if you have a model where the response is highly correlated to (say) 5 non-causal variables and 2  direct causal variables, how do you assign a weight to the two causal variables?

Can you provide examples of successful cause detection? In which contexts is it important to detect the true causes? In which context causation can be ignored as long as your predictive model generates great ROI?

Related articles

Views: 11101

### Replies to This Discussion

As an illustration, you can predict chances of lung cancer within the next 5 years, for a given individual, either as a function of daily consumption of cigarettes, or as a function of electricity bill. Both are great predictors: the first one is a cause. The second one is not, but it is linked to age (the older you get, the more expensive commodities are due to inflation) and thus is a very indirect (non-causal) factor for lung cancer.

1. David Freedman is the author of an excellent book: "Statistical Models: Theory and Practice" which discusses the issue of causation. It's a very unique stat book in that it really gets into the issue of model assumptions. I highly recommend it. It claims to be introductory but I believe that a semester or two of math stat as a pre-req would be helpful.
2. In the time series context, you can run a VAR and then do tests for Granger Causality to see if one variable is really "causing" the other where "causing" is defined by Granger. ( see any of Granger's books for the technical definition of granger causaility ). R has a nice package called vars which makes building VAR models and doing testing extremely straightforward.
3. Correlation does not imply causation although where there is causation you will often but not always have correlation. Causality analysis can be done by learning Bayesian Networks from the data. See the excellent tutorial "A Tutorial on Learning With Bayesian Networks" by David Heckerman.
4. Stefan Conrady who manages the company called "Bayesia Networks" once told me his methodology can identify causation.
5. For time series, seehttp://en.wikipedia.org/wiki/Granger_causality/
6. The only way to discriminate between correlation and coincidence is through a controlled experiment that. Design the experiment such that you can test the effect of each parameter independent of the other.

Best book that I have seen so far on the topic is - Causality by Judea Pearl. He discusses this topic in a lot of detail - direct, indirect, confounding, counterfactuals, etc.

Please note that the causal model is interested in describing the mechanism (why something is happening) – essentially a deeper understanding than is possible through correlation analysis.

The technique that I have used often is roughly described as:

• Start with a causal model diagram that will form your hypotheses (you can use SEM diagrams - I actually use a technique called system dynamics).
• Now you find data to confirm/refute the set of hypotheses made in your model. For data that does not exist, you will need to perform focused experiments.
• Based on data/correlation/sem analysis, you refine your causal understanding.
• Simulate the model to see if the results are plausible - refine.
• Start using your model, keep refining as new evidence shows up.

Also, if you are dealing with more than 200 variables, you will find correlations that look significant (from a statistical point of view) but are actually an artifact caused by the large number of variables. Example: simulate 50 observations each with 10,000 variables: each value in each variable being a simulated random number. Chances are you will find, among these 10,000 variables, two that are highly correlated despite the fact that the real correlation between any of these two variables is (by design!) zero.

There are ways to deal with this issue, but this will be the subject of another discussion.

As David Hume tells us, all we observe in the world is constant conjunction.  Cause and effect is something we infer, guided by theory and experiments.

Thanks for the insight, some valuable information here.  I look forward to exploring some of it.

Vincent Granville said:

1. David Freedman is the author of an excellent book: "Statistical Models: Theory and Practice" which discusses the issue of causation. It's a very unique stat book in that it really gets into the issue of model assumptions. I highly recommend it. It claims to be introductory but I believe that a semester or two of math stat as a pre-req would be helpful.
2. In the time series context, you can run a VAR and then do tests for Granger Causality to see if one variable is really "causing" the other where "causing" is defined by Granger. ( see any of Granger's books for the technical definition of granger causaility ). R has a nice package called vars which makes building VAR models and doing testing extremely straightforward.
3. Correlation does not imply causation although where there is causation you will often but not always have correlation. Causality analysis can be done by learning Bayesian Networks from the data. See the excellent tutorial "A Tutorial on Learning With Bayesian Networks" by David Heckerman.
4. Stefan Conrady who manages the company called "Bayesia Networks" once told me his methodology can identify causation.
5. For time series, seehttp://en.wikipedia.org/wiki/Granger_causality/
6. The only way to discriminate between correlation and coincidence is through a controlled experiment that. Design the experiment such that you can test the effect of each parameter independent of the other.

Another example:

Doing more sports is correlated with less obesity. So someone decides that all kids should have more physical education at school, in order to fight obesity. However, lack of sports is not the root cause of obesity: bad eating habits are, and that's what should be addressed first to fix the problem. Then, replacing sports by mathematics or foreign languages would make American kids both less obese (once eating habits are fixed) and more educated at the same time.

A lot can be done with black-box pattern detection, where patterns are found but not understood. For many applications (e.g. high frequency training) it's fine as long as your algorithm works. But in other contexts (e.g. root cause analysis for cancer eradication), deeper investigation is needed for higher success. And in all contexts, identifying and weighting true factors that explain the cause, usually allows for better forecasts, especially if good model selection, model fitting and cross-validation is performed. But if advanced modeling requires paying a high salary to a statistician for 12 months, maybe the ROI becomes negative and black-box brute force performs better, ROI-wise. In both cases, whether caring about cause or not, it is still science. Indeed it is actually data science - and it includes an analysis to figure out when/whether deeper statistical science can or can not be required. And ill all cases, it always involves cross-validation and design of experiment. Only the statistical theoretical modeling aspect can be ignored. Other aspects, such as scalability and speed, must be considered, and this is science too: data and computer science.

Facebook data scientists hilariously debunk Princeton "correlation equals causation" based study that says Facebook will lose 80% of users - by "proving" that Princeton will lose all its students by 2021.

Vincent,

Since your post is 2012, you probably already know this but I'm posting here for future readers...

Regarding Bayesia's (Stefan Conrad's company) flagship product, BayesiaLab, and causation: yes, they have an approach that helps. Since BayesiaLab is designed primarily for Bayesian networks, it's probably not a big surprise that their approach relies on the same methodology as Judea Pearl's formalism for directed acyclic graphical (DAG) models. The thing to note, however, is that one must develop the causative model first and BayesiaLab is then used to test whether the data agrees with the model. That is, you cannot determine causality entirely from data.

The workflow is that an SME (perhaps in conjunction with a knowledge engineer) encodes their hypothesis in the form of a DAG. In contrast to Bayesian networks, here arcs denote causality and the flow of time. This is Pearl's "data generation model" (i.e., a hypothesized model for how the system works, using the appropriate level of granularity, etc.). The absence of arcs between nodes (random variables) is what is key -- that's where your assumptions come in (arcs merely represent that there is a possibility of causality). Then you run whatever data you have through the model and see whether it agrees. If not, it's back to the drawing board (albeit much wiser)

The need for an SME to encode the model (rather than learning it entirely from data) is because this approach is based on counterfactual reasoning and, thus, is essentially a missing data problem. You've got to account for that somehow and that's where human knowledge comes in. But using a graphical model makes the process bearable.

Hope this is helpful for future readers. Pearl's book is quite the tome but it's the most comprehensive on this approach, so that's the one to get if one wants to read more.

-Mark

Vincent Granville said:

1. David Freedman is the author of an excellent book: "Statistical Models: Theory and Practice" which discusses the issue of causation. It's a very unique stat book in that it really gets into the issue of model assumptions. I highly recommend it. It claims to be introductory but I believe that a semester or two of math stat as a pre-req would be helpful.
2. In the time series context, you can run a VAR and then do tests for Granger Causality to see if one variable is really "causing" the other where "causing" is defined by Granger. ( see any of Granger's books for the technical definition of granger causaility ). R has a nice package called vars which makes building VAR models and doing testing extremely straightforward.
3. Correlation does not imply causation although where there is causation you will often but not always have correlation. Causality analysis can be done by learning Bayesian Networks from the data. See the excellent tutorial "A Tutorial on Learning With Bayesian Networks" by David Heckerman.
4. Stefan Conrady who manages the company called "Bayesia Networks" once told me his methodology can identify causation.
5. For time series, seehttp://en.wikipedia.org/wiki/Granger_causality/
6. The only way to discriminate between correlation and coincidence is through a controlled experiment that. Design the experiment such that you can test the effect of each parameter independent of the other.

One thing that doesn't seem to make into the discussion of correlation vs. causation is practicality.  Let us assume that X and Y are correlated but not causally linked.  Every time we observe X happening, we observe a similar pattern in Y.  If I notice X going up, I can take an action that depends on Y outcome.  If Y happens because of my inferences in X, I win even if the two are not causally linked.  Of course, there are a variety of situations where proving causality is necessary.  My point, however, is that not all situations require it.

I would welcome your thoughts on this.

What about the decades old literature that basically started the modern conversation on this topic? Rubin's causal model is standard fare (or should be) for graduate statistics programs ... The fact that it remains unknown or unrecognized in these circles speaks to me that data science is losing touch with its roots in statistics as it is blended in with other disciplines.

1

2

3

4

5

6