Correlation vs. causation

What are the techniques to discriminate between coincidences, correlations, and real causation? Also, if you have a model where the response is highly correlated to (say) 5 non-causal variables and 2  direct causal variables, how do you assign a weight to the two causal variables?

Can you provide examples of successful cause detection? In which contexts is it important to detect the true causes? In which context causation can be ignored as long as your predictive model generates great ROI?

Related articles

Load Previous Replies
  • up

    Vincent Granville

    A lot can be done with black-box pattern detection, where patterns are found but not understood. For many applications (e.g. high frequency training) it's fine as long as your algorithm works. But in other contexts (e.g. root cause analysis for cancer eradication), deeper investigation is needed for higher success. And in all contexts, identifying and weighting true factors that explain the cause, usually allows for better forecasts, especially if good model selection, model fitting and cross-validation is performed. But if advanced modeling requires paying a high salary to a statistician for 12 months, maybe the ROI becomes negative and black-box brute force performs better, ROI-wise. In both cases, whether caring about cause or not, it is still science. Indeed it is actually data science - and it includes an analysis to figure out when/whether deeper statistical science can or can not be required. And ill all cases, it always involves cross-validation and design of experiment. Only the statistical theoretical modeling aspect can be ignored. Other aspects, such as scalability and speed, must be considered, and this is science too: data and computer science.

  • up

    Alex Esterkin

    Facebook data scientists hilariously debunk Princeton "correlation equals causation" based study that says Facebook will lose 80% of users - by "proving" that Princeton will lose all its students by 2021.

  • up

    Mark Meloon


    Since your post is 2012, you probably already know this but I'm posting here for future readers...

    Regarding Bayesia's (Stefan Conrad's company) flagship product, BayesiaLab, and causation: yes, they have an approach that helps. Since BayesiaLab is designed primarily for Bayesian networks, it's probably not a big surprise that their approach relies on the same methodology as Judea Pearl's formalism for directed acyclic graphical (DAG) models. The thing to note, however, is that one must develop the causative model first and BayesiaLab is then used to test whether the data agrees with the model. That is, you cannot determine causality entirely from data.

    The workflow is that an SME (perhaps in conjunction with a knowledge engineer) encodes their hypothesis in the form of a DAG. In contrast to Bayesian networks, here arcs denote causality and the flow of time. This is Pearl's "data generation model" (i.e., a hypothesized model for how the system works, using the appropriate level of granularity, etc.). The absence of arcs between nodes (random variables) is what is key -- that's where your assumptions come in (arcs merely represent that there is a possibility of causality). Then you run whatever data you have through the model and see whether it agrees. If not, it's back to the drawing board (albeit much wiser)

    The need for an SME to encode the model (rather than learning it entirely from data) is because this approach is based on counterfactual reasoning and, thus, is essentially a missing data problem. You've got to account for that somehow and that's where human knowledge comes in. But using a graphical model makes the process bearable.

    Hope this is helpful for future readers. Pearl's book is quite the tome but it's the most comprehensive on this approach, so that's the one to get if one wants to read more.


    Vincent Granville said:

    A few great answers from our LinkedIn groups:

    1. David Freedman is the author of an excellent book: "Statistical Models: Theory and Practice" which discusses the issue of causation. It's a very unique stat book in that it really gets into the issue of model assumptions. I highly recommend it. It claims to be introductory but I believe that a semester or two of math stat as a pre-req would be helpful.
    2. In the time series context, you can run a VAR and then do tests for Granger Causality to see if one variable is really "causing" the other where "causing" is defined by Granger. ( see any of Granger's books for the technical definition of granger causaility ). R has a nice package called vars which makes building VAR models and doing testing extremely straightforward. 
    3. Correlation does not imply causation although where there is causation you will often but not always have correlation. Causality analysis can be done by learning Bayesian Networks from the data. See the excellent tutorial "A Tutorial on Learning With Bayesian Networks" by David Heckerman.
    4. Stefan Conrady who manages the company called "Bayesia Networks" once told me his methodology can identify causation.
    5. For time series, see
    6. The only way to discriminate between correlation and coincidence is through a controlled experiment that. Design the experiment such that you can test the effect of each parameter independent of the other.