Correlation vs. causation

What are the techniques to discriminate between coincidences, correlations, and real causation? Also, if you have a model where the response is highly correlated to (say) 5 non-causal variables and 2  direct causal variables, how do you assign a weight to the two causal variables?

Can you provide examples of successful cause detection? In which contexts is it important to detect the true causes? In which context causation can be ignored as long as your predictive model generates great ROI?

Related articles

Load Previous Replies
  • up

    Mark Meloon


    Since your post is 2012, you probably already know this but I'm posting here for future readers...

    Regarding Bayesia's (Stefan Conrad's company) flagship product, BayesiaLab, and causation: yes, they have an approach that helps. Since BayesiaLab is designed primarily for Bayesian networks, it's probably not a big surprise that their approach relies on the same methodology as Judea Pearl's formalism for directed acyclic graphical (DAG) models. The thing to note, however, is that one must develop the causative model first and BayesiaLab is then used to test whether the data agrees with the model. That is, you cannot determine causality entirely from data.

    The workflow is that an SME (perhaps in conjunction with a knowledge engineer) encodes their hypothesis in the form of a DAG. In contrast to Bayesian networks, here arcs denote causality and the flow of time. This is Pearl's "data generation model" (i.e., a hypothesized model for how the system works, using the appropriate level of granularity, etc.). The absence of arcs between nodes (random variables) is what is key -- that's where your assumptions come in (arcs merely represent that there is a possibility of causality). Then you run whatever data you have through the model and see whether it agrees. If not, it's back to the drawing board (albeit much wiser)

    The need for an SME to encode the model (rather than learning it entirely from data) is because this approach is based on counterfactual reasoning and, thus, is essentially a missing data problem. You've got to account for that somehow and that's where human knowledge comes in. But using a graphical model makes the process bearable.

    Hope this is helpful for future readers. Pearl's book is quite the tome but it's the most comprehensive on this approach, so that's the one to get if one wants to read more.


    Vincent Granville said:

    A few great answers from our LinkedIn groups:

    1. David Freedman is the author of an excellent book: "Statistical Models: Theory and Practice" which discusses the issue of causation. It's a very unique stat book in that it really gets into the issue of model assumptions. I highly recommend it. It claims to be introductory but I believe that a semester or two of math stat as a pre-req would be helpful.
    2. In the time series context, you can run a VAR and then do tests for Granger Causality to see if one variable is really "causing" the other where "causing" is defined by Granger. ( see any of Granger's books for the technical definition of granger causaility ). R has a nice package called vars which makes building VAR models and doing testing extremely straightforward. 
    3. Correlation does not imply causation although where there is causation you will often but not always have correlation. Causality analysis can be done by learning Bayesian Networks from the data. See the excellent tutorial "A Tutorial on Learning With Bayesian Networks" by David Heckerman.
    4. Stefan Conrady who manages the company called "Bayesia Networks" once told me his methodology can identify causation.
    5. For time series, see
    6. The only way to discriminate between correlation and coincidence is through a controlled experiment that. Design the experiment such that you can test the effect of each parameter independent of the other.

    • up

      Taymour Matin

      Vincent, just started to read your blogs - thanks for your contributions!  

      One thing that doesn't seem to make into the discussion of correlation vs. causation is practicality.  Let us assume that X and Y are correlated but not causally linked.  Every time we observe X happening, we observe a similar pattern in Y.  If I notice X going up, I can take an action that depends on Y outcome.  If Y happens because of my inferences in X, I win even if the two are not causally linked.  Of course, there are a variety of situations where proving causality is necessary.  My point, however, is that not all situations require it.

      I would welcome your thoughts on this.


      • up

        Matt Anthony

        What about the decades old literature that basically started the modern conversation on this topic? Rubin's causal model is standard fare (or should be) for graduate statistics programs ... The fact that it remains unknown or unrecognized in these circles speaks to me that data science is losing touch with its roots in statistics as it is blended in with other disciplines.