Data Science Central

What are the techniques to discriminate between coincidences, correlations, and real causation? Also, if you have a model where the response is highly correlated to (say) 5 non-causal variables and 2 direct causal variables, how do you assign a weight to the two causal variables?

Can you provide examples of successful cause detection? In which contexts is it important to detect the true causes? In which context causation can be ignored as long as your predictive model generates great ROI?

**Related articles**

## Mark Meloon

Vincent,

Since your post is 2012, you probably already know this but I'm posting here for future readers...

Regarding Bayesia's (Stefan Conrad's company) flagship product, BayesiaLab, and causation: yes, they have an approach that helps. Since BayesiaLab is designed primarily for Bayesian networks, it's probably not a big surprise that their approach relies on the same methodology as Judea Pearl's formalism for directed acyclic graphical (DAG) models. The thing to note, however, is that one must develop the causative model first and BayesiaLab is then used to test whether the data agrees with the model. That is, you cannot determine causality entirely from data.

The workflow is that an SME (perhaps in conjunction with a knowledge engineer) encodes their hypothesis in the form of a DAG. In contrast to Bayesian networks, here arcs denote causality and the flow of time. This is Pearl's "data generation model" (i.e., a hypothesized model for how the system works, using the appropriate level of granularity, etc.). The

absenceof arcs between nodes (random variables) is what is key -- that's where your assumptions come in (arcs merely represent that there is apossibilityof causality). Then you run whatever data you have through the model and see whether it agrees. If not, it's back to the drawing board (albeit much wiser)The need for an SME to encode the model (rather than learning it entirely from data) is because this approach is based on counterfactual reasoning and, thus, is essentially a missing data problem. You've got to account for that somehow and that's where human knowledge comes in. But using a graphical model makes the process bearable.

Hope this is helpful for future readers. Pearl's book is quite the tome but it's the most comprehensive on this approach, so that's the one to get if one wants to read more.

-Mark

Vincent Granville said:

Oct 16, 2014

## Taymour Matin

Vincent, just started to read your blogs - thanks for your contributions!

One thing that doesn't seem to make into the discussion of correlation vs. causation is practicality. Let us assume that X and Y are correlated but not causally linked. Every time we observe X happening, we observe a similar pattern in Y. If I notice X going up, I can take an action that depends on Y outcome. If Y happens because of my inferences in X, I win even if the two are not causally linked. Of course, there are a variety of situations where proving causality is necessary. My point, however, is that not all situations require it.

I would welcome your thoughts on this.

on Sunday

## Matt Anthony

5 hours ago