What are the techniques to discriminate between coincidences, correlations, and real causation? Also, if you have a model where the response is highly correlated to (say) 5 non-causal variables and 2 direct causal variables, how do you assign a weight to the two causal variables?
Can you provide examples of successful cause detection? In which contexts is it important to detect the true causes? In which context causation can be ignored as long as your predictive model generates great ROI?
As an illustration, you can predict chances of lung cancer within the next 5 years, for a given individual, either as a function of daily consumption of cigarettes, or as a function of electricity bill. Both are great predictors: the first one is a cause. The second one is not, but it is linked to age (the older you get, the more expensive commodities are due to inflation) and thus is a very indirect (non-causal) factor for lung cancer.
A few great answers from our LinkedIn groups:
Another great answer from Bipin Chadha:
Best book that I have seen so far on the topic is - Causality by Judea Pearl. He discusses this topic in a lot of detail - direct, indirect, confounding, counterfactuals, etc.
Please note that the causal model is interested in describing the mechanism (why something is happening) – essentially a deeper understanding than is possible through correlation analysis.
The technique that I have used often is roughly described as:
Also, if you are dealing with more than 200 variables, you will find correlations that look significant (from a statistical point of view) but are actually an artifact caused by the large number of variables. Example: simulate 50 observations each with 10,000 variables: each value in each variable being a simulated random number. Chances are you will find, among these 10,000 variables, two that are highly correlated despite the fact that the real correlation between any of these two variables is (by design!) zero.
There are ways to deal with this issue, but this will be the subject of another discussion.
As David Hume tells us, all we observe in the world is constant conjunction. Cause and effect is something we infer, guided by theory and experiments.
Thanks for the insight, some valuable information here. I look forward to exploring some of it.
Vincent Granville said:
A few great answers from our LinkedIn groups:
- David Freedman is the author of an excellent book: "Statistical Models: Theory and Practice" which discusses the issue of causation. It's a very unique stat book in that it really gets into the issue of model assumptions. I highly recommend it. It claims to be introductory but I believe that a semester or two of math stat as a pre-req would be helpful.
- In the time series context, you can run a VAR and then do tests for Granger Causality to see if one variable is really "causing" the other where "causing" is defined by Granger. ( see any of Granger's books for the technical definition of granger causaility ). R has a nice package called vars which makes building VAR models and doing testing extremely straightforward.
- Correlation does not imply causation although where there is causation you will often but not always have correlation. Causality analysis can be done by learning Bayesian Networks from the data. See the excellent tutorial "A Tutorial on Learning With Bayesian Networks" by David Heckerman.
- Stefan Conrady who manages the company called "Bayesia Networks" once told me his methodology can identify causation.
- For time series, seehttp://en.wikipedia.org/wiki/Granger_causality/
- The only way to discriminate between correlation and coincidence is through a controlled experiment that. Design the experiment such that you can test the effect of each parameter independent of the other.
Doing more sports is correlated with less obesity. So someone decides that all kids should have more physical education at school, in order to fight obesity. However, lack of sports is not the root cause of obesity: bad eating habits are, and that's what should be addressed first to fix the problem. Then, replacing sports by mathematics or foreign languages would make American kids both less obese (once eating habits are fixed) and more educated at the same time.
A lot can be done with black-box pattern detection, where patterns are found but not understood. For many applications (e.g. high frequency training) it's fine as long as your algorithm works. But in other contexts (e.g. root cause analysis for cancer eradication), deeper investigation is needed for higher success. And in all contexts, identifying and weighting true factors that explain the cause, usually allows for better forecasts, especially if good model selection, model fitting and cross-validation is performed. But if advanced modeling requires paying a high salary to a statistician for 12 months, maybe the ROI becomes negative and black-box brute force performs better, ROI-wise. In both cases, whether caring about cause or not, it is still science. Indeed it is actually data science - and it includes an analysis to figure out when/whether deeper statistical science can or can not be required. And ill all cases, it always involves cross-validation and design of experiment. Only the statistical theoretical modeling aspect can be ignored. Other aspects, such as scalability and speed, must be considered, and this is science too: data and computer science.