Correlation vs. causation

What are the techniques to discriminate between coincidences, correlations, and real causation? Also, if you have a model where the response is highly correlated to (say) 5 non-causal variables and 2  direct causal variables, how do you assign a weight to the two causal variables?

Can you provide examples of successful cause detection? In which contexts is it important to detect the true causes? In which context causation can be ignored as long as your predictive model generates great ROI?

Related articles

Load Previous Replies
  • up

    Vincent Granville

    Another example:

    Doing more sports is correlated with less obesity. So someone decides that all kids should have more physical education at school, in order to fight obesity. However, lack of sports is not the root cause of obesity: bad eating habits are, and that's what should be addressed first to fix the problem. Then, replacing sports by mathematics or foreign languages would make American kids both less obese (once eating habits are fixed) and more educated at the same time.

  • up

    Vincent Granville

    A lot can be done with black-box pattern detection, where patterns are found but not understood. For many applications (e.g. high frequency training) it's fine as long as your algorithm works. But in other contexts (e.g. root cause analysis for cancer eradication), deeper investigation is needed for higher success. And in all contexts, identifying and weighting true factors that explain the cause, usually allows for better forecasts, especially if good model selection, model fitting and cross-validation is performed. But if advanced modeling requires paying a high salary to a statistician for 12 months, maybe the ROI becomes negative and black-box brute force performs better, ROI-wise. In both cases, whether caring about cause or not, it is still science. Indeed it is actually data science - and it includes an analysis to figure out when/whether deeper statistical science can or can not be required. And ill all cases, it always involves cross-validation and design of experiment. Only the statistical theoretical modeling aspect can be ignored. Other aspects, such as scalability and speed, must be considered, and this is science too: data and computer science.

  • up

    Alex Esterkin

    Facebook data scientists hilariously debunk Princeton "correlation equals causation" based study that says Facebook will lose 80% of users - by "proving" that Princeton will lose all its students by 2021.