Co-relation does not equal causation – is a mantra drilled into a Data Scientist from an early age
That’s fine ..
But very few talk of the follow-on question ..
How exactly do you determine causation?
This problem is further compounded because most books and examples are based on standard datasets (ex: Boston, Iris etc) .
These examples do not discuss causation because the features chosen are already determined to be causal (ex the factors affecting house prices are chosen to be causal)
So, if we start from the beginning (without simplified examples) how do you know if a particular variable is a causal variable?
Firstly, causality cannot be determined from data alone.
Data gives co-relation, but data alone cannot determine causation
To determine causation, we need to perform an experiment or a controlled study
In a statistical sense, two or more variables are related if their values change correspondingly i.e. increase or decrease together. On the other hand, if there is a causal relationship between two variables, then the occurrence of one depends on the other i.e. they exhibit a cause and effect relationship. For example, smoking causes lung cancer is a causal relationship while smoking is correlated to alcoholism but does not cause alcoholism.
Correlation is typically measured using Pearson’s coefficient or Spearman’s coefficient. If there is correlation, then further investigation is needed to establish if there is a causal relationship.
The most effective way of establishing causation is by means of a controlled study.
In a controlled study, the sample or population is split in two, with both groups being comparable in almost every way.
The two groups then receive different treatments, and the outcomes of each group are assessed.
For example, in medical research, one group is given a placebo whereas the other group is given a new medication.
So, in a nutshell - "To find out what happens when you change something, it is necessary to change it."...There are things you learn from perturbing a system that you'll never find out from any amount of passive observation.
Source: http://people.umass.edu/~stanek/pdffiles/causal-holland.pdf
The design of controlled experiments is a non-trivial exercise:
Adapted from source
Finally, there are some methods like the Granger causality that is a statistical method which demonstrates some causality (with limitations)
https://abs.gov.au/websitedbs/a3121120.nsf/home/statistical+languag...
Why do we need causality in data science
Image source: Khan academy
Comment
Well stated. You can try to boost the national economy by encouraging women to wear miniskirts, but it's probably not going to work. Causation is something that has to be tested directly, and on many fronts, such as economics, this is no easy task.
My brilliant insight: under information theory, signal and noise are a package deal. People usually measure correlation and infer causality by matching signals.
But, if the noise profile for a dependent signal cannot possibly match the noise profile of an influencing signal, how could causality be true? That is, comparing both signal and noise gives a possible negation, but does not prove casuation.
© 2019 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central