Subscribe to DSC Newsletter

Correlation does not equal causation but How exactly do you determine causation?





Co-relation does not equal causation – is a mantra drilled into a Data Scientist from an early age

That’s fine ..

But very few talk of the follow-on question ..

How exactly do you determine causation?

This problem is further compounded because most books and examples are based on standard datasets (ex: Boston, Iris etc) .

These examples do not discuss causation because the features chosen are already determined to be causal (ex the factors affecting house prices are chosen to be causal)

So, if we start from the beginning (without simplified examples) how do you know if a particular variable is a causal variable?

Firstly, causality cannot be determined from data alone.

Data gives co-relation, but data alone cannot determine causation

To determine causation, we need to perform an experiment or a controlled study


In a statistical sense, two or more variables are related if their values change correspondingly i.e. increase or decrease together. On the other hand, if there is a causal relationship between two variables, then the occurrence of one depends on the other i.e. they exhibit a cause and effect relationship. For example, smoking causes lung cancer is a causal relationship while smoking is correlated to alcoholism but does not cause alcoholism.  

Correlation is typically measured using Pearson’s coefficient or Spearman’s coefficient. If there is correlation, then further investigation is needed to establish if there is a causal relationship.

How can causation be established?

The most effective way of establishing causation is by means of a controlled study.

In a controlled study, the sample or population is split in two, with both groups being comparable in almost every way.

The two groups then receive different treatments, and the outcomes of each group are assessed. 

For example, in medical research, one group is given a placebo whereas the other group is given a new medication.

So, in a nutshell - "To find out what happens when you change something, it is necessary to change it."...There are things you learn from perturbing a system that you'll never find out from any amount of passive observation.



The design of controlled experiments is a non-trivial exercise:

  • You may have measurement error problems
  • subjects might drop the study or not follow instructions, among other issues.
  • You will need to make assumptions about how things are related to determine inference.
  • You may have incomplete/imprecise data
  • Target causal quantity of interest may not be well defined
  • Confounding variables. A confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association.
  • Selection bias (self-selection, truncated samples)
  • Measurement error (that can induce confounding, not only noise)
  • Misspecification (e.g., wrong functional form)
  • External validity problems (wrong inference to target population)

Adapted from source

Finally, there are some methods like the Granger causality that is a statistical method which demonstrates some causality (with limitations)



Why do we need causality in data science

Image source: Khan academy

Views: 8494


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Marek J Druzdzel on December 18, 2019 at 10:57am

The basic claim of the article, i.e., "data alone cannot determine causation" is false.  Under certain circumstances, which are not really uncommon, you can determine causation from data.  Please see the excellent book by Spirtes, Glymour and Scheines "Causation, Prediction and Search", available also in PDF format online at  Granger causality is quite dubious as far as causation is concerned -- it is based on too many assumptions.  Finally, experiments alone are not sufficient to establish causation, as you need to make assumptions there as well.  Complicated, isn't it :-)?

Comment by John L. Ries on October 7, 2019 at 11:16am

Well stated.  You can try to boost the national economy by encouraging women to wear miniskirts, but it's probably not going to work.  Causation is something that has to be tested directly, and on many fronts, such as economics, this is no easy task.

Comment by Lance Norskog on October 6, 2019 at 3:50pm

My brilliant insight: under information theory, signal and noise are a package deal. People usually measure correlation and infer causality by matching signals.

But, if the noise profile for a dependent signal cannot possibly match the noise profile of an influencing signal, how could causality be true? That is, comparing both signal and noise gives a possible negation, but does not prove casuation.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service