Subscribe to DSC Newsletter

Science, data science and causality - unresolved contradictions

The term "causality" has a sad fortune - it could be compared with a legitimate son (of any "truth seeker"?) or with a bastard, depending on point of view. On the one hand, the science claims as its goal to find the causes of things. On the other hand, statistics, the science of all sciences, lives in peace with the idea that "correlation is not causation" (but no more than that flat statement) for about a century and generates an uncountable number of models where causation never slept for a minute.

The attempts to resolve a problem also last around the century started with S.Wright's path analysis, J. Neyman's potential outcomes seed ideas and blooming in last decades in theories of P. Spirtes, J. Pearl (Directed Acyclic Graphs, DAG), D. Rubin (Potential Outcomes) and many others. But the gap between three grand approaches - classical statistical inference (based on the idea of significance, or non-randomness); statistical (machine) learning (based on the idea of the error minimization on testing data), and causality theory per se - does not seem to narrow. Is data science to play a role of the great unifier? Not very likely, at least judging from some opinions from this Central: “… from a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page.” Or: “Data science uses a bit of old statistical science:…A/B and multivariate testing, but without traditional tests of hypotheses…(V. Granville, 2014, http://www.datasciencecentral.com/profiles/blogs/data-science-witho....
Causality was also specifically addressed in portal:  
http://www.datasciencecentral.com/forum/topics/correlation-vs-causa... (2012-15). The summary was positive to DAG approach and BayesianLab software, based on it. Yet the problems persist.

It could be easily shown, for example, that DAG approach fails even in simplest regression-like situation when all X variables are exogenous, it doesn't distinguish causal from non-causal variables. Respectively, to the original question of the mentioned blog - can one separate causal and non-causal variables in observational studies - DAG theory does not provide the answer. In some situations may help tests proposed by B. Scholkopf http://ml.dcs.shef.ac.uk/masamb/schoelkopf.pdf (2012) - but they address the issue of direction (how to say, that time affects economic growth, not other way around), not size of the effect. Within the linear model certain criteria for distinction were proposed in I. Mandel (2017) Troublesome Dependency Modeling: Causality, Inference, Statistical Learning https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2984045 (section 4.2.2), together with detailed analysis of fundamental problems of main causality approaches (it still needs an additional testing). For binary variables the problem of estimation of causal coefficients is analytically solved  in S. Lipovetsky and I. Mandel Modeling Probability of Causal and Random Impacts. Journal of Modern Applied Statistical Methods 2015, Vol. 14, No. 1, 180-195.

Placing questions more generally: is there a common ground for all three approaches? Or, more explicitly, what people think about following statements, all related to dependency model like Y=F(X): 

a) If coefficients of the model are insignificantly different from zero - does it mean absence of causality?

b) If, contrary to a), coefficients are statistically significantly different from zero (with all other precautions related to high power, etc. - see "reproducibility crisis" https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2984045) - does it mean causality?

c) If answer to b) is negative (i.e. causality is not guaranteed) - what does this significance mean, if anything? 

d) If statistical learning algorithm shows very low errors on testing data, but it is surely known, that variables are not of causal nature (as very often in deep learning, for instance) - what is implied from it? Should one prefer the worse model with causal variables (if available) or not?

e) If one has three models of the same process and data (one with very good significance, another with very good "learning qualities", and yet third which is surely the causal one), and criteria of quality contradict to each other - which one will be preferred? 

There are many similar questions. Do we want it or not, the problem of either unification of the old paradigms, or replacing those to the new one is inevitable. And these or similar questions should be answered, if data science claims to be this new paradigm. 

 

Views: 476

Reply to This

Replies to This Discussion

Hello, Mandel good discussion about data science course. A qualified knowledge of the methods of comprehending the approaches and techniques of a successful business conduction is indispensable for corporations thrive. To become a professional, it is quite a known fact that experience matters. But in the present market scenario, the main thing that matters is getting the thought of the professional. Maybe a skilled executive possesses an inherent wit of knowing those which are cardinally profitable for a firm and likewise those that are detrimental. Thanks to “ExcelR Solutions” institute who is providing Agile Certification, Tableau online training, PMP Certification course, Business Analytics training and Data Scientist course online or offline.

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service