As data science evolves into a separate and distinct scientific and business discipline, there is talk about the death of traditional statistics. It is true that today's large data sets are unlike the ones we analyzed in graduate statistics classes. It is also true that big data sets have different properties than small data sets. It is very true that one can lie with statistics and present an illusion of reality. It is certainly true that many professional statisticians lack business acumen and communication skills.
Yet, while the sins of statistics are well-known, assertions of it's death is premature. In fact, the proper use of statistics can help us understand massive quantities of data and help clients make better decisions. Statistics professionals are among the smartest folks I know and - properly used - should be a part of well-rounded team of data scientists.
Data science uses a myriad of tools and techniques, including but not limited to: math, statistics, computer science, hacking, business acumen, storytelling and business communication skills. Statistics is part of the tool kit and very important.
Prudent use of statistics can be very useful for finding meaning in messy, large data sets. Misuse or intentional abuse of statistics can mislead and present a false view of reality. At best, statistics helps us simplify complexity to make better decisions faster. At worst, statistics can define and measure the wrong things, create dangerous illusions of reality and cause us to make bad decisions.
One of my favorite books is "How to Lie with Statistics" (1954) by Darrell Huff. Huff demonstrated that you can intentionally lie with statistics or make unintentional errors and discussed specific methods used to fool rather than to inform. He showed how statistical methods are used in reporting massive amounts data (e.g., hard and social science, economics, business conditions, polls, census) yet without honest, clear communication and real understanding it is semantic nonsense at best and dangerous at worst.
"The secret language of statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse, and oversimplify," warned Huff. An modern example is Wall Street risk management models known as "Value at Risk" (VAR) prior to the 2008 financial meltdown. The various VAR models were complex and very precise - yet the assumptions embedded in the models were dangerously wrong and created an illusion of reality. Decision makers relying on VAR had a false sense of security that caused inaccurate conclusions - putting not only their own firms at risk but the entire global financial system.
Statisticians resent the old Mark Twain adage "Lies, Damned Lies and Statistics", yet the best and brightest statisticians on Wall Street (called "Quants") created a false view of reality that caused serious economic damage. If we are not careful, data science can also produce an illusion of reality that causes serious harm.
I argue the proper use of statistics can help us better understand complicated reality, make better decisions and make life better. Good quality data, scientific methods and the right statistical tools can help us find valuable, actionable insights in large data sets. Data scientists should have a solid grounding in statistical analysis - including concepts such as inference, correlation, causation and regression analysis. It is useful to understand - among other concepts - the mean, how the median is less influenced by outliers, standard deviation, how the weighting of index components affects results, correlation vs. causation, inflation-adjustment, precision vs. accuracy, the importance of using the appropriate unit of analysis, statistical vs. operational significance, and how performance data is sometimes manipulated.
We should always be skeptical and understand how the biased or careless can manipulate or misrepresent data. Further, it is crucial to understand the distinction between "precision" and "accuracy." Precision means the state or quality of exactness and ability of a measurement to be consistently reproduced. An example is my office is 8.4 miles from my home rather than 8 miles. Accuracy means a faithful measurement or representation of the truth. An example is my office is my office is 8.4 miles west of my home. A problem arises when I tell you my office is 8.4 miles east of my home - precise but not accurate. Statistical analysis will sometimes be precise but not accurate - like Wall Street VAR models.
Beware and be skeptical of key assumptions embedded in statistical models. Are you defining and measuring the right thing(s) to obtain understanding?
Most important, we need to have clarity about what we are attempting to define, measure, describe or explain. Clear and simple communication to the consumers of data science - the decision makers - so they understand and thus make optimal decisions - is the paramount goal.