I received a call from an old client who stated his analytics team had a recent string of failures alarming the firm and costing money. He asked me to review and audit the teams work and analytical processes in attempt to understand and remedy the failures. The data crunching technology was modern and data of varying quality and veracity abundant.
I immediately detected three (3) red flags:
1. Almost all of the data analysts had undergraduate business degrees;
2. About a quarter of the analysts had renamed themselves as "data scientists"; and
3. None of the analysts had graduate school training or experience in a scientific discipline or strong analytic profession like law or medicine.
It appeared the analytics team had processes that:
1. Valued finding and interpreting patterns and correlations over causation;
2. Interpreted patterns where none actually existed;
3. Ignored differences in data yet focused on data similarities;
4. Failed to use scientific techniques to differentiate between signal and noise;
5. Failed to adequately interpret signal strength;
6. Failed to use proper scientific formulation of hypotheses when appropriate;
7. Spent a significant amount of time building models; and
8. Did not have expertise in designing and executing experiments.
This created ideal conditions for what is known as the "Texas Sharpshooter Fallacy" - named after a Texas cowboy who fired a number of shots at a barn side - then painted a target centered on the biggest cluster of hits and claimed to be a Texas sharpshooter!
In data science, the Texas Sharpshooter Fallacy leads us to falsely attribute patterns to random data. This usually arises when analyzing large data sets the focus is only on a small subset of data - and a factor other than one attributed may give all factors in the subset common property. When some data subset with some common property by a factor other than its actual cause is found, you are committing a Texas Sharpshooter Fallacy.
This usually arises when you fail to have a specific hypothesis prior to data collection or the formulation of a hypothesis only after data has been examined. It usually does not apply with constructing a hypothesis - or a prior expectation of variable relationships - before data collection and analysis. It is not scientifically proper to formulate hypotheses - or continually modify a specific hypothesis - after data collection and analysis. In this case, the analytics team was continually constructing different hypotheses suggested by the data - leading to patterns and spurious correlations that were misinterpreted as strong signals. As a result, bad analysis caused poor decisions leading to disastrous or sub-optimal business results. Note that this is different from A/B testing or running well designed and executed experiments without hypotheses. It may or may not be necessary to formulate a hypothesis - depending on the subject matter and context - yet even without a hypothesis, it is prudent to use standard scientific methods to measure and record any experimental or test results for optimal decision making and continuous improvement. What you cannot do is form a specific hypothesis after data analysis or continually construct different hypotheses suggested by the data. Here, the analytics team was also engaging in "Cherry Picking" - pointing to individual cases or data that seem to confirm a particular position, while ignoring a significant portion of related cases or data that may contradict that position. They often attempted to fit a story, or pattern to series of connected or disconnected facts - thus committing the sin of the “Narrative Fallacy.” The combination of the Texas Sharpshooter Fallacy, Cherry Picking and Narrative Fallacy caused the analytics team to often confuse signal (meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge) and noise (competing interpretation of data not grounded in science that may not be considered scientific evidence). I suggested the organization hire real "data scientists" to work with the analytics team to avoid these problems in the future. It was clear members of the team that renamed themselves "data scientists" had no scientific or deep analytical training or experience and could easily fall into these traps. They were not data scientists but garden variety business or data analysts attempting to capitalize on a hot job title - exploiting market confusion over the definition of data science. Like confusing signal and noise can lead to tragedy, confusing professional data scientists with garden variety business and data analysts can lead an organization to disaster. See: http://bit.ly/1kXsipc