It’s happened to all of us sooner or later: The hypothesis seemed plausible, the data was clean, the conclusions sound. Our recommendation was damn near foolproof. Yet when put into practice, the result was anything but favorable.
How could that happen? Data science is about, well, DATA. And science, which implies a reliable method. We have the information, we have the models, we aren’t just shooting in the dark here. Where did it go wrong?
The truth is there are lots of things that can go wrong, and when attempting to determine root cause for why this happens, let us begin with ourselves.
You’re crazy. And so am I.
You as a person have mental characteristics that influence how you work with and interpret data of any form, regardless of whether it comes from your senses, a database, or the written page. Some of these are common to our race, some are formed through our life experiences. Psychologists call these cognitive biases, and they are present in varying degrees to every person on the planet.
As data scientists it is tempting to think that we are objective and impartial. That would be nice if it were true, but unfortunately there are certain cognitive biases we must be continually mindful of at various stages in our process or risk the corruption of our results.
The Usual Suspects
Below are but a handful of the more prevalent (and relevant) forms that are particularly relevant to us:
- Stereotyping – We’ll start with this one because it’s widely understood. Stereotyping is simply assigned characteristics to all members of a population without specific data on each member of the population. While we talk about stereotyping mostly in the context of human relations, it can be equally damaging in data.
- Anchoring – First impressions are powerful, and your first impression of a data set (usually during exploratory analysis or visualization) will usually stick with you even in the face of contradictory evidence later. Holding to an initial idea when later data debunks it is the essence of Anchoring.
- Base Rate Fallacy – AKA “Can’t see the forest for the trees.” The basic idea is that zooming in on the specifics of a particular case without zooming back out to see how the data applies to the entire population is dangerous.
- Curse of Knowledge – This is common to experts in a specific knowledge domain, and concerns the inability of the expert to see things from the perspective of a novice. In other words, you assume your audience has knowledge they do not. Ever been in a seminar where the speaker used industry jargon that sounded alien to you?
- Expectation Bias – This one is a killer. If you expect an analysis to turn out a certain way, you will tend to disbelieve or discount data or results that conflict with the expected outcome. I probably don’t have to elaborate on why that’s a bad thing.
- Framing Effect – This is simply drawing different conclusions form the same information, based on the context in which the information is presented. Any parent of a teenager has likely experienced this. Known in political circles as “spin doctoring.”
- Illusion of Validity – Ever been in a situation where management demands more information before making a decision, even when the information available is more than sufficient? That’s Illusion of Validity, the belief that more information is always better. Known in management circles as “Paralysis by Analysis”, but by no means is it exclusive to managers. I’ve known quite a few data scientists who fall into this category, especially those who specialize in Big Data.
- Irrational Escalation – AKA “Sunk Cost Fallacy,” the justification of additional investment in a project even when all data say the project was a bad decision. This is more common in managers than data scientists, simply because managers are the ones who typically decide how to spend budgets.
- Reactance – Doing the opposite of what someone else tells you in an effort to prove your independence of their influence or authority. Prevalent in toddlers and mainframe coders. Closely related to “Reverse Psychology,” which is widely practiced by parents and mainframe coder managers.
- Selection Bias – Samples that are not random for some reason, skewing the results of the analysis. A subset of this is the commonly seen “Survivorship Bias,” where only the candidates that passed a selection process (the survivors) are studied to the exception of those that did not pass. For example, when companies that went bankrupt are not included in a financial analysis of a vertical, simply because they didn’t show up in a search.
But the granddaddy of them all (in my opinion) is one called “Bias Blind Spot,” in which you can readily identify the cognitive biases in other people but believe you have significantly fewer than those people. I’ll leave it to your imagination to picture the chaos that can cause.