Spooky Scary Data Science Skeletons
October is the spookiest time of the year, when the ghosts and witches are out in force, there’s a chill in the air as gray clouds gather, and pumpkin-flavored, well, just about anything anymore seems ubiquitous. I blame a particular Seattle coffee chain for the last one, but there’s something about moving into Fall that focuses one’s mind on the spooky scary skeletons lurking underneath the bed.
In the realm of the data scientist, there are more than a few skeletons hiding in the closets as well. These are the things that keep analysts up at night, and no matter how well prepared you may be, these jump scares are enough to send anyone screaming.
Data Quality Demons. The business manager assured you that their data’s great and has everything you could ever need. Yet when you pry the lid off the coffin and stare at the mouldering remains of software projects past, you get the creeping sensation that perhaps the manager was a bit ¦ optimistic ¦ in his estimates. Inconsistencies in spelling, the use of arbitrary placeholders, lists of items stored as single strings, differing date and currency conventions, data type errors, these can usually be dispelled with intelligent analysis software, but the bigger demons come about due to cardinality misunderstandings, a failure to account for change in data over time, duplications with subsequent edits creating phantom information, and similar errors that can be difficult to catch and even harder to fix.
Sparse Metadata Monsters. These are more sublime issues having to do with data that was collected primarily to facilitate fast transactions at the expense of containing minimal metadata about those transactions. This includes identifying dimensional units (length, currency, and count units, such as three books not being the same as three cars), identifying the time over which a certain entity exists within the system, metadata about the provenance of the data (who entered it, why did they enter it, how valid is it, where is the source of record for that data), and so on. This data often determines the reliability of the data.
Modeling Mayhem. A recent prepress article about COVID-19 vaccine efficacies in Wisconsin made a modeling assumption about the number of people who had been vaccinated in the state. It turned out that the number was off by a factor of 100, and what had seemed like a strong statistical case against the vaccine became instead a strong case for the vaccine. These kinds of modeling errors can break careers.
Bias Boggarts. Sampling by its very nature can be fraught with gotchas. Is the sample representative of the overall population? What hidden assumptions were made about the questions being asked or the means that the information is gathered? For a long time, surveys were conducted over LAN lines, until a statistician realized that a growing number of people were no longer using them in favor of mobile phones, and those that were left were older, more conservative, and likely wealthier, skewing everything from product marketing to politics.
Interpretation Imps. Having created a model and run the data, ultimately the question is how to interpret the results, and it is here that the imps of the perverse delight in ruining a data scientist’s day. Are the conclusions supported by the analysis? Is it possible that those who have commissioned the analysis will ignore all of the caveats about probabilities and will treat the results as absolute statements? (Yes). Will people justify their own agendas based upon your conclusions, even when the conclusions do not support those results at all? Oh, definitely.
Data Science can be fun and exciting, but it can also be filled with deadly traps and snarling beasts. Sometimes the best that you can do is to be aware of all the goblins and ghoulies, and of course, read Data Science Central.
Goodnight, sleep tight ¦ don’t let the bedbugs bite!
To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free!
Data Science Central Editorial Calendar
DSC is looking for editorial content specifically in these areas for October, with these topics having higher priority than other incoming articles.
DSC Featured Articles
Picture of the Week
To make sure you keep getting these emails, please add [email protected] to your browser’s address book.
Join Data Science Central | Comprehensive Repository of Data Science and ML Resources
This email, and all related content, is published by Data Science Central, a division of TechTarget, Inc.
275 Grove Street, Newton, Massachusetts, 02466 US
copyright 2021 TechTarget, Inc. all rights reserved. Designated trademarks, brands, logos and service marks are the property of their respective owners.