Home » Media Types » Newsletters

Data Science Central Weekly Digest, 18 Jan 2021

  • Kurt Cagle 
From the Editor’s Desk
When working with information, context is becoming increasingly important. All too often, data scientists and analysts tend to work upon the assumption that the data that they work with exists in a vacuum, disconnected from anything else, yet the production of data invariably involves dozens or even hundreds of decisions – which sources to use, which interpretations are to be made on data, what shaped the data to be gathered in the first place, and so on. Loss of context can often render good data meaningless and can introduce unexpected patterns that can seem mystifying until the biases can be resolved.
A few years back, a study was undertaken to attempt to use machine learning on medical patient records to attempt to identify, a priori, whether there were any indications in a person’s medical charts that would suggest that they were dealing with early-stage cancers that had not yet been detected. The researchers scanned tens of thousands of such records through OCR from a given hospital, and lo and behold, the machine learning algorithm was able to pick up such patterns 99.95% of the time.
While the young researchers were jubilant – their technique had worked after all – more experienced data scientists raised an eyebrow that the rate was so very nearly perfect, and began checking their data acquisition chain. Eventually, they made a discovery. Nurses at the hospital who were dealing with patients that had cancer would routinely put a circled C on each record because there was no space in the form to indicate that the patients were dealing with cancer. Once this particular bit of context was factored in, the strong correlation disappeared.
Data science is ultimately about more than just running numbers through algorithms. Understanding the data itself is often more than half the battle. Anyone who believes that doing good data science work doesn’t involve getting heavily involved in ascertaining the quality, veracity, and context of the data that they work with will not succeed in this field. This is why Data Science Central is here.
Data Science Central is your community. It is a chance to learn from other practitioners, and a chance to communicate what you know to the data science community overall. I encourage you to submit original articles and to make your name known to the people that are going to be hiring in the coming year. As always let us know what you think.


Jump-start your career as a data scientist, data engineer, or analytics manager in Northwestern’s online MS in Data Science. You’ll learn from a faculty of industry experts as you build statistical and analytic expertise as well as the management and leadership skills necessary to implement high-level, data-driven decisions. Learn more.

DSC Featured Articles

Tech Target Articles

Picture of the Week

Statistics + Machine Learning = Statistical Learning

Machine Learning / Stats / BI: Mini Translation Dictionary


To make sure you keep getting these emails, please add [email protected] to your address book or whitelist us.

This email, and all related content, is published by Data Science Central, a division of TechTarget, Inc.

275 Grove Street, Newton, Massachusetts, 02466 US

You are receiving this email because you are a member of TechTarget. When you access content from this email, your information may be shared with the sponsors or future sponsors of that content and with our Partners, see up-to-date Partners List below, as described in our Privacy Policy. For additional assistance, please contact: [email protected]

© 2020 TechTarget, Inc. all rights reserved. Designated trademarks, brands, logos and service marks are the property of their respective owners.

Privacy Policy | Partners List

Leave a Reply

Your email address will not be published. Required fields are marked *