From the Editor's Desk
When working with information, context is becoming increasingly important. All too often, data scientists and analysts tend to work upon the assumption that the data that they work with exists in a vacuum, disconnected from anything else, yet the production of data invariably involves dozens or even hundreds of decisions - which sources to use, which interpretations are to be made on data, what shaped the data to be gathered in the first place, and so on. Loss of context can often render good data meaningless and can introduce unexpected patterns that can seem mystifying until the biases can be resolved.
A few years back, a study was undertaken to attempt to use machine learning on medical patient records to attempt to identify, a priori, whether there were any indications in a person's medical charts that would suggest that they were dealing with early-stage cancers that had not yet been detected. The researchers scanned tens of thousands of such records through OCR from a given hospital, and lo and behold, the machine learning algorithm was able to pick up such patterns 99.95% of the time.
While the young researchers were jubilant - their technique had worked after all - more experienced data scientists raised an eyebrow that the rate was so very nearly perfect, and began checking their data acquisition chain. Eventually, they made a discovery. Nurses at the hospital who were dealing with patients that had cancer would routinely put a circled C on each record because there was no space in the form to indicate that the patients were dealing with cancer. Once this particular bit of context was factored in, the strong correlation disappeared.
Data science is ultimately about more than just running numbers through algorithms. Understanding the data itself is often more than half the battle. Anyone who believes that doing good data science work doesn't involve getting heavily involved in ascertaining the quality, veracity, and context of the data that they work with will not succeed in this field. This is why Data Science Central is here.
Data Science Central is your community. It is a chance to learn from other practitioners, and a chance to communicate what you know to the data science community overall. I encourage you to submit original articles and to make your name known to the people that are going to be hiring in the coming year. As always let us know what you think.
Jump-start your career as a data scientist, data engineer, or analytics manager in Northwestern's online MS in Data Science. You’ll learn from a faculty of industry experts as you build statistical and analytic expertise as well as the management and leadership skills necessary to implement high-level, data-driven decisions. Learn more.
DSC Featured Articles