Home » Uncategorized

Why is it so hard to train data scientists?

Some time ago I met a colleague who expressed her disappointment from two data scientists that she hired. These were the first employees with a data science degree hired by that company, and apparently did not meet the high expectations. She felt that in some cases the data scientists did not do work she could not do without them, and in other cases did not provide very useful insights.

I do not have specific information about the training and background of these two data scientists, but the difficulty in training effective data scientists is something I am definitely familiar with. In a time that almost anyone can define themselves as a data scientist, training a complete data scientist is a challenge.

A data scientist should be familiar with databases, as many of the world’s data are organized in relational and non-relational databases. For working with a variety of data types the data scientist needs to be able to parse and render files, and convert between data formats. Working with large databases often requires programing skills beyond basic scripting in R or Python, as well as knowledge in algorithm design and operating system. Machine learning is also a required skill. In other words, a complete data scientist should have knowledge in computer science at the level of a trained computer scientist.

A data scientist must also be highly familiar with statistics, and understand multiple statistical methods for tasks such as regression, dimensionality reduction, statistical significance analysis, Mote Carlo simulations, and Bayesian methods, to name a few. The data scientist needs to have knowledge in statistics at a level close to the knowledge of the statistician.

In addition to statistics and computer science, a data scientist should also have knowledge in business administration. That knowledge is required to understand and define the business problems, and communicate the insights.

Data analysis tools change rapidly, and I do not highlight knowledge of a specific tool as a major part of data science training. However, the reality is that the data scientist also needs to be familiar with a collection of data analytics tools, and be able to quickly learn new tools.

The combination of knowledge in computer science, statistics, business administration, and applied technology is very difficult to train to the deep level required from a data scientist, certainly in an undergraduate program. Therefore, complete data scientists is a rare species, and most of those who identify themselves as data scientists should be selected carefully to a job that meets their skillset and knowledge.