Why is it so hard to train data scientists?

Some time ago I met a colleague who expressed her disappointment from two data scientists that she hired. These were the first employees with a data science degree hired by that company, and apparently did not meet the high expectations. She felt that in some cases the data scientists did not do work she could not do without them, and in other cases did not provide very useful insights.


I do not have specific information about the training and background of these two data scientists, but the difficulty in training effective data scientists is something I am definitely familiar with. In a time that almost anyone can define themselves as a data scientist, training a complete data scientist is a challenge.


A data scientist should be familiar with databases, as many of the world’s data are organized in relational and non-relational databases. For working with a variety of data types the data scientist needs to be able to parse and render files, and convert between data formats. Working with large databases often requires programing skills beyond basic scripting in R or Python, as well as knowledge in algorithm design and operating system. Machine learning is also a required skill. In other words, a complete data scientist should have knowledge in computer science at the level of a trained computer scientist.


A data scientist must also be highly familiar with statistics, and understand multiple statistical methods for tasks such as regression, dimensionality reduction, statistical significance analysis, Mote Carlo simulations, and Bayesian methods, to name a few. The data scientist needs to have knowledge in statistics at a level close to the knowledge of the statistician.


In addition to statistics and computer science, a data scientist should also have knowledge in business administration. That knowledge is required to understand and define the business problems, and communicate the insights.


Data analysis tools change rapidly, and I do not highlight knowledge of a specific tool as a major part of data science training. However, the reality is that the data scientist also needs to be familiar with a collection of data analytics tools, and be able to quickly learn new tools.


The combination of knowledge in computer science, statistics, business administration, and applied technology is very difficult to train to the deep level required from a data scientist, certainly in an undergraduate program. Therefore, complete data scientists is a rare species, and most of those who identify themselves as data scientists should be selected carefully to a job that meets their skillset and knowledge.

Views: 10118


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Seng Chu on February 16, 2018 at 12:17am

Just adding onto Anderson's response here. Currently, the job title "Data Scientist" is an umbrella term. In some cities, you can have the job title "Data Scientist" with zero knowledge in machine learning. A lot of employers don't actually know what they are looking for. There are employers who hire students straight out of data science/analytics programs and expect them to provide immediate value. They are left disappointed when they find out that the data scientist they hired did not provide any useful insights.

Personally, I believe the main issue is the lack of business domain knowledge. This is something students can not learn in school, because it is specific to the business. Understanding the data specific to the domain is essential to produce meaningful results.

It is a bit unrealistic to expect a complete package data scientist. The spectrum of technical skills a data scientist could have is enormous. A better strategy is to establish some sort of base line on the technical side. Then hire someone who has the drive and motivation to fill in knowledge gaps. A good trainee would seek out information about the business and the skill sets they are lacking. Also, it is important to keep in mind that statistics and programming languages are tools. These tools are not very useful if the trainee doesn't understand the domain.

That being said, it is going to take time for companies to understand what they are looking for. Once they have that understanding, the job title "Data Scientist" would transform into many other roles.

Comment by Anderson L Amaral on February 15, 2018 at 2:37pm

I had already seen even companies that expected Data Scientists not only to code in Python / R the model, but also do parse data, deploy it and reproduce it in languages suchs Java/ Scala.  It was sold to many companies that Data Scientists could carry out magic when actually it is becoming obivous that we have now new roles such as Data Engineers which  are essential in a Data Science team.

Comment by Eric A. King on February 15, 2018 at 6:24am

I run an AI / machine learning / predictive analytics training organization -- and have found that data scientists have given up formal events that tell a story and paint a picture for on-demand recorded lectures.

Industry surveys show that they're not even going through the on-demand -- with 94% attrition in those events.  We're finding that many are paying a low fee to simply research a particular aspect and cherry-pick the technique or item they wanted to learn. 

As such, they're picking up ad-hoc tools and techniques... with no vision on how to put them together in a concerted way to arrive at results that are ultimately accountable, understandable, measurable, deployable or adoptable.  As such, it seems that most have given up formal training in lieu of a glamorous Google search via on-demand subscriptions. 

And most of the training available seems to reinforce a technology fascination over addressing the most critical killers of projects: how to maneuver complex environments, cultures, politics, team roles -- so that the results are ultimately deployed and adopted successfully. 

We examined this particular issue in the latest episode of The Analytics Clinic -- where our expert panelist revealed that there's a higher order three pillars to the standard three-legged Andrew Conway stool that defines the technical / tactical aspects of a data scientist.  They (or someone on the team) also needs to serve as a project lead to oversee successful deployment, as well as ensuring organizational alignment --- that what we're building is truly a priority to the organization and fits within the culture and environment.

We also discussed why there is rarely good alignment between the data scientist being sought and the one hired.  A link to the archived recording of this event is attached.  Do note that it's free to attend live.  We'll soon write up a blog summary of this event which will address many of the destructive ways that organizations are qualifying data scientists and where there is a chronic 'void of dysfunction' between data scientists and leadership.

Great disucssion.  Big issue.  I hope the trend reverses!

Comment by Suresh Babu on February 14, 2018 at 6:37am
We are talking about training data scientists (difficult/not difficult) not the starting points.  Also the term data science presumes some versatility in moving between the fields of computing and statistical learning.  That is, you’re likely to find data scientists inhabiting the “tropics” of an imaginary data science sphere rather than living in the "polar circles".
We should be aware of a well known problem (the illusion of expertise) that is relevant to this discussion.
Daniel Kahneman, the Nobel Prize winner in Economics, teamed up with Amos Tversky in a memorable collaboration of pioneers contributing much to the foundations of behavioral economics.  Kahneman, a psychologist, and Tversky, a decision theorist, formed an effective partnership bringing intuition and logic together to address the rather systematic biases in judgment that humans exhibit and the economic consequences of these.
One of the areas discussed by them is the way experts consistently underperform algorithms, a key finding of Paul Meehl.  Mainly, Kahneman explains in his book, experts tend to frame problems in more complex terms than needed — to show their cleverness — and then proceed to make inconsistent judgments in summarizing complex information.  This is a widespread phenomenon seen across areas.  
He offers the classic example of Dr. Virginia Apgar who in 1953 identified the relevant variables for assessing if a newborn was in distress and created a simple scoring method (still used) that could be used by anyone in the delivery room to determine the conditions of distress.  The rate of infant mortality came down significantly as a result.  Prior to that, physicians (the experts) who handled delivery were pretty often inconsistent in their judgment about newborn distress, leading to many infant deaths as a result. 
Unless the right framing exists (like Apgar’s framing of the relevant factors for infant mortality), experts (like physicians in the above case) are likely to fail more often, rather consistently, and non-experts (like staff assisting delivery) who could be trained easily miss the opportunity to learn.   
If you look at it in machine learning terms, I see the blank call for expertise as overfitting the training set, leading to underperformance in test conditions. The illusion of expertise is a powerful initial bias.
Comment by Dr S Kotrappa on February 14, 2018 at 1:47am

I am also on similar confusion , many authors discusses  Data science ecosystem  in their own perspective like Computer Science/Software Engineering, Statistics, Probability and Mathematics , Programming (OSS-Python/R/Perl/Java/C++, Matlab, SAS etc..) , , Bigdata , business acumen , communication skills etc But where to start and what to start , it is not possible for any one person to best in all these topics. I agree that its hard to train data scientist in all these topics. 

Comment by Lior Shamir on February 12, 2018 at 3:50pm

Thank you Rebecca! I can’t agree with you more. What is done in the classrooms is ridiculously different than what real-world data science is.

I work with part-time graduate students who also have jobs, and in that case they bring their data from work to school and we analyze it in the class. That helps a bit. But with undergraduate students that is much more difficult.

Comment by Rebecca Barber, PhD on February 12, 2018 at 1:55pm

I would point out that, if you review the curriculum of most data science or analytics degree programs, they are heavy on programming and statistics, VERY light on things like data preparation, data understanding/exploration, and communicating results.  They work with pre-cleaned data, are rarely given any context for that data beyond maybe a paragraph, and then write up results focused on their process and the technical aspects of what they did.  It is as though all of the program time is spent on 20% of the work.  I never did figure out where the people who designed those programs expect their students to learn the other 80%.  

Comment by Lior Shamir on February 9, 2018 at 1:44pm

Thank you Suresh for the comment. You state that "It's the expectation that one has to be a great statistician as well as a true computing genius that leads to this issue.  Very few are both."

That is exactly the problem - the versatility of knowledge (and I would also add knowledge in business administration) needed for solving real-world data problems. You can surely define a problem based on available skills or available tools, but that might be a compromise on the solution. Not every data science problem has an immediate solution through just applying tools. At this point I suspect that undergraduate level education does not provide all of that knowledge.

Comment by Suresh Babu on February 9, 2018 at 11:16am

I don't know how you can reach this conclusion.  

Data science is a mix of statistics and computing, favoring both and the spectrum in between.  And there can be generalists and specialists.  It's the expectation that one has to be a great statistician as well as a true computing genius that leads to this issue.  Very few are both.

When such problems are encountered, it's important to check if you have framed the problem well and invited the right people to provide the solution.  Right framing of the problem, right statistical learning techniques and the right architectural approach (tools).  This three-legged stool (problem, technique, tools) becomes wobbly if any one of these legs is weak.  

Even if people show up with the technique and the tools what are they going to hammer away at?

In my opinion, to reach the conclusion that it is hard to train data scientists is a rush to judgment.  

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service