The First Things you Should Learn as a Data Scientist - Not what you Think

The list below is a (non-comprehensive) selection of what I believe should be taught first, in data science classes, based on 30 years of business experience. This is a follow up to my article Why logistic regression should be taught last.

I am not sure whether these topics below are even discussed in data camps or college classes. One of the issue is the way teachers are recruited. The recruitment process favors individuals famous for their academic achievements, or for their "star" status, and they tend to teach the same thing over and over, for decades. Successful professionals have little interest in becoming a teacher (as the saying goes: if you can't do it, you write about it, if you can't write about it, you teach it.)

It does not have to be that way. Plenty of qualified professionals, even though not being a star, would be perfect teachers and are not necessarily motivated by money. They come with tremendous experience gained in the trenches, and could be fantastic teachers, helping students deal with real data. And they do not need to be a data scientist, many engineers are entirely capable (and qualified) to provide great data science training.

Topics that should be taught very early on in a data science curriculum

I suggest the following:

  • On overview of how algorithms work
  • Different types of data and data issues (missing data, duplicated data, errors in data) together with exploring real-life sample data sets, and constructively criticizing them 
  • How to identify useful metrics
  • Lifecycle of data science projects
  • Introduction to programming languages, and fundamental command line instructions (Unix commands: grep, sort, uniq, head, Unix pipes, and so on.)
  • Communicating results to non experts and understanding requests from decision makers (translating requests into action items for the data scientist)
  • Overview of popular techniques with pluses and minuses, and when to use them
  • Case studies
  • Being able to identify flawed studies

By contrast, here is a typical list of topics discussed first, in traditional data science classes:

  • Probability theory, random variables, maximum likelihood estimation
  • Linear regression, logistic regression, analysis of variance, general linear model
  • K-NN (nearest neighbors clustering), hierarchical clustering
  • Test of hypotheses, non-parametric statistics, Markov chains, time series
  • NLP, especially world clouds (applied to small sample Twitter data)
  • Collaborative filtering algorithms 
  • Neural networks, decision trees, linear discriminant analysis, naive Bayes

There is nothing fundamentally wrong about these techniques (except the two last ones), but you are unlikely to use them in your career -- not the rudimentary version presented in the classroom anyway -- unless your are in a team of like-minded people all using the same old-fashioned black box tools. Indeed they should be taught, but maybe not at the beginning.

Topics that should also be included in a data science curriculum

The ones listed below should not be taught at the very beginning, but are very useful, and rarely included in standard curricula:

  • Model selection, tool (product) selection, algorithm selection
  • Rules of thumb
  • Best practices
  • Turning unstructured data into structured data (creating taxonomies, cataloging algorithms and automated tagging)
  • Blending multiple techniques to get the best of each of them, as described here
  • Measuring model performance (R-Squared is the worst metric, but usually the only one taught in the classroom)
  • Data augmentation (finding external data sets and features to get better predictive power, blending it with internal data)
  • Building your own home-made models and algorithms
  • The curse of big data (different from the curse of dimensionality) and how to discriminate between correlation and causation
  • How frequently data science implementations (for instance, look-up tables) should be updated
  • From designing a prototype to deployment in production mode: caveats
  • Monte-Carlo simulations (a simple alternative to computing confidence intervals and test statistical hypotheses, without even knowing what a random variable is.)

To find out more about these techniques, use our search box to find literature about the topic in question. 

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.

DSC Resources

Views: 33456


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Lance Norskog on July 23, 2018 at 1:55pm

@Nancy Grady- would this be the "Netflix Prize Effect"?

Comment by Nancy Grady on July 23, 2018 at 6:45am
I pulled up this great article (again), and realized you should also mention emergent behavior, or the "mosaic effect". As big datasets get integrated, you can end up with security and privacy issues – even though the individual datasets did not have those concerns.
Comment by Lance Norskog on July 14, 2018 at 12:45pm

Also- put up a production site. Just a web site where you upload cat pix, the app consults Google's API, then gives an answer. This is not data sci per se, but you're going to want to understand the quirks of putting Data Sci into production. For example, in software you throw away last week's build when you ship, but with a DS site you would run a few inputs against an old "gold" model, to decide whether your newer model is drifting against real production data.

Comment by Lance Norskog on May 29, 2018 at 6:31pm

An aspect you touch on in the first part is that an experiment is a sacred ritual. Someone who spends months worrying about her petri dishes full of fungus understands this deeply. Programmers are never taught about it.

"Design of Experiments" is a common term for this. Basic failures like "Survivorship Bias" should be taught.


Note that the WW2 story of armoring planes has something of a "just so story" nature.

Comment by Robert de Graaf on May 29, 2018 at 3:13pm

Two more items to add to the list

-Enough knowledge of databases, data warehouses, data vaults and other data catacombs to be able access the data you need, wherever it happens to be, and especially enough knowledge of the associated jargon to communicate with data gate keepers.

- The ability to transform a fuzzy business objective that the business user barely understands themselves into a sharply defined data science problem, clearly defined with respect to both statistical aspects at the research and development stage, and software engineering aspects at the implementation stage. Not trivial.

Get the first wrong, and you could spend a lot of time banging your head against a wall, the second wrong, and much of your effort will be wasted. Get them right, and you'll only spend the effort you need, like a true Lazy Data Scientist

Comment by Robert Vonderheid on May 29, 2018 at 2:22am

This item = Communicating results to non experts and understanding requests from decision makers (translating requests into action items for the data scientist) = should be taught over and over again.  Few folks can do this well and even when done well their explanations are not trusted because of others that don't do it well.

Comment by John Elder on May 28, 2018 at 1:44pm

Excellent list!  Especially the first section.  I'm going to teach a workshop at PAW in Vegas next week that follows that outline quite closely!  And then a 2nd one that talks about the Top 12 Data Science Mistakes (Deadly Dozen) and How to Defeat them, after the conference there.  The links are, for the Machine Learning Methods:  https://www.predictiveanalyticsworld.com/lasvegas/workshops/the-bes...

And for Deadly Dozen: https://www.predictiveanalyticsworld.com/lasvegas/workshops/the-dea...

-John Elder ([email protected])

Comment by Pablo Martinez on May 28, 2018 at 7:45am

Thank you very much for this article.

Comment by Rebecca Barber, PhD on May 28, 2018 at 6:34am

Thanks for this!  I have been arguing this for a long time.  All too many graduates of these programs come out with a great deal of knowledge about the modeling part of a project and no knowledge to speak of regarding things like getting from a business question to a data science problem, understanding or preparing the data or presenting the results to a non technical audience.  I had a graduate of one of these programs come work for me who had never seen a dirty data set, had never had to pull together data at two different levels of observation (and, in fact, didn’t know what that even meant, resulting in an exponential increase in the number of records when he tried to bring the data sets together, something he didn’t recognize as a potential problem or even check for). and didn’t realize what missing data would do to his models.  Yes, on high end large data science shops there rae data engineers to sort out some of this for you, but that is the minority of shops these days, and frankly someone who understands only one part of the process is far less useful/employable.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service