The list below is a (non-comprehensive) selection of what I believe should be taught first, in data science classes, based on 30 years of business experience. This is a follow up to my article Why logistic regression should be taught last.
I am not sure whether these topics below are even discussed in data camps or college classes. One of the issue is the way teachers are recruited. The recruitment process favors individuals famous for their academic achievements, or for their "star" status, and they tend to teach the same thing over and over, for decades. Successful professionals have little interest in becoming a teacher (as the saying goes: if you can't do it, you write about it, if you can't write about it, you teach it.)
It does not have to be that way. Plenty of qualified professionals, even though not being a star, would be perfect teachers and are not necessarily motivated by money. They come with tremendous experience gained in the trenches, and could be fantastic teachers, helping students deal with real data. And they do not need to be a data scientist, many engineers are entirely capable (and qualified) to provide great data science training.
Topics that should be taught very early on in a data science curriculum
I suggest the following:
By contrast, here is a typical list of topics discussed first, in traditional data science classes:
There is nothing fundamentally wrong about these techniques (except the two last ones), but you are unlikely to use them in your career -- not the rudimentary version presented in the classroom anyway -- unless your are in a team of like-minded people all using the same old-fashioned black box tools. Indeed they should be taught, but maybe not at the beginning.
Topics that should also be included in a data science curriculum
The ones listed below should not be taught at the very beginning, but are very useful, and rarely included in standard curricula:
To find out more about these techniques, use our search box to find literature about the topic in question.
For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.
DSC Resources
Comment
@Nancy Grady- would this be the "Netflix Prize Effect"?
Also- put up a production site. Just a web site where you upload cat pix, the app consults Google's API, then gives an answer. This is not data sci per se, but you're going to want to understand the quirks of putting Data Sci into production. For example, in software you throw away last week's build when you ship, but with a DS site you would run a few inputs against an old "gold" model, to decide whether your newer model is drifting against real production data.
An aspect you touch on in the first part is that an experiment is a sacred ritual. Someone who spends months worrying about her petri dishes full of fungus understands this deeply. Programmers are never taught about it.
"Design of Experiments" is a common term for this. Basic failures like "Survivorship Bias" should be taught.
https://en.wikipedia.org/wiki/Survivorship_bias
Note that the WW2 story of armoring planes has something of a "just so story" nature.
Two more items to add to the list
-Enough knowledge of databases, data warehouses, data vaults and other data catacombs to be able access the data you need, wherever it happens to be, and especially enough knowledge of the associated jargon to communicate with data gate keepers.
- The ability to transform a fuzzy business objective that the business user barely understands themselves into a sharply defined data science problem, clearly defined with respect to both statistical aspects at the research and development stage, and software engineering aspects at the implementation stage. Not trivial.
Get the first wrong, and you could spend a lot of time banging your head against a wall, the second wrong, and much of your effort will be wasted. Get them right, and you'll only spend the effort you need, like a true Lazy Data Scientist
This item = Communicating results to non experts and understanding requests from decision makers (translating requests into action items for the data scientist) = should be taught over and over again. Few folks can do this well and even when done well their explanations are not trusted because of others that don't do it well.
Excellent list! Especially the first section. I'm going to teach a workshop at PAW in Vegas next week that follows that outline quite closely! And then a 2nd one that talks about the Top 12 Data Science Mistakes (Deadly Dozen) and How to Defeat them, after the conference there. The links are, for the Machine Learning Methods: https://www.predictiveanalyticsworld.com/lasvegas/workshops/the-bes...
And for Deadly Dozen: https://www.predictiveanalyticsworld.com/lasvegas/workshops/the-dea...
-John Elder ([email protected])
Thank you very much for this article.
Thanks for this! I have been arguing this for a long time. All too many graduates of these programs come out with a great deal of knowledge about the modeling part of a project and no knowledge to speak of regarding things like getting from a business question to a data science problem, understanding or preparing the data or presenting the results to a non technical audience. I had a graduate of one of these programs come work for me who had never seen a dirty data set, had never had to pull together data at two different levels of observation (and, in fact, didn’t know what that even meant, resulting in an exponential increase in the number of records when he tried to bring the data sets together, something he didn’t recognize as a potential problem or even check for). and didn’t realize what missing data would do to his models. Yes, on high end large data science shops there rae data engineers to sort out some of this for you, but that is the minority of shops these days, and frankly someone who understands only one part of the process is far less useful/employable.
Posted 12 April 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central