Subscribe to DSC Newsletter

Summary:  If you are wondering how to become a Data Scientist or what that title really means, try these insights.

I got started in data science way back.  I’ve been a commercial predictive modeler since 2001 and as naming trends have changed I now identify myself as a Data Scientist.  No one gave me this title.  But by observing the literature, the job listings, and my peers in the field it was clear that Data Scientist communicated most clearly what my knowledge and experience have led me to become.

These days you can get a degree in data science so you can show your diploma that certifies your credentials.  But these are relatively new so, with all due respect, if you only recently got your degree you are still a beginner.  Those of us who use this title today most likely came from combination backgrounds of business, hard science, computer science, operations research, and statistics.

What you call yourself is one thing but what your employer or client is looking for can be quite a different kettle of fish.  A lot has been written about data scientists being as elusive as unicorns.  Not being a unicorn I’d say this sets the bar pretty high.  Additionally, as I’ve perused the job listings it is equally true that the title is used so loosely and with such little understanding that an ad for data scientist may actually describe an entry level analyst and some ads for analysts are looking for polymath data scientists. 

All of this confusion over what we’re called and what we actually do can make you down right schizophrenic.  This makes it all the more complicated to answer the frequent inquiries I get from folks still in school or early in their career about how to become a data scientist.

Imagine my surprise and delight when in the space of a week two publications came across my desk that not only cast new light and understanding on this question but also have helped me understand that there is not just one definition of data scientist, but a reasoned argument (based on statistical analysis) that there are in fact four types.

Four Types of Data Scientists

The information here comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013.  My hat’s off to these folks for their insightful survey and conclusions drawn by statistical analysis of those results.  This is a must read.  I was able to download this at no charge from http://www.oreilly.com/data/free/analyzing-the-analyzers.csp.

There are 40 pages of good analysis here so this will be only the highest level summary.  In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.

  1. Data Businesspeople

  2. Data Creatives

  3. Data Developers

  4. Data Researchers

By evaluating 22 specific skills and multi-part self-identification statements they cluster and generalize according to these descriptions.  I am betting you will recognize yourself in one of these categories.

Data Businesspeople are those that are most focused on the organization and how data projects yield profit. They were most likely to rate themselves highly as leaders and entrepreneurs, and the most likely to have reported managing an employee. They were also quite likely to have done contract or consulting work, and a substantial proportion have started a business. Although they were the least likely to have an advanced degree among respondents, they were the most likely to have an MBA. But Data Businesspeople definitely have technical skills and were particularly likely to have undergraduate Engineering degrees. And they work with real data — about 90% report at least occasionally working on gigabyte-scale problems. 

Data Creatives.  Data scientists can often tackle the entire soup-to-nuts analytics process on their own: from extracting data, to integrating and layering it, to performing statistical or other advanced analyses, to creating compelling visualizations and interpretations, to building tools to make the analysis scalable and broadly applicable. We think of Data Creatives as the broadest of data scientists, those who excel at applying a wide range of tools and technologies to a problem, or creating innovative prototypes at hackathons — the quintessential Jack of All Trades. They have substantial academic experience with about three-quarters having taught classes and presented papers. Common undergraduate degrees were in areas like Economics and Statistics. Relatively few Data Creatives have a PhD. As the group most likely to identify as a Hacker they also had the deepest Open Source experience with about half contributing to OSS projects and about half working on Open Data projects.

Data Developer.  We think of Data Developers as people focused on the technical problem of managing data — how to get it, store it, and learn from it. Our Data Developers tended to rate themselves fairly highly as Scientists, although not as highly as Data Researchers did. This makes sense particularly for those closely integrated with the Machine Learning and related academic communities. Data Developers are clearly writing code in their day-to-day work. About half have Computer Science or Computer Engineering degrees.  More Data Developers land in the Machine Learning/ Big Data skills group than other types of data scientist.

Data Researchers.  One of the interesting career paths that leads to a title like “data scientist” starts with academic research in the physical or social sciences, or in statistics. Many organizations have realized the value of deep academic training in the use of data to understand complex processes, even if their business domains may be quite different from classic scientific fields. The majority of respondents whose top Skills Group was Statistics ended up in this category. Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD.

What Does this Mean for Someone Seeking to Enter the Field?

So if I am a young person seeking to enter Data Science how are these descriptions useful?  It’s possible that you could train and develop an emphasis that would lead you into the Researcher, Developer, or Creative roles.  It is less likely that education alone will put you on the Businesspeople track which implies experiences in business, not just education.  But here’s what’s interesting.  According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems.  Here’s their chart.

The skills are the same but the emphasis we place on them varies.  Perhaps a better way to say this is how do you prefer to spend your day?  Programming, working in Machine Learning (statistics), analyzing and resolving business questions?  Your answer as a younger person fresh from education may be different from the more mature you a few years down the path and that’s OK.  However, if you know now that you identify as a Data Researcher then it appears that the statistical skills need to be your focus.  If you identify as a Creative or Developer then programming and ML/Big Data are an appropriate emphasis.  And as you gain experience and learn whether you are happier as a team member or a business leader you may shift your focus to project profitability and the solution of business problems.

Where Does Big Data Fit in all This?

Personally I love Big Data.  But I actually love it for the attention it draws to predictive analytics.  If you drew a Venn diagram of Big Data and Predictive Analytics there would be a big but not perfect overlap.  There are areas of Big Data that are purely operational and outside the realm of data science.  Take for example the use of NoSQL databases as the operational databases powering large multi-player on line games.  No analysis there.  Just getting it done.  Likewise there is a lot of room in predictive analytics that has nothing to do with Big Data.

However, there’s no reason we shouldn’t learn about Big Data on our path to becoming data scientists.  Just don’t expect to see a lot of it in your professional life unless you are engaged with big-web-users like Amazon or Facebook.

Once again, thanks to Harris, Murphy, and Vaisman we can see how often today’s data scientists work at the Petabyte and Terabyte level.

The answer is not very often at all.  Yes, NoSQL document databases like Mongo are gaining traction as the way to blend transactional and unstructured data and that may be the future.  But frankly, in terms of volume, data scientists are most often working at normal scales of data, not Big Data.

What Tools and Languages are Most Important?

The second document to come across my desk is the blog by Robert A. Muenchen, “The Popularity of Data Analysis Software” which can be found at r4stats.com and is another must read.  The thing that separates this writing from other comparison reviews is the depth and variety of analysis.  Muenchen uses 13 separate type of analysis to rate market share and popularity and to his credit, does not try to reconcile the results which can be quite different based on the source.  As he says, here’s the list of measures “in approximate order of usefulness”.

  1. Job Advertisements

  2. Scholarly Articles

  3. Books

  4. Website Popularity

  5. Blogs

  6. Surveys of Use

  7. Discussion Forum Activity

  8. Programming Activity

  9. Popularity Measures

  10. IT Research Firm Reports

  11. Sales or Download Measures

  12. Competition Use

  13. Growth in Capability

If you’ve been a practitioner for a while then your toolbox is probably already pretty well defined.  Where this is really helpful is in answering the questions of young people entering the Data Science field, ‘what should I learn to use?’

This won’t talk you out of using SPSS, SAS, R, or Python but it will show you some interesting trends.  Once again, you’ll have to read the blog since it is so rich in content and it is left to the reader to evaluate results which can sometimes seem contradictory.  However, if I were trying to answer the ‘what should I study question’ I would look to at least these two graphs from the Muenchen blog.

Since getting a job should be foremost in your mind as you invest in your education, this look at total job listings requesting specific analytic software skills is an eye-opener.

Similarly, this graph based on survey of use data gives a very insightful look into what data scientists are using today.

I’m not going to try to answer the question, ‘what should I study’ other than to say the obvious, Java, R or Python, SAS or SPSS.  Frankly it’s most likely to be what your professor wants you to use which just as often seems to be how good an academic deal the particular analytic platform vendor made available.

R or Python?  I’m not touching that one except to say that there’s an interesting chart implying that Python is accelerating ahead of R.

If you’re looking for the answer of how to become a Data Scientist and what should you be learning, think of this as your first challenge.  Study the source material and draw your own conclusions.  I’m just happy these authors have brought this material together and hope they keep them updated in the future.

 

August 25, 2014

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.

About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

Bill@Data-Magnum.com

The original blog can be viewed at:

http://data-magnum.com/how-to-become-a-data-scientist/

Views: 35187

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Ramin Mehrani on December 15, 2015 at 11:56pm

Thank you very much for sharing such a good classification.

Comment by Sione Palu on December 15, 2015 at 9:51am

I would rephrase it as "How to become a data-analyst". That's very appropriate.

Richard Feynman said something like that over 30 years ago:

https://www.youtube.com/watch?v=IaO69CF5mbY

Comment by Vincent Granville on December 15, 2015 at 8:13am

Some high level data scientists do not necessarily code (though they typically know how to code). They manage people who manage programmers. Their job title might be Chief Data Scientist.

Comment by Sonia Prolac on December 15, 2015 at 12:57am

You also need some programming background to begin, preferably in Python. Most other things on this guide can be learned on the job (like random forests, pandas, A/B testing), but you can't get away without knowing how to program!

Comment by Martin Squires on July 8, 2015 at 3:56am

Hi Khurram,

My viewpoint would be that it comes down to the different types of data scientist mentioned earlier in the post. I suspect Indeed.com data reflects the data developer roles where Java, C# and Python are core to people developing data applications eg building dashboards in tools like Spotfire, rather than the data business people or data researchers where SAS and SPSS or R are the main tools of choice for decision trees etc.

Comment by Khurram on July 8, 2015 at 2:34am

Very nice article , i am lucky who came across to such valuable article. A quick question why java stand outs on top for analytics software. This is high level language which is good for relational transactional database , i dont know is it good for statitical analysis and does it have capability for using algo like decision tree,KNN , regression etc. It make sense R/Rapid miner which has seeded all these algortihm for data analysis.

Comment by Livan Alonso on September 28, 2014 at 6:54pm

Hi Patrick,

Dr Vincent Granville is mentoring a Data Science Apprenticeship program. There are many projects to work on (here there is a list of projects) , where you can train your data science skills. I hope you enjoy it.

Comment by Patrick L. Hagerty on September 28, 2014 at 1:02pm

Thanks for the reply.

Well, I also have about 20+ years in software engineering, including several years of Java.  But that was before my stint in aerospace systems QA.  My software skills are a little bit rusty, but I have no doubt I could get back into it with little effort.  What's more, my introduction to data science has led to my discovering R and I spend much of my free time exploring it guided by the several texts on the language I've picked up, and practicing with those very same publicly available databases to which you refer.

As for schooling, I'm enrolling in a local college here in Milwaukee.

That salary thing, though.  That could be a problem.

Comment by William Vorhies on September 28, 2014 at 10:54am

Patrick:

That's a tough one.  I started late and made the switch so I presume others can too.  Mid-career switching it seems has more to do with your need for a paycheck at your current level since this almost always means stepping back a bit.  I'm not aware of any apprenticeships but plenty of entry level jobs for people with the right skills.  That generally means Java, R or Python, SAS or SPSS and some experience at working ML problems.  Once you have the tools you can easily get the practice with publically available datasets some of which are available right here on datasciencecentral.com.  For an educational approach try one of the free on-line MOOCs like the Coursera data science curriculum from which you can also earn a certificate.

Good luck.

Comment by Patrick L. Hagerty on September 27, 2014 at 11:40am

Got a question for you, Bill.

Under the heading "What Does this Mean for Someone Seeking to Enter the Field?" you start by saying, "So if I am a young person seeking to enter Data Science [...]"  What if you're not quite so young? What if you've worked with some limited aspects of data science, but want to immerse oneself more deeply into the field?

I'm a quality assurance engineer and have done some limited data mining and basic statistical analysis exercises to identify defect trends and related factors.  I've just recently come across literature on Big Data, predictive analytics, and data analysis and have gotten very excited about the field.  I'm hoping there might exist somewhere an opportunity for someone like myself to leverage my background in quality assurance to wedge myself into an apprenticeship role or similar entry level position in serious data science.

Any thoughts, anyone?

Cordially yours,

Patrick L. Hagerty

Systems Quality Assurance Engineer

Astronautics Corporation of America

Milwaukee, WI

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2016   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service