There is no question that comes up more frequently than ‘how do I become a data scientist’. I’ve actually written several articles on this topic (and will reference them liberally in this post) but they lacked the global perspective that potential new entrants to data science want. I’m going to try to resolve here.
Data Scientist versus Doing Data Science
I thought about changing the title to “Doing Data Science” instead of becoming a Data Scientist to focus on the activity and not just the job title. There are two good reasons. First, not everyone doing data science is necessarily a data scientist. Second, it’s often far from clear what the “data scientist” job title actually means. Data scientists are not born fully formed. For all but a tiny percentage of us there is career progression. As in any profession you start off a junior and progress through experience and learning. There aren’t any clearly recognized titles that differentiate senior from junior positions but there probably should be.
In the final analysis though it’s the job title that seems to interest folks the most, presumably because there are both substantial opportunities and good paychecks attached. (For a good current salary survey see: http://www.analyticbridge.com/group/salary-trends-and-reports/forum/topics/salary-trends-for-data-science-professionals) And the shortage of people who can do data science is likely to persist into the future. How long? Well Gartner says that only about one in eight companies has yet to adopt this discipline and technology even though they describe it as fully mature (I’m talking about predictive analytics, not necessarily Big Data). So my personal guess is that this skills shortage and opportunity has a good 10 year run ahead, plenty of time for you to get on board.
Who Is This Article For
Do you need a Ph.D? Absolutely not. However if you’ve got a fresh one that’s outside of data science, for example in the hard sciences and want in to data science, employers are looking for you. Check out the Stanford Insight post doc fellowship program in data science (http://insightdatascience.com). Fully funded by anxious employers with a 100% hiring guarantee. For others just Google “post doc data science fellowship programs” and you’ll find several of them.
However, this article is definitely not written for Ph.Ds. It’s for those of you in college, freshly out with bachelors or masters, and for those of you with a few years since graduation. It also describes career paths for those of you at the junior or mid-levels of IT or data analyst roles.
What’s In a Job Title?
A year ago I wrote an article about “How to Become a Data Scientist” (http://www.datasciencecentral.com/profiles/blogs/how-to-become-a-data-scientist). As I perused the job listings one of my first observations was that the title is used so loosely and with such little understanding that an ad for data scientist may actually describe an entry level analyst and some ads for analysts are looking for polymath data scientists. If you’re a job seeker you need to look carefully at what’s under the hood to see if you’re really going to get to do data science.
How do you tell if it’s Data Science?
‘Data science’ is so ambiguous. It would be so much better if we called it what it is: predictive analytics. It’s PREDICTIVE. It’s about the future and what will happen, not the past.
If employers sometimes call analysts data scientists and sometimes the reverse then what is at the core that tells you this is data science? I offer this simple test:
If you are using data to describe what happened in the past this is traditional business intelligence (BI) and that’s what data analysts focus on. Lots of successful careers have been built on BI but in the end you are presenting historical data to ‘business experts’ (often the most senior or best paid person in the room) who will offer their opinion about what it means and what actions should be taken ‘based on their years of experience’.
If you are doing data science, you are using tools that ‘discover the signal in the data’ without (almost) any human filtering. Data science tells us what the data predicts about the future and what actions we should take as a result (predictive and prescriptive analytics). This can be about human behaviors (why they come, why they stay, why they go, what they will buy next, what they will buy the most of) or about predicting physical values (e.g. the spot price of oil next week, the specific performance or failure of a complex machine or system, is that tumor cancerous or benign).
If you are doing data science you will be using Machine Learning (ML) tools. The core definition of ML tools includes things like regression, decision trees, neural nets, and the like. The great majority of folks who enter the data science field can spend their entire careers creating substantial value with just these tools.
However, it would only be fair to point out that since the introduction of NoSQL databases and our ability to analyze unstructured and semi-structured data that the definition of ML tools has been greatly expanded to include tools that give less specific and more directional guidance. These include Natural Language Processing (NLP), Recommenders (who should I date, what should I watch, what price should I pay or charge), the Internet of Things (IoT), Image Processing, and Deep Learning. (See my article “How Machine Learning Fundamentally Changed Machine Learning” http://www.datasciencecentral.com/profiles/blogs/how-nosql-fundamentally-changed-machine-learning).
Two Different Worlds of Data Science (DS)
Remember that we are going to focus on how to enter this profession, not what your skills will be when you fully mature as a data scientist. Also remember that this is a young profession and the skills you need today will hold for five or ten years but by then the technology and knowledge will have carried us into areas not yet imagined, just as occurred over the last ten years. This is a profession that requires life-long-learning and intense curiosity.
One of the things that is rapidly emerging is that it is now possible to specialize in some of the cutting edge ML techniques and to build a career around that specialty. There is a need for these specialists and if that particularly appeals then you should explore it. However, it’s not something that I recommend for most new entrants because of the difference in opportunities between what I will call Core DS and Big Web User DS.
Let’s start with the Big Web User DS. These are B2C companies that interact with their customers primarily through the internet and they cover the waterfront from search, to ecommerce, social media, dating sites, MOOGs, and content sites. Think Google, Amazon, eHarmony, Twitter, Yahoo, and World of Warcraft.
What these companies have in common is that they have exploited the newest DS tools that are rapidly arising from our new abilities in ML on NoSQL: recommenders, NLP, dynamic pricing optimizers, image processing, IoT, and deep learning. If you want to work at the very cutting edge of DS then you can specialize in one of these areas and probably spend many productive years doing just that with no exposure to any of the other DS disciplines.
You will almost certainly work directly in code, probably R or Python, and that code will be incorporated directly into the ‘product’ that is what those companies are. You also have about a 90% probability of living in San Francisco, Los Angeles, or New York because frankly that’s where those jobs are.
Core DS: Core DS includes applications in all the industries not mentioned above, from brick-and-mortar retailers with add-on ecommerce functions, to manufacturing, finance, insurance, utilities, health care, government, education, transportation, services, and literally everything else. These can be B2C or B2B and the tie that binds is that they all have customers and/or they all have machines, processes, commodities, or scientific phenomena that they need to understand and predict.
For customers these are the why they come, why they stay, why they go, what will they buy next questions. For machines, processes, and phenomena this is about predicting specific future values, failure points, or techniques of optimization. There is literally not a mid-size or larger company anywhere in the world that could not operate more profitably and efficiently if they knew the answers to these questions.
The balance of this article will focus on Core DS. Why? Because the need and jobs are everywhere. There are many more of these jobs, and they are much easier for new entrants to access.
And you are not cut off from the leading edge techniques used in the Big Web User world. NLP, recommenders, optimizers, and IoT are all finding their way into the Core DS world but only as tools, not as your primary focus. This gives you time to master the fundamentals and grow into these new areas as your interest and talents guide you.
Does This Mean I Must Become a Mathematician or Programmer?
Absolutely not – mostly. You do not need to be a programmer or a mathematician in Core DS. However a little SQL, R, or Python will certainly help your resume. And the math you must master is around a not-overwhelmingly large set of statistics that will guide you to understand whether the models you construct are actually robust and predictive.
The Core ML tools like regression, segmentation, decision trees, and neural nets (there are many more) are all themselves mathematical algorithms that were developed in academia. None are proprietary. All are available open source. When turned loose on a properly formatted body of data these ML tools can discover ‘the signal in the data’ which they express as a predictive algorithm. We call this output a predictive model.
You can access and learn these ML tools by learning code. R and Python (among many others) are widely taught in schools today but these are not required in order to use these ML tools. Most DS practitioners will use ML tools via a specialty package or application designed to make life easier than coding. These include the widely adopted for-profit packages by SAS and SPSS (plus many others), or free open source packages like WEKA or PSPP (plus many others). Personally I am not a fan of coding though many are. To me any time spent writing, debugging, or otherwise focusing on code is time my expensive and talented data scientists could better spend thinking about the data.
The package or code of your choice can be learned from colleges, MOOCs, academies, for-pay instructional courses, on-line tutorials, and even paper manuals. There are also well defined certifications for most widely adopted versions that will help you establish your credential in an interview. If you want to pick I suggest you go on DICE or your favorite job board and count the number of job openings available for each in the place you would most like to live. You will find SAS, SPSS, R, and Python all right up there at the top. (Once again see “How to Become a Data Scientist” (http://www.datasciencecentral.com/profiles/blogs/how-to-become-a-data-scientist).
School versus Experience
In industry today only a tiny fraction of data scientists have Masters or Ph.Ds. in data science. Most of us have come to data science through OR, statistics, computer science, the physical sciences or other paths. Right now there is an explosion of certificate and degree granting programs of all levels just coming on line. If you’re under 25 or have the time and resources to reskill this is a good way to go. There are even free programs such as the apprenticeship offered through DataScienceCentral.com and some of the MOOCs.
Here’s one of many fairly comprehensive lists of non-university programs that will train you to become a data scientist: http://yet-another-data-blog.blogspot.com/2014/04/data-science-bootcamp-landscape-full.html
However, it’s also possible to start doing data science incrementally, starting at a junior level, drawing a pay check, and getting trained up as you go with OJT.
How Do Companies Organize For Predictive Analytics?
In the Core DS world there seem to be two patterns for starting out but only one that succeeds. Once the need has been recognized roughly half of companies go out and hire a unicorn. Haven’t heard of that job title? A unicorn, a mythical beast, is a polymath data scientist who can do it all. They’re expensive, probably $150K to $250K and up and frankly they don’t like doing the junior level work. Companies may then promise to build analytic staff to support the unicorn since if they don’t (and sometimes even if they do) this expensive asset with a year’s worth of the company’s experience under their belt will depart for greener pastures. (See “Do You Really Need a Unicorn” http://www.datasciencecentral.com/profiles/blogs/do-you-really-need-a-unicorn)
The more logical pattern and the pattern that even the unicorn-shoppers eventually converge on is to build from the bottom up. What are the critical needs and what are roles that demand the most time? The needs are 1.) Knowledge of the business, and 2.) Knowledge of the data that’s already on hand. The PA role that requires the most time is the data wrangler who will provide about 80% of the labor in each modeling project. The model jockey and senior data scientist / analytics architect-manager roles require little time during startup and many companies would be advised to find a consulting data scientist to help organize, mentor, and train while performing the modeling and more senior duties. In year two or three when the volume of models being deployed reaches say 6 or 8 or more then the top layers can be brought in-house.
What Does Career Progression Look Like and What Exactly Will I be Doing?
The key takeaway from successful bottoms-up development of a predictive analytics practice is that many of the key requirements are for people that know the business and know the data. Especially at the entry levels of data prep and junior predictive modelers this favors promoting from within and training up the folks they have.
If 80% of the effort is in data prep, about 10% in actual modeling, and the remaining 10% in organizing the project, deploying the production code, and monitoring to see when the models need to be refreshed then there are many entry level opportunities. As a ‘junior data scientist’ or data wrangler your early tasks are likely to be doing the grunt work of preparing the data under the supervision of an experienced predictive modeler (model jockeys).
If you are already a data analyst or the person in IT that pulls the data for the analyst you already know more than half of this job. In many organizations the best person for this role is someone who already knows the company and its data and has the enthusiasm to self-nominate. Of course there’s more you need to know but you can either be self-taught, guided by the more senior predictive modeler, or even get your company to pay for training OJT.
When you get this role you should immediately look for opportunities to also start to produce the predictive models yourself. This is how you’ll get the experience to move up. That plus a little more training perhaps at company expense. If you’re using a package like SAS or SPSS all the tools you’ll need are right there and there’s nothing like a little experimentation to cement learning.
In my view the minimum requirement for a Core DS effort is three, the data wrangler, the model jockey, and a senior project lead whose task it is to communicate the project and its solution to the C-Levels, the SMEs, the functional project owner and the IT staff that must implement, monitor, and maintain the production code that is the end product of the predictive analytics effort. Once the company has several models deployed it may want to add a Model Deployment Lead where the key knowledge base is how the company’s operational systems are organized and deployed. So this role can also favor a motivated insider.
Notice that all of the above tasks can be accomplished purely with packages like SAS or SPSS without ever having to touch a line of code. However, you will increase in value if you add some or all of these to your skill set: SQL, Python or R, NoSQL recommenders and Natural Language Processing. The list goes on from here. If you are truly on the path to becoming a data scientist you will know that this is about life-long learning and will always be expanding and perfecting your skills.
July 1, 2015
Editorial Director, DSC