Summary: The world of data science is splitting into two distinct camps, the start-up app world and the commercial world. The good news is that almost all the opportunity lies in commercial predictive analytics where you can broadly specialize and still play with all the latest innovations.
In case you’re the only person who hasn’t heard this phrase, data scientists have increasingly been referred to as ‘unicorns’ as in ‘as rare as a unicorn’. As a data scientist I have taken exception to this since it seems to set an unrealistically high bar and simply isn’t true of the many data scientists I know personally. (See my earlier article “How to Become a Data Scientist”).
In February I spent several days at the Strata Conference catching up on all things analytic and big data, and yes there was still a fairly strong theme around the difficulty in finding these unicorns. One very persuasive speaker actually spoke about how to affect the capture, which in his version was to take fresh Ph.Ds. in math, statistics, OR, or computer science and train them up himself. Well, I thought, 1.) If the future is limited to data scientists with fresh Ph.Ds. then the supply is indeed vanishingly small, and 2.) What are the rest of us supposed to do for talent? Three things became apparent.
1. The days of finding a super-polymath data scientist are over. The field of data science has just become too big and too diverse for any one individual to master all its disciplines. This is like trying to find a single medical doctor who can cure cancer, deliver babies, diagnose tropical diseases, and perform open heart surgery. The era of specialization is upon us. I don’t care how hard you look the odds of finding in one person a data scientist who is equally competent at image processing, natural language processing, graph data bases, recommenders, big data, all the variants on predictive modeling, who can conceptualize and solve analytics-based problems across all processes in all industries, and who is at home writing R, Python, Julia, Scala, and at the same time run a modeling factory with SAS is in fact more rare than a unicorn. That person simply does not exist. The good news is that specialization will be good for us. It will allow practitioners to focus on what they like and what they are best at.
2. Is there really anyone who actually needs a Unicorn? Probably not. But the thing about the Strata Conference which is always held in San Jose is that you are in the app-icenter of big data and new and innovative ways to use analytics. At Strata you are likely to get a very biased view of what’s going on with analytics and big data since by location and by its focus on what is new and novel, presentations necessarily skew to the cutting edge. As conferences go, that’s the way it should be for practitioner data scientists. But that’s not true for our customers, the business people who are paying our bills. Silicon Valley start-ups may indeed benefit from polymath near-unicorns but as a proportion of practicing data scientists, I propose that this is a very tiny percentage both by job count and by qualified candidates.
3. Our world of Data Science is dividing in two. This is perhaps best illustrated by the debate over working in Python, R, Julia, or Scala versus working with advanced drag-and-drop modeling packages like SAS or SPSS. Supporters of coding in open source R or Python exult in the perfection of their code and the fact that it can do anything that packages can do. Supporters of packages like SAS say ‘why waste time coding when I can produce 10 times as many models in the same time as you coders’. And if you weren’t aware, packages like SAS now do Natural Language Processing, graph analytics, optimization, and recommenders just like open source. In fact some packages allow you to import R models or work directly in R alongside the internally generated models.
The great divide that is becoming evident seems to be this. If you are part of the start-up, high-innovation, cutting edge app world you are much more likely to be called on to utilize your R or Python skills. Basically this is custom code that becomes a significant element of the product. You might even be a unicorn in training.
However, if you are out there in the business world in insurance, banking, ecommerce, manufacturing, brick-and-mortar retail, or any of the other well established B2C or B2B models, including by the way government, then you are much more likely to be using a package like SAS or SPSS.
Is manual coding in R or Python likely to take over this space? Not very likely. Taking just SAS, they’re a $3 Billion company with over 33% market share including 93 of the top 100 Fortune Global 500. By sales or by head-count, this part of our Data Science world accounts for probably 90% of the folks identifying as data scientists. These analytics power houses are running modeling factories where turning out one or two new models a day just won’t cut it. They’re making use of cutting edge in-memory analytics that makes model creation a volume event and can take in Big Data quantities of structured and unstructured data to enhance modeling. In short, they’re doing about 95% of the things that the most innovative start-ups are doing, and they’re doing it at volume and with a sharp focus on the business top and bottom line.
So how should we respond to these trends: the declining need for unicorns, the reality of broad specialization, and the split of our world into writing code central to start-up products versus the much larger production world of predictive analytics and data viz? It’s important for folks who are thinking of becoming data scientists to understand that the great majority of opportunities do not require unicorns. You can specialize (broadly), work on packages that make analytics faster than coding, and still play with all the really cool innovations that are constantly arising.
March 16, 2015
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
The original blog can be seen at: