Subscribe to DSC Newsletter

Data Science is Changing and Data Scientists will Need to Change Too – Here’s Why and How

Summary:  Deep changes are underway in how data science is practiced and successfully deployed to solve business problems and create strategic advantage.  These same changes point to major changes in how data scientists will do their work.  Here’s why and how.

 

There’s a sea change underway in data science.  It’s changing how companies embrace data science and it’s changing the way data scientists do their job.  The increasing adoption and strategic importance of advanced analytics of all types is the backdrop.  There are two parts to this change. 

One is what is happening right now as analytic platforms build out to become one-stop shops for data scientists.  But the second and more important is what is just beginning but will now take over rapidly.  Advanced analytics will become the hidden layer of Systems of Intelligence (SOI) in the new enterprise applications stack. 

Both these movements are changing the way data scientists need to do their jobs and how we create value.

 

What’s Happening Now

Advanced analytic platforms are undergoing several evolutionary steps at once.  This is the final buildout in the current competitive strategy being used by advanced analytic platforms to capture as many data science users as possible.  These last steps include:

  1. Full integration from data blending, through prep, modeling, deployment, and maintenance.
  2. Cloud based so they can expand and contract their MPP resources as required.
  3. Expanding capabilities to include deep learning for text, speech, and image analysis.
  4. Adopting higher and higher levels of automation in both modeling and prep reducing data science labor and increasing speed to solution. Gartner says that within two years 40% of our tasks will be automated.

Here are a few examples I’m sure you’ll recognize. 

  • Alteryx with roots in data blending is continuously upgrading its on-board analytic tools and expanding access to third party GIS and consumer data such as Experian.
  • SAS and SPSS have increased blending capability, incorporated MPP, and most recently added enhanced one-click model building and data prep options.
  • New entrants like DataRobot emphasize labor savings and speed-to-solution through MPP and maximum one-click automation.
  • The major cloud providers are introducing complete analytic platforms of their own to capture the maximum number of data science users. These include Google’s Cloud Datalab, Microsoft Azure, and Amazon SageMaker.

 

The Whole Strategic Focus of Advanced Analytic Platforms is About to Change

We are in the final stages of large analytics users wanting to assemble different packages in a best of breed strategy.  Gartner says users, starting with the largest will increasingly consolidate around a single platform.

These same consolidation forces were at work in ERP systems in the 90s or DW/BI, and CRM systems in the 00s.  Give the customer greater efficiency and ease of use with a single vendor solution creating a wide moat of good user experience combined with painful high switching costs.

This is only the end of the last phase and not where advanced analytic platforms are headed over the next two to five years.  So far the emphasis has been on internal completeness and self-sufficiency.  According to both strategists and Venture Capitalists the next movement will see the advanced analytic platform disappear into an integrated enterprise stack as the critical middle System of Intelligence.

 

Why the Change in Strategy – and When?

The phrase Systems of Intelligence (SOI) was first used by Microsoft CEO Satya Nadella in early 2015.  However it wasn’t until 2017 that the strategy of creating wide moats using SOI was articulated by venture capitalist Jerry Chen at Greylock Partners.

Suddenly Systems of Intelligence is on everyone’s tongue as the next great generational shift in enterprise infrastructure, the great pivot in the ML platform revolution.

Where current Advanced Analytic Platform strategies rely on being the one-stop general-purpose data science platform of choice, those investing and developing the next generation of platforms say that is about to change.  That the needs of each industry, or the needs of each major business process like finance, HR, ITSM, supply chain, ecommerce, and others have become so specialized in terms of their data science content that wide moats are best constructed by making the data science disappear as the middle layer between systems of record and systems of engagements.

As Chen states, “Companies that focus too much on technology without putting it in context of a customer problem will be caught between a rock and a hard place”.  As an investor he would say that he is unwilling to back a general purpose DS platform for that very reason. 

Chen and many others are investing directly on the basis of these thoughts that the future of data science, machine learning, and AI is as the invisible secret sauce middle layer.  No one cares exactly how the magic is done, so long as your package arrives on time, or the campaign is successful, or whatever insight the DS has provided proves valuable. It’s all about the end user. 

From the developer’s and investor’s point of view, this strategy is also the only forward path to deliver measurable and lasting competitive differentiation.  The treasured wide moat.

So in the marketplace the emphasis is on the system of engagement.  Look at Slack, Amazon Alexa, and every other speech /text /conversational UI startup that uses ML as the basis for its interaction with the end user.  In China, Tencent and Alibaba have almost completely dominated ecommerce, gaming, chat, and mobile payments by focusing on improving their system of engagement through advanced ML.

It’s also true that systems of engagement experience more rapid evolution and turnover than either the underlying ML or the systems of records.  So it’s important that in this new enterprise stack the ML be able to work with a variety of existing and new systems of engagement and also systems of record. 

The old methods of engagement don’t disappear but new ones are added.  In fact being in control of the end user and being compatible with multiple systems of records provides access to the flow of data that will allow the ML SOI to constantly improve enhancing your dominant position.

Here’s how Chen and other SOI enthusiasts see the market today.

 

 

How Does this Change the Way Data Scientists Work?

So why does this matter to data scientists and how will it change the way we perform our tasks?  Gartner says that by 2020 more than 40% of data science tasks will be automated.  There are two direct results:

 

Algorithm Selection and Tuning Will No Longer Matter 

It will be automated.  It will no longer be one of the data scientist’s primary tasks.  We see the movement to automating model construction all around us from the automated modeling features in SPSS to the fully automated modeling platforms like DataRobot. 

Our ability to try various algorithms including our hands-on ability to tune hyperparameters will very rapidly be replaced by smart automation.  The amount of time we need to spend on this part of the project is dramatically reduced and will no longer be the best and most effective use of our expertise.

 

Data Prep will be Mostly Automated 

Data prep for the most part will be automated and in some narrowly defined instances can be completely automated.  This problem is actually much more difficult to totally automate than model creation.  However you can already utilize automated data prep in tools as diverse as SPSS and Xpanse Analytics.  Right now, of the many steps in prep at least the following can be reliably automated:

  • Blending data sources.
  • Profile the data for initial discovery.
  • Recode missing and mislabeled values.
  • Normalize the data distribution.
  • Run univariate analyses.
  • Bin categoricals.
  • Create N-grams from text fields.
  • Detect and resolve outliers.

If you’ve experienced any of these automated prep tools you know that today they’re not perfect.  Give them a little time.  This step alone eliminates all the unpleasant grunt work and lower level time and labor in ML.

 

Who You Want to Work For

The Systems of Intelligence strategy shift raises another interesting change.  It probably impacts who you want to work for.  One of the great imbalances in the shortage of the best data scientists is that such a high percentage work for tech companies mostly engaged in one-size-fits-all platforms.  Certainly one implication is that we may want to search out industry or process vertical solution developers who will be the primary beneficiaries of this major change.

 

What’s Left for the Data Scientist to do?

Whether you’ve been in the industry for long or are fresh out of school you’ve been intently focused on data prep, model selection, and tuning.  For many of us these are the tasks that define our core skill sets.  So what’s left?

This isn’t as dark as it seems.  We shift to the higher value tasks that were always there but represented a much smaller percentage of our work.

 

Feature Engineering and Model Validation Become a Focus

In all the automation of prep so far there have been some attempts to automate feature engineering (feature creation) by for example taking the difference in all the possible date fields, creating all the possible ratios among variables, looking at trending of values, and other techniques.  These have been brute force and tend to create lots of meaningless engineered features.

It is your knowledge of both data science and particularly the industry specific domain knowledge that will keep the creation and selection of important new predictive engineered features a major part of our future efforts. 

Your expertise will also be required at the earliest stages of data examination to ensure the automation hasn’t gone off the rails.  It’s pretty easy to fool today’s automated prep tools into believing data may be linear when in fact it may be curvilinear or even non-correlated (I’m thinking Anscombe’s Quartet here).  It still takes an expert to validate that the automation is heading in the right direction.

 

Your Understanding of the Business Problem to be Solved

If you are working inside a large corporation as part of the advanced analytics team then your ability to correctly understand the business problem and translate that into a data science problem will be key.

If you are working under the SOI strategy and trying to solve a cross industry process problems (HR, finance, supply chain, ITSM) or even if you are working with a more narrowly defined industry vertical (e.g. ecommerce customer engagement) it will be your knowledge and understanding of the end users experience that will be valued.

Even today progress as a data scientist requires deep domain knowledge of your specialty process or industry.  Knowledge of the data science required to implement the solution is not sufficient without domain knowledge.

 

Machine Learning Will Increasingly be a Team Sport

With all this talk of automation it is easy to be misled that professional data scientists will no longer be necessary.  Nothing could be further from the truth.  True, fewer of us will be required to solve problems which can be implemented much more quickly.

Where does this leave the Citizen Data Scientist?  This is a movement that has quite a lot of momentum and it’s easy to understand that reasonably smart and motivated LOB managers and analysts may not only want to consume more data science but also want a hands-on seat at the table.

And indeed they should have a major role in defining the problem and implementing the solution.  However, even with all the new automated features the underlying data science still requires an expert’s eye. 

The new focus of your skills will be as a team leader, one with deep knowledge of the data science and the business domain.

 

How Fast Will All This Happen

The build out of advanced analytic platforms and automated features has been underway for about the last two years.  I’m with Gartner on this one.  I think roughly half our tasks will be automated within two years.  Beyond that it’s about how fast this trickles down from the largest companies to the smaller ones.  The speed and reduced cost that automation offers will be impossible to resist.

As for the absorption of the data science platform into the hidden middle layer of the stack as the System of Intelligence, you can already see this underway in many of the thousands of VC funded startups.  This is fairly new and it will take time for these startups to scale and mature.  However, don’t overlook the role that M&A will play in bringing these new platform concepts inside large existing players.  This is probable and will only accelerate the trend.

Is hiding the data science from the end user in any way a bad thing?  Not at all.  Our contribution to the end user’s experience was never meant to be on direct display.  This means more opportunities to apply our data science skills on more tightly focused groups of end users and create more delight in their experience.

 

 

About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

[email protected]

Views: 15731

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Andreas Theil yesterday

This is a great article, and well, it's about time to implement state of the art algorithms into established framework. I think data scientists will be welcome team members for support staffs at companies fot implementation on use cases and utilization.

However, I don't think it's the end of the line. New algorithms and APIs emerge as open source again with disruptive crowd based development and innovation as leverage. One platform was great and many of us learned or experienced it right. But in the meantime many (gen y...) speak of old thinking.

Maybe it's just the right balance in the new ecosystem of IT, research and innovation.

BR Andy

Comment by Girish Kurup on Thursday

Great article Will.

I agree that System of intelligence is adopted more by system of Engagements . I observed adoption rate of System of Intelligence by System of Records compared to System of Engagement is at lower side.

Comment by Claude Cundiff on January 29, 2018 at 8:21pm

Again, I do think this is an important topic for everyone to consider. And as Michael Bryan pointed out, "dialogue" is important, in fact critical.

  • As someone who automated their own job, I found the most difficult part was the human facet, not just those who directly interacted with the software but probably more so, upper management, who were taking the risk with a yay or nay.
  • From a Data Scientist perspective, there's always a better algorithm or process which replaces the existing system(s)... maybe.
  • For every problem solved comes at least another new problem at best.

It's going to be interesting to say the least.

Take Care everyone!

~Claude

Comment by Ralph Winters on January 29, 2018 at 10:47am

If everything else is automated and optimized, why not feature engineering?  Software can feature engineer this as well, better than humans.  So, everything will eventually need to be model validated by humans.  The problem is that many of our optimized models are becoming so complex that not even the creators can understand them. 

Data Scientists can work on tools to decipher these models and add checkpoints to make sure than results are somewhat interpretable and that intermediary results do not lead to nonsense or bad decisions. 

Comment by Steven Ramirez on January 21, 2018 at 1:05pm

Bill, great article, a lot to think about here. 

As data scientists, we all understand how AI and machine learning is going to disrupt industries. If AI can power autonomous vehicles, I'm quite sure it can automate data prep and algorithm selection.

Yes, even data scientists have to be concerned about the future of work!

Comment by Savita Kirpalani on January 19, 2018 at 3:22am

Bill,

Informative article as always. Working with SAP, I see the same principles been applied for ML using Leonardo foundation ML services. Data scientist will still build some models, which can be put as part of the services. Data engineers will have enablers of easier integration, cleansing, mapping,loading. The developers will be building APIs to connect to the services.

Indeed a new way of looking at data science and analytics.

Comment by Michael Bryan on January 17, 2018 at 5:59am

Great article, Bill.  Liked and Shared.  For dialogue, though, I'll remain skeptical like Claude.

Promises of a push button, self service world are eternal and never realized, for essentially human reasons. Platforms continue to expand, and algorithms advance.  But Gartner's 40% automation just isn't sober.  Three specific observations:

  • Innovation (so far) increases capabilities, skills and work rather than less.
  • Most people fear math. And machines. Unless it's a STEM business, there's a bridge to build.
  • Left brained people are loners. A team of deep analysts stare at each others shoes rather than their own.

Predicting a future inconsistent with the past is in danger of selling software.

Comment by Mohamed Judi on January 17, 2018 at 5:24am

Thanks, Bill, good article, however, you have omitted a major player. I would assume for now that your omission is due to lack of knowledge about the company and its solutions. SAP Leonardo is a now offering many automated solutions with its Predictive Analysis and Factory products. Let's not forget SAP HANA  and the massive predictive and advanced analytics in-database libraries. SAP is leading the way in embedded machine learning in its market-leader ERP and Value Chain solutions. Which is, by the way, a missed trend in your article too.

Embedded machine learning, predictive and optimization models in enterprise solutions are here and powering the new digital economy. Models are automatically retrained and refreshed to keep them relevant based on new data. Enterprise microservices in SAP HANA Cloud Platform are integrated with Leonardo IoT platform to give the business an extended visibility to metrics and measurements coming directly from sensors in the plant in real-time and presented to the user in Virtual Reality using SAP Visual Enterprise and Microsoft HoloLens. Machine Learning and other optimization algorithms play a major role in the SAP Preventive Maintenance solution as you can imagine.

Anyway, I just wanted to draw your readers' attention that there is so much going on, and SAP is in a leading position with Digital Transformation, and the power behind most of the leading organizations in every industry like Apple, Walmart, Exxon Mobil, General Electric, Shell, Coke, Pepsi, Microsoft, Lenovo, local and federal government in many countries, Petrobras, Samsung, etc. 

Comment by Dr. Dimitrios Geromichalos on January 16, 2018 at 11:41pm

In wide areas of the finance industry, e.g., auditors require the detailed documentation and explanation of every method used - at least for the "critical" tasks.
Here, I think that black boxes have still a far way to go before they are broadly accepted, no matter how good the results are. This could happen of course, if established and certificated tools are used but that would also increase systemic risk.

Comment by Claude Cundiff on January 16, 2018 at 12:51pm

Two Years? That's rather optimistic. While I have no doubt that these changes are on there way, seems like there are a lot of, if I had to guess, business side stakeholders who are forgetting a few points:

  1. We can't even get rid COBOL
  2. Our Algorithms are NOT as good as they could be
  3. As every Project Manager knows, scope creep is female dog, and
  4. most things take at least 4 times longer than planned

This is a good article! Thanks for posting. This is important information everyone is this field needs to know.

~Claude

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2018   Data Science Central™   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service