Cliff Notes for Managing the Data Science Function

Summary:  There are an increasing number of larger companies that have truly embraced advanced analytics and deploy fairly large numbers of data scientists.  Many of these same companies are the one’s beginning to ask about using AI.  Here are some observations and tips on the problems and opportunities associated with managing a larger data science function.


We spend a lot of time looking inward at our profession of data science, studying new developments, looking for anomalies in our own practices, and spreading the word to other practitioners.  But when we look outward to communicate about data science to others it’s different.  Maybe you have this same experience but when I talk to new clients it’s often as not to educate them at a fairly basic level about what’s possible and what’s not.

The good news is that there’s now a third group:  execs and managers in larger companies who have embraced advanced analytics and who try to keep up by reading, but are not formally trained.  If these analytics managers are data scientists, all well and good.  But as you move up the chain of command just a little bit you’ll soon find yourself talking to someone who may be an enthusiastic supporter but whose well intentioned self-education still leaves them short on some basic knowledge.

Who are these folks?  Well Gartner says you’re a mid-size user if you have 6 to 12 data scientists and it takes more than 12 to be a larger user.  And that’s not counting the dedicated data engineers, IT, and analysts also assigned to the task.  So it’s certainly the large users and probably many of the mid-size users we’re addressing.

For a while I’ve been collecting what I call Cliff Notes for Managing Data Science to address this group.  Here’s the first installment.


Do you really want an AI strategy?

I’ll try to keep this short because this topic tends to set me off.  The popular press and many of the platform and application vendors have started just recently calling everything in advanced analytics “Artificial Intelligence”.  Not only is this not accurate it makes the conversation much more difficult.

First of all if you’ve already got a dozen data scientists then you are firmly in the camp of machine learning / predictive analytics.  Machine learning is much more mature and more broadly useful than just AI (which also uses a narrow group of machine learning techniques).  So good for you.  Keep up the good work.  Just because it helps humans make decisions, what you have been doing so far is not AI.

Modern AI is the outcome of deep neural nets and reinforcement learning.  AI involves recognition and response to text, voice, image, and video.  It also encompasses automated and autonomous vehicles, game play, and the examination of ultra-large data sets to identify very rare events.  This area is quite new and only the text, voice, image, and video capabilities are ready for commercial deployment.

Technically if you have deployed a chatbot anywhere in your organization you are utilizing AI.  Chatbot input and sometimes output is based on NLU (natural language understanding) one of the good applications of deep learning.  Chances are that if you do not have at least one chatbot today, you will have one within a year.  Chatbots are a great way to engage with customers and save money.  They are not whiz-bang solutions.  This is what most of AI is going to look like when deployed.

By all means begin the conversation about where modern AI may be of value in your strategy but don’t oversell this yet.  The real money is in what you’ve been doing right along with predictive and prescriptive analytics, IoT, and the other well developed machine learning technologies.


Should your data scientists be centralized or decentralized?

There are two schools of thought here and the deciding factor is probably how many data scientists you have.  One school of thought is that you should embed them closest to where the action is, in marketing, sales, finance, manufacture.  You name it, every process can benefit from advanced analytics.  They will learn the unique perspective, language, problems, and data of that process which will make them more effective.

On the other hand, the average data scientist has had that title only 2 ½ years.  That just shows you how fast we’re starting to graduate new ones and how rapidly they’re getting snapped up.  What it means is that you probably have a few relatively experienced data scientists who have been around the block and a larger number of juniors who are just getting started.

The juniors should have come with a very impressive set of technical skills and in theory can contribute to any data science problem.  The reality is that the juniors and the seniors as well need to keep learning by experience, not to mention having time to catch up with the new techniques that are being introduced all the time.  So the goal will be to have enough contact time between the seniors and the juniors so that everyone continues to develop. 

If you’ve got a half-dozen with various experience levels working together that’s probably OK.  However one interesting model is a hybrid that brings all your data scientists together on a fairly regular schedule so they can share experiences and learning. 

Another possible implementation would be to have a few seniors deployed out in each end user organization with the juniors on rotating assignment to assist.

Spread them too thin and you won’t benefit from their growth.  It may also cause them to leave for greener pastures.


Should every data science project have an ROI?

The further you go up the chain of command, the more senior management will say ‘of course, this is our most basic concept’.  And that’s not necessarily bad but it needs to come with some balance.

It will be many years and perhaps never when you fully exploit all the data and all the analytics that will create competitive advantage.  And many of those applications haven’t even been conceived today.

One type of financial discipline that is completely appropriate is pre-establishing time budgets or measures that tell you when the solution is good enough.  This is particularly true in all types of customer behavior modeling. 

If your data science team has a built in bias that is a potential weakness it is that they will always want to keep working to make those models better.  Even if the time would be better spent on other data science projects.  This scheduling discipline needs the understanding of exactly how the work is done and that most likely belongs to your Chief Data Scientist.

Incidentally, that Chief Data Scientist should also be regularly evaluating and recommending platforms and techniques to make the group more efficient, particularly in the fast emerging area of automated machine learning.

HOWEVER, there needs to be an opportunity for discovery.  This is a little like an engineering lab where your data scientists need a little formally allocated time to ‘go in there and find something interesting’.  Give them a little unstructured time to explore. 

The most interesting phrase in science is not ‘Eureka’, but ‘that’s funny’ (Isaac Asimov).


Should you keep all that data or only what you need?

This question is very closely related to the one above about ROI.  Our ordinary instinct is to keep only what we need.  However there is a strong school of thought among data scientists that data is now so inexpensive to store that we should keep it all and figure out how to benefit from it later.

The opposite school starts with ‘what is the problem we are trying to solve’ and works backwards to the data necessary to achieve that.  This is also the school that says when we have achieved X% accuracy to this question that is sufficient and any data not necessary to support that should be discarded.

Well the problem is that all that data that doesn’t appear to be predictive most assuredly contains pockets of outliers and pockets of really interesting new opportunities.  You may not have the manpower to dig into it today, but that’s also why we argued for giving your data science team some self-directed time to ‘find something interesting’.

Storing new data in a cloud data lake where data scientists can explore it is ridiculously cheap (but not free).  Where you need discipline is when you operationalize your new insight and it becomes mission critical.  Then you need the full weight of good data management, provenance control, bias elimination, and the proverbial single definition of the truth.


What about Citizen Data Scientists and the democratization of analytics?


All data science projects are team efforts and those teams consists of data scientists, LOB SMEs, and probably some analysts and folks from IT.  As these non-DS team members get more experience of course they become more valuable.  Particularly they are increasingly able to restate business problems as data science problems that they can bring to the table.

What really scares me and should scare you too are statements in the press to the effect that AI and machine learning have become so user friendly that they are “only a little more complex than word processors or spreadsheets”, or “users no longer need to code”.

That may be very narrowly true but that does not mean you should ever begin a data science project without including an adequate number of formally trained data scientists.

It is true that our advanced analytic platforms are becoming easier to use.  The benefit is that fewer data scientists can do the work that used to require many more.  It does not mean that citizen data scientists, no matter how well intentioned, should be given control over these projects. 

The slick visual user interfaces in many analytic applications hide many critical considerations that a DS will know and a CDS will not.  The issues are much too long to list here but for example include false positive/false negative threshold cost tradeoffs, best algorithm selection, the creation of new features, hyperparameter adjustment of those algorithms, bias detection, and the list goes on.  This is no self-driving car.

What we recommend is indeed getting those analysts and LOB managers deeply involved in the team process.  That will allow them to spot new opportunities.  If you want to empower your organizations start with actively educating for data literacy.  Leave actually driving the data science to the data scientists.


Cybersecurity may be what forces your hand into true AI

Whether you are dealing with cybersecurity in-house or contracting out, this is the first place you should be sure that there is real AI at work.  It turns out that the deep learning techniques at the core of modern AI are particularly good at spotting anomalies and threats. If you want to front load your AI strategy this is the best place to start.  As the Marines say, don’t bring a knife to a gun fight.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

[email protected]

Views: 5039

Tags: AI, ROI, centralized, citizen data scientist, decentralized, democratization, strategy


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Rick Henderson on February 9, 2018 at 4:43am

So what kind of training is there for "data engineers?". Interesting article Bill.

Comment by William Vorhies on February 7, 2018 at 7:12am


Good observations, particularly about the third possible organization strategy.  



Comment by Paul Bremner on February 6, 2018 at 3:59pm

I think on your question of where to put Data Scientists, there’s probably a third alternative to either “centralized” or “decentralized.”  Namely, that some business units will have their own small groups of data scientists or people with full-on programming and statistics expertise.  These people can handle some of the load that would be going to the central data science groups but will also work closely with the central groups on other projects.  And people like this might eventually move into the central Data Science group (and perhaps the other way around.)


When I was getting my MBA there was a specialization in what was then called “Management and Information Systems” (this was before the era of Data Science and these folks would now presumably be doing the types of IT/Analytics things you see in Data Science.)  My recollection is that some companies (Intel comes to mind) would hire folks from this specialization to work in functions like a “Finance MIS group.”  The benefit to Intel was that they had people with full-fledged MBAs and specialization in both finance and info systems, allowing them to work easily with the Finance folks and also be able to expedite important Finance work requiring IT involvement since they knew the IT stuff.  These people wouldn’t really be called “Citizen Data Scientists” (using today’s terminology) because they had far more expertise in IT and programming than a Line of Business analyst or market manager.  Pretty sure I see some companies doing this sort of hiring with Data Science people who have “business-line” backgrounds.


I agree with you about the hype concerning “Citizen Data Scientist” and “democratization of data science.”  You can’t do this stuff (at least not in a competent fashion) by pushing a button.  It’s true that Data Science Platforms (i.e. SAS Enterprise Miner) provide the capability for users to do most Data Science things without coding.  But they’re designed for people with statistics degrees or solid statistical backgrounds who may not have programming capabilities.  DSPs cannot “Data Science-enable” someone with a casual interest in numbers whose background is something other than Programming/Statistics.  While you don’t absolutely need programming capability to use the platform, from what I can see about two thirds of the people with an Enterprise Miner certification also have advanced programming certifications.  (Otherwise you’re depending on someone else to do the all the major data munging necessary before processing data in a statistical application.)  And, as has been discussed elsewhere, a good Data Science Platform will have a full coding interface which allows the user, when necessary, to make full use of the programming language that backs it up (i.e. SAS) or to use Python/R/etc to extend the capability of the platform (as in Azure ML or other cloud DSPs.)  DSPs do not remove the need for either statistics expertise or programming capability – the first is definitely a requirement and the second is close behind.


I think after all the press and software vendor hype over citizen data scientists has run its course, what we’re going to find is that these systems mainly benefit Data Scientists, making them much more productive and enabling them to focus on higher-level, more important, issues as opposed to spending all their time writing thousands of lines of code that could easily be generated using drag and drop interfaces.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service