Subscribe to DSC Newsletter

Summary:  Do you think we can replace Data Scientist with software?

Talk about a fraught concept, this one ought to give you the willies.  I don’t mean to be a Luddite about the magical abilities of technology but the concept here is to replace data scientists with software.

As long as I have practiced in data science I am constantly coming upon new and unexpected reasons that my results may be misguided.  As a reminder you might look back at my recent articles on Simpson’s Paradox  (read it here) or Why Big Data Isn’t Necessarily Better Data (read it here).

So perhaps even though the most recent NoSQL ML techniques are too directional and not sufficiently specific in their results to lend themselves to automation (or perhaps these are exactly the required conditions), then the question is open as to whether more traditional ML techniques like predictive modeling can be successfully automated.  Which opens the door to their application by untrained users, or charitably, “citizen data scientists”.

Tom Simonite takes on this topic in his February article by this same name.  The premise he says is this, not enough data scientists to go around means the process must be automated.

For example, Google is apparently funding work on the “Automatic Statistician”.

“It’s not meant to replace exactly what a statistician would do, but it can help a lot,” says Zoubin Ghahramani, professor of information engineering at the University of Cambridge, who developed the software. “Sometimes it finds patterns that a regular data analyst would not,” he adds.

The Automatic Statistician uses an iterative building block approach to create mathematical models.  The software first tries out the simplest of those methods on the data; it then selects the ones that best explain the data for another round of experimentation, adding more mathematical techniques to see what happens. The best model is then used to generate the final written report.  (There’s a hint at genetic programming here which I think is a much overlooked ML technique, but more on that separately.)

Another entrant to this field is Skytree.  Simonite reports “it claims (to be) the first commercial tool that can automatically select the best model to explain a particular data set.”  When I read the Skytree site and the most recent 451 Research report on Skytree I see references to ‘near real time’ modeling and results from Big Data sets with connectors for the big three Hadoop distributions.  It appears to actually compete with SAS and operates the same way.  The automagical operation claim isn’t easy to spot. 

As we all know, the modeling takes only a small fraction of the total time.  It’s the decisions about data prep such as whether or not and how to impute missing values or include or exclude variables that takes time for iteration.  Skytree says it claims “accurate” models.  My take is that speed means “good enough” models and if you’re satisfied with that, I’ll be happy to be your competitor any day.

Finally Simonite identifies the company Narrative Science that provides a service to turn numerical data into readable reports. Cofounder Kristian Hammond, who is also a professor at Northwestern University, says that Google’s Automatic Statistician could help data scientists be more efficient, but its reports would offer little to those who are unfamiliar with statistics. “Most business people don’t want to know about mathematical models” says Hammond, “they want to know that they could save money by reducing factory activity by 50 percent between the hours of 1 a.m. and 6 a.m.”

Hmmm.  Since the interpretation of results, the story telling, is such a major portion of any project I’ll hold judgement.  We should wait for some proof on this one and look for some head-to-head tests of this automated output pitted against the interpretive skills of a decent senior data scientist.

These are just three examples and I’m sure there are 10 times that number in development somewhere.  I can’t wait for some serious adopters of data science automation to report their results so we can get some head-to-head comparisons.

You can read Tom Simonite’s original article here.

 

September 14, 2015


Bill Vorhies
Editorial Director, DSC

Views: 2784

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Aruna Rajasekhar on September 15, 2015 at 9:02am

After having worked with data for many years now, I have learnt that there is an exception to every rule. Automation is a double edged sword and it must be used carefully.

Comment by Pradyumna S. Upadrashta on September 14, 2015 at 9:24pm

“Most business people don’t want to know about mathematical models” says Hammond, “they want to know that they could save money by reducing factory activity by 50 percent between the hours of 1 a.m. and 6 a.m.”

All this tells me is why business leaders who don't want to know about the mathematical models will be obsolete -- their roles being taken over by qualified data scientists who bring the curiosity to the process.  If anything will accelerate the eventual rise of the data scientist within the C-suite that various articles keep mentioning, it is this statement here.

Comment by Pradyumna S. Upadrashta on September 14, 2015 at 9:21pm

I'm sorry but the day they "automate" the data scientist, look out, because they'll have made every other role obsolete as well.  Strong AI changes everything.  Weak AI is no different than no AI, since you still need a data scientist in the iterative loop to validate the process, check off that the model is doing what its supposed to do, and ensure proper execution in real-time.

I would ask: Who cares if it selects the best model? I would still trust a data scientist over a non data scientist to execute on that model, or to put that model into operation.  Companies that blindly trust models dig their own grave.

All models are wrong, some models are useful still applies ... so who is the best person qualified to judge which models are most useful? Who defines useful/utility? A model that chooses models? Or a human willing to take responsibility for the choice?

Choices still involve responsibility, and humans still take ownership better than machines last I checked.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service