Subscribe to DSC Newsletter

Guest post by Ben LoricaThe original can be viewed on radar.oreilly.com.

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services..

As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata + Hadoop World in New York, I sat down with my co-organizer Ben Recht for the the latest episode of the O’Reilly Data Show Podcast. Recht is a UC Berkeley faculty member and member of AMPLab, and his research spans many areas of interest to data scientists including optimization, compressed sensing, statistics, and machine learning.

At the 2014 Strata + Hadoop World in NYC, Recht gave an overview of a nascent AMPLab research initiative into machine learning pipelines. The research team behind the project recently released an alpha version of a new software framework called KeystoneML, which gives developers a chance to test out some of the ideas that Recht outlined in his talk last year. We devoted a portion of this Data Show episode to machine learning pipelines in general, and a discussion of KeystoneML in particular.

Since its release in May, I’ve had a chance to play around with KeystoneML and while it’s quite new, there are several things I already like about it:

KeystoneML opens up new data types

Most data scientists don’t normally play around with images or audio files. KeystoneML ships with easy to use sample pipelines for computer vision and speech. As more data loaders get created, KeystoneML will enable data scientists to leverage many more new data types and tackle new problems.

It’s built on top of Apache Spark: Community, scale, speed, results

Spark is the hot new processing framework that many data scientists and data engineers are already using (and judging from recent announcements, more enterprises will start paying attention to it as well). By targeting Spark developers, the creators of KeystoneML can tap into a rapidly growing pool of contributors.

As a distributed computing framework, Spark’s ability to comfortably scale out to large clusters can significantly speed up computations. Early experiments using KeystoneML tackle some computer vision and speech recognition tasks on modestly-sized Spark clusters — resulting in training times that are much faster than other approaches used.

And, while the project is still in its early stages, early pipelines that ship with KeystoneML actually match some state-of-the-art results in speech recognition.

 

A comparison of spark.ml and KeystoneML. Source: Evan Sparks, used with permission.

Emphasis on understanding and reproducing end-to-end machine learning pipelines

Rather than the simplistic approach frequently used to teach machine learning (input data -> train model -> use model), KeystoneML’s API reinforces the importance of thinking in terms of end-to-end pipelines. As I and many others have pointed out, model building is actually just one component in data science workflows.

An image classification pipeline. Source: Evan Sparks, used with permission.

There are many ways to contribute to KeystoneML

As a fairly new project, KeystoneML’s codebase is still relatively small and accessible to potential contributors. A typical pipeline includes data loaders, featurizers, models, and many other components. You need not be an algorithm whiz or a machine learning enthusiast to contribute. In fact, I think many important future contributions to KeystoneML will likely be pipeline components that aren’t advanced modeling primitives. Moreover, if you have access to, or have already created well-tuned components, the creators of KeystoneML provide examples of how to quickly integrate external libraries (including tools written in C) into your pipelines.

A platform for large-scale experiments: Benchmarking and reproducibility

As I noted in an earlier post, the project’s longer term goal is to produce error bounds for end-to-end pipelines. Another important objective is to build a framework and accompanying components that make it easy for data scientists to run experiments and make comparisons. Recht explained:

There are benchmarks, and you get thousands of papers written about the same benchmark and it’s completely impossible to know how people are comparing. They’d say, “Algorithm A is better than algorithm B.” You don’t actually get to see how exactly they’re running algorithm A, or what they did to the default parameters in algorithm B. It’s very hard to actually to make comparisons to make them reproducible. Then [someone will] come along after a bunch of these [and] try to reproduce all of the results and [write a] survey paper comparing a bunch of results. … What we’d like to be able to do is have this framework where you can actually do those kinds of comparisons.

(Automatic) Tuning

Data scientists and data engineers who work with big data tools often struggle to configure and tune complex distributed systems. The designers of KeystoneML are building automation and optimization tools to address these issues. Recht noted:

The other thing that we’re hoping to be able to do are systems optimizations: meaning that we don’t want you to load a thousand-node cluster because you made an incorrect decision in caching. We’d like to be able to actually make these things where we can smartly allocate memory and other resources to be able to run to more compact clusters.

Subscribe to the O’Reilly Data Show Podcast

Stitcher, TuneIn, iTunes, SoundCloud, RSS

Views: 975

Tags: KeystoneML

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sione Palu on July 19, 2015 at 10:33pm

Quote :  "Most data scientists don’t normally play around with images or audio files."

It should be rephrased :  "Most data scientists don’t play around with images or audio files."

Most are working on business data, which mostly don't involve those data types. Unless they want to learn about the other use of machine learning algorithms they're currently using.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service