Note: Opinions expressed are solely my own and do not express the views or opinions of my employer.
As a data scientist who has been munging data and building machine learning models in tools like R, Python and other software(s) (open source and proprietary), I had always longed for a world without technical limitations. A world which would allow me to create data structures (data scientists usually call them vectors, matrices or dataframes) of virtually any size (i.e. big), manipulate them, and use them in machine learning models. A world where I can do all these fancy things without having to worry about whether I can fit them in memory; without having to wait for hours on end for my computations to finish (data scientists are an impatient breed) and without needing to write lots of code to merely compute a dot product between two vectors.
The seeds of my data science utopia were sown many years ago with the advent of Apache Hadoop and the MapReduce programming model. It solved the problem of storing and processing large amounts of data. A few years later, Apache Mahout was developed on top of MapReduce which provided implementations of machine learning algorithms. It all seemed too good to be true.
However IMHO, MapReduce as good as it is for lots of data processing workloads and use-cases, it didn’t seem to be best suited for data scientists. Its lack of interactive data analysis capability, coupled with the need to write very verbose mappers and reducers in Java (or in other languages using Hadoop Streaming) was never going to be liked by the Data Science community. This didn’t mean that the dream was over……..Apache Spark came to the rescue!!
For the uninitiated, Apache Spark is an in-memory, distributed, data processing engine, designed to run on top of distributed storage systems like HDFS. As a data scientist, Spark whets my appetite for a number of reasons.
To me, speed of analysis matters. It’s no good if you have to wait for hours to get results of a correlation matrix just to forget why you ran it in the first place. The ability to do train-of-thought analysis interactively on large volumes of data is one of the most important features that distinguishes Apache Spark from other data processing engines.
The ability to write succinct code to accomplish data science tasks was never the forte of MapReduce jobs. Spark has nailed this with its high-level API in Scala and Python (a widely used scripting language in the data science community). Add to this, its MLlib package which provides implementations for a number of feature extraction and machine learning techniques.
Finally, Spark processes big data. Don’t think I need to say more on this other than, in my view (with a few ifs and buts), more data beats clever algorithms. Maybe more on this in a different post.
As is the case with most things in life, every technology goes through it's peaks and troughs. For now, those data scientists who dream of a similar utopia, will find in Spark a much needed ray of hope…..Welcome to Sparkling Land.