In my last post, I covered setting up the basic tools to start doing machine learning (Python, NumPy, Matplotlib and Scikit-Learn). Now, you are probably wondering how to do this on a very large scale, involving terabytes (may be even petabytes) of data and across several server nodes.
The best answer is Apache Spark ! Spark is an in-memory analytics engine which runs on top of HDFS and also unifies many other data sources e.g. NoSQL databases like MongoDB or even CSV files. Spark is also a much faster and simpler replacement of Hadoop's original processing model - MapReduce. IBM has announced plans to include Spark in all its analytics platforms and has committed 3,500+ developers to Spark-related projects.
The picture below shows how Spark plays across applications, data sources and environments.
A ton of material is already available to tell you the benefits of Spark. I will keep it short and simple. Spark is great because it is free, it is Open Source, General Purpose, scales up massively (I mean up to 8000 nodes and Petabytes of data) and has amazing speed (~100x faster than traditional Map Reduce. details here!) and comes with a delightfully elegant programming model! BTW, it is the programming model with deep roots in Functional programming that won me over. Details here!
Spark is also the hottest project in Apache.
How do we get started? See this course taught by Professor Anthony Joseph of Berkeley. The lectures were broken in small bite-size videos (3 to 4 minutes maximum) which are simple and very nicely explained. Some are followed by quiz questions which helps to validate the knowledge immediately. The entire course environment is provided in a Virtual machine (you need VirtualBox, Vagrant, and the image) which is runnable on your laptop. The best part of the course were the 4 labs. The labs came as iPython notebooks with sample exercises. Each exercise was followed by tests which gave a pass/fail result immediately.
The first lab was meant to count the most frequent words in ALL of Shakespeare's plays. The second lab provided WebServer Logs from NASA and asked students to parse the Apache Common Log Format, create Spark RDDs (that is Resilient Distributed Datasets), and analyze how many valid requests/responses (200X), how many failed, which resources failed and when!
A screenshot of a section of this lab to visualize 404 responses by hour of day is shown below.
The third lab provided product listings from Google and Amazon and the objective was to use a Bag-of-words technique to break up product descriptions into tokens and compute similarity between products. A TF-IDF (Term Frequency and Inverted Document Frequency) technique was used to compute similarity between documents of product descriptions. Learning such powerful text analysis techniques to do entity resolution would be a real asset to solve live problems.
The fourth lab was the icing on the cake. It analyzed movies from IMDB and came up with historical ratings of movies. A dataset of 500K ratings came along with the VM.
A screenshot of a section of this lab to retrieve highest ever rated movies shown below.
It is a lot of fun to see the highest rated movies and be able to run your own queries on the RDD :-)
The lab used Spark's MLlib (Machine Learning Library) to use Collaborative Filtering. This is a method to make automatic predictions about user interests using preferences of many (collaboration). The basic assumption is if a person A has the same opinion as B on one item x, then they are more likely to have same opinion on another item y compared to a random person. CF was combined with Alternating Least Squared techniques to make predictions of movie ratings. Finally, the lab asked the user to rate a small sample of movies to make personalized movie recommendations. You will love this lab!
Additionally, the course discussion group was full of questions and very supportive responses from the staff and other students.
I found this to be an excellent course, at the right level of difficulty and helpful in de-mystifying Spark for a beginner and putting it to actual use.
I am looking forward to the next in the series - "Scalable Machine Learning".
Best wishes and best regards,