Subscribe to DSC Newsletter

Apache Spark and R : The best of both worlds.

As folks working in the field of Data Science and Analytics would know, R is one of the best languages to do data analytics and machine learning. Its simple and easy to use syntax and support for a huge library of capabilities makes it a top Data Science language. But the biggest limitation of R is the amount of data it can process. Its data processing capacity is limited to memory on a single node (at least the free version.).

Apache Spark is taking the Big Data world by storm. On one side, it has fast parallel computing capabilities that can extend over hundreds of nodes. On the other, it is easy to program. Libraries like Spark SQL and ML are pretty easy to learn and code with. Data transformation and processing is a scream.It supports Scala, Python and Java with the same set of features and libraries which makes transitioning from a known language easier. And its interpreter mode provides for an adhoc analytics mode that data analysts would love.

Now there is SparkR from Apache Spark . SparkR provides an interface from R to Spark. You can use R language and R Studio IDE to connect and work with data in Spark. Sitting on a windows laptop running RStudio, you can process data on parallel nodes in a Spark Cluster. The syntax is simple, straightforward and powerful. Data Cleansing and Transformation operations can be done as Map-Reduce activities across the cluster in real time. Summarized data can then be visualized using R's graphics capabilities.

A marriage made in heaven for R loyalists? Not yet. SparkR does not support the same suite of machine language algorithms like the other languages in Spark. In fact, only a couple of algorithms are available. I expect and hope that Spark developers are working to add these in later versions to make SparkR feature compatible with PySpark and Scala.

So is SparkR not useful until then? Not really. You can still use SparkR for data cleansing and transformation activities without having to breakup your data into smaller pieces to fit your R's memory limitations. Spark transformations are also many times faster since they happen on parallel nodes in memory.

Let us hope that Spark developers answer our prayers and provide us with a fully capable version of SparkR.

Views: 4962

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Brian Christopher Brown on March 11, 2016 at 5:33am

I concur with the author:  SparkR will be a great marriage between syntax and scalability, but it's not ready for prime time.  I have attempted to prototype 4 different use cases in SparkR, and I was unable to see any of them to fruition.  So it's not hard for me to understand why SparkR is only available in standalone mode.

SparkR works with dataframes, and it supports Spark SQL.  But by release 1.6 it had stopped supporting RDDs (like every other Spark API always has and still does).  So a lot of the experimental code you will find on the web has been deprecated, and the low-level functionality one might take for granted simply isn't there now.

Another fundamental problem: it's hard to see how one could use a function written in R when querying a dataset, with or without Spark SQL.  I won't say it's impossible, but it certainly has not been made clear as far as I am concerned.  And this is very basic functionality we're talking about.

As far as machine learning goes, it's my understanding that SparkR only supports the glm (as of release 1.6).  When you consider how much ML research has been done with R, it's hard to see how SparkR will catch up with scalable versions of familiar features anytime soon.

The bottom line: SparkR does not yet provide the fundamental means to either use or develop Spark's own natural scalable machine-learning capabilities, which would be the most natural and most practical use cases for SparkR.  It's going to take a lot of effort, and most likely a lot of time, before it becomes a first-class Spark API.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service