Apache Spark: distributed data processing faster than Hadoop

This blog is extrapolated from DataScience Hacks by the author himself. 

Apache Spark, another apache licensed top-level project that could perform large scale data processing way faster than Hadoop (I am referring to MR1.0 here). It is possible due to Resilient Distributed Datasets concept that is behind this fast data processing. RDD is basically a collection of objects, spraed across a cluster stored in ram or disk, automatically rebuilt on failure. It is purpose is to support higher-level, parallel operations on data as straightforward as possible.

Apache Spark is often referred to as data processing engine. Simply put, Spark is cluster computing engine that made it easy to handle a wide range of workloads: ETL, SQL-like queries, machine learning and streaming. The amount of code you write is also minimized to a great extent compared to traditional mapreduce development. Also, it has been proven to be 10x faster than Apache Mahout.

The Spark engine has four major components.

  • SparkSQL: can query structured data, connect using JDBC drivers to import data, work with existing data warehouse (Hive)
  • Spark MLib: scalable machine learning library (like that of Mahout but faster) that could run forecasting classification clustering recommendations, etc…
  • GraphX: an API, used for graph computations, ETL tasks, exploratory data analysis
  • Streaming: similar to Storm, can be used to build real-time streaming applications (analyze twitter feeds real-time as it is posted, etc…)

To install Spark, we need the following in the OS (Mac/Debian):

  • Java
  • Scala
  • Maven
  • Git
  • Sbt

Visit Apache Spark Part I: Setting up preliminaries and Apache Spark Part II for more information.

Pavan Kumar

Lead blogger,

Data Science Hacks

Views: 3233

Tags: analytics, big, data, learning, machine, science, spark


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Alexander Kashko on October 3, 2014 at 12:37am

I just downloaded the distribution and did not need any of these.  I was able to run locally using python and my HDFS system. I have yet to see if startuing up Hadoop as well made a timing difference

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service