Home » Uncategorized

Apache Spark Introduction – A Comprehensive Guide for beginners

This article was posted on Data Flair. Below is a quick overview of the original article.

2808328549

1.Objective

This tutorial provides introduction to Apache Spark, what are its ecosystem components, Spark abstraction – RDD, transformation and action. The objective of this introductory guide is to provide detailed overview of spark, its history, architecture, deployment mode and RDD.

2.History:

Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it become AMPLab. It was open sourced in 2010 under BSD license. In 2013 spark was donated to Apache Software Foundation where it became top-level project in 2014. Apache Spark became most popular project at apache in 2015.

3.Introduction

Apache Spark is a general-purpose & lightning fast cluster computing system. It provides high-level API like Java, Scala, Python and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Hadoop and 10 times faster than accessing data from disk. Spark is written in Scala but provides rich APIs in Scala, Java, Python and R. I can be integrated with Hadoop and can process existing HDFS data.

4.Need for Spark

5.Components:

  • Spark core
  • Spark SQL
  • Spark streaming
  • MLlib
  • GraphX

6.Resilient Distributed Dataset – RDD

7.Spark Shell:

To read the full article or get Spark training, click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge