Fast forward transformation process in data science with Apache Spark
Curation is a critical process in data science that helps to prepare data for feature extraction to run with machine learning algorithms. Curation generally involves extracting, organising, integrating data from different sources. Curation may be a difficult and time consuming process depending on the complexity and volume of the data involved.
Most of the time data won’t be readily available for feature extraction process, data may be hidden is unobstructed and complex data sources and has to undergo multiple transformational process before feature extraction .
Also when the volume of data is huge this will be a huge time consuming process and can be a bottle neck for the whole machine learning pipeline.
General Tools used in Data Science :
- R Language – Widely adopted in data science with lot of supporting libraries
- Mat lab – Commercial tool with lot of builtin libraries for data science
- Apache Spark – New, powerful and gaining traction, Spark on Hadoop provides distributed and Resilient architecture help to fasten the curation process by multiple times.
Recent Study
One of my project involved curing and extracting the features from huge volume of data in natural language conversation text. We started with using R programming language for the transformation process, R language is simple with lot of functionalities in statistics and data science space but has limitations in terms of computation and memory and in turn efficiency and speed. We tried to migrate the transformation process to Apache Spark and observed tremendous improvement in the performance of transformation, We were able to bring down the time for transformation from more day to almost an hour of time for huge volume of data.
Here are some of the benefits that I would like to highlight the benefits of Apache Spark over R.
- Effective Utilization of resources:
By default R runs in a single core and is limited by the capabilities of the single core and memory usage. Even though you have multi core system R is limited with using only one core, for memory it has the process limitations of a 32 bit R execution with virtual memory user space of 3 GB and for 64 bit R execution limited to amount of RAM. R has some parallel lib packages that can help to span the processing to multi cores.
Spark can run in distributed form with the processing running on executors with each executor running on its own process utilizing the cpu and memory.Spark brings the concept of RDD (Resilient Distributed Dataset) to achieve distributed , resilient and scalable processing solution.
- Optimized transformation:
Spark has the concept of Transformation and Actions where the transformation perform lazy evaluation of job execution until an Action task is being called and intern brings optimization when multiple transformations are involved before an Action task which leads to transferring the results back to the driver program
- Integration to Hadoop Eco System
Spark integrates well in the Hadoop ecosystem with yarn architecture and can easily bind to HDFS , multiple NOSQL database like HBase, Cassandra etc.
- Support for multiple languages:
Spark API’s has support on multiple programming languages like Scala, Java and Python

