Data Science is the study of extracting meaningful insights from the data using various tools and technique for the growth of the business. Despite its inception at the time when computers came into the picture, the recent hype is a result of the huge amount of unstructured data that is getting generated and the unprecedented computational capacity that modern computers possess.
However, there is a lot of misconception among the masses about the true meaning of this field with many of the opinion that it is about predicting future outcomes from the data. Though predictive analytics is a part of Data Science, it is certainly not all of what Data Science stands for. In an analytics project, the first and foremost role is to get the build the pipeline and get the relevant data to perform predictive analytics later on. The professional who is responsible for building such ETL pipelines and the creating the system for flawless data flow is the Data Engineer and this field is known as Data Engineering.
Over the years the role of Data Engineers has evolved a lot. Previously it was about building Relational Database Management System using Structured Query Language or run ETL jobs. These days, the plethora of unstructured data from a multitude of sources has resulted in the advent of Big Data. It is nothing but a different forms of voluminous data which carries a lot of information if mined properly.
Now, the biggest challenge that professionals face is to analyse these huge terabytes of data which traditional file storage systems are incapable of handling. This problem was resolved by Hadoop which is an open-source Apache framework built to process large data in the form of clusters. Hadoop has several components which takes care of the data and one such component is known as Map Reduce.
Created by Doug Cutting and Mike Cafarella in 2006, Hadoop facilitates distributed storage and processing of huge data sets in the form parallel clusters. HDFS or Hadoop Distributed File System is the storage component of Hadoop where different file formats could be stored to be processed using the Map Reduce programming which we would cover later on in this article.
The HDFS runs on large clusters and follows a master/slave architecture. The metadata of the file i.e., information about the relative position of the file in the node is managed by the NameNode which is the master and could save several DataNodes to store the data. Some of the other components of Hadoop are –
Hadoop is a vast concept and in detail explanation of each components is beyond the scope of this blog. However, we would dive into one of its components – Map Reduce and understand how it works.
Map Reduce is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop Cluster, i.e. suppose you have a job to run and you write the Job using the MapReduce framework and then if there are a thousand machines available, the Job could run potentially in those thousand machines.
The Big Data is not stored traditionally in HDFS. The data gets divided into chunks of small blocks of data which gets stored in respective data nodes. No complete data’s present in one centralized location and hence a native client application cannot process the information right away. So a particular framework is needed with the capability of handling the data that stays as blocks of data into respective data nodes, and the processing can go there to process that data and bring back the result. In a nutshell, data is processed in parallel which makes processing faster.
To improve performance and for better efficiency, the idea of parallelization was developed. The process is automated and concurrently executed. The instructions which are fragmented could also run on a single machine or on different CPU’s. To gain direct disk access, multiple computers uses SAN or Storage Area Networks which is a common type of Clustered File System unlike the Distributed File Systems which sends the data using the network.
One term that is common in this maser/slave architecture of data processing is Load Balancing where among the processors the tasks are spread to avoid overload on any DataNode. Unlike the static balancers, there is more flexibility provided by the dynamic balancers.
The Map-Reduce algorithm which operates on three phases – Mapper Phase, Sort and Shuffle Phase and the Reducer Phase. To perform basic computation, it provides abstraction for Google engineers while hiding fault tolerance, parallelization, and load balancing details.
Some of the examples of Map Reduce programming are –
Data Engineering is a key step in any Data Science project and Map Reduce is undoubtedly an essential part of it. In this article we have a brief intuition about Big Data and provided an overview of Hadoop. Then we explained Map Reduce programming and its workflow and gave few real life applications of Map Reduce programming as well.
Read more here.