Data Science is the field of study in which large volumes of data are mined, analysed to build predictive models, and help the business in the process. The data used over here is often unstructured, and it’s huge in quantity. Such data which encompasses the definition of volume, velocity, veracity, and variety is known as Big Data.
Hadoop and Spark are two of the most popular open-source framework used to deal with big data. The Hadoop architecture includes the following –
- HDFS – It is the storage mechanism which stores the big data across multiple clusters.
- Map Reduce – The data are processed parallel in the form of programming model known as Map Reduce.
- Yarn – It manages resources required for the data processing.
- Ozzie – A scheduling system to manage Hadoop jobs.
- Mahout – For Machine Learning operations in Hadoop, the Mahout framework is used.
- Pig – Executes Map Reduce programs. It allows the flow of programs which are higher level.
- Flume – Used for streaming data into the Hadoop platform.
- Sqoop – The data transfer between Hadoop Distributed File System, and the Relational Database Management system is facilitated by Apache Sqoop.
- Hbase – A database management system which is column oriented, and works best with sparse data sets.
- Hive – Allows SQL like query operations for data manipulation in Hadoop.
- Impala – It is a SQL query engine for data processing but works faster than Hive.
As you can see there are numerous components of Hadoop with their own unique functionalities. In this article we would look into the basics of Hive and Impala.
Basics of Hive
Hive allows processing of large datasets using SQL which resides in the distributed storage. It is a Data Warehousing Tool which is built on top of the HDFS making operations like Data encapsulation, ad-hoc queries, data analysis, easy to perform.
The structure of Hive is such that first the tables, and the databases are created, and the tables are loaded with the data then after. It is platform designed to perform queries on only structured data which are loaded into the Hive tables. Unlike Map-Reduce, Hive has optimization features like UDFs which improves the performance. As Map-Reduce could be quite difficult to program, Hive resolved this difficulty, and allows to write queries in SQL which runs Map Reduce jobs in the backend. There is a Metastore in Hive as well which generally resides in a relational database.
Two of methods of interacting with Hive are Web GUI, and Java Database Connectivity Interface. There is a command line interface in Hive on which you could write queries using the Hive Query Language that is syntactically similar to SQL. Text file, Sequence file, ORC, RC file are some of the formats supported by Hive. The derby database is used for a single user storage metadata, and MYSQL is used for multiple user metadata.
Some notable points related to Hive are –
- The Hive Query Language is executed on the Hadoop infrastructure while the SQL is executed on the traditional database.
- Once a Hive query is ran, a series of Map Reduce jobs is generated automatically at the backend.
- The bucket, and the partition concepts in Hive allows for easy retrieval of data.
- The custom User Defined Functions could perform operations like filtering, cleaning, and so on.
Now, Hive allows you to execute some functionalities which could not be done in the relational databases. In production, it is highly necessary to reduce the execution time for the queries and thus Hive provides the advantage in this regard as the results are obtained in the second’s time.
There is a reason why queries are executed quite fast in Hive.
The Schema on Read and Write system in the relational databases allows one to create a table first, and then insert data into it. Thus insertions, modifications, updates could be performed over there. On the other hand, the Schema on Read only mechanism in Hive doesn’t allow modifications, updates to be done. The modifications across multiple nodes is not possible because on a typical cluster, the query is run on multiple data nodes. There is also a Read many write once mechanism in Hive where the tables could be updated in the latest versions after insertion is done.
The three core parts in Hive are – Hive Clients, Hive Services, Hive Storage and Computing.
- To enable communication across different type of applications, there are different drives which are provided by Hive. The Thrift client is provided for communication in Thrift based applications. The JDBC drivers are provided for the java related applications. The ODBC drivers are provided for the other type of applications. In the Hive service, there is again communication between these drivers and the Hiver server.
- The Hive Services allows client interactions. All operations in Hive are communicated through the Hiver Services before it is performed. The Hive service of the Data Definition Language is the Command Line Interface. The ODBC, JDBC, etc., is communicated by the drivers in the service.
- Services such as file system, Metastore, etc., performs certain actions after communicating with the storage.
In Hive, the query is first executed through the User Interface, and then its metadata information is gathered after an interaction between the driver, and the compiler. The plan is created by the compiler, and the metadata request is obtained. The compiler receives the metadata information back from the Meta store and starts communication to execute the query. The Execution engine receives the execution plans from the Driver. The bridge between Hadoop and Hive is the engine which processes the query. The results are fetched from the driver and sent to the Execution Engine which would eventually send the results to the front end via the driver.
There are two modes – Local, and Map Reduce on which Hive could operate. The local mode used in case of small data sets, and the data is processed at a faster speed in the local system. In Map Reduce mode, there are multiple data nodes in Hadoop and used to execute large datasets in a parallel manner. A better performance on large data sets could be achieved through this. The Map Reduce mode is default in Hive.
The server interface in Hive is known as HS2 or the Hive Server2 where the query execution against the Hive is enabled for the remote clients. Authentication and concurrency for multiple clients are some of the advanced features included in the latest versions.
Basics of Impala
Impala is a parallel query processing engine running on top of the HDFS. Between both the components the table’s information is shared after integrating with the Hive Metastore. As Hive is mostly used to perform batch operations by writing SQL queries, Impala makes such operations faster, and efficient to be used in different use cases.
There are some changes in the syntax in the SQL queries as compared to what is used in Hive. The transform operation is a limitation in Impala. Distributed across the Hadoop clusters, and used to query Hbase tables as well.
- The queries in Impala could be performed interactively with low latency.
- Impala produces results in second unlike the Hive Map Reduce jobs which could take some time in processing the queries.
- For real-time analytical operations in Hadoop, Impala is more suited and thus is ideal for a Data Scientist.
- Reporting tools like Pentaho, Tableau benefits form the real-time functionality of Impala as they already have connectors where visualizations could be performed directly from the GUI.
- All formats of files like ORC, Parquet are supported by Impala.
The parquet file used by Impala is used for large scale queries. In this format, the data is stored vertically i.e., the columnar storage of data. Thus the performance while using aggregation functions increases as only the columns split files are read. The encoding and compression schemes are efficiently supported by Impala. Impala could be used in scenarios of quick analysis or partial data analysis. Along with real-time processing, it works well for queries processed several times.
The architecture of Impala consists of three daemons – Impalad, Statestored, and Catalogd.
The Impalad is the core part of Impala which allows processing of data files and accepts queries with JDBC ODBC connections. The distribution of work across the nodes and the transmission of results to the coordinator node immediately is facilitated by the Impalad.
The Impala daemons availability is checked by the Statestored. The health of the nodes are continuously checked by constant communication between the daemons, and the Statestored. In case of a node failure, all other Impalad daemons are notified by the Statestored to leave that daemon out for future task assignment.
The metadata changed from DDL to other nodes are notified by the Catalogd daemon. Its configuration is required in a single host.
The Impalad takes any query requests, and the execution plan is created. Impalad communicates with the Statestored, and the hive Metastore before the execution.
Various built-in functions like MIN, MAX, AVG are supported in Impala. It also supports the dynamic operation. The VIEWS in Impala acts as aliases.
Big Data plays a massive part in the modern world with Hive, and Impala being two of the mechanisms to process such data. This article gave a brief understanding of their architecture and the benefits of each.
If you want to read more about data science, you can read our blogs here