Data Science is the field of study in which large volumes of data are mined, analysed to build predictive models, and help the business in the process. The data used over here is often unstructured, and it’s huge in quantity. Such data which encompasses the definition of volume, velocity, veracity, and variety is known as Big Data.
Hadoop and Spark are two of the most popular open-source framework used to deal with big data. The Hadoop architecture includes the following –
As you can see there are numerous components of Hadoop with their own unique functionalities. In this article we would look into the basics of Hive and Impala.
Hive allows processing of large datasets using SQL which resides in the distributed storage. It is a Data Warehousing Tool which is built on top of the HDFS making operations like Data encapsulation, ad-hoc queries, data analysis, easy to perform.
The structure of Hive is such that first the tables, and the databases are created, and the tables are loaded with the data then after. It is platform designed to perform queries on only structured data which are loaded into the Hive tables. Unlike Map-Reduce, Hive has optimization features like UDFs which improves the performance. As Map-Reduce could be quite difficult to program, Hive resolved this difficulty, and allows to write queries in SQL which runs Map Reduce jobs in the backend. There is a Metastore in Hive as well which generally resides in a relational database.
Two of methods of interacting with Hive are Web GUI, and Java Database Connectivity Interface. There is a command line interface in Hive on which you could write queries using the Hive Query Language that is syntactically similar to SQL. Text file, Sequence file, ORC, RC file are some of the formats supported by Hive. The derby database is used for a single user storage metadata, and MYSQL is used for multiple user metadata.
Some notable points related to Hive are –
Now, Hive allows you to execute some functionalities which could not be done in the relational databases. In production, it is highly necessary to reduce the execution time for the queries and thus Hive provides the advantage in this regard as the results are obtained in the second’s time.
There is a reason why queries are executed quite fast in Hive.
The Schema on Read and Write system in the relational databases allows one to create a table first, and then insert data into it. Thus insertions, modifications, updates could be performed over there. On the other hand, the Schema on Read only mechanism in Hive doesn’t allow modifications, updates to be done. The modifications across multiple nodes is not possible because on a typical cluster, the query is run on multiple data nodes. There is also a Read many write once mechanism in Hive where the tables could be updated in the latest versions after insertion is done.
The three core parts in Hive are – Hive Clients, Hive Services, Hive Storage and Computing.
In Hive, the query is first executed through the User Interface, and then its metadata information is gathered after an interaction between the driver, and the compiler. The plan is created by the compiler, and the metadata request is obtained. The compiler receives the metadata information back from the Meta store and starts communication to execute the query. The Execution engine receives the execution plans from the Driver. The bridge between Hadoop and Hive is the engine which processes the query. The results are fetched from the driver and sent to the Execution Engine which would eventually send the results to the front end via the driver.
There are two modes – Local, and Map Reduce on which Hive could operate. The local mode used in case of small data sets, and the data is processed at a faster speed in the local system. In Map Reduce mode, there are multiple data nodes in Hadoop and used to execute large datasets in a parallel manner. A better performance on large data sets could be achieved through this. The Map Reduce mode is default in Hive.
The server interface in Hive is known as HS2 or the Hive Server2 where the query execution against the Hive is enabled for the remote clients. Authentication and concurrency for multiple clients are some of the advanced features included in the latest versions.
Impala is a parallel query processing engine running on top of the HDFS. Between both the components the table’s information is shared after integrating with the Hive Metastore. As Hive is mostly used to perform batch operations by writing SQL queries, Impala makes such operations faster, and efficient to be used in different use cases.
There are some changes in the syntax in the SQL queries as compared to what is used in Hive. The transform operation is a limitation in Impala. Distributed across the Hadoop clusters, and used to query Hbase tables as well.
The parquet file used by Impala is used for large scale queries. In this format, the data is stored vertically i.e., the columnar storage of data. Thus the performance while using aggregation functions increases as only the columns split files are read. The encoding and compression schemes are efficiently supported by Impala. Impala could be used in scenarios of quick analysis or partial data analysis. Along with real-time processing, it works well for queries processed several times.
The architecture of Impala consists of three daemons – Impalad, Statestored, and Catalogd.
The Impalad is the core part of Impala which allows processing of data files and accepts queries with JDBC ODBC connections. The distribution of work across the nodes and the transmission of results to the coordinator node immediately is facilitated by the Impalad.
The Impala daemons availability is checked by the Statestored. The health of the nodes are continuously checked by constant communication between the daemons, and the Statestored. In case of a node failure, all other Impalad daemons are notified by the Statestored to leave that daemon out for future task assignment.
The metadata changed from DDL to other nodes are notified by the Catalogd daemon. Its configuration is required in a single host.
The Impalad takes any query requests, and the execution plan is created. Impalad communicates with the Statestored, and the hive Metastore before the execution.
Various built-in functions like MIN, MAX, AVG are supported in Impala. It also supports the dynamic operation. The VIEWS in Impala acts as aliases.
Big Data plays a massive part in the modern world with Hive, and Impala being two of the mechanisms to process such data. This article gave a brief understanding of their architecture and the benefits of each.
If you want to read more about data science, you can read our blogs here