Beyond Datawarehouse - The Data Lake - DataScienceCentral.com

Over the last few years, organizations have made a strategic decision to turn big data into competitive advantage. Owing to rapid changes in the trends of BI and DW space, Big Data has been driving the organizations to explore the implementation aspects on how to integrate big data into the existing EDW infrastructure. The process of extracting data from multiple sources such as social media, weblogs, sensor data etc. and transforming that data suit the organization’s analytical needs is central to the integration challenge. The said process is what is famously called ETL (Extract Transform & Load). It’s essential to think unconventionally about storing such huge volumes of data and also processing the data economically. Hence there is a compelling reason to integrate different technology components in the right places, besides selecting the right technology components. Thanks to Open Source Foundation, we do have lot of options to embrace Big Data at very economical cost and with very high computing throughput. Data Lake has become synonymous with the Big Data.

What Is a Data Lake?

The data lake concept is closely tied to Apache Hadoop and its ecosystem of open source projects. All brain storming sessions of the data lake often hover around how to build a data lake using the power of the Apache Hadoop ecosystem. Data Lake provides an economically viable and technologically feasible ways to meet big data challenges. Organizations see Data Lake evolve from their existing data architecture.

How Apache Hadoop does fits into Enterprise Architecture?

Apache Hadoop has a plethora of use cases. Harnessing its Distributing computing power, Hadoop is used as an ETL Workhorse. Given that Hadoop can store beyond PBs of data, it is used as a backup storage or “Cold Storage”, as data is redundant and easy to recover. Hadoop’s Map Reduce engine or Spark Engine has a capability to deal with the unstructured data, Since Hadoop’s data model is not strict and it’s a “Schema on Read” – implying that data can be stored in anyway and the model can be determined while reading. This contradicts Data Warehousing model in which the data model is predefined and is by nature “Schema on Write”. The architecture diagram below gives a conceptual understanding of how Hadoop can be integrated with existing EDW and Data Marts in organization. We can leverage the computational power of the Hadoop Eco system along with real time analytics and running advanced Data Science Algorithms.

By integrating Hadoop we are able to not only deal with unstructured data, but also process the data in the ETL in very fast and efficient way given the power of Hadoop’s distributed computing capability. All of the said commodities come at a very economical price because of commodity hardware.

A Quote from CMSReport

“The Teradata Active Data Warehouse starts at $57,000 per Terabyte. In comparison, the cost of a supported Hadoop distribution and all hardware and datacenter costs are around $1,450/TB, a fraction of the Teradata price1. With such a compelling price advantage, it is a no brainer to use Hadoop to augment and in some cases even replace the enterprise data warehouse”.

How Apache Hadoop can replace the Enterprise Data Warehouse?

Hadoop traditionally with its Map Reduce Computing Engine, which is batch oriented, is more suitable for ETL Jobs. The strength of Hadoop lies in massive redundant storage, its highly efficient distributed computing power thus making it suitable for ETL jobs, which are also typically batch oriented. When it comes to interactive data analysis it’s a different ball game as the users expect a sub second response, which Map Reduce Engines cannot cater to. However, Thanks to various Hadoop eco system projects, Hadoop family is matured enough to provide interactive data analysis. The below architecture diagram depicts How one can leverage these eco system tools and build an entirely new type of a data warehouse – The Data Lake, which can give us the interactivity of a Data Warehouse and Data Marts with all the benefits of dealing with unstructured data, massive storage, distributed computing workhorse – all at a rock bottom price.

Meet the New Heroes in the Data Analysis Space in Hadoop Eco System

Apache Spark – is an interactive data processing engine that can be used for either ETL or interactive reporting by directly getting the data out of Hadoop. It is “Lightning Fast Cluster computing Engine”. Within its In Memory Mode Spark is 100X faster than Map Reduce Engine and on disk it is 10X faster. By combining the Massive Redundant storage power of Hadoop and In Memory Interactive Computing of Spark, We can eliminate the EDW or Relational Database at the front end and can have Interactive reporting and analysis directly from the Data Lake or Apache Hadoop. Spark also comes with a Machine Learning Library MLIB which can used by the Data Scientists, also utilizing the parallel computing capabilities of Spark.

Apache Impala: Impala is a distributed query engine that runs on Apache Hadoop. It tightly integrates with Hadoop Distributed File System (HDFS) and Hadoop’s Metadata. Equipped with a massively parallel processing (MPP) engine, Impala can support high-concurrency workloads, thereby providing broad access to business analysts across the entire business; it is also the engine with fastest time-to-insights. Impala by passes the Map Reduce Engine and rather loads Data from Hadoop to In Memory – thus it is also referred as Hadoop’s In Memory MPP Database. All the leading business intelligence (BI) tools seamlessly integrate with Impala through JDBC/ODBC Driver for added compatibility, so business users can continue to work with the existing infrastructure.

Apache Drill: Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational data stores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Drill by passes the Map Reduce Engine and Directly Works with Data in Hadoop. It also supports many Data Sources apart from Hadoop and is extremely user and developer friendly.

Conclusion:

Data is bound to increase exponentially and the need to rely on Big Data will be inevitable, sooner rather than later. With its ability to work on both structured and unstructured data, Hadoop Eco-system equips you with great power and lends you a competitive advantage. The costs associated are an icing on the cake, Hadoop is more economical than Enterprise Data Warehousing. Happy Hadooping.

Beyond Datawarehouse – The Data Lake

Leave a Reply Cancel reply