According to Gartner, 80% of successful CDOs will have value creation or revenue generation as their Number 1 priority through 2021.
To create the maximum value out the organization’s data landscape, traditional decision support system architecture are no longer adequate. New architectural patterns need to be developed to harness the power of data. To fully capture the value of using big data, organizations need to have flexible data architectures and able to extract maximum value from their data ecosystem.
Data Lake concept has been around for some time now. However, I have seen organizations struggle to understand the concept as many of them are still boxed in the older paradigm of Enterprise Data Warehouses.
In this article, I will deep-dive into conceptual constructs of Data Lake Architecture pattern and layout an architecture pattern.
Let us start with the known first.
Traditional Enterprise DWH architecture pattern has been used for many years. There are data sources, data is extracted, transformed and loaded (ETL) and on the way, we do some kind of structure creation, cleansing etc. We predefine the data model in EDW (dimensional model or 3NF model) and then create departmental data marts for reporting, OLAP cubes for slicing and dicing and self-service BI.
This pattern is quite ubiquitous and has served us well for a long time now.
However, there are some inherent challenges in this pattern that can’t scale in the era of Big Data. Let us look at few of them:
Let us now discuss briefly how the definition of data has changed. The 4 Vs of big data are now very well known. Volume, velocity, variety, and veracity. Let me put some context to these things:
In short, the definition of analysable data has changed. It is not just structured corporate data now but all kinds of data. The challenge is to mash them up together and make sense out of them.
Since 2000 there has been tremendous changes in the processing capabilities, storage, and the corresponding cost structure. It has been subjected to what we call as Moore’s Law.Key points:
Let me explain the concept of Data Lake using an analogy.
Visiting a large lake is always a very pleasant feeling. The water in the lake is in its purest form and there are different activities different people perform on the Lake. Some are people are fishing, some people are enjoying a boat ride, this lake also supplies drinking water to people living in Ontario. In short, the same lake is used for multiple purposes.
With the changes in the data paradigm, a new architectural pattern has emerged. It's called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.
The way I compare a data lake to a data warehouse or a mart is like this:
Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.
Having explained the concept, let me now walk you through a conceptual architecture of data lake. Here are the key components in a data lake architecture. We have our data sources which can be structured and unstructured. They all integrate into a raw data store that consumes data in the purest possible form i.e. no transformations. It is a cheap persistent storage that can store data at scale. Then we have the analytical sandbox that is used for understanding the data, creating prototypes, performing data science and exploring the data to build new hypothesis and use-cases.
Then we have batch processing engine that processes the raw data into something that can be consumed by the users i.e. a structure that can be used to reporting to the end-user. We call it as a processed data store. There is a real-time processing engine that takes streaming data and processes it as well. All the data in this Architecture is cataloged and curated.
Let me walk you through each component group in this Architecture.
The first component group caters to processing data. It follows an Architecture pattern that is called as Lambda Architecture. Basically, Lambda architecture takes two processing path. A batch layer and a speed layer. Batch layer stores data in the rawest possible form i.e. raw data store and speed layer processes the data near real time. Speed layer also stores data into the raw data store and may store transient data before loading into processed data stores.
Analytical sandboxes are one of the key components in data lake architecture. These are the exploratory areas for data scientists where they can develop and test the new hypothesis, mash-up and explore data to form new use-cases, create rapid prototypes to validate these use-cases and realize what can be done to extract value out of the business.
It's the place where data scientists can discover data, extract value and help to transform the business.
Data cataloging is an important principle that been constantly overlooked in traditional business intelligence. In the big data landscape, cataloging is the most important aspect that one should focus on. Let me first give an analogy to explain what is cataloging. I do this exercise with my customers to get the point of cataloging across.
When I ask my customers to guess the potential cost of the painting without providing the catalog information, the answer ranged from $100 to $100,000 dollars. The answer to much closer to the actual when I provide the catalog information. By the way, this painting is called as the ‘The old Guitarist’ by Pablo Picasso created in 1903. Its estimated cost is more than $100 million.
The idea of data catalog is very similar. Different data nuggets have different value and this value varies based on the lineage of the data, quality of data, the source of creation etc. The data needs to be cataloged so that a data analyst or a data scientist can decide for themselves which data point to use for a specific analysis.
The catalog map provides potential metadata that can be cataloged. Cataloging is a process of captures valuable meta data so that it can be used to determine the characteristics of data and to arrive at the decision to use it or not. There are basically two types of metadata: Business and Technical. Business metadata is more to do with the definitions, logical data models, logical entities and so on whereas the technical metadata is to capture the metadata related to the physical implementation of the data structure. It includes things like the database, quality score, the columns, schema etc.
Based on the catalog information, an analyst can choose to use a specific data point in the right context. Let me give you an example. Imagine that a data scientist wants to do an exploratory analysis of Inventory Turnover Ratio and the way it is defined in ERP and an inventory system is different. If the term is cataloged, the data scientist, based on the context can decide to use the column from ERP or from the Inventory system.
Here is an explicit slide that tries to explain the difference.
Cloud platforms are best suited to implement the Data Lake Architecture. They have the host of compose-able services that can be weaved together to achieve the required scalability. Microsoft’s Cortana Intelligence Suite provides one or more components that can be mapped to fruition the Data Lake Architecture.