Let us examine the above scenario more closely using the example of a library. To maintain books in a library you have two options. The first option is to catalog the books based on Topic, Author, Title etc and arrange them ordered on racks. This method is universally followed and it makes it easy to fetch specific books from the racks in a single look up. The look up operation saves time in searching. Data storage based on catalogs and and indexing is used in traditional database management systems. Suppose we have infinite computing resources to perform the search operation in a negligible time, then we need not keep the books arranged in order. Just keep the books on the racks as it arrives and when someone asks for a book, initiate a search operation on all the racks starting from beginning to end. The search will stop when the particular book is located and the time taken for this search varies depending on the location of the book on the rack. But in the case of catalog based book arrangement, the time taken for locating specific books will be identical irrespective of the location of the book on rack. But, it takes some time to index and keep the books arranged on the racks. When a book is returned by the borrower it has to be placed back in its allotted location. Misplacing of returned books may result in situations in which the book may get declared as missing. Or a full search on the complete rack is to be initiated before declaring the book as missing.
Here we have two parameters to take into consideration before we choose a method for arranging books in the library racks. The first parameter is the time taken to maintain the books on racks in an ordered manner. The second parameter is the time taken for a total search on the racks. In maintaining data warehouses, we maintain the data in an ordered fashion using traditional databases and it involves time and cost to structure the data and place them on the database tables. With the evolution of low cost implementation of distributed file systems like Hadoop Distributed File Systems (HDFS), it has become possible to implement parallel searches on data to speed up the fetch operation. In this case, the time taken for the fetch operation is minimum. In fact, the evolution of low cost highly reliable distributed file systems triggered the emergence of data lakes. In data lakes, we keep the data files in directories. Also we can keep files with the same name. Analytics operations like MapReduce and Machine Learning training are designed to work on the entire data set. This makes data lakes suitable for systems involving big data analytic and machine learning. Apache Spark is a proven distributed computing framework using data stored on HDFS. Data lakes hosted on HDFS get efficiently utilized in Spark applications used for big data analytics and machine learning.
Data lake is a repository of data stored in its natural format, usually object blobs or files, and usually in a distributed file system that maintains raw copies of source system data for ensuring reliability. It is a scalable repository that allows you to store all your structured and unstructured data as and when it arrives. You can store your data as is, without having to first structure the data followed by indexing. You can run different types of analytics on data lakes from dashboards and visualizations to big data processing, real-time analytics, and machine learning to help better decisions.
Data lakes are an ideal workload to be deployed in the cloud, because the cloud provides performance, scalability, reliability, availability, a diverse set of analytic engines, and massive economies of scale. The top reasons customers perceived the cloud as an advantage for data lakes are better security, faster time to deployment, better availability, more frequent feature/functionality updates, more elasticity, more geographic coverage, and costs linked to actual utilization.
The crux of the story is that maintaining a data ware house is expensive but gives quick access to specific data records. And data lakes are low cost implementations which gives slow access to specific data records and is ideal for applications in which the entire data set is to be accessed in every cycle of processing. Hope this helps you to make a decision about your storage strategy based on your data usage scenario.
See you next time ………