Contributed by Chuck Currin of Mather Economics:
There’s tremendous value in corporate data, and some companies can maximize their data value through the use of a data lake. This assumes that the adopting company has high volume, unstructured data to contend with. The following article describes ways that a data lake can help companies maximize the value of their data. The term “data lake” has been credited to James Dixon, the CTO of Pentaho. He offered the following analogy:
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in or take samples.”
Rapid Data Application Development
In practice, a data lake can be a very useful platform for fishing out new data applications. In particular, the functionality that Hadoop’s HDFS provides for storing a mixture of structured and unstructured data, side by side, is a game changer. Data analysts can utilize Hive or HBase to put queryable metadata on top of the unstructured data to provide the ability to join very disparate data sources. Once there’s structure in place, data analysts can run queries and machine learning algorithms against the data in an iterative fashion to gain further insights. Additionally, R, Stata or Python can be used for further statistical analysis for additional insights. This methodology enables data organizations to quickly develop churn models, survival models, to do Monte Carlo simulations and other advanced analytics.
Not only is the data lake very useful for data discovery, it is also a great platform for rapid data application development. Analysts use this platform to experiment through the use of queries and statistical modelling to develop new data applications iteratively. Due to the low cost of data storage and computing resources, you can scale out a Hadoop cluster on commodity hardware as necessary, and accommodate massive data volumes. Also, due to HDFS’s lack of schema, it can handle any file format or type. Non-relational data formats such as machine logs, images and binary can be stored in your data lake. You can also store structured, relational data as well. This comes with the added benefit of being able to store compressed data on HDFS, and leave it in a compressed state while querying over it. Along with the fiscal advantages of using the pooled computing resources of the data lake, it is also highly advantageous to be able to quickly integrate disparate data sets into one place where advanced analytics can be applied.
Data Lakes Complement Data Warehouses
The data lake isn’t a replacement for a data mart or data warehouse. The functionalities of the data lake and the data warehouse are complementary. In the data warehousing world, there are Extract, Transform and Load (ETL) processes that feed the system. By contrast, on the data lake side, there are typically Extract, Load and Transform (ELT) processes. The juxtaposition of the letters in data lake’s loading acronym is due to the fact that you indiscriminately load data into the data lake. So, it’s Extract, and Load first, and then the Transformation step happens later. In the data warehouse paradigm, where there’s a relational database involved, the data is Extracted, Transformed and then Loaded. As part of that ETL, there’s data staging and cleansing, and then the data is loaded. Data warehousing provides temporal context around the various aspects of your business. That context cannot be replaced by a data lake.
Maximizing Your Data’s Value
Technology managers are always looking to maximize the ROI of their data projects. By indiscriminately pooling their corporate data into a data lake, there are more opportunities to recoup their technology investments. Taking advantage of commodity hardware, and the ease of integrating disparate data sources, data organizations are able to maximize the value of their data through continuously improving their business models and rapidly developing new data applications. Statistical models are likely to take multiple iterations to optimize. Having a data lake will facilitate rapid integration of additional metrics to statistical models. The data lake can also feed a traditional data warehouse, or it can load data from the warehouse to do a “mashup” against unstructured, non-relational data. Finally, the data lake’s place in the data organization is potentially huge. The ability to keep all historical data, to do complex statistical modelling, to create new data applications and to enhance the data warehouse, a company can continuously innovate and maximize the business value generated from its data.