Deep Diving the Data Lake – Automatically Determining What’s In There

Summary: As your data lake grows larger and your user group more diverse you will need these tools that automatically catalog data and control access to your information. They are a huge benefit and only enhance the spirit of free exploration of data for new value.

Data Lakes have been around almost as long as we’ve had Big Data. That means this year could be their 10^th birthday. And they are still a valuable approach to storing data and making it widely available for exploration without the pain and suffering associated with putting it in your EDW.

The big idea was to trade:

Structure -> Ingest -> Analyze

The basis for all RDBMS, for:

Ingest -> Analyze -> Structure.

This is what we know as ‘late schema’ or ‘schema on read’. And the value continues to be that we don’t burden IT with the need to create new schemas, tables, and data, and in return we get our data on demand, not weeks or months from now.

We’re not saying bad things about IT. We’re saying that the growth in demand for stored and extracted data by data scientists, analysts, and line of business users is growing so fast that IT can’t keep up. Also, that EDWs based on RDBMS simply require too much work to modify for our new mindset, “let’s see what’s in there”.

Data Puddles

Early data lakes were the meat-and-potatoes of data scientists working on new projects. They could take data from transactional systems, from EDWs, and from new Big Data sources like text and streams, create their own data lake, and go to town.

Because these early users were largely data scientists and had good SQL and coding skills as well as some decent data engineering chops they could do this in a self-contained way that served their immediate set of projects. If only one or a few data scientists were using it, they controlled what went in, how it was processed, and what it was used for.

But as data lakes evolved into more robust systems and more and more users wanted this capability, these older narrow-use instances didn’t meet the need of the much larger group of folks who wanted larger and larger varieties of data. To democratize data you needed something larger than these siloed data puddles.

The Modern Data Lake

The number of vendors providing turnkey, easy to establish and easy to use data lakes has exploded. And the use of data lakes has exploded as well because the same dynamics about the time-delays and costs still exist, while the number of users clamoring for data has grown.

It’s more common now, and certainly more efficient to have a single large data lake than any number of smaller data puddles. Data lakes are now much more likely to be created and supported by IT specialists and data engineers. That’s a good thing.

Don’t Throw It Out

Another positive aspect of creating centralized data lakes is that it tends to support a new company-wide ethos to save everything and analyze it later, even if we don’t yet know what the value will be.

Storage is cheap and a remarkable amount of potentially valuable data is being discarded all the time. Take weblogs for example or CSR recordings or transcripts. All sorts of interesting and value-enhancing models are being built from these sources.

The Challenge

But here’s the rub. The more data you put in there the harder it is for any given user to know where to find it, how valid it is, how clean or dirty it might be, whether it duplicates other data also in the lake, or any of a host of other reasons to slow down and be concerned as you undertake your analytic or modeling project. One source says accurately “finding data in data lake is like shopping in a flea market”.

To further complicate the matter, a wider group of users means a much wider set of skills. The data scientists on your team may be equipped to search out, join, shape and prep the data they need but it’s very unlikely that your line of business users can competently do that for themselves.

A third issue is what sort of proprietary or sensitive data might be in there. Whether it’s company proprietary transactional or financial data or customer personally identifying information (PII), not everyone can be allowed to see everything. And there are plenty of shades of gray here too. For example, your data scientists might need to see absolutely everything about each customer but from a legal standpoint ‘everything’ probably shouldn’t include social security numbers or full credit card numbers. So there are degrees to which the data needs to be filtered and cleansed before each type or level of user sees it.

While data lakes started out with a kind of free spirit, anything goes ethos, it rapidly appears that even these meccas of experimentation need governance as well.

Good Governance Means a Good Metadata Catalog and Access Control

From a user perspective the most valuable tool would be a good Metadata Catalog showing where everything is and what we know about it.

From an organizational standpoint, exposing just the Metadata Catalog to users is the perfect point to add governance in the form of user authorization levels, and also a good place to be inspecting incoming data for sensitive data like PII.

Finally, since new data from existing sources is likely to be added all the time, we need to leverage what we learned about that data source the first time around and apply that learning to any new data.

Oh yes, and as much as possible this should be fully automated. And to the extent it still requires individual SMEs to look at the data and evaluate it; we should turn that knowledge into a kind of additional crowdsourced metadata.

The Market Responds

Count on the marketplace to come to our aid when an obvious pain point exists. Many of the proprietary database providers have built in just such tools. Microsoft has the Azure Data Catalog, Teradata has a tool called Loom, and Oracle has Big Data Discovery.

There are a number of standalone packages as well. Waterline is a widely used package with partnerships with Hortonworks and MapR. Attivio Data Source Discovery is another (not meant to be an exhaustive list).

Here are some of the general features around which these packages are organized.

Zones or Levels of Preprocessing

To make the data as useful to as wide a group of users as possible it is going to need some level of preprocessing. Typically these packages identify four levels:

The Raw or Landing Zone:

No governance or processing here.
Begin the process of curation by the automatic identification of data by type and source and begin the creation of metadata.
Identify sensitive or PII data as soon as it’s received and take the action appropriate for your organization, typically removal or highly restricted access.
Users of the Landing Zone are usually only data engineers.

The Sensitive Zone:

Separate out the sensitive and PII data into a separate partition.
Put this zone under the control of designated data stewards and tightly control access.

The Working Zone:

Use the full array of automated tools and human evaluation to create both automated and crowdsourced metadata in depth.
Users in this zone are typically data scientists capable of evaluating the utility of the data for themselves.

The Gold Zone:

This title varies by vendor but means a zone in which the data has been heavily cataloged and preprocessed.
It is likely that metadata includes detailed notes on source, validity, and other related data that might be combined to produce commonly used analyses.
It is likely that the data will have been cleaned even to include statistically valid methods of missing data imputation.
Preprocessing could include blending from different sources, shaping for example to resolve matching entities like customers from different sources, or even transforming data like aggregating, bucketing, or converting codes to user friendly names.
Users in this zone are analysts and line of business users. Tools on this level might include SQL but are just as likely to be Excel, Tableau, or Qlik.

Automatically Tagging Data

These applications will attempt to automatically create metadata tags for each incoming data object and type. First users of the new data will need to inspect the data to see if the automatic identification is accurate but their confirmation or correction is recorded and adds authority and confidence for future users. Similarly, when new data from the same source is added later, this earlier crowdsourced curation will be applied to the new data.

Some vendors like Waterline have gone so far as to use predictive analytics to suggest the confidence the user should have in their automatically created tags to guide further user curation.

Governed Data Lakes

The concept of governance over data lakes does not diminish the free-spirited exploration of data. Although this will require some organization and resources it greatly enhances the utility of the data to the largest group of users and controls risk of data misuse to the company.

These tools to create automated metadata catalogs are a huge benefit and are now a must-have for any data-committed company.

About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:

[email protected]

Deep Diving the Data Lake – Automatically Determining What’s In There

Leave a Reply Cancel reply