Subscribe to DSC Newsletter

Data Warehouse and Data Lake Analytics Collaboration

This blog was written with the thoughtful assistance of David Leibowitz, Dell EMC Director of Business Intelligence, Analytics & Big Data

So data warehousing may not be cool anymore, you say? It’s yesterday’s technology (or 1990’s technology if you’re as old as me) that served yesterday’s business needs. And while it’s true that recent big data and data science technologies, architectures and methodologies seems to have rendered data warehousing to the back burner, it is entirely false that there is not a critical role for the data warehouse and Business Intelligence in digitally transformed organizations.

Maybe the best way to understand today’s role of the data warehouse is with a bit of history. And please excuse us if we take a bit of liberty with history (since we were there for most of this!).

Phase 1: The Data Warehouse Era

Phase 1: In the beginning, Gods (Ralph Kimble and Bill Inmon, depending upon your data warehouse religious beliefs) created the data warehouse. And it was good. The data warehouse, coupled with Business Intelligence (BI) tools, served the management and operational reporting needs of the organization so that executives and line-of-business managers could quickly and easily understand the status of the business, identify opportunities, and highlight potential areas of under-performance (see Figure 1).

Figure 1: The Data Warehouse Era

The data warehouse served as a central integration point; collecting, cleansing and aggregating a variety of data sources from AS/400, relational and file based (such as EDI). For the first time, data from supply chain, warehouse management, AP/AR, HR, point of sale was available in a “single version of the truth.”

Using extraction-transform-load (ETL) processing wasn’t always quick, and could require a degree of technical gymnastics to bring together all of these disparate data sources. At one point, the “enterprise service bus” entered the playing field to lighten the load on ETL maintenance, but routines quickly went from proprietary data sources, to proprietary (and sometimes arcane) middleware business logic code (anyone remember Monk?).

The data warehouse supported reports and interactive dashboards that enabled business management to have a full grasp on the state of the business. That said, report authoring was static and not really enabled for democratizing data. Typically, the nascent concept of self-service BI was limited to cloning a subset of the data warehouse to smaller data marts, and extracts to Excel for business analysis purposes. This proliferation of additional data silos created reporting environments that were out of sync (remember the heated sales meetings where teams couldn’t agree as to which report figures were correct?) and the analysis paralysis caused by spreadmarts meant that more time was spent working the data rather than driving insight. But we all dealt with it, as it was agreed that some information (no matter the effort it took to acquire) was more important that no data.

Phase 2: Optimize the Data Warehouse

But IT man grew unhappy with being held captive by proprietary data warehouse vendors. The costs of proprietary software and expensive hardware (and let’s not even get started on user-defined functions in PL/SQL and proprietary SQL extensions that created architectural lock-in) forced organizations to limit the amount and granularity of data in the data warehouse. IT Man grew restless and looked for ways to reduce the costs associated with operating these proprietary data warehouses while delivering more value to Business Man.

Then Hadoop was born out of the ultra-cool and hip labs of Yahoo. Hadoop provided a low-cost data management platform that leveraged commodity hardware and open sources software that was an estimated to be 20x to 100x cheaper than proprietary data warehouses.

Man soon realized the financial and operational benefits afforded by a commodity-based, natively parallel, open source Hadoop platform to provide an Operational Data Store (now that’s really going old school!) to off-load those nasty Extract Load and Transform (ETL) processes off the expensive data warehouse (see Figure 2).

Figure 2: Optimize the Data Warehouse

 

The Hadoop-based Operational Data Store was deemed very good as it helped IT Man to decrease spending on the data warehouse (guess not so good if you were a vendor of those proprietary data warehouse solutions…and you know who you are T-man!). Since it’s estimated that ETL consumes 60% to 90% of the data warehouse processing cycles, and since some vendors licensed their products based upon those cycles – this concept of “ETL Offload” could provide substantial cost reductions. So in an environment limited by Service Level Agreements (because outside of Doc Brown’s DeLorean equipped with a flux capacitor, there’s still only 24 hours in a day in which to do all the ETL work), Hadoop provided a low-cost, high-performance environment for dramatically slowing the investment in proprietary data warehouse platforms.

Things were getting better, but still weren’t perfect. While IT Man could shave costs, he couldn’t make the tools easy to use by simple data consumers (like Executive Man). And while Hadoop was great for storing unstructured and semi-structured data, it couldn’t always keep up to the speed relied upon for relational or cube based reporting from traditional transactional systems.

Phase 3: Introducing Data Science

Then God created the Data Scientists, or maybe it was the Devil based upon one’s perspective. The data scientists needed an environment where they could rapidly ingest high volumes of granular structured (tables), semi-structured (log files) and unstructured data (text, video, images). They realized that data beyond the firewall was needed in order to drive intelligent insight. Data such as weather, social, sensor and third party could be mashed up with the traditional data stores in the EDW and Hadoop to determine customer insight, customer behavior and product effectiveness. This made Marketing Man happy. The scientists needed an environment where they could quickly test new data sources, new data transformations and enrichments, and new analytic techniques in search of those variables and metrics that might be better predictors of business and operational performance. Thusly, the analytic sandbox, which also runs on Hadoop, was born (see Figure 3).


Figure 3: Introducing Data Science

 

The characteristics of a data science “sandbox” couldn’t be more different than the characteristics of a data warehouse:

 

Finance Man tried desperately to combine these two environments but the audiences, responsibilities and business outcomes were just too varying to create an cost-effectively business reporting and predictive analytics in single bubble.

Ultimately, the analytic sandbox became one of the drivers for the creation of the data lake that could support both the data science and data warehousing (Operational Data Store) needs.

Data access was getting better for the data scientists but we again were moving towards proprietary process and a technical skill reserved for the elite. Still, things were good as IT Man, Finance Man and Marketing Man could work through the data scientists to drive innovation. But they soon wanted more.

Phase 4: Creating Actionable Dashboards

But Executive Man was still unsatisfied. The Data Scientists were developing wonderful predictions about what was likely to happen and prescriptions about what to do, but the promise of self-service BI was missing. Instead of the old days, and having to run to IT Man for reports, now he was requesting them of the Data Scientist.

The reports and dashboards created to support executive and front-line management in Stage 1 were the natural channel for rendering the predictive and prescriptive insights, effectively closing the loop between the data warehouse and the data lake. With data visualization tools like Tableau and Power BI, IT Man could finally deliver on the promise of self-service BI by providing interactive descriptive and predictive dashboards that even Executive Man could operate (see Figure 4).


Figure 4: Closing the Analytics Loop

 

And Man was happy (until the advent of Terminator robots began making decisions for us).

 

Views: 1543

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dan Kuhn on November 9, 2017 at 1:41pm

Bill – Great research paper! You concluded that “…data is like crude oil with potential value but is of limited economic value in its raw form. Data is refined with analytics that has more potential value…” and “the kinetic value of data is not realized until the analytics are ‘put into motion…’”. I believe that Aginity Amp is the best element available to get from potential value to kinetic value. Here’s why:

 

We have had a unique view on analytic re-use.  Most tools, including notebooks, produce a sequence of events focused on a data flow from input to output. This presents a challenge for analytic re-use because it’s difficult to re-use a variety of components in a new use case.  At best, if people can find existing assets, they are driven to copy and paste - meaning there are now new versions decoupled and being used through the organization.  Aginity has been approaching the problem through a concept of object oriented analytics - allowing atomic re-use of algorithms as building blocks that support encapsulation, inheritance, polymorphism, etc.  These atomic algorithms (features to the data scientist) can then be simply assembled with software generating the processing flow on-demand to feed the analysis.  This can create a level of re-use and productivity far beyond a shared code repository for notebook code.

 

We've seen an issue with so many tools needing to share analytics and assets provided by others. One recent discussion with a CDO said they have 148 different tools in the analytics and analysis domain.  Some people are using self-serve data prep tools, others self-serve visualization tools, data science tools like notebooks and such, as well as the normal BI tools (even Excel).  A company is lucky to get re-use and sharing within a single tool, but almost never get efficient reuse/sharing across multiple tools.  Aginity has been delivering on a vision of filling a missing gap… being the analytic layer that spans the entire analytic ecosystem. 

What are your thoughts? 

Comment by Bill Schmarzo on November 8, 2017 at 8:57am

Hey Joshua, I agree that being able to "operationalize" the analytics (e.g., cataloging, indexing, check in/check out, version control, regression testing, artifact capture and sharing) could be part of a Phase 5.  We use a concept called Analytic Profiles (think key-value store) that is a combination of Jypter Notebooks and Github to capture and re-use the analytics.  And I talk about it more in the University of San Fransisco research paper on determining the economic value of data.

USF EvD Research Paper
https://infocus.emc.com/wp-content/uploads/2017/04/USF_The_Economic...


And I see more tools coming to market to address the need to capture, refine and re-use the organization's analytic assets.  I would certainly appreciate more details on what Aginity Amp is doing to help address the issue of "Analytics Operationalization."

Thanks!
Comment by Joshua Isunza on November 8, 2017 at 8:22am

Interesting article, but I think there is a missing piece or maybe it's phase 5?

I am finding that working in phase 4 is not a happy place for marketing man, executive man, finance man, IT man, or business man.

Also, in phase 4 lots of time is wasted on data prep, and it is not uncommon that two different BI tools will be at odds in regard to the same analytic. 

The reason: Math is not reusable or managed, and analytics become inconstant and unreliable when transferring them from the analytics environment to the BI/Marketing automation environment.

The idea of 'analytic reuse' is new, but our solution is a single semantic layer (think universal adaptor) that allows analytics to be created, cataloged, managed, and distributed(through API call) as assets. Thus, making the analytics 'reusable'.

In my opinion this would be phase 5. Aginity Amp is a single semantic layer that sits between your end-point tools (BI, Stats, Marketing Automation, Execution, etc.) and your data platforms (Hadoop, NZ, Oracle, AWS, Snowflake, etc). 

Based on Figure 4 Amp would exists between both of the red arrows in the diagram. Thus, tons of time would be saved on data prep and your analytic projects would yield more consistent results. 

Interested in learning about Aginity Amp? Email me at [email protected]

Disagree? just reply here. 

- Joshua Isunza

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service