Data Warehouse and Data Lake Analytics Collaboration

This blog was written with the thoughtful assistance of David Leibowitz, Dell EMC Director of Business Intelligence, Analytics & Big Data

So data warehousing may not be cool anymore, you say? It’s yesterday’s technology (or 1990’s technology if you’re as old as me) that served yesterday’s business needs. And while it’s true that recent big data and data science technologies, architectures and methodologies seems to have rendered data warehousing to the back burner, it is entirely false that there is not a critical role for the data warehouse and Business Intelligence in digitally transformed organizations.

Maybe the best way to understand today’s role of the data warehouse is with a bit of history. And please excuse us if we take a bit of liberty with history (since we were there for most of this!).

Phase 1: The Data Warehouse Era

Phase 1: In the beginning, Gods (Ralph Kimble and Bill Inmon, depending upon your data warehouse religious beliefs) created the data warehouse. And it was good. The data warehouse, coupled with Business Intelligence (BI) tools, served the management and operational reporting needs of the organization so that executives and line-of-business managers could quickly and easily understand the status of the business, identify opportunities, and highlight potential areas of under-performance (see Figure 1).

Figure 1: The Data Warehouse Era

The data warehouse served as a central integration point; collecting, cleansing and aggregating a variety of data sources from AS/400, relational and file based (such as EDI). For the first time, data from supply chain, warehouse management, AP/AR, HR, point of sale was available in a “single version of the truth.”

Using extraction-transform-load (ETL) processing wasn’t always quick, and could require a degree of technical gymnastics to bring together all of these disparate data sources. At one point, the “enterprise service bus” entered the playing field to lighten the load on ETL maintenance, but routines quickly went from proprietary data sources, to proprietary (and sometimes arcane) middleware business logic code (anyone remember Monk?).

The data warehouse supported reports and interactive dashboards that enabled business management to have a full grasp on the state of the business. That said, report authoring was static and not really enabled for democratizing data. Typically, the nascent concept of self-service BI was limited to cloning a subset of the data warehouse to smaller data marts, and extracts to Excel for business analysis purposes. This proliferation of additional data silos created reporting environments that were out of sync (remember the heated sales meetings where teams couldn’t agree as to which report figures were correct?) and the analysis paralysis caused by spreadmarts meant that more time was spent working the data rather than driving insight. But we all dealt with it, as it was agreed that some information (no matter the effort it took to acquire) was more important that no data.

Phase 2: Optimize the Data Warehouse

But IT man grew unhappy with being held captive by proprietary data warehouse vendors. The costs of proprietary software and expensive hardware (and let’s not even get started on user-defined functions in PL/SQL and proprietary SQL extensions that created architectural lock-in) forced organizations to limit the amount and granularity of data in the data warehouse. IT Man grew restless and looked for ways to reduce the costs associated with operating these proprietary data warehouses while delivering more value to Business Man.

Then Hadoop was born out of the ultra-cool and hip labs of Yahoo. Hadoop provided a low-cost data management platform that leveraged commodity hardware and open sources software that was an estimated to be 20x to 100x cheaper than proprietary data warehouses.

Man soon realized the financial and operational benefits afforded by a commodity-based, natively parallel, open source Hadoop platform to provide an Operational Data Store (now that’s really going old school!) to off-load those nasty Extract Load and Transform (ETL) processes off the expensive data warehouse (see Figure 2).

Figure 2: Optimize the Data Warehouse


The Hadoop-based Operational Data Store was deemed very good as it helped IT Man to decrease spending on the data warehouse (guess not so good if you were a vendor of those proprietary data warehouse solutions…and you know who you are T-man!). Since it’s estimated that ETL consumes 60% to 90% of the data warehouse processing cycles, and since some vendors licensed their products based upon those cycles – this concept of “ETL Offload” could provide substantial cost reductions. So in an environment limited by Service Level Agreements (because outside of Doc Brown’s DeLorean equipped with a flux capacitor, there’s still only 24 hours in a day in which to do all the ETL work), Hadoop provided a low-cost, high-performance environment for dramatically slowing the investment in proprietary data warehouse platforms.

Things were getting better, but still weren’t perfect. While IT Man could shave costs, he couldn’t make the tools easy to use by simple data consumers (like Executive Man). And while Hadoop was great for storing unstructured and semi-structured data, it couldn’t always keep up to the speed relied upon for relational or cube based reporting from traditional transactional systems.

Phase 3: Introducing Data Science

Then God created the Data Scientists, or maybe it was the Devil based upon one’s perspective. The data scientists needed an environment where they could rapidly ingest high volumes of granular structured (tables), semi-structured (log files) and unstructured data (text, video, images). They realized that data beyond the firewall was needed in order to drive intelligent insight. Data such as weather, social, sensor and third party could be mashed up with the traditional data stores in the EDW and Hadoop to determine customer insight, customer behavior and product effectiveness. This made Marketing Man happy. The scientists needed an environment where they could quickly test new data sources, new data transformations and enrichments, and new analytic techniques in search of those variables and metrics that might be better predictors of business and operational performance. Thusly, the analytic sandbox, which also runs on Hadoop, was born (see Figure 3).

Figure 3: Introducing Data Science


The characteristics of a data science “sandbox” couldn’t be more different than the characteristics of a data warehouse:


Finance Man tried desperately to combine these two environments but the audiences, responsibilities and business outcomes were just too varying to create an cost-effectively business reporting and predictive analytics in single bubble.

Ultimately, the analytic sandbox became one of the drivers for the creation of the data lake that could support both the data science and data warehousing (Operational Data Store) needs.

Data access was getting better for the data scientists but we again were moving towards proprietary process and a technical skill reserved for the elite. Still, things were good as IT Man, Finance Man and Marketing Man could work through the data scientists to drive innovation. But they soon wanted more.

Phase 4: Creating Actionable Dashboards

But Executive Man was still unsatisfied. The Data Scientists were developing wonderful predictions about what was likely to happen and prescriptions about what to do, but the promise of self-service BI was missing. Instead of the old days, and having to run to IT Man for reports, now he was requesting them of the Data Scientist.

The reports and dashboards created to support executive and front-line management in Stage 1 were the natural channel for rendering the predictive and prescriptive insights, effectively closing the loop between the data warehouse and the data lake. With data visualization tools like Tableau and Power BI, IT Man could finally deliver on the promise of self-service BI by providing interactive descriptive and predictive dashboards that even Executive Man could operate (see Figure 4).

Figure 4: Closing the Analytics Loop


And Man was happy (until the advent of Terminator robots began making decisions for us).


Views: 6546

Tags: AI, Analytics, Big, Business, Data, Database, Intelligence, IoT, Lake, Science, More…artificial, digital, intelligence, internet, of, things, transformation


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by jaap Karman on January 27, 2018 at 9:37am

Bill, You started in the 80's with Frank Codd and 3NF.  I hope you recognized your 2014 blog Vin diesel "flatten the star".  I used that one as it is a very rare statement with the explanation to be found.  You already made the statement of the difference descriptive using Kimball structure there . Add Inmon Lindsted and many others as they are dictating a schema on write before we understand the data.  Descriptive analytics is very different in the handling approach. This issue is the wall of confusion.

You probably have noticed I put the combined data sources as delivery of the EWH (datalake as you wish).  

The structuring into an OLAP dashboard not being part neither the delivery to predictive prescriptive.

No it is not really realized yet that way, it wil be a rather big change to get it accepted.

Hower seeing the real production lines with warehouses & logistics it is more like that. Manufacturing analyses are no topics dealt with warehousing.  

The descriptive analytics as comparison:  

My background is somewhat different as yours. I had also Key-value Coddasyl (IDMS) and basics like balance line. The mind is not being blocked by necessity of schema's. Although you need ER relationships with a clear data context when retrieving and processing the information. I would prefer modelling the data quite different, It would become data containers in some temporal validation chain. Specific axioma: no business logic as that is coming only on read as needed later.

The mentioned challenging figures of a predictive model project is a real experience.

Together with an other one to share only by private mail.   





Comment by Bill Schmarzo on January 27, 2018 at 5:40am

What is the technology foundation for your "data marts"?  I find the "schema on load" necessity of a traditional data mart / data warehouse greatly inhibits the ability to quickly integrate, transformation, and enrich new data sources.  I prefer the "schema on read" nature of Hadoop (and the fact that when I do build a schema, it's a simple and fast to design flat file).

Comment by jaap Karman on January 26, 2018 at 11:06pm

Bill, I understand the business value of giving results out of analytics back to the primary operations.
Classifying platinium / gold / silver / brons for getting more value (sales) better offerings is classic one as is the churn rate. As minimizing loss is the other end of profits, the same kind of analytics is applicable. Just a sign differnce wiht the opposite direction.

Monthly updates is not really challenging. Make it daily with a maximum of 1 hour processing (1 hour spare) for up 300.000 scores to be done  to achieve selections on further investigations.  This will requirement a well designed process of modelling deployment and operations. What would you think of;

There are two local datamarts one for modelling and one after deployment for operations.

For the model evaluation results ae saved so they can be used for improvements and valdiation of new models.
The hand over of modelling to deployed operations in place. The full process circle is now possible and clear.  


Comment by Bill Schmarzo on January 26, 2018 at 2:05pm

Thanks Jaap for the details.  I like your detailed flowcharts.

One reason why we might want to feed the analytic results back to the data lake is for future analysis. For example, let's say I'm creating a customer loyalty score that I update monthly.  I might want to track changes to that loyalty score as changes greater than certain deviations may be reason for further investigation (might flag at-risk customers).

Comment by jaap Karman on January 26, 2018 at 1:10am

Bill I added some thing, but didn't like the result. An new attempt.

I am seeing you are feeding back tot the Enterprise Ware House (data lake). Recently I concluded it shouls be acceptable as it is acceptable in the real physical warehouses. There is some dogma that it is not allowed (Inmon) as it was political jusitiefied in 80's it could be a blocking limitation now.
This is my figurem note the similarity


It was a result of reviewing and rebuilding the famous CRisp-DM visualisation into m own one. It adds the PDCA circle and the strategy-tactics operatons leve and who is doing what when and add some additional circles.

For the modelling and scoring different requiements on data sources exist. Build them segregated. he deployment stage I have fully filled. Dev Test UAT PRD can be added in an other dimension.  (different sheet)

No I don't believe it will be an easy change in project approaches. This is wy:



Comment by Bill Schmarzo on January 24, 2018 at 5:08pm

I agree that there are tools and applications out there today that help with re-use of the "math" and analytics.  It started with Jupyter Notebooks and Github, but has progressed to commercial tools like Aginity Amp and Domino Data.  

You guys are certainly making life much easier in Phase 4!!

Comment by Dan Kuhn on November 9, 2017 at 1:41pm

Bill – Great research paper! You concluded that “…data is like crude oil with potential value but is of limited economic value in its raw form. Data is refined with analytics that has more potential value…” and “the kinetic value of data is not realized until the analytics are ‘put into motion…’”. I believe that Aginity Amp is the best element available to get from potential value to kinetic value. Here’s why:


We have had a unique view on analytic re-use.  Most tools, including notebooks, produce a sequence of events focused on a data flow from input to output. This presents a challenge for analytic re-use because it’s difficult to re-use a variety of components in a new use case.  At best, if people can find existing assets, they are driven to copy and paste - meaning there are now new versions decoupled and being used through the organization.  Aginity has been approaching the problem through a concept of object oriented analytics - allowing atomic re-use of algorithms as building blocks that support encapsulation, inheritance, polymorphism, etc.  These atomic algorithms (features to the data scientist) can then be simply assembled with software generating the processing flow on-demand to feed the analysis.  This can create a level of re-use and productivity far beyond a shared code repository for notebook code.


We've seen an issue with so many tools needing to share analytics and assets provided by others. One recent discussion with a CDO said they have 148 different tools in the analytics and analysis domain.  Some people are using self-serve data prep tools, others self-serve visualization tools, data science tools like notebooks and such, as well as the normal BI tools (even Excel).  A company is lucky to get re-use and sharing within a single tool, but almost never get efficient reuse/sharing across multiple tools.  Aginity has been delivering on a vision of filling a missing gap… being the analytic layer that spans the entire analytic ecosystem. 

What are your thoughts? 

Comment by Bill Schmarzo on November 8, 2017 at 8:57am

Hey Joshua, I agree that being able to "operationalize" the analytics (e.g., cataloging, indexing, check in/check out, version control, regression testing, artifact capture and sharing) could be part of a Phase 5.  We use a concept called Analytic Profiles (think key-value store) that is a combination of Jypter Notebooks and Github to capture and re-use the analytics.  And I talk about it more in the University of San Fransisco research paper on determining the economic value of data.

USF EvD Research Paper

And I see more tools coming to market to address the need to capture, refine and re-use the organization's analytic assets.  I would certainly appreciate more details on what Aginity Amp is doing to help address the issue of "Analytics Operationalization."

Comment by Joshua Isunza on November 8, 2017 at 8:22am

Interesting article, but I think there is a missing piece or maybe it's phase 5?

I am finding that working in phase 4 is not a happy place for marketing man, executive man, finance man, IT man, or business man.

Also, in phase 4 lots of time is wasted on data prep, and it is not uncommon that two different BI tools will be at odds in regard to the same analytic. 

The reason: Math is not reusable or managed, and analytics become inconstant and unreliable when transferring them from the analytics environment to the BI/Marketing automation environment.

The idea of 'analytic reuse' is new, but our solution is a single semantic layer (think universal adaptor) that allows analytics to be created, cataloged, managed, and distributed(through API call) as assets. Thus, making the analytics 'reusable'.

In my opinion this would be phase 5. Aginity Amp is a single semantic layer that sits between your end-point tools (BI, Stats, Marketing Automation, Execution, etc.) and your data platforms (Hadoop, NZ, Oracle, AWS, Snowflake, etc). 

Based on Figure 4 Amp would exists between both of the red arrows in the diagram. Thus, tons of time would be saved on data prep and your analytic projects would yield more consistent results. 

Interested in learning about Aginity Amp? Email me at [email protected]

Disagree? just reply here. 

- Joshua Isunza

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service