There is a phrase in baseball about pitchers “pitching through pain” that refers to pitchers taking the mound to pitch even though they have aches and pains – sore arms, stiff joints, blisters, strained muscles, etc. The idea is that these pitchers are so tough that they can pitch effectively even though they are not quite physically right.
However, when the human system is asked to do something that it’s not prepared to do in the most effective manner, other bad habits emerge in an attempt to counter these aches and pains. One problem is then compounded into multiple problems until the body breaks. Seasons end. And careers die. Sounds like the story of Data Lakes!
“In 2016, Gartner estimated that 60 percent of big data projects failed.” A year later, Gartner analyst Nick Heudecker said his company was "too conservative" with its 60 percent estimate and put the failure rate closer to 85 percent. Today, he says nothing has changed.”
Many early data lake projects started with the CIO buying Hadoop, loading lots of data into the Hadoop environment, hiring some data scientists and waiting for magic to happen…and waiting for magic to happen…and waiting for magic to happen. Bueller, Bueller, Bueller.
And now these data lakes are “failing” – and creating data lake “second surgery” situations – for two reasons:
Economics is about the production, consumption, and transfer of value and the most powerful force in the business world. Let’s see how a basic economic concept, plus one new one, can provide the frame for thinking about how we approach these data lake “second surgeries.”
Our first economics lesson is about the concept of sunk costs. A sunk cost is a cost that has already been incurred and cannot be recovered. Your dad probably referred it as “throwing good money after bad money” (my dad advised me to stop plowing more money into my 1968 Plymouth Fury III). To make intelligent business decisions, organizations should only consider the costs that will change as a result of the decision at hand and ignore sunk costs.
What this means in the world of technology is that once you have bought a particular technology and have trained folks on that technology, those acquisition, implementation and training costs should be ignored when making future decisions.
In the world of Data Lakes (and Data Science), technologies will come and go. So, the sooner you can treat those technology investments as sunk costs, the more effective business decisions you will make. From the “Disposable Technology: A Concept Whose Time Has Come” blog about modern digital companies, we learned two important lessons:
These modern digital companies, through their aggressive open source architecture strategies, realize that they are not in the technology architecture business; they are in the data monetization business.
Solution: Stop factoring what money and time you spent to build your original (failing) data lake as you make new data lake decisions going-forward.
But not understanding sunk costs isn’t the worst economics mistake you can make. Let me introduce you to our second economic concept – Schmarzo’s Economics of Vampire Indecisions Theorem (I’m still campaigning for a Nobel Prize in economics). This principle refers to the inability of organizations to “let go” of out-of-date technologies, and in turn, leads to “Vampire Indecisions”, which is the inability of IT to make the decision to kill irrelevant technologies (e.g., name your favorite data warehouse appliance). Consequently, these technologies continue to linger and slowly drain financial and human resources from more important technology investments.
Heck, Computer Associates has created a business model around organizations that can’t muster the management fortitude to eradicate these irrelevant, outdated technologies.
Solution: Kill…eradicate irrelevant technologies and superfluous data in your data lake to free up human and financial resources to focus on those technologies and data sources that support the organization’s business strategy.
However, the biggest problem driving most data lake “failures” is the inability to leverage the data in the data lake to derive and drive data monetization efforts; that is, to uncover new sources of customer, product and operational value (see Figure 1).
Figure 1: CIO’s Top Challenges
If one does not know what business value they are trying to derive and drive out of their data lake (what is the targeted use case, what are the metrics against which progress and success will be measured, what decisions does that use case need to support, etc.), then the organization doesn’t know what data sources are critical…and which ones are not. Consequently, the IT organization defaults to loading lots of unnecessary data into the data lake resulting in a swamp of uncurated, unusable-to-the-business-user data.
However, once the high-priority data sources are identified, then IT organizations can embrace DataOps to turn that data swamp into a data monetization goldmine. DataOps is the key to driving the productivity and effectiveness of your data monetization efforts. It enables your data science team to explore variables and metrics that might be better predictors of performance, and not be burdened with the data aggregation, cleansing, integration, alignment, preparation, curation and publication processes. See the blog “What is DataOps and Why It’s Critical to the Data Monetization Valu...” for more details on the symbiotic role of DataOps and Data Science to drive Data Monetization (see Figure 2).
Figure 2: Data Monetization Value Chain
Yes, Hitachi Vantara has lived this Data Lake story – buying Hadoop, loading lots of data into the Hadoop environment, hiring some data scientists and waiting for magic to happen…. However, the difference between the Hitachi Vantara story and other failed data lakes is our visionary CIO, Renee Lahti. With a little help from a friend J, Renee realized that her original data lake approach was doomed. Time for a “second surgery”!
Renee started her data lake “second surgery” (code named “Project Champagne” because Hitachi Vantara is going to drink its own champagne) by re-setting the data lake technology platform, identifying a business partner with whom to collaborate around the creation of business value (Jonathan Martin, Hitachi Vantara’s Chief Marketing Officer) and embracing our Data Science Digital Value Enablement (DVE) process.
The results will be unveiled at Hitachi Vantara’s customer event NEXT 2019 in Las Vegas October 8-10. But since I can’t wait to tip my hand (which is why I don’t gamble in Vegas), here are some of our “Drinking Our Own Champagne” stories we endeavored upon:
But, here’s my key observation – we were able to achieve 90% model predictive accuracy using only 3 data sources! Yes, just three! The importance of this learning is that organizations don’t need to start the process by loading tens if not hundreds of data sets into the data lake. If the organization has a deep understanding of the problem they are solving, the efficiency and effectiveness gained in initially limiting the data to work with, enables focus on the data cleansing, completeness and enrichment activities of those 3 most important data sets.
Now, can we further engineer those 3 data sources to improve model accuracy? Definitely, and that’s where IT will be focusing much of their data improvement efforts.
However, here’s the real interesting question: could we improve model accuracy by incorporating more data sources? Possibly, but at some point, the costs of investigating new data sources has to be weighed against the marginal value in the improvements in the analytic model. That is, it is an economic decision whether to continue to invest resources in improving this model and the supporting data or reassign that investment to the next use case (we have at least 10 more for Marketing).
The Data Lake can become a “collaborative value creation platform” that drives organization alignment around identifying and prioritzing those use cases that can derive and drive new sources of customer, product and operational value. However, don’t “pitch through pain” with a technology platform and a monetization approach that are outdated. Embrace the economic concepts associated with Sunk Costs and Vampire Indecisions to “let go” and move forward, especially as Data Lake “second surgeries” become the norm.
And while you are at it, drink some champagne to celebrate the transition!
By the way, I will be running a 90-minute “Thinking Like A Data Scientist” workshop with my University of San Francisco co-teacher – Professor Mouwafac Sidaoui– on Monday, October 7thin Las Vegas prior to NEXT 2019. It’s free and participants will get a signed copy of my new book “The Art of Thinking Like A Data Scientist.” Click here for more details and to register (link to “DataOps Fundamentals” workshop about half way down the page.
Key blog points: