Home » Technical Topics » Data Science

Why Data Is Not The Next Oil

Oil rig in the desert on a background of a dramatic sky. SymbolMarketing, at least in the IT sector, has been replaced by memes. Every so often a Gartner slide deck goes viral, and the next thing anyone knows, pithy and mostly meaningless phrases and sayings are driving Fortune 500 strategies. Execs commit to multi-billion-dollar initiatives to make sure that their companies are perceived as being hip or cool (or, to use the more typical phrases, competitive and lean), big projects get greenlit, and at the end of the day, after a forced death march, the system goes live with a great big “meh”. Those same execs may see one or two quarters boost from the system of a couple of percentage points, but then the hemorrhaging begins anew.

The meme-factories recently spit out the meme “Data is the next Oil”. Translating from the Memespeak, what I believe this expression is intended to imply is that the data within your organization is valuable and that if you do not transform your organization to more effectively utilize that data, you will get left behind.

The problem with this is that it is true only for a small percentage of companies or for a limited period of time. If you’re a retailer, for instance, trying to transform your company into something that’s competitive with Amazon is a fool’s errand. It was a fool’s errand a decade ago. The tsunami forces that are now ripping apart brick-and-mortar retailers were inevitable once the Internet came along. Jeff Bezos happened to be at the right place and the right time to harness those forces, but if Amazon had stumbled, some other company would have been the next Amazon. Put another way, there was a data retail hole that opened up about twenty years ago, and Amazon happened to fill it.

The disruption that Amazon represents did not rely solely, or even primarily, upon data. Instead, it was a gamble that paid off: in a digital environment, the advantage that a brand confers is mostly negated by the efficiency of an exchange.  This was the same bet that Uber made (and mostly capitalized on). That these exchanges required the digitization of information about products as part of the process, but it’s important not to confuse this with a blanket need for an organization to thoroughly transform their processes so that everything is digital.

Oil is a form of energy storage, though it also plays a role in reducing friction within systems. In the former capacity, oil, once burned, is generally unrecoverable. As gasoline or jet fuel, oil is really only useful when vaporized and combusted, with its byproducts generally being hydrocarbon sludge and carbon dioxide gas. The energy generated by oil usually gets transformed into kinetic energy (work) and heat. Refined, oil’s primary value then is as a potential energy store. 

Data, on the other hand, is information about the state of something at a given point of time. That data can be stored and analyzed for trends and behavior, but it doesn’t actually power anything. Data can also be sampled at arbitrarily fine sampling rates, but greater sampling does not necessarily actually make the data any more useful. If you sample the temperature distribution of a chemical reaction, sampling a million times a second may give a detailed picture of how that reaction works. On the other hand, sampling the price of houses a million times a second will likely yield no more information than if you sampled a set of houses once a month. However, the energy expended to process this sampling goes up linearly with the number of samples. 

A significant percentage of data, even in traditional applications, is also categorical in nature. In the last several years, there has been a growing realization that these categories and identifiers, often known as controlled vocabularies or taxonomies, are an indispensable part of the business language of a given organization. What’s more, there is a growing realization that the inconsistencies in this categorical information due to poor management, ambiguous definitions, conceptual redundancies, and poor modeling mean that much of the data that currently exists in various relational data systems is simply in no shape to be used for meaningful analytics. All too often you are comparing apples to dump trucks, not just oranges.

Admittedly, there are areas where the analog does seem to be increasingly true, but these are not necessarily positive ones. Recently oil prices have collapsed as the coronavirus has effectively frozen the economy, and has created a situation where the profit to be made from that oil is far outweighed by the cost of pumping it. Moreover, on one historic day, the prices for a particular contract on oil (West Texas Intermediate, considered the benchmark for North American Oil Prices) actually went negative, because there was no place to store the oil being produced. Oil tankers were taking the long way around Africa (or sailing through the Indian Ocean) because tankers making their way into port would have to settle their oil purchases at rock-bottom prices (or might not even be allowed into port at all).

Similarly, data “lakes” and enterprise data warehouses are being filled with complex, poorly described and often useless data because of the notion that all data is valuable. In many cases, a better strategy would have been to better manage what is being stored, work towards reducing redundant data through intelligent knowledge bases, and to federate data in ways that placed that data in closest proximity (physically or in terms of network connectivity) to the people who would likely curate and utilize this data.

This process is likely to be expensive because it necessitates reprocessing what exists towards a more manageable solution (hint, knowledge graphs) and then only keeping the traces of transactional data. The benefits, however, far outweigh the costs in the longer-term – lower IT computing costs (in the cloud or elsewhere), better, cleaner data for analytics and decision making, adequate consistency to make autonomous data flow possible, and faster processing.

Data is not oil. Data is data – it is essential to the operation of your business and can ultimately help make the decisions to better situate your business, but it requires a very different paradigm of thinking about how best to take advantage of that data.