Guest blog post by Francesca Krihely.
Here’s a prediction and a challenge, rolled into one. Whatever the level of your present understanding of Hadoop, in short, you’re going to hear a lot more about Hadoop in future.
And the challenge? Well, it’s this: whatever the level of your present understanding of Hadoop, you’re also likely to be missing critical pieces of the jigsaw. Which pieces? Read on.
Hadoop, let’s first of all remind ourselves, is an open source data platform which performs a very neat trick. Simply put, Hadoop is a tool for tying together multiple servers into single, easily-scalable clusters, ideal for distributed data storage and processing.
So it’s not too difficult to see just why Hadoop has been so phenomenally successful.
For one thing, by allowing organizations to piece together clusters from inexpensive commodity x86 servers, Hadoop sharply cuts the cost of cluster construction. And being open source, Hadoop not only works well with other open source technologies, but also offers an attractive—and surprisingly affordable—cost of initial acquisition and ongoing ownership.
All of which does a lot to transform the prospects of Big Data within even the most cash-constrained organizations. And so, by happy coincidence, even as Big Data has become all the rage, the price of entry to the party is pretty much open to all.
In short, thanks to Hadoop—and other allied open source technologies—organizations can readily store, extract and analyze data in volumes that would recently been unthinkable. And, what’s more, do it at costs that would recently have been considered unbelievable.
Now, why does this matter? Why is the ability to inexpensively store and analyze large data sets so valuable?
For one thing, it’s relatively new. Until quite recently, the large data sets within most organizations might have been large, but they certainly weren’t all-encompassing. Lots of data was simply thrown away, as the cost of storing it exceeded the likely value of keeping it or analyzing it.
The result? There are likely to be all sorts of interesting—and profitable—linkages and relationships out there, just waiting to be discovered. Structured or unstructured data, within existing database schema or not, to Hadoop it makes very little difference.
Then there’s the silo effect. Think large data sets in the corporate world, and you’ll typically think of ERP systems, where transaction volumes are high. Maybe so, but—to pick just one example—the data captured second-by-second on the factory floor by machine tools and quality systems is every bit as extensive. Stored separately, managed separately, and analyzed separately, such data exists in a silo of its own. No longer, perhaps: with Hadoop, all of it can be captured, and stored as fast as it is generated.
Better still, to repeat the point, Hadoop lowers the cost of entry to Big Data—and impressively so. So Hadoop—and the open source tools typically deployed alongside it—can be thought of as having something of a leveling effect, bringing Big Data to all organizations, and not just those with the biggest budgets.
The economic benefits of this? It’s difficult to say. But the last time we saw a truly transformational step change in technology—this time in terms of connectivity and inexpensive processing power—business startups certainly benefited disproportionately.
For proof, look no further than three such startups: Amazon.com, eBay and Google—the latter, as it happens, today delivering some of the driving force in bringing the technology behind Hadoop to fruition.
Even so, the road ahead isn’t without a few bumps. Chief amongst which is Hadoop’s lack of query and analysis tools. Truth be told, Hadoop is arguably more of a data warehouse than a database—a great way of inexpensively storing data, but not such a great tool for making sense of that data.
In the short term, this isn’t a problem: organizations and their IT functions are simply grateful to have Hadoop at all, opening up the possibility of running queries and analytics against data sets of such size.
But in the longer term, it’s clear that these organizations and their IT functions will have to engage with more than just Hadoop if they are to deliver on the promise of Big Data.
For profitable Big Data insights, in short, Hadoop is a necessary condition, but not a sufficient one.