Hive, Pig, Spark, Yarn, Zoo Keeper, Falcon, Flume, Nutch, Sqoop, Storm. Unless you work with Big Data all the time you can be forgiven for loosing track of the latest developments from the open source Apache Software Foundation which is the intellectual mother and father to all things Hadoop. Well here’s one that you should pay attention to because it may very well be the basis for a revolution in the use of Big Data and that’s DRILL.
Big Data on Hadoop is open source and essentially free, or at least available for what amounts to pocket change in most large company IT departments. And if you’re following the technology literature it’s not possible to go a single day without seeing an article or whitepaper extolling the value of Hadoop. Yes, Big Data and Hadoop are not synonymous. That would require a discussion of NoSQL databases. But the fact is that Hadoop is so far out in front right now that it might as well be one-in-the-same.
Also, Hadoop is like the Swiss army knife of unstructured or semi-structured data, for blending structured and unstructured data, or for ingesting very fast data flows like IoT inputs. So why haven’t more companies implemented it yet? To my way of thinking it’s a resource problem and by resource here I mean human resources, not money or time or machines but the well trained and experienced information technology professionals of all stripes that are necessary to implement and exploit IT technologies.
We’ve spent 20 or 30 years developing IT staffs with the necessary skills and the one thing that virtually 100% of them relied upon to get at the data is SQL. It’s pretty much the universal skill. So when Hadoop became available it wasn’t possible to access the data using SQL. That’s why it’s called NoSQL. You may know that Hadoop uses Map Reduce to retrieve data and that is slow and complicated. It was so slow and complicated that much of the development energy over the last five years has been devoted to making it easier. Most currently we have SPARK which targets data piping Hadoop data into other media and has an element of SQL embedded in it.
Now comes the revolution, DRILL. DRILL is a standalone tool that can be used by any SQL-literate user to directly query and retrieve data from not only Hadoop (which is a key-value store) but also from H-Base (the columnar Hadoop variant much loved by data warehouse types).
Drum roll please. This means for the first time since Hadoop was introduced that all of our vast army of IT professionals who know SQL, along with the pretty substantial cadre of pure business users who know SQL can use their existing skills to query and extract Big Data from Hadoop. No Hadoop specialists required.
As the Apache web site says, DRILL is purpose built for semi-structured, nested data. Basically that means any data that has been encoded with CSV, TSV, JSON, Parquet, Avro or similar simple late-schemaed storage structures including H-base (that previously was a particular challenge) can be queried with ANSI-standard SQL. Nothing new to learn.
What’s not included? Well pure unstructured text like Twitter streams doesn’t lend itself to this. But so much of our ‘new data’ has or can be stored in JSON or similar that this opens up a huge opportunity to bring our existing IT professionals back on-line for Big Data projects.
DRILL is in beta as we speak and is due for release in Q2. Its release will sweep away one of the last remaining impediments for companies of all sizes to explore and exploit Big Data.
February 23, 2015
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
The original article can be viewed at: