Does Big Data mean Hadoop? Not really, however when one thinks of the term Big Data, the first thing that comes to mind is Hadoop along with heaps of unstructured data. An exceptional lure for data scientists having the opportunity to work with large amounts data to train their models and businesses getting knowledge previously never imagined. But has it lived up to the hype? In this article, we will look at a brief history of Hadoop and see how it stands today.
Some key takeaways from the Hype cycle of 2015:
Just to check on the relevance of the Hype Cycle sitting in India, I check the Google trend for the terms ‘Big Data’ and ‘Hadoop’, and here are the results:
So there is definitely a fall after a point of inflection in 2014-2015.
Gartner conducted this survey with 284 companies, with only 125 stating that they had already invested in Hadoop, or were looking to invest within the next 2 years. The release can be found here:http://www.gartner.com/newsroom/id/3051717.
57% respondents in the survey cited Skills Gap as the major reason for not adopting, and 49% were still trying to figure out the value deriving process. While Gartner estimates that it would take 2 years for finding the right number of people with the needed skills, Hadoop distribution providers are working on creating more user friendly and integrated modules and interfaces. However, these still do not seem to be friendly enough for the average user.
RoI / Priority
As organizations will have to get busy figuring out ways to incorporating new processes, hiring skilled individuals, Hadoop or Big Data deployment is taking a backseat. Traditional database providers have evolved their products be it in-memory or Massive Parallel Processing systems, which stay good enough to get the job done for most times, or even better than Hadoop for some applications. For instance, this experiment was conducted at Airbnb comparing a Amazon Redshift 16 node cluster with a 44 node Hive/Hadoop EMR cluster, and the SQL based Redshift outperformed the EMR cluster. The study was done in 2013, and Hadoop has evolved from then with Hive on Yarn, Apache Impala and so on, however this does not change the fact that Hadoop wasn’t built as a database for performance optimized structured data querying, unless the data is in Petabytes of course.
When the need arises to create something sort of a data mart, the lure towards ‘complicating’ things with Hadoop is something to be debated. Keep in mind, most enterprises have teams who take ownership of social media and other such unstructured information on the digital medium. And there are numerous awesome solutions to do digital data tracking and brand monitoring and what not in real time. Point is Hadoop would be low priority in all such scenarios.
Another important question which would need a large qualitative analysis – Can Hadoop coexist peacefully with an existing data warehouse? If so, how? Offloading processes (even for parts) would have million dollar price tags for the enterprise.
After that bit, the dilemma exists of hosting a cluster but not building any legacy systems versus going cloud. Majority of Cloudera’s customers host their own clusters with a minority being on the cloud. The major options here are a. hosting on premise / leasing data center b. IaaS (Infrastructure as a Service) such as Amazon Web Service, Google Cloud Platform etc , and c. Now emerging Hadoop as a Service.
Expensive to host on premise, cloud based providers such as Amazon EMR require skills of not only Hadoop but also understanding the Amazon version of the ecosystem. Hard to find. Hadoop as a Service providers are still fine tuning their products and will take some time maturing for the enterprise.
Emergence of the Cloud
In the points list above, Skills Gap and RoI/Priority are something which will evolve over time. They are related to each other, and as organizations see value, they will either hire or train people for these skills which are very much attainable. But what will be interesting to see in the next few years is the emergence of cloud based solutions for Hadoop to fine tune integration for the enterprise.
SAP is acquiring Altiscale as reported on SAP reportedly buying Altiscale to power big data services. Altiscale being one the early providers of Hadoop on the Cloud.
Another interesting recent development is Cloudera asked Intel for $1 billion to build a cloud service. The Hadoop distribution market leader – Cloudera is pushing towards going cloud and gain some market share of the Big Data workloads from current leaders in the Hadoop on the Cloud space – Amazon AWS, IBM BigInsights, Google Cloud Platform, and Microsoft Azure HDInsight.
In summary, as organizations realize the value of Big Data, factors such as skills shortage and integration lead to the slow adoption rates. Moreover, traditional database providers have evolved their services with Massive Parallel Processing systems, In-memory and columnar database solutions which has delayed the realization of the value related to Hadoop. The emergence of cloud based Hadoop service providers provides an alternative way for organizations to incorporate Hadoop clusters for Big Data workloads in the future.
Originally posted here.