Home » Uncategorized

Seduced by the Big Data meme: Hadoop vs the Public Cloud


Currently, Cloudera is in the news for all the wrong reasons(Cloudera stock down 42%)

Since Cloudera now also incorporates Hortonworks – the current issues are just the latest in the Big Data woes. Apparently, the third vendor commercialising open source Hadoop technology (MapR) also says that…

The demise of Big Data and Hadoop has been a while in the making . However, there is a lot to learn from it and also it points us to a future which is a lot more stable and led by the three Public Cloud providers(AWS, Azure and GCP)

There are three reasons for the current state of Big Data:

  • Technical reasons
  • Commercial reasons and most importantly,
  • A Psychological reason

Technical reason: Application beyond its intended use

The technical reason is easy to explain. In a nutshell, Big Data technology is useful for a specific type of problem – but was applied to many other types of Enterprise problems. Last year, after the Cloudera – Hortonworks merger, the new Cloudera was billed as the next ‘Oracle’. That’s a curious choice of words since technically that was indeed the biggest gap to adoption of Big Data in the Enterprise. Vendors like Oracle were at pains to point out that Hadoop is not a database – but it was often marketed as trying to solve problems which a relational database can solve for the Enterprise.

In essence, Hadoop is collection of open-source software products  that provides a distributed storage framework (through HDFS) to manage very large data sets. Its primary purpose is the storage, management, and delivery of data for mostly analytical applications using the MapReduce paradigm which provides parallelization. Hence, HDFS is a file system and not a database and therefore, not suited for transactional applications that need ACID compliance.  Initiatives like Cloudera Impala, Apache Hive, and Spark SQL attempted to add SQL like capabilities to Hadoop – but these were still oriented to end user analytic applications and not transactional ones. So, the basic problem remains i.e. the use of Hadoop derivatives in the Enterprise was not suited for OLTP systems.

Commercial reason: Ignoring the Public Cloud

The commercial reasons are also easy to spot. The proverbial elephant in the room for Hadoop vendors is the ‘Cloud’ – specifically the three PAAS vendors – AWS, Azure and GCP. The Cloud gives many options and is cheaper in contrast to on-premise. The Cloud strategy is much beyond simply offering Hadoop as a platform.  While the Hadoop vendors have been incorporating many new initiatives like containers, kubernetes etc – the Cloud moves fast and more importantly, all three (Amazon, Microsoft and Google) have a great developer ecosystem which leads to much faster innovation.

Psychological reason: seduced by the meme

But there is a third very significant reason. It is a psychological reason.

It’s because Big Data is a meme – created by west coast conference organisers

Before Big Data – there was ‘web 2.0’

Unlike web 2.0 (which was a concept) – Big Data was based on open source software

Hence, a whole bandwagon was created with companies getting on the Big Data meme – but forgetting a critical fact that the Cloud changes the nature of Open Source. In the sense that, Open source (or not) does not matter in the Cloud

That’s why the Hadoop vendors focussed on Hadoop as if it was an end in itself – while a whole alternate ecosystem was being developed around them


Having said this, the future is bright

Today, Tableau was acquired by salesforce.com. salesforce has also acquired mulesoft. Pivotal, which helps companies to deploy to multiple clouds, also h…. All this shows a move away from on-premise and towards simplicity with a few large Cloud players. This is a good development for the ecosystem. It is similar to the Mobile Apps industry a decade ago when the smaller players lost out and there were effectively two ecosystems (Android and the iPhone). At that point, it was possible for apps to become mainstream.  Today, the big ecosystems round the corner are Artificial Intelligence and IoT/Edge Computing – which we cover in …  

I see a growth in the AI and IoT ecosystems driven by Azure, AWS and GCP

Canalsys indicates that Cloud infrastructure spending grew 46% in Q4 2018 and 80 billion 2018 with the market shares of AWS, Azure and GCP to be respectively 32.3 %, 16.5%, and  9.55  – with Azure showing 75.9% and GCP showing 81.7% annual growth

Hence, I believe that the time has come for the AI and IoT ecosystems to thrive  – but driven by the three PAAS vendors.

Image source https://www.ibmbigdatahub.com/infographic/four-vs-big-data

Note that the views expressed in this article are personal only and not affiliated to any organization which the author is associated with.