Subscribe to DSC Newsletter

How Effective Is Hadoop Distribution for Big Data Analytics?

Hadoop, named after a toy elephant that belonged to the child of one its inventors, is an open-source software framework. It is capable of storing colossal amounts of data and handling massive applications and jobs endlessly. Hadoop’s capabilities make it one of the most sought after data platforms for successful businesses all over the world.

Hadoop Benefits

Because it can store and quickly process any type of data, Hadoop is lightyears ahead of the game in the open-source world. Data is increasing and changing everyday due to social media inventions, new mobile devices, and technological advancements. Here are a few more benefits it exudes:

  • Malleability - Hadoop is not like other databases that need to process its data before storing it. You can store as much as you need to and then process it later. That applies to images, videos, and text as well.
  • Failure tolerance - All of your data is protected against the occurrence of faulty hardware. If a node (communication point) fails, the tasks are sent to other nodes. Several copies of the data are stored to insure successful processing.
  • Minimal cost - The open-source framework is free and the cost to use the commodity hardware is low.  
  • Growth accommodation - It is relatively simple to increase your system by adding nodes to it.


The Role of Hadoop in Big Data Analytics

Because Hadoop can handle enormous amounts of data of any kind, it has the capability to do analytical algorithms. It can help your business run smoother, discover new developments, and analyze advantages over your competitors. Web-based recommendation is derived from Hadoop. It is how Facebook suggests friends that you may know; how LinkedIn shows you jobs that may be of interest to you; and how eBay can predict which items you might want to bid on.

Although Hadoop is a free platform, the need for commercial distribution is growing. It will tackle any issues with the open source version of it such as the following:

  • Technical support - Assistance can be given to clients that need Hadoop to accomplish high level tasks that are outside of their expertise.
  • Stability - Hadoop vendors will alert clients immediately upon discovering a bug or virus in their system. Prompt attention will be given to fix the issues to insure stable solutions.
  • Complete package - Vendors will pair their distributions with add-on tools to help you personalize the Hadoop application for your specific needs.


Vendors Enjoying Growth by Commercially Distributing Hadoop

These are the leading Hadoop vendors who will contribute to its effectiveness in big data analytics over the next few years:

  • Amazon - Amazon has a long collaboration with Hadoop. That has evolved into Amazon Web Services Elastic MapReduce (AWS EMR). It provides organized big data analytics including scientific simulation, web indexing, and financial analysis. Instead of managing servers by the thousands, businesses can use this “cloud” platform that is ready to work. 
  • Hortonworks - Hortonworks is an organisation that catapults open source distributions into the IT market. Their main goal is to speed up the adoption of Hadoop by all of its partners. It is said to obtain more than 59 new customers quarterly via eBay, Bloomberg, Samsung, and Spotify. They have partnered with Microsoft, SAP, and more. Hortonworks generated more than $33 million in 2013. That equated to a 109% increase from the year before.
  • Cloudera - This organisation was founded by former Yahoo, Google, and Facebook engineers. They nurture ready Hadoop solutions with extra technical support and training. Their customers are among Allstate and the U.S. Army. They are partnered with IBM, NetApp, Oracle, and more. They have approximately 53% of Hadoop’s market.
  • Microsoft - Microsoft typically does not collaborate with open source software solutions, however Microsoft has decided lately to go open source. Hadoop is at its finest when used with Microsoft’s public cloud product, Azure. One of the Microsoft's software plus service 'Office 365' has extended itself into various big data involved productivity sectors such as Communication, Education, etc. In Communication Sector, Office 365 integrates with cloud phone systems to efficiently save and easily retrieve large data such as customer details, customer follow-up management and previous conversations in a moment's notice. Similarily, in Education Sector, office 365 integrates with Moodle to bring in a more productive environment for teachers and students by harmonizing login credentials, calendar management and course content creation, in addition to other workflow improvements for education institutions.
  • IBM - IBM pairs Hadoop with high level characteristics. Customers are able to quickly create and move data in less than half an hour. This is with a data processing speed of $0.60 per cluster per hour. They can also reach market at a much higher speed due to the advanced big data analytics that Hadoop provides.

Hadoop is definitely a powerhouse in the open source world. It has capabilities that far exceed what the other data platforms can do. Its performance is increasing revenue, creating new jobs, and setting records.

What do you think about Hadoop and its distribution? Post a comment and let us know what you think.

Views: 1322


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Benjamin Bertincourt on November 5, 2015 at 9:17am

Spark and Hadoop's MapReduce have complementary aspects. Spark is faster than Hadoop as long as data can stay in main memory which is a limitation. Considering main memory is much smaller than HDDs, this translates into a more expensive cluster as you will need more nodes for the same dataset size. This also makes your cluster more expensive as main memory is more expensive than HDDs.

A case where Spark is at a clear advantage over Hadoop's MapReduce is for a streaming data processing pipeline where you expect the newly acquired data to be fairly small in comparison to the whole and data does not need to stay in the "pipe". Another aspect is training models with Machine Learning algos which very often are highly iterative or at the very least need several passes over the data. Since the data can be cached, read overheads are likely affecting execution only once.

Comment by Kening Ren on November 5, 2015 at 8:03am
I heard some of those vendors are moving from Mapreduce to Apache Spark. People find that Mapreduce logic inside Hadoop cannot handle the more and more complicated data. At least that is what Cloudera is doing. I would wait and see what AWS EMR and hortonworks will do.
Comment by Shezagary on November 4, 2015 at 3:54am

Hi Benjamin, I truly appreciate your comments. MapR is indeed a great solution and I will definitely consider including that in my future article. Thanks for your insight. Best, Sheza.

Comment by Benjamin Bertincourt on October 30, 2015 at 4:46am
If citing the lead vendors of Hadoop distributions MapR should definitely be mentioned. I don't have recent figures but a couple years ago it was still #2 in market shares significantly above Hortonworks. It is also the only one that is proposed server side from AWS as an option to their own (though one can install just about any distribution on an EC2).

For those looking for it, IBM's distribution is called BigInsight and Microsoft's is Azure HDinsight.

Follow Us


  • Add Videos
  • View All


© 2018   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service