Subscribe to Dr. Granville's Weekly Digest

In the Name of Hadoop! What do you think about Flume, Oozie, Mahout, Pig, and Hive?

A few questions regarding these programming languages / environments.

  • In what situations should each be used?
  • What alternative languages are available?
  • What are the differences between Hive and Pig?
  • Can you provide success stories for each of them?
  • Strengths and weaknesses?

I'm planning a webinar event on this topic and would like to include your feedback.


I'll Look forward to hearing from you.

Sincerely, 

Vincent Granville

Views: 2092

Reply to This

Replies to This Discussion

I'm familiar with Hive, have heard a bit about Pig, but what are Flume, Oozie and Mahout?

Interesting topic Vincent!  I do have some experience with Mahout and it's best represented from their site so I'll quote them directly "

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms

Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.

Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.

Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together." 

I'd love to hear some comentary from others that may be able to weigh in on thier person expeience realitve to Hive and Pig.

Cheers,

Stan

I gave a talk at Intuit on the Hadoop Ecosystem covering these. A synopsis is:

•Pig – A declarative (data flow) programming language to hide Map/Reduce
•Hive – A SQL-like programming language to hide Map/Reduce
•HBase – A tool to manage massive amounts of data
•Sqoop – A tool to move data between Hadoop and RDBMSs
•Flume – A tool to process log files
•Oozie – A tool to manage complex workflows in Hadoop
•Zookeeper – A tool to coordinate distributed applications in Hadoop
Let me know if you would like to hear more.
Alex

... And how do these compare to IBM's Netezza, or SAP's HANA?

Jan - I am not sure what you want to compare. Is it Hadoop vs. Netezza and Hana? Or is it HBase vs. Netezza and Hana?

The answers are different.

I'm interested in understanding whether or not Netezza and Hana impose strong constraints on solution formulation like Hadoop does. 

Really very interesting and useful - as things are ever changing, at an astonishing rate, I have to say this kind of discussion provides a perspective via the community of practitioners that I've not seen anywhere else.  I did some investigating to see what I could come up with and found this useful nugget - it's a bit long but worth the read:

Traditional Analytics Approach

The front-end of the above analytics architecture remains relatively unchanged for casual users, who continue to use reports and dashboards running against dependent data marts (either physical or virtual) fed by a data warehouse.

This environment typically meets much of the information needs of the organization, which can be defined up-front through requirements-gathering exercises. Predefined reports and dashboards are designed to answer questions tailored to individual roles within the organization.

Ad hoc needs of casual users can also be serviced by the traditional data warehouse and data mart architecture. However, the interactive reports and dashboards rely on the IT department or “super users”—tech-savvy business colleagues—to create ad hoc reports and views on their behalf.

Search-based exploration tools that allow users to type queries in plain English and refine their search using facets or categories is one of several ways to allow more business users access to the data without being so sophisticated.

One new addition to the casual user environment are dashboards powered by streaming/CEP engines (real-time reports). While these operational dashboards are primarily used by operational analysts and workers, many executives and managers are keen to keep their fingers on the pulse of their companies’ core processes by accessing these “twinkling” dashboards directly or, more commonly, receiving alerts from these systems.

New Analytics Approach

The biggest opportunity in the above analytics architecture is how it improves the information needs of power users. It gives power users many new options for consuming corporate data rather than creating countless “spreadmarts”. A power user is a person whose job is to crunch data on a daily basis to generate insights and plans.

Power users include business analysts (e.g., Excel jockeys), analytical modelers (e.g., SAS programmers and statisticians) and data scientists (e.g., application developers with business process and database expertise.) Under a new paradigm, power users query either an analytic platform (separate from the enterprise data warehouse) and/or Hadoop directly (the new semi-structured data warehouse).

An analytic platform can be implemented via a number of technology approaches:

  • MPP analytic databases (e.g. Greenplum, AsterData)
  • Columnar databases (e.g. ParAccel, Infobright, Sybase IQ, Vertica)
  • Analytic appliances (e.g. Netezza, Exadata)
  • In-memory databases (e.g. Hanna, QlikView)
  • Hadoop-based analytics (e.g. Hive, Hbase, Mahout, Giraph)

Which approach or combination of approaches are you currently using or going to use?

Do you think that the Hadoop open source ecosystem will evolve to the point where the other analytic platforms become less relevant (e.g. what happens when the community adds real-time mix-workload support to Hadoop, and develops a comprehensive suite of Hadoop-enabled / parallelized analytic algorithms)?

In an attempt to be controversial, I’m going to predict that Hadoop will expand to provide support for a sophisticated analytics layer which surpasses the performance of all existing analytic platform alternatives.

All these platforms are integrating with Hadoop because Hadoop acts as a great initial data store and ETL pre-processing engine. However, this integration will ultimately lead to their demise as the Hadoop system’s capabilities begin to overlap.

I hope this proves to be helpful

AP

I cannot see Hadoop ever replacing all analytical needs, since not all algorithms can be forced into that structure. How is one supposed to do large spatial queries? or engineering style stiff matrix calculations using that form of structure? Communication patterns among mappers need to be severely curtailed, and if a reference to an entire database needs to be passed along to each, that carriage overhead bogs processors down. 

I am looking for solutions that are more general.  I also fear the many lessons learned when doing parallel computation with Cray-scale computers and mesh computers have been lost because of the popularity of the map-reduce framework, such as the need to still do scalar computations very quickly, and the costs of input-output loading.

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

© 2014   Data Science Central

Badges  |  Report an Issue  |  Terms of Service