A few questions regarding these programming languages / environments.
I'm planning a webinar event on this topic and would like to include your feedback.
I'll Look forward to hearing from you.
I'm familiar with Hive, have heard a bit about Pig, but what are Flume, Oozie and Mahout?
Interesting topic Vincent! I do have some experience with Mahout and it's best represented from their site so I'll quote them directly "
Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms
Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.
Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together."
I'd love to hear some comentary from others that may be able to weigh in on thier person expeience realitve to Hive and Pig.
I gave a talk at Intuit on the Hadoop Ecosystem covering these. A synopsis is:
... And how do these compare to IBM's Netezza, or SAP's HANA?
Jan - I am not sure what you want to compare. Is it Hadoop vs. Netezza and Hana? Or is it HBase vs. Netezza and Hana?
The answers are different.
I'm interested in understanding whether or not Netezza and Hana impose strong constraints on solution formulation like Hadoop does.
Really very interesting and useful - as things are ever changing, at an astonishing rate, I have to say this kind of discussion provides a perspective via the community of practitioners that I've not seen anywhere else. I did some investigating to see what I could come up with and found this useful nugget - it's a bit long but worth the read:
The front-end of the above analytics architecture remains relatively unchanged for casual users, who continue to use reports and dashboards running against dependent data marts (either physical or virtual) fed by a data warehouse.
This environment typically meets much of the information needs of the organization, which can be defined up-front through requirements-gathering exercises. Predefined reports and dashboards are designed to answer questions tailored to individual roles within the organization.
Ad hoc needs of casual users can also be serviced by the traditional data warehouse and data mart architecture. However, the interactive reports and dashboards rely on the IT department or “super users”—tech-savvy business colleagues—to create ad hoc reports and views on their behalf.
Search-based exploration tools that allow users to type queries in plain English and refine their search using facets or categories is one of several ways to allow more business users access to the data without being so sophisticated.
One new addition to the casual user environment are dashboards powered by streaming/CEP engines (real-time reports). While these operational dashboards are primarily used by operational analysts and workers, many executives and managers are keen to keep their fingers on the pulse of their companies’ core processes by accessing these “twinkling” dashboards directly or, more commonly, receiving alerts from these systems.
The biggest opportunity in the above analytics architecture is how it improves the information needs of power users. It gives power users many new options for consuming corporate data rather than creating countless “spreadmarts”. A power user is a person whose job is to crunch data on a daily basis to generate insights and plans.
Power users include business analysts (e.g., Excel jockeys), analytical modelers (e.g., SAS programmers and statisticians) and data scientists (e.g., application developers with business process and database expertise.) Under a new paradigm, power users query either an analytic platform (separate from the enterprise data warehouse) and/or Hadoop directly (the new semi-structured data warehouse).
An analytic platform can be implemented via a number of technology approaches:
Which approach or combination of approaches are you currently using or going to use?
Do you think that the Hadoop open source ecosystem will evolve to the point where the other analytic platforms become less relevant (e.g. what happens when the community adds real-time mix-workload support to Hadoop, and develops a comprehensive suite of Hadoop-enabled / parallelized analytic algorithms)?
In an attempt to be controversial, I’m going to predict that Hadoop will expand to provide support for a sophisticated analytics layer which surpasses the performance of all existing analytic platform alternatives.
All these platforms are integrating with Hadoop because Hadoop acts as a great initial data store and ETL pre-processing engine. However, this integration will ultimately lead to their demise as the Hadoop system’s capabilities begin to overlap.
I hope this proves to be helpful
I cannot see Hadoop ever replacing all analytical needs, since not all algorithms can be forced into that structure. How is one supposed to do large spatial queries? or engineering style stiff matrix calculations using that form of structure? Communication patterns among mappers need to be severely curtailed, and if a reference to an entire database needs to be passed along to each, that carriage overhead bogs processors down.
I am looking for solutions that are more general. I also fear the many lessons learned when doing parallel computation with Cray-scale computers and mesh computers have been lost because of the popularity of the map-reduce framework, such as the need to still do scalar computations very quickly, and the costs of input-output loading.
Hi Jan Galkowski ,
I disagree on your statement: "since not all algorithms can be forced Into That structure" because has been proof that a structure an algorithm with the map-reduce paradigm is turing complete.
At this, I agree with you that it is not always useful a map-reduce approach . I work in signals data analysis and in that context map-reduce it is not winning choice; often the easiest way is to use tools that can raise the problem expression level an example is www.chartie.io.
Hive is a relational data warehousing and querying platform on top Hadoop, where as Pig is a dataflow language. So you can ETL your data using Pig and load into Hive.
Oozie is a workflow environment for management of more complex scheduling of tasks. Mahout is a data mining package with data parallel scalability on Hadoop.
Flume is for collecting and analyzing Hadoop logs, but nowadays it is being used for other streaming data as well.
I gave an overview of the Hadoop stack and some alternative in my "Introduction to Big Data" course on Coursera (https://www.coursera.org/learn/intro-to-big-data/). Hope this helps further. Please feel free to download and use the slides I made available with attribution.