Percolator, Dremel and Pregel: Alternatives to Hadoop

Posted on behalf of Michael Walker.

Percolator, Dremel and Pregel: Alternatives to Hadoop

Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.

In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.

As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.

Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.

Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.

Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.

See: http://research.google.com/pubs/pub36726.html

Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.

Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.

See: http://research.google.com/pubs/pub36632.html

Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.

Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

See: http://kowshik.github.com/JPregel/pregel_paper.pdf

Views: 5259


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Big Data Queen on September 8, 2012 at 5:36am

Vincent, good insight. As an alternative to Hadoop altogether, LexisNexis has open sourced its HPCC Systems big data platform that is a complete enterprise-ready solution. Designed by data scientists, it provides for a single architecture, a consistent data-centric programming language (ECL), and two data processing clusters. Their built-in analytics libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. This all in one platform means only one thing to support and from a significant lower number of resources. In contrast, the complexity of the Hadoop ecosystem requires a huge investment in technology and resources up front and throughout. More at http://hpccsystems.com

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service