Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is the most well known technology used for "Big Data" because it allows an organization to store huge quantities of data at very low costs. R is a programming language and software environment for statistical computing and graphics. Put the two together to provide easy to use R interfaces for the distributed computing Hadoop environment and you have one king-hell data crunching tool for serious data analytics.

RHadoop is a small, open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R - allowing data scientists access to Hadoop’s scalability from their favorite language, R. It allows users to write general MapReduce programs, offering the full power and ecosystem of an existing, established programming language.

RHadoop is comprised of three packages:

- RHDFS, which provides file level manipulation for HDFS, the Hadoop file system
- RHBASE, which provides access to HBASE, the Hadoop database
- rmr, which allows you to write MapReduce programs in R

There are two basic types of graph engines:

(1) Graph databases providing real-time, traversal-based algorithms over linked-list graphs represented on a single-server (vendors include Neo4j, OrientDB, DEX, and InfiniteGraph).

(2) Batch-processing using vertex-centric message passing within a graph represented across a cluster of machines. (Hama, Golden Orb, Giraph, and Pregel).

With Hadoop, instead of focusing on a particular vertex-centric BSP-based graph-processing package such as Hama or Giraph, the results presented are via Hadoop (HDFS + MapReduce). Moreover, instead of developing the MapReduce algorithms in Java, the R programming language is used.

When a graph is on the order of 100+ billion elements (vertices+edges), then a single-server graph database will not be able to represent nor process the graph. A multi-machine graph engine is required. While Hadoop is not a graph engine, a graph can be represented in its distributed HDFS file system and processed using its distributed processing MapReduce framework.

The graph generated previously is loaded up in R and a count of its vertices and edges is conducted. Next, the graph is represented as an edge list. An edge list (for a single-relational graph) is a list of pairs, where each pair is ordered and denotes the tail vertex id and the head vertex id of the edge. The edge list can be pushed to HDFS using RHadoop.

Click Here for a brief introduction to Hadoop and R describing how to program in the MapReduce framework, and provides an alternative way to implement MapReduce programs that strikes a delicate compromise between power and usability.

See: http://bit.ly/MhyoBC

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central