The “big data” revolution has taken place. Startup companies are formed and running full speed to take advantage of this new market. The press is wound up outputting articles daily on big data and how it will change all of our lives. Even stalwarts like Oracle and IBM are jumping into the fray. Seeing a product announcement from Oracle about a new “no SQL” data store was a bit of a shock. These are all signs that the revolution has put down some roots and is ready to evolve. All revolutions at some point have to evolve. They make their initial massive change and then what? Evolve or die.
An important thing to look for from the Hadoop platform, the center of the Big Data Universe: will it evolve? It already has. What started out as MapReduce and a distributed file system (HDFS) has turned into a complete ecosystem. Management and coordination using ZooKeeper, data storage using HBase, other user interfaces such as Hive and PIG and a smattering of commercial add-ons are a testament to a growing and evolving platform.
But one of the key evolutions taking place now is rather misnamed: NextGen MapReduce. It’s also known as YARN (Yet Another Resource Negotiator). So let’s go with YARN since it better captures the underlying sea change that is taking place. For the entire infrastructure growing around Hadoop, the basic compute paradigm hadn’t changed until now. That paradigm is MapReduce. MapReduce is great at solving problems such as indexing the world’s web sites. However, it’s not the most flexible or the most efficient platform for every computation problem out there. Think of Hadoop as a distributed operating system. Without YARN, Hadoop allows running only one type of application, namely MapReduce. Linux would not have gotten very far with that constraint! YARN changes all that. YARN allows running not only MapReduce applications but other applications types as well. And as its name implies, YARN negotiates usage of Hadoop cluster resources for all application types allowing sharing of valuable cluster resources.
YARN is also flexible and highly extensible. MapReduce has already been “ported” to YARN and is available from Apache and the major distributions in beta form. YARN opens up the Hadoop platform for many different programming models. Efforts are under way to port other computation models to YARN such as MPI. The DataRush team at Pervasive is porting our distributed framework to YARN also. This will allow executing dataflow based jobs on the Hadoop platform, sharing resources with MapReduce and other job types. Dataflow is very efficient for large data volumes when doing data transformation and preparation. It is also a good model for predictive analytics that require multiple passes over large amounts of data such as decision trees. Dataflow also lends itself well to graphical based development. Using a drag-and-drop GUI to build out big data applications opens this technology to a wider audience.
With YARN in place, the choices available for building programs on the Hadoop platform will continue to grow. As we’ve seen in the High Performance Computing (HPC) world, no one programming model will solve everyone’s problems. Many models are required to allow users to pick and choose the right solution for their particular problem. YARN opens the door for additional solutions to grow and thrive. MapReduce, MPI, dataflow, real time streaming and others can all live together in the same Hadoop platform. In the end, the platform benefits along with all of its users.