With the big 3 Hadoop vendors – Cloudera, Hortonworks and MapR - each providing their own Hadoop sandbox virtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download and get started with one of these VMs and try out Hadoop to practice data science right away.
However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem. e.g. Hortonworks includes the open-source Ambari while Cloudera includes its own Cloudera Manager for orchestrating Hadoop installations and managing multi-node clusters.
Moreover, most of these distributions require today a 64-bit machine and sometimes a high-amount of memory (for a laptop). e.g. running Cloudera Manager with a full-blown Cloudera Hadoop Distribution (CDH) 5.x requires at least 10GB RAM. For a developer with a laptop, RAM is always at a premium, hence it may seem easier to try out the vanilla Apache Hadoop downloads for installations. The documentation for Hadoop for installing a single-node cluster, and even a multi-node cluster is much improved nowadays, but with the hassles of downloading the distributions and setting up SSH, it can easily take up a long-time to effectively set up a useful multi-node cluster. The overhead of setting up and running multiple VMs can also be a challenge.
This is where using a tool like Vagrant can be very useful. Vagrant uses an easy configurable workflow for automation of development environment setups. With simple commands like vagrant up and a single file to describe the type of machine you want, the software which needs to be installed and the way the machine can be accessed, configuring and setting up multiple VMs for a cluster is extremely easy. A list of available configurations for development boxes can be accessed at vagrantbox.es.
For this project, I tried adapting Jee Vang's excellent Vagrant project to allow for 32-bit machines and updated versions of Hadoop and Spark. Following is a list of simple steps to get started with a multi-node cluster with Hadoop and Spark in minutes.
For the detailed instructions and source files, refer to the project on Github:
Other approaches to this problem used a container-based approach to installation. Hadoop clusters can be setup with LXC (Linux containers) e.g. with the very popular Docker. There are also other options like using Puppet, Ansible, Chef and Salt all of which allow easy installations. Some of these can also be clubbed with Vagrant, to build a virtualized development Hadoop cluster.