Subscribe to DSC Newsletter

Get started with Hadoop and Spark in 10 minutes

With the big 3 Hadoop vendors – ClouderaHortonworks and MapR - each providing their own Hadoop sandbox virtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download and get started with one of these VMs and try out Hadoop to practice data science right away.

However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem. e.g. Hortonworks includes the open-source Ambari while Cloudera includes its own Cloudera Manager for orchestrating Hadoop installations and managing multi-node clusters.

Moreover, most of these distributions require today a 64-bit machine and sometimes a high-amount of memory (for a laptop). e.g. running Cloudera Manager with a full-blown Cloudera Hadoop Distribution (CDH) 5.x requires at least 10GB RAM. For a developer with a laptop, RAM is always at a premium, hence it may seem easier to try out the vanilla Apache Hadoop downloads for installations. The documentation for Hadoop for installing a single-node cluster, and even a multi-node cluster is much improved nowadays, but with the hassles of downloading the distributions and setting up SSH, it can easily take up a long-time to effectively set up a useful multi-node cluster. The overhead of setting up and running multiple VMs can also be a challenge. 

This is where using a tool like Vagrant can be very useful. Vagrant uses an easy configurable workflow for automation of development environment setups. With simple commands like vagrant up and a single file  to describe the type of machine you want, the software which needs to be installed and the way the machine can be accessed, configuring and setting up multiple VMs for a cluster is extremely easy. A list of available configurations for development boxes can be accessed at vagrantbox.es.

For this project, I tried adapting Jee Vang's excellent Vagrant project to allow for 32-bit machines and updated versions of Hadoop and Spark. Following is a list of simple steps to get started with a multi-node cluster with Hadoop and Spark in minutes.

  1. Download and install the pre-requisites: Virtualbox and Vagrant.
  2. Run vagrant box add command with the link for the desired vagrant development box configuration.
  3. Download Hadoop, Spark and Java to local for speeding up installation. Alternatively, you could specify the remote URLs for downloading these software during installation.
  4. Run vagrant up to setup your development environment.
  5. Run vagrant ssh to access the environment and perform post-provisioning tasks like starting up Hadoop and Spark daemons.

For the detailed instructions and source files, refer to the project on Github:

How to set up a multi-node Hadoop-Spark cluster with Vagrant

Other approaches to this problem used a container-based approach to installation. Hadoop clusters can be setup with LXC (Linux containers) e.g. with the very popular Docker. There are also other options like using Puppet, Ansible, Chef and Salt all of which allow easy installations.  Some of these can also be clubbed with Vagrant, to build a virtualized development Hadoop cluster.

Views: 20906

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by tang mandy on January 11, 2015 at 9:31pm

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service