Subscribe to DSC Newsletter

Ten top languages for crunching Big Data

With an ever-growing number of businesses turning to Big Data and analytics to generate insights, there is a greater need than ever for people with the technical skills to apply analytics to real-world problems.

Computer programming is still at the core of the skillset needed to create algorithms that can crunch through whatever structured or unstructured data is thrown at them. Certain languages have proven themselves better at this task than others. Here’s a brief overview of 10 of the most popular and widely used.

Fractal landscape simulation requires a lot of computing (this one possibly produced with MATLAB)

Julia

Julia is a relative newcomer, having existed only for a few years, however it is quickly gaining popularity with data scientists praising both its flexibility and ease of use. Although designed as a “jack of all trades” language, able to cope with any sort of application, it is thought to be particularly efficient at utilizing the power of distributed systems such as Hadoop, frequently used in Big Data.

Crowd-sourced data science website Kaggle is currently running a competition which doubles as a tutorial on getting started with Julia – it will show you how to use it to create algorithms designed to detect text characters, such as roadside graffiti, in Google Street View images.

 SAS

The SAS language is the programming language behind the SAS (Statistical Analysis System) analytics platform, which has been used for statistical modelling since the 1960s and is still popular today after many years of updates and refinements. Although unlike many of the other languages mentioned here it isn’t open source, so it isn’t free, there is a free University Edition designed for learners, available here.

Python

Python is one of the most popular open source (free) languages for working with the large and complicated datasets needed for Big Data. It has become very popular in recent years because it is both flexible and relatively easy to learn. Like most popular open source software it also has a large and active community dedicated to improving the product and making it popular with new users. A free Code Academy course will take you through the basics in 13 hours.  

R

Like Python, R is hugely popular (one poll suggested that these two open source languages were between them used in nearly 85% of all Big Data projects) and supported by a large and helpful community. Where Python excels in simplicity and ease of use, R stands out for its raw number crunching power. Its widespread adoption means you are probably executing code written in R every day, as it was used to create algorithms behind Google, Facebook, Twitter and many other services. A free, online beginners’ course in programming R can be found here.

SQL

Although SQL is not designed for the task of handling messy, unstructured datasets of the type which Big Data often involves, there is still a need for structured, quantified data analytics in many organizations. Older and less sexy than Python or R, it was still used by 30% of organizations for their data crunching, according to one poll (the same one mentioned above!) and is a useful tool for any statistician. A free course which will teach you the basics of SQL programming is available here.

Scala

Scala is based on Java and compiled code runs on the Java Virtual Machine platform, meaning it can be run on just about any platform. Just like Java it has become popular with data scientists and statisticians thanks to its powerful number-crunching abilities, and scalability (hence the name!) A free course suitable for those with some basic experience of programming another language such as Java or Python is available here.

MATLAB

As the name suggests MATLAB is designed for working with matrixes which makes it very good for statistical modelling and algorithm creation. It isn’t open source so doesn’t have the volume of free community-driven support but this is alleviated somewhat by its widespread use in academia meaning that many will be introduced to it at college and if not there are ample resources online. Coursera offers Vanderbilt University’s Introduction to Programming with Matlab free of charge.

HiveQL

HiveQL is a query-based language for coding instructions to Apache Hive, designed to work on top of Apache Hadoop or other distributed storage platforms such as Amazon’s S3 file system. It is based on SQL, one of the oldest and most widely-used data programming languages, meaning it has been well adopted since its initial development by Facebook. It has since been passed to the Apache Foundation and given open source status. An intermediate level tutorial for those already familiar with SQL is available here.

Pig Latin

Another Hadoop-oriented, open source system, Pig Latin is the language layer of the Apache Pig platform, which is used to create Hadoop MapReduce jobs which sort and apply mathematical functions to large, distributed datasets. Like other newer languages, users can create functions in more established languages such as Python to carry out functions which are not natively supported. An online Pig tutorial can be found here.

Go

Go has been developed by Google and released under an open source licence. Its syntax is based on C, meaning many programmers will be familiar with it, which has aided its adoption. Although not specifically designed for statistical computing, its speed and familiarity, along with the fact it can call routines written in other languages (such as Python) to handle functions it can’t cope with itself, means it is growing in popularity for data programming. An online introduction and tutorial can be found here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 24246

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Michelle Hirsch on August 31, 2015 at 1:55am

Thanks for the interesting article and comments. A few small notes:

There is a vibrant community providing of MATLAB users providing code and support to each other through MATLAB Central. There are nearly 25,000 code submissions and a rapidly growing collection of well over 100,000 answered questions. 

François suggested that GNU octave is 99% compatible with MATLAB syntax. This isn't really the case anymore, as octave has not kept pace with the development of the core MATLAB language and datatypes. Most notably for big data and data analytics are tables, categorical arrays, datetime arrays, image and text datastores, and support for Map Reduce.

Comment by Rick Henderson on August 28, 2015 at 6:20am

Why are you posting a photo if you don't know the exact source? It *might* be MatLab? Seriously. It looks like it was rendered in Terragen, but I guess a question would be where did the data come from or how was it processed. However, if it was Terragen, it could be fractally generated and therefore not real.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service