Guest blog post by Martijn Theuwissen, co-founder at DataCamp. Other R resources can be found here, and R Source code for various problems can be found here. A data science cheat sheet can be found here, to get you started with many aspects of data science, including R.
Learning R can be tricky, especially if you have no programming experience or are more familiar working with point-and-click statistical software versus a real programming language. This learning path is mainly for novice R users that are just getting started but it will also cover some of the latest changes in the language that might appeal to more advanced R users.
Creating this learning path was a continuous trade-off between being pragmatic and exhaustive. There are many excellent (free) resources on R out there, and unfortunately not all could be covered here. The material presented here is a mix of relevant documentation, online courses, books, and more that we believe is best to get you up to speed with R as fast as possible.
Here is an outline:
R is rapidly becoming the lingua franca of Data Science. Having its origins in academics, you will spot it today in an increasing number of business settings as well where it is a contestant to commercial software incumbents such as SAS, STATA and SPSS. Each year, R gains in popularity and in 2015 IEEE listed R in the top ten languages of 2015.
This implies that the demand for individuals with R knowledge is growing, and consequently learning R is definitely a smart investment career wise (according to this survey R even is the highest paying skill). This growth is unlikely to plateau in the next years with large players such as Oracle &Microsoft stepping up by including R in its offerings.
Nevertheless, money should not be the only driver when deciding to learn a new technology or programming language. Luckily, R has a lot more to offer than a solid paycheck. By engaging yourself with R, you will become familiar with a highly diverse and interesting community. Namely, R is being used for a diverse set of task such as finance, genomic analysis, real estate, paid advertising, and much more. All these fields are actively contributing to the development of R. You will encounter a diverse set of examples and applications on a daily basis, keeping things interesting and giving you the ability to apply your knowledge on a diverse range of problems.
Before you can actually start working in R, you need to download a copy of it on your local computer. R is continuously evolving and different versions have been released since R was born in 1993 with (funny) names such as World-Famous Astronaut and Wooden Christmas-Tree. Installing R is pretty straightforward and there are binaries available for Linux, Mac and Windows from the Comprehensive R Archive Network (CRAN).
Once R is installed, you should consider installing one of R’s integrated development environment as well (although you could also work with the basic R console if you prefer). Two fairly established IDE’s are RStudio and Architect. In case you prefer a graphical user interface, you should check out R-commander.
Learning the syntax of a programming language like R is very similar to the way you would learn a natural language like French or Spanish: by practice & by doing. One of the best ways to learn R by doing is through the following (online) tutorials:
Next to these online tutorials there are also some very good introductory books and written tutorials to get you started:
Every R package is simply a bundle of code that serves a specific purpose and is designed to be reusable by other developers. In addition to the primary codebase, packages often include data, documentation, and tests. As an R user, you can simply download a particular package (some are even pre-installed) and start using its functionalities. Everyone can develop R packages, and everyone can share their R packages with others.
The above is an extremely powerful concept and one of the key reasons R is so successful as a language and as a community. Namely, you don’t need to do all the hard core programming yourself or understand every complex detail of a particular algorithm or visualization. You can simple use the out-of-the box functions that come with the relevant package as an interface to such functionalities. As such it is useful to have an understanding of R’s package ecosystem.
Many R packages are available from the Comprehensive R Archive Network, and you can install them using the install.packages function. What is great about CRAN is that it associates packages with a particular task via Task Views. Alternatively, you can find R packages on bioconductor, github and bitbucket.
Looking for a particular package and corresponding documentation? Try Rdocumentation, where you can easily search packages from CRAN, github and bioconductor.
You will quickly find out that for every R question you solve, five new ones will pop-up. Luckily, there are many ways to get help:
Once you have an understanding of R’s syntax, the package ecosystem, and how to get help, it’s time to focus on how R can be useful for the most common tasks in the data analysis workflow
Before you can start performing analysis, you first need to get your data into R. The good thing is that you can import into R all sorts of data formats, the hard part this is that different types often need a different approach:
Performing data manipulation with R is a broad topic as you can see in for example this Data Wrangling with R video by RStudio or the book Data Manipulation with R. This is a list of packages in R that you should master when performing data manipulations:
One of the main reasons R is the favorite tool of data analysts and scientists is because of its data visualization capabilities. Tons of beautiful plots are created with R as shown by all the posts on FlowingData, such as this famous facebook visualization.
Credit card fraud scheme featuring time, location, and loss per event, using R: click here for source
If you want to get started with visualizations in R, take some time to study theggplot2 package. One of the (if not the) most famous packages in R for creating graphs and plots. ggplot2 is makes intensive use of the grammar of graphics, and as a result is very intuitive in usage (you’re continuously building part of your graphs so it’s a bit like playing with lego). There are tons of resources to get your started such as this interactive coding tutorial, a cheatsheet and an upcoming book by Hadley Wickham.
Besides ggplot2 there are multiple other packages that allow you to create highly engaging graphics and that have good learning resources to get you up to speed. Some of our favourites are:
Next to the “traditional” graphs, R is able to handle and visualize spatial data as well. You can easily visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with a package such as ggmap. Another great package is choroplethr developed by Ari Lamstein of Trulia or the tmap package. Take this tutorial onIntroduction to visualising spatial data in R if you want to learn more.
In case you are new to statistics, there are some very solid sources that explain the basic concepts while making use of R:
Note that these resources are aimed at beginners. If you want to go more advanced you can look at the multiple resources there are for machine learning with R. Books such as Mastering Machine Learning with R andMachine Learning with R explain the different concepts very well, and online resources like the Kaggle Machine Learning course help you practice the different concepts. Furthermore there are some very interesting blogs to kickstart your ML knowledge like Machine Learning Mastery or this post.
One of the best way to share your models, visualizations, etc is through dynamic documents. R Markdown (based on knitr and pandoc) is a great tool for reporting your data analysis in a reproducible manner though html, word, pdf, ioslides, etc. This 4 hour tutorial on Reporting with R Markdownexplains the basics of R markdown. Once you are creating your own markdown documents, make sure this cheat sheet is on your desk.
R is a fast-evolving language. It’s adoption in academics and business is skyrocketing, and consequently the rate of new features and tools within R is rapidly increasing. These are some of the new technologies and packages that excite us the most:
Once you have some experience with R, a great way to level up your R skillset is the free book Advanced R by Hadley Wickham. In addition, you can start practicing your R skills by competing with fellow Data Science Enthusiasts on Kaggle, an online platform for data-mining and predictive modelling competitions. Here you have the opportunity to work on fun cases such as this titanic data set.
To end, you are now probably ready to start contributing to R yourself by writing your own packages. Enjoy!