It's a complete tutorial on data wrangling or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. It's one of the most popular R package as of date. This post includes several examples and tips of how to use dply package for cleaning and transforming data.

**dplyr vs. Base R Functions**

dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.

People have been utilizing SQL for analyzing data for decades. Every modern data analysis software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to perform data analysis. It was rather designed for querying and managing data. There are many data analysis operations where SQL fails or makes simple things difficult. For example, calculating median for multiple variables, converting wide format data to long format etc. Whereas, dplyr package was designed to do data analysis.

dplyr Function | Description | Equivalent SQL |
---|---|---|

select() | Selecting columns (variables) | SELECT |

filter() | Filter (subset) rows. | WHERE |

group_by() | Group the data | GROUP BY |

summarise() | Summarise (or aggregate) data | - |

arrange() | Sort the data | ORDER BY |

join() | Joining data frames (tables) | JOIN |

mutate() | Creating New Variables | COLUMN ALIAS |

sample_n(mydata,3)

sample_frac(mydata,0.1)

**Example 3 : Selecting Variables (or Columns)**

Suppose you are asked to select only a few variables. The code below selects variables "Index", columns from "State" to "Y2008".

**Example 4 : Dropping Variables**

The **minus sign** before a variable tells R to drop the variable.

The above code can also be written like :

mydata = select(mydata, -c(Index,State))

