It’s a complete tutorial on data wrangling or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. It’s one of the most popular R package as of date. This post includes several examples and tips of how to use dply package for cleaning and transforming data.

**dplyr vs. Base R Functions**

dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.

**SQL Queries vs. dplyr**

dplyr Function | Description | Equivalent SQL |
---|---|---|

select() | Selecting columns (variables) | SELECT |

filter() | Filter (subset) rows. | WHERE |

group_by() | Group the data | GROUP BY |

summarise() | Summarise (or aggregate) data | – |

arrange() | Sort the data | ORDER BY |

join() | Joining data frames (tables) | JOIN |

mutate() | Creating New Variables | COLUMN ALIAS |

**.**

**Example 1 : Selecting Random N Rows**

The **sample_n **function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.

sample_n(mydata,3)

**.**

**Example 2 : Selecting Random Fraction of Rows**

The **sample_frac **function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.

sample_frac(mydata,0.1)

**Example 3 : Selecting Variables (or Columns)**

Suppose you are asked to select only a few variables. The code below selects variables “Index”, columns from “State” to “Y2008”.

mydata2 = select(mydata, Index, State:Y2008)

**Example 4 : Dropping Variables**

The **minus sign** before a variable tells R to drop the variable.

mydata = select(mydata, -Index, -State)

The above code can also be written like :

mydata = select(mydata, -c(Index,State))

*For Original Article , click here*