It’s a complete tutorial on data wrangling or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. It’s one of the most popular R package as of date. This post includes several examples and tips of how to use dply package for cleaning and transforming data.
dplyr vs. Base R Functions
dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.
| dplyr Function | Description | Equivalent SQL | 
|---|---|---|
| select() | Selecting columns (variables) | SELECT | 
| filter() | Filter (subset) rows. | WHERE | 
| group_by() | Group the data | GROUP BY | 
| summarise() | Summarise (or aggregate) data | – | 
| arrange() | Sort the data | ORDER BY | 
| join() | Joining data frames (tables) | JOIN | 
| mutate() | Creating New Variables | COLUMN ALIAS | 
 The sample_n function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.
 sample_n(mydata,3)
 The sample_frac function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.
 sample_frac(mydata,0.1)
Example 3 : Selecting Variables (or Columns)
Suppose you are asked to select only a few variables. The code below selects variables “Index”, columns from “State” to “Y2008”.
mydata2 = select(mydata, Index, State:Y2008)
Example 4 : Dropping Variables
The minus sign before a variable tells R to drop the variable.
mydata = select(mydata, -Index, -State)
The above code can also be written like :
mydata = select(mydata, -c(Index,State))
For Original Article , click here
