It’s a complete tutorial on data wrangling or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. It’s one of the most popular R package as of date. This post includes several examples and tips of how to use dply package for cleaning and transforming data.
dplyr vs. Base R Functions
dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.
|dplyr Function||Description||Equivalent SQL|
|select()||Selecting columns (variables)||SELECT|
|filter()||Filter (subset) rows.||WHERE|
|group_by()||Group the data||GROUP BY|
|summarise()||Summarise (or aggregate) data||–|
|arrange()||Sort the data||ORDER BY|
|join()||Joining data frames (tables)||JOIN|
|mutate()||Creating New Variables||COLUMN ALIAS|
The sample_n function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.
The sample_frac function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.
Example 3 : Selecting Variables (or Columns)
Suppose you are asked to select only a few variables. The code below selects variables “Index”, columns from “State” to “Y2008”.
mydata2 = select(mydata, Index, State:Y2008)
Example 4 : Dropping Variables
The minus sign before a variable tells R to drop the variable.
mydata = select(mydata, -Index, -State)
The above code can also be written like :
mydata = select(mydata, -c(Index,State))
For Original Article , click here