Subscribe to DSC Newsletter

Exclusive Tutorial on Data Manipulation with R (50 Examples)

It's a complete tutorial on data wrangling or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. It's one of the most popular R package as of date. This post includes several examples and tips of how to use dply package for cleaning and transforming data.

dplyr vs. Base R Functions

dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.

SQL Queries vs. dplyr
People have been utilizing SQL for analyzing data for decades. Every modern data analysis software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to perform data analysis. It was rather designed for querying and managing data. There are many data analysis operations where SQL fails or makes simple things difficult. For example, calculating median for multiple variables, converting wide format data to long format etc. Whereas, dplyr package was designed to do data analysis.
dplyr Function Description Equivalent SQL
select() Selecting columns (variables) SELECT
filter() Filter (subset) rows. WHERE
group_by() Group the data GROUP BY
summarise() Summarise (or aggregate) data -
arrange() Sort the data ORDER BY
join() Joining data frames (tables) JOIN
mutate() Creating New Variables COLUMN ALIAS
Example 1 : Selecting Random N Rows

The sample_n function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.
Example 2 : Selecting Random Fraction of Rows

The sample_frac function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.

Example 3 : Selecting Variables (or Columns)

Suppose you are asked to select only a few variables. The code below selects variables "Index", columns from "State" to "Y2008".

mydata2 = select(mydata, Index, State:Y2008)

Example 4 : Dropping Variables

The minus sign before a variable tells R to drop the variable.

mydata = select(mydata, -Index, -State)

The above code can also be written like :

mydata = select(mydata, -c(Index,State))

For Original Article , click here

Views: 9502


You need to be a member of Data Science Central to add comments!

Join Data Science Central


  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service