Manipulating and processing data in R
Data structures provide the way to represent data in data analytics. We can manipulate data in R for analysis and visualization.
One of the most important aspects of computing with data in R is its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few basic data structures in R:
These are ordered container of primitive elements and are used for 1dimensional data.
Types – integer, numeric, logical, character, complex
These are Rectangular collections of elements and are useful when all data is of a single class that is numeric or characters.
Dimensions – two, three, etc.
These are ordered container for arbitrary elements and are used for higher dimension data, like customer data information of an organization. When data cannot be represented as an array or a data frame, list is the best choice. This is so because lists can contain all kinds of other objects, including other lists or data frames, and in that sense, they are very flexible.
These are twodimensional containers for records and variables and are used for representing data from spreadsheets etc. It is similar to a single table in the database.
Creating Subsets of Data in R
As we know, data size is increasing exponentially and doing analysis on complete data is very timeconsuming. So data is divided into small sized samples and analysis of samples is done. The process of creating samples is called subsetting.
Different methods of subsetting in R are:
The dollar sign operator selects a single element of data. When you use this operator with a data frame, the result is always a vector.
Similar to $ in R, the double square brackets operator in R also returns a single element, but it offers the flexibility of referring to the elements by position rather than by name. It can be used for data frames and lists.
The single square bracket operator in R returns multiple elements of data. The index within the square brackets can be a numeric vector, a logical vector, or a character vector.
For example: To retrieve 5 rows and all columns of already built in data set iris, below command is used:
1

> iris[1:5, ] 
As we have seen, samples are created from data for analysis. To create samples, sample() command is used and the number of samples to be drawn are mentioned.
For example, to create a sample of 10 simulation of a die, below command is used:
1

> sample(1:6, 10, replace=TRUE) 
It gives output as:
1

[1] 2 2 5 3 5 3 5 6 3 5 
Sample() should always produce random values. But it does not happen with the test code sometimes. If substituted with a seed value, the sample() command always produces random samples.
Seed value is the starting point for any random number generator formula. Seed value defines both, the initialization of the random number generator along with the path that the formula will follow.
Let us see how seed value is used.
1
2

> set.seed(1) //setting seed values for sample() command >sample(1:6, 10, replace=TRUE) 
This gives output as below:
1

[1] 2 3 4 6 2 6 6 4 4 1 
Let us now see few applications of subsetting data in R:
Below command shows how to find duplicate data in subsets: Duplicated() function finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value.
1

>duplicated(c(1,2,1,3,1,4)) 
This gives output as below:
1

[1] FALSE FALSE TRUE FALSE TRUE FALSE 
For all those values which are duplicate in the sample, true is returned.
If during analysis, any row with missing data can be identified and removed as below:
complete.cases() command in R is used to find rows which are complete. It gives logical vector with the value TRUE for rows that are complete, and FALSE for rows that have some NA values.
Rows which have NA values can be removed using na.omit() function as below:
> row_name < na.omit(file_name)
After you have created the appropriate subset of your data, the next step in your analysis is to perform some calculations. R makes it easy to perform calculations on columns of a data frame because each column is itself a vector.
Let us see data manipulation with R with the help of an example:
Let us see how to calculate the ratio between the lengths and width of the sepals
Command for the same is:
1
2
3

> x < iris $Sepal .Length / iris $Sepal .Width >head(x) 
//Command to display the first five elements of the result
It gives the output as:
1

[1] 1.457143 1.633333 1.468750 1.483871 1.388889 1.384615 
Let us discuss some variations of the operations performed on data frames in R.
To reduce the amount of typing and make code more readable, with() command is used as below:
1
2
3

>y < with(iris, Sepal.Length / Sepal.Width) //Command to calculate the ratio between the lengths and width of the sepals using the with() function >head(y) 
This gives output same as above but reduced the task of typing.
Let us now see the use of within function for same task:
1

>iris< within(iris, ratio < Sepal.Length / Sepal.Width) 
With() function allows you to refer to columns inside a data frame without explicitly using the dollar
sign or even the name of the data frame itself.
With and Within can be used interchangeably.
Most statisticians often draw histograms to investigate their data. As this type of calculation is common when you use statistics, R has some functions for it.
Cut() function groups values of a variable into larger bins. It creates bins of equal size and classifies each element into its appropriate bin.
Let us see how cut works in R with example:
1

> cut(frost, 3, include .lowest=TRUE) 
This gives the result as a factor with three levels.
The cut() function creates mathematical labels for the bins. The label names can be provided by the user.
Let us see this with the help of example:
1

>cut(frost, 3, include .lowest=TRUE, labels=c( "Low" , "Med" , "High" )) 
The result shows three labels in the output.
To count the number of observations in each level of factor, R table() command can be used as below:
1
2
3

> x < cut(frost, 3, include .lowest=TRUE, labels=c( "Low" , "Med" , "High" )) > table(x) 
The result shows the output as a table containing the number of elements in each factor.
If you want to combine data from different sources in R, you can combine different sets of data in three ways:
If the two sets of data have an equal set of rows, and the order of the rows is identical, then adding columns makes sense. This can be done by using the data.frame or cbind() function.
If both sets of data have the same columns and you want to add rows to the bottom, use rbind().
The merge() function combines data based on common columns as well as rows. In database language, this is usually called joining data.
For merging the existing data, using the merge()function is useful. You can use merge()to combine data only when certain matching conditions are satisfied.
Let us see the use of merge() function.
The merge() function is used to combine data frames. Let us see this with an example:
1

> merge(cold.states, large.states) Name Frost Area 
This is the command to create a data frame that consists of cold as well as large states.
Let us see different types of merge().
The merge() function allows four ways of combining data:
To keep only rows that match from the data frames, specify the argument all=FALSE
To keep all rows from both data frames, specify all=TRUE
To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE
To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE
The merge()function takes a large number of arguments, as follows:
The R match() function returns the matching positions of two vectors or, more specifically, the positions of the first matches of one vector in the second vector.
1

> index < match(cold.states $Name , large.states $Name ) 
This is the command to search for large states that also occur in the data frame cold.states
1

> index 
It gives output as:
1

[1] 1 4 NA NA 5 6 NA NA NA NA NA 
A common task in data analysis and reporting is sorting information. You can answer many everyday questions with sorted tables of data that tell you the best or worst of specific things; for example, parents want to know which school in their area is the best, and businesses need to know the most productive factories or the most lucrative sales areas.
Let us first create data frame and then we will sort it.
1

> some.states < data.frame( + Region = state.region, + state.x77) 
This is the command to create data frame some.states.
1

> some.states < some.states[1:10, 1:3] 
This will create subset of it.
By default, sorting is done in ascending manner if not specified.
1
2
3

> sort(some.states $Population ) //Command to sort Population in ascending order > sort(some.states $Population , decreasing=TRUE) 
//Command to sort Population in descending order
This is how sorting of data can be done in R.
Data frames can also be sorted as below:
1

order.pop < order(some.states $Population ) 
Above is the command to show the order of the elements of the data frame some.states
Now to sort above data frame in ascending order, below command is used:
1

> some.states[order.pop, ] 
To sort in descending order, we need to specify as below:
1

> order(some.states $Population , decreasing=TRUE) 
This is how order() and sort() functions are used.
To traverse the data, R uses apply functions. The output of the apply() function depends on the data structure being traversed.
The apply() function traverses either the rows or columns of a matrix, applies a function to each resulting vector, and returns a vector of summarized results
The lapply() function can traverse a list, it applies a function to each element, and returns a list of the results. Sometimes it is possible to simplify the resulting list into a matrix or vector. lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
R Apply() function is used as below:
1

apply(X, MARGIN, FUN, ...) 
The apply() function takes four arguments as below:
In essence, the apply function allows us to make entrybyentry changes to data frames and matrices. If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the columns of X. Most impressively, when MARGIN=c(1,2) the function is applied to every entry of X.
Let us now discuss the variations of the apply() function:
We have already seen it above.
It works on a list or vector and returns vector.
It is used to create tabular summaries of data. This function takes three arguments:
An illustrative example
Consider the code below:
#Create the matrix
1

m<matrix(c(seq(from=98,to=100,by=2)),nrow=10,ncol=10) 
# Return the product of each of the rows
1

apply(m,1,prod) 
# Return the sum of each of the columns
1

apply(m,2,sum) 
The R formula interface allows you to concisely specify which columns to use when fitting a model, as well as the behavior of the model.
You need the operators when you start building models. Formula notation refers to statistical formulae, as opposed to mathematical formulae. The formula operator + means to include a column, not to mathematically add two columns together.
Operator  Example  Meaning 
~  y ~ x  Model y as a function of x 
+  y ~ a + b  Include columns a as well as b 
–  y ~ a – b  Include a but exclude b 
:  y ~ a : b  Estimate the interaction of a and b 
*  y ~ a * b  Include columns as well as their interaction (that is, y ~ a + b + a:b) 
  y ~ a  b  Estimate y as a function of a conditional on b 
Above table shows meanings of different operators in formula interfacing.
The two types of R variables are:
Identifier, or ID variables identify the observations. These act as the keys that identify the observations.
These represent the measurements to be observed.
Base R has a function, reshape() that works fine for reshaping longitudinal data.
The problem of data reshaping is far more generic than simply dealing with longitudinal data. So package reshape2 that contains several functions to convert data between long and wide format is released.
1
2
3

> install.packages( "reshape2" ) //This is the command to install reshape2 package > library( "reshape2" ) 
//This is the command to load reshape2 package
R reshape2 package is based on two key functions:
© 2020 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 20082014  20152016  20172019  Book 1  Book 2  More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central