In today’s tutorial I am going to teach on how to do basic data manipulation on a sample dataset. I will cater the data from US Census. US Census is one of the richest data source on internet. You can get great insights from American Community Survey about United States of America.
So let’s get started.
What we will do in this tutorial.
What we will learn in this tutorial
You have to first set your directory in R to your desired folder in your hard drive. Setting directory helps you get all the files at single click to your workspace.
Usually people like to import data using read.csv, but I like to import it through GUI.
setwd("H:/Tech Blogs/R Tutorial/ACS Manipulation")
However for the tutorial, let’s stick to R, so I will import via read.csv. I will put to acs_data.
acs_data <- read.csv("ACS_14_5YR_DP04_with_ann.csv")
Now since the data is imported, let’s have a look at it.
You can look at data both through GUI and console. The Above mentioned command is for GUI. Having a good look at data can itself give you great insights and help you understand what you want to do with it.
Let’s see how many rows and columns our data frame contains
##  567
##  3143
The data has 567 number of columns which means we have lots of variables. Dealing with such large files requires a lot of work. The other interesting thing about the data is that the columns really doesn’t have human understandable names. To our fortune, the other CSV file with metadata comes along with it which specifies all the column names description.
You can clearly see the column names which are not humanly understandable. So let’s give them a human face.
Let’s now load the metadata file which contains the column name description.
acs_desc <- read.csv("ACS_14_5YR_DP04_metadata.csv", header=F)
I have specificly included header = FALSE while loading this file into my R enviornment because I don’t want Geo.id and ID to be header name for columns. Make sure to study the significance of headers when working in R.
Let’s see how our acs_desc data frame looks like
## 1 GEO.id
## 2 GEO.id2
## 3 GEO.display-label
## 4 HC01_VC03
## 5 HC02_VC03
## 6 HC03_VC03
## 1 Id
## 2 Id2
## 3 Geography
## 4 Estimate; HOUSING OCCUPANCY - Total housing units
## 5 Margin of Error; HOUSING OCCUPANCY - Total housing units
## 6 Percent; HOUSING OCCUPANCY - Total housing units
You can now see that there’s a description for each variable we have in Acs_data file.
R is an amazing tool for data analysis, statistical inferences and data transformation. You wouldn’t believe the power of R. We can actually give a human face to our original file acs_data just by one command.
Let’s do it.
colnames(acs_data) <- acs_desc$V2
Now if you view your file, it would look like this.
We have proper description for each variable now. There are words instead of funny codes. The best thing about words are that they actually mean something :).
If you closely look at the data, you’d find that we have a lot of things hidden in the data. We have estimates, percentages and margin of errors but for now, we are only interested in looking estimates and that too with deep house mixes being played at our ears.
You can read about Regular expressions here.
Let’s first find how many data columns actually have estimates written in their headings. We will be using Regular expressions for that.
##  4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
##  72 76 80 84 88 92 96 100 104 108 112 116 120 124 128 132 136
##  140 144 148 152 156 160 164 168 172 176 180 184 188 192 196 200 204
##  208 212 216 220 224 228 232 236 240 244 248 252 256 260 264 268 272
##  276 280 284 288 292 296 300 304 308 312 316 320 324 328 332 336 340
##  344 348 352 356 360 364 368 372 376 380 384 388 392 396 400 404 408
##  412 416 420 424 428 432 436 440 444 448 452 456 460 464 468 472 476
##  480 484 488 492 496 500 504 508 512 516 520 524 528 532 536 540 544
##  548 552 556 560 564
We can see all these columns have estimates in them but to play with these estimates, we need them to be in another data frame. So what we will do is, we will subset the data into another data frame. We will subset only the column names which have Estimate in them. Deep House mixes are cooL :)
Est <- acs_data[, grep('Estimate;', colnames (acs_data))]
Now if you view the dataset, we only have column names that had Estimate in them. The world is not perfect, there is no such thing as ideality. So we all encounter problems similar to what we have encountered now. If you look at original data file which is acs_data, you’d see that there’s a column named Geography which is not copied in estimates data frame. It happened because we asked the R to only copy columns with “Estimate” written in the header.
I think Geography is very important factor for any sort of analysis. So let’s now add geography from acs_data to Est.
What we will do is, we will copy the Geography from acs_data to Est.
Est$Geography <- acs_data$Geography
Now we have geography enjoined, I have this weird thought coming on check how many houses ran on Solar energy in USA. So we will repeat the similar regular expression as we did for subsetting estimates.
solar <- Est[, grep('Solar', colnames (Est))]
We have made a vector, let’s turn it into a dataframe with geography.
solar_data <- data.frame (acs_data$Geography, solar)
We shall here remove the first row because it contains the count for whole USA.
solar_data = solar_data[-1,]
Now we have a file which tells us for each county, how many houses we have which run on solar energy.
Let’s apply a delimiter and split the data in counties and states only. We would be able to do a better analysis on which state has more houses running on solar energy.
solar_data$county <- lapply(strsplit(as.character(solar_data$acs_data.Geography), "\\,"), "[", 1)
solar_data$state <- lapply(strsplit(as.character(solar_data$acs_data.Geography), "\\,"), "[", 2)
Now you can see, we have two new columns which are split by “,”
We want to sum the count which in our case in the column name called Solar with states. The first thing we will do is check what is the class type for column name state.
##  "list"
It’s list, we have to change it to character.
solar_data$state <- as.character(solar_data$state)
Now When its changed to character, we will perform a base function of R to aggregate the sum.
states_with_solar <- as.data.frame(xtabs(solar ~ state, solar_data))
Now you can see that we exactly know which state has how many houses running on solar energy but it’s not enough yet, so we have to visualize it for better comparison. We are not interested in all the states, we are only interested in top 5 states which has highest number of houses running on solar energy.
This is going to take a lot of effort and I am going to use all the base functions. You can use many packages here but I like to work with base functions. So let’s sort it for top 5 states that use Solar Energy.
top_5 <- states_with_solar[order(states_with_solar$Freq, decreasing = TRUE),]
We have created a new data frame and sorted the highest frequency on the top. Now we will extract the top 5 by simply using the head command.
top_5 <- head(top_5,5)
Here we have the top 5 states where houses rely on Solar Energy. It is pretty evident why the top state is California.
The only time I like to use a package is with visualization because ggplot2 is immensely powerful. First install it if you don’t have it installed already.
## Warning: package 'ggplot2' was built under R version 3.2.3
ggplot(top_5, aes(x=reorder(state, +Freq), y=Freq)) +
geom_bar(stat="identity", fill="lightgreen", color="grey50", position="dodge") +
So folks! that’s all for now. My deep house mix playlist is also over. We shall meet again with something interesting. Please return me with your feedback.
Usman is an aspiring data scientist who likes deep house mixes and hates chocolates. He likes to tweet @rana_usman and can be reached at [email protected]