Subscribe to DSC Newsletter

R Functions for Exploratory Analysis, Data Frame Merging & Map Displays

Given below is a list of R functions for quickly exploring the key attributes of the data set. The data set is based on car prices & insurance prices.

# import the dataset from an existing .csv file

> Make.and.Model.Price.by.State..June.8.2014 <- read.csv("C:/Users/Lucky/Desktop/Vozag/Car Insurance Analysis- May 28/UCA_R_June 2/Make and Model Price by State- June 8 2014.csv")

>   View(Make.and.Model.Price.by.State..June.8.2014)

> attach(Make.and.Model.Price.by.State..June.8.2014)

> MMPBS=Make.and.Model.Price.by.State..June.8.2014

 

# understanding data

> str (MMPBS)

'data.frame': 1011078 obs. of  11 variables:

$ VIN      : Factor w/ 1011078 levels "137FA90303E205513",..: 133480 133481 133486 133506 133509 133516 133521 133523 133526 133531 ...

$ State    : Factor w/ 50 levels "AK","AL","AR",..: 12 14 49 34 44 38 49 38 35 38 ...

$ Year     : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...

$ Make     : Factor w/ 61 levels "Acura","AM General",..: 19 19 19 19 19 19 19 19 19 19 ...

$ Model    : Factor w/ 926 levels "1 Series","190-Class",..: 308 308 308 308 308 308 308 308 308 308 ...

$ Trim     : Factor w/ 1945 levels "","1","1.5L Hybrid",..: 1872 1872 1872 1872 1872 1872 1872 1872 1872 1872 ...

$ Price    : num  22995 16995 23455 24985 18997 ...

$ Mileage  : int  62558 80747 35519 44909 55683 37135 65209 86571 73455 29757 ...

$ Color    : Factor w/ 12795 levels "","'High Voltage' Yellow",..: 1623 1336 2027 2592 12489 9805 2580 9117 1336 6598 ...

$ Latitude : num  42 40.3 38.5 42.1 40.1 ...

$ Longitude: Factor w/ 61 levels "-105.3272","-105.3272,,",..: 54 46 33 26 8 29 33 29 35 29 ...

 

#list the name of each variable (data column) and the first six rows of the dataset

> head(MMPBS)

               VIN State Year Make    Model Trim Price Mileage                      Color Latitude Longitude

1 1FMEU7DE5AUA46342    IA 2010 Ford Explorer  XLT 22995   62558 Black Pearl Slate Metallic  42.0046   -93.214

2 1FMEU7DE5AUA49113    IL 2010 Ford Explorer  XLT 16995   80747                      Black  40.3363  -89.0022

3 1FMEU7DE5AUA58569    WV 2010 Ford Explorer  XLT 23455   35519                 Blue Flame  38.4680  -80.9696

4 1FMEU7DE5AUB02070    NY 2010 Ford Explorer  XLT 24985   44909  Brilliant Silver Metallic  42.1497  -74.9384

5 1FMEU7DE5AUB07415    UT 2010 Ford Explorer  XLT 18997   55683                White Suede  40.1135 -111.8535

6 1FMEU7DE6AUA23930    PA 2010 Ford Explorer  XLT 20900   37135                Sangria Red  40.5773   -77.264

5 1FMEU7DE5AUB07415    UT 2010 Ford Explorer  XLT 18997   55683                White Suede  40.1135 -111.8535

6 1FMEU7DE6AUA23930    PA 2010 Ford Explorer  XLT 20900   37135                Sangria Red  40.5773   -77.264


 

 

 # basic statistics of the variables

> summary(MMPBS)

               VIN              State             Year             Make                    Model             Trim       

137FA90303E205513:      1   TX     :116568   Min.   :1981   Ford     :193128   F-150          : 38778   Base   :102740  

137FA90343E202923:      1   CA     :105994   1st Qu.:2007   Chevrolet:117950   Escape         : 20952   SE     : 59891  

137PH843X6E221839:      1   FL     : 91213   Median :2010   Toyota   : 90953   Fusion         : 20688   Limited: 46267  

137PH90316E218064:      1   IL     : 44525   Mean   :2009   Dodge    : 65722   Ram Pickup 1500: 19279   LT     : 45003  

137ZA8431YE188472:      1   OH     : 37376   3rd Qu.:2012   Nissan   : 61409   Silverado 1500 : 18852   XLT    : 35614  

137ZA8433WE182282:      1   NY     : 37351   Max.   :2015   Honda    : 59667   Focus          : 18519   LX     : 30689  

(Other)          :1011072   (Other):578051                  (Other)  :422249   (Other)        :874010   (Other):690874  


Price             Mileage            Color           Latitude         Longitude     

Min.   :    1074   Min.   :      0   Black  :110876   Min.   :21.11   -97.6475 :116568  

1st Qu.:   12500   1st Qu.:  24528   White  : 77807   1st Qu.:32.99   -119.7462:105984  

Median :   16995   Median :  43580   Silver : 72229   Median :37.77   -81.717  : 91213  

Mean   :   19061   Mean   :  55921   Gray   : 55899   Mean   :37.18   -89.0022 : 44524  

3rd Qu.:   23977   3rd Qu.:  79852   Blue   : 45341   3rd Qu.:40.58   -82.7755 : 37376  

Max.   :10000000   Max.   :3404022   Red    : 44259   Max.   :61.38   -74.9384 : 37351  

                                     (Other):604667                   (Other)  :578062  

 

From the summary, we can say that top selling colors are black (110,876) followed by white (77,807) and silver (72,229).

 # combining make and model as a new column

To find the top 10 selling used car models in the US, I need to get the count of each make and model. But, our data set consists make and model as a separate columns. So, I combined make and model in a new column to find top selling models based on count.

 

> MMPBS$Make_Model = paste(MMPBS$Make, MMPBS$Model, sep="_")

 

# getting the count by make and model

 

> MMC<-table(MMPBS$Make_Model)

> head (MMC)

 

Acura_CL     Acura_ILX Acura_Integra  Acura_Legend     Acura_MDX     Acura_NSX

         105           232            39            19          3123             3

 

 

# data table combination for cross table analysis

Example: Ford F-150 was the best selling car model with 38,778 units sold in Q1-2014 followed by Ford Escape (20,952), Ford Fusion (20,688), Chevrolet Silverado 1500 (18,852), Ford Focus(18,519), Toyota Camry (18,484), Honda Accord (17,448), Nissan Altima (16,327), Honda Civic (14,999), and Ford Edge (14,014).

We also got a separate file for average auto insurance, by year, make, model and state. we can combine both the files to know the total ownership cost (mainly internet asking price and average insurance) of top selling models.

To combine both data frames by make and model, first we can create a separate column combining State+ Make + Model+ Year in both files and used this as a “key” to link/merge these data frames.

 

# import the average insurance dataset from an existing .csv file

> Average.Insurance.by.Make.Model.and.Year..June.2.2014 <- read.csv("C:/Users/Lucky/Desktop/Gridlex/Used Car Analysis- May 28/UCA_R_June 2/Average Insurance by Make Model and Year- June 2 2014.csv")


>   View(Average.Insurance.by.Make.Model.and.Year..June.2.2014)

> MMAIBS=Average.Insurance.by.Make.Model.and.Year..June.2.2014


> head (MMAIBS)

 State Year  Make   Model Average.Insurance

1    SC 2000 Mazda Mazda6s              1166

2    SC 2001 Mazda Mazda6s              1167

3    SC 2002 Mazda Mazda6s              1168

4    SC 2003 Mazda Mazda6s              1169

5    SC 2004 Mazda Mazda6s              1170

6    SC 2005 Mazda Mazda6s              1171


 

# Creating new column in both the files to merge data frames using State+ Make + Model+ Year

> MMPBS$State_Year_Make_Model = paste(DF$State, DF$Year, DF$Make, DF$Model, sep="_")

> MMAIBS$State_Year_Make_Model = paste(DF1$State, DF1$Year, DF1$Make, DF1$Model, sep="_")


# Selecting specific columns from MMAIBS to reduce redundancy of make and model

MMAIBSS<-MMAIBS[,c("Average.Insurance","State_Year_Make_Model")]


# merging two data frames


> TC <- merge(MMPBS,MMAIBS,by="State_Year_Make_Model")




> head (TC)

State_Year_Make_Model               VIN State Year   Make   Model     Trim Price Mileage                Color Latitude

1     AK_2000_Ford_Focus 1FAFP33P4YW415496    AK 2000   Ford   Focus       LX  1499  100481                White   61.385

2     AK_2000_Honda_CR-V JHLRD187XYC035757    AK 2000  Honda    CR-V       SE  6990  143633 Naples Gold Metallic   61.385

3     AK_2000_Honda_CR-V JHLRD187XYS022545    AK 2000  Honda    CR-V       SE  8000  115786                Brown   61.385

4  AK_2000_Subaru_Legacy 4S3BH625XY7308180    AK 2000 Subaru  Legacy Brighton  6200  117573      Winestone Pearl   61.385

5 AK_2000_Subaru_Outback 4S3BH6863Y7600883    AK 2000 Subaru Outback  Limited  8995  120940                White   61.385

6 AK_2000_Subaru_Outback 4S3BH6759Y7630530    AK 2000 Subaru Outback     Base  6998  129729          White Birch   61.385


Longitude     Make_Model Average.Insurance

1 -152.2683     Ford_Focus              1293

2 -152.2683     Honda_CR-V              1184

3 -152.2683     Honda_CR-V              1184

4 -152.2683  Subaru_Legacy              1301

5 -152.2683 Subaru_Outback              1223

6 -152.2683 Subaru_Outback              1223

 

# adding price and average insurance to know the total cost of vehicle

> TC$Total_Price<-TC$Price+TC$Average.Insurance

 

> head(TC)

  State_Year_Make_Model               VIN State Year   Make   Model     Trim Price Mileage                Color Latitude

1     AK_2000_Ford_Focus 1FAFP33P4YW415496    AK 2000   Ford   Focus       LX  1499  100481                White   61.385

2     AK_2000_Honda_CR-V JHLRD187XYC035757    AK 2000  Honda    CR-V       SE  6990  143633 Naples Gold Metallic   61.385

3     AK_2000_Honda_CR-V JHLRD187XYS022545    AK 2000  Honda    CR-V       SE  8000  115786                Brown   61.385

4  AK_2000_Subaru_Legacy 4S3BH625XY7308180    AK 2000 Subaru  Legacy Brighton  6200  117573      Winestone Pearl   61.385

5 AK_2000_Subaru_Outback 4S3BH6863Y7600883    AK 2000 Subaru Outback  Limited  8995  120940                White   61.385

6 AK_2000_Subaru_Outback 4S3BH6759Y7630530    AK 2000 Subaru Outback     Base  6998  129729          White Birch   61.385


Longitude     Make_Model Average.Insurance Total_Price

1 -152.2683     Ford_Focus              1293        2792

2 -152.2683     Honda_CR-V              1184        8174

3 -152.2683     Honda_CR-V              1184        9184

4 -152.2683  Subaru_Legacy              1301        7501

5 -152.2683 Subaru_Outback              1223       10218

6 -152.2683 Subaru_Outback              1223        8221

 

# filtering Ford_F-150

Ford_F150<-TC[TC$Make_Model == 'Ford_F-150',]

 

# plotting used Ford_F-150 total price range on the US map

> install.packages("ggmap")

> install.packages("mapproj")

> library(ggmap)

> library(mapproj)

> map <- get_map(location = 'US', zoom = 4)

> ggmap(map)

> mapPoints <- ggmap(map) + geom_point(aes(x = Longitude, y = Latitude, size = Total_Price), data = Ford_F150, alpha = .5)

> mapPoints


 

Top Selling Model- Ford_F-150 Total Price Range by State:

 

By following last two steps, we can plot total price range for remaining top vehicles on the US map.

 

Views: 4317

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Onsongo Wilfred on May 30, 2016 at 6:31am
Nice... how about using the dplyr package to manipulate the columns?
Comment by Jason Williams on June 3, 2015 at 12:35pm

Someone beat me to the ggplot2 comment! But a very good look at visual applications.

Comment by Anil Bhargava on June 1, 2015 at 6:05pm

This is good. Applicable and simple. However, I think this can be easily done via ggplot2

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service