Subscribe to DSC Newsletter

Multi-Regression in R (Exxon Mobil stock price ~ WTI, Gas, and S&P500)

[Previous Post]
Single regression on Exxon's stock

[Introduction of Multi-regression]

Let's recall our last job. We conducted the single regression on Exxon Mobil's stock along with WTI crude oil spot price. The result was fantastic, which accounts for 25% of the variation of stock movement. Put it in other way, R-square. The problem is "are you happy with the result that explains only 25% of the variation?"

For investors, it could be risky that they know only one variable to affect the stock price that they hold. Is there any way to account for the variation of the stock price better?

Then, multi-regression comes out.

y = a + (a1)x1 + (a2)x2 + (a3)x3 + ... + ERROR

In the previous post, we simplify the model as y= ax + b. It was a great starting point. However, our real world is not that simple. It is affected by many variables.

I'll talk about this problem later, but the assumption is the variable x1, x2, and x3 are independent of one another. In order to understand this, you need to understand that the two random variables are independent of each other.

[Independent]
<Theory>
When we talk that the random variables are independent of each other, it means, it's not correlated. Think about the relationship between temperature and electricity bill. As the temperature goes up, you would use more air conditioner, increasing your electricity bill. In this case, they are "positively correlated." Now, think about the relationship between temperature and the time you spend with your laptop. For some people it might be related, but for me, regardless of the temperature, I should use laptop more than 5 hours for my work. Now, we can say that the temperature and the usage of computer are independent of each other. We can represent this in a mathematical way.

P(AB) = P(A)P(B)
Correlation(A, B) = 0


In order to understand the concept of correlation you can think about the gears rotating together.

<In real>
Sure, it's impossible to meet "totally independent two random variables." Especially, in the macro economic, it is nearly impossible because all variables are connected. In this case, we have to decide where do we draw the line. We can assume that if the correlation is less than 0.2, we can consider it independent.

[What can have an impact on Exxon's equity value?]
We assumed that the WTI crude oil price can do, which turned out to be true. Natural gas price can do that. Not only does Exxon sell oil, but it also sells natural gas. Most importantly, as a member of S&P 500, it also heavily affected by the general market condition. Especially, the index of S&P 500 is a collection of macro variables, like interest rate. In this analysis, I'll add two variables more.

[Where can I can the data]
I really want to be nice, so that I uploaded WTI oil price and natural gas price on my Google Drive. How do we use that? You can watch 3 min video to learn how to retrieve the data from Google Drive to R.

How to retrieve the data from Google Drive to R.

[Before getting into the code]
(1) I used tseries library just like the previous post. Please, install this if you don't have.
(2) I used the vanguard's S&P500 index fund in lieu of S&P itself because it's more accessible.
(3) If you can't understand the code, please go over the previous post again. I deliberately rule out any type of difficult code.
(4) As I mentioned earlier, you can use WTI and Natural gas data without changing URL. Don't download from the internet. Just use the code.

[Code]
#Gas: https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NG...
#WTI: https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdM...

#Multi Regression and Correlation

library(tseries)
library(zoo)

#Exxon Mobil's equity value in 2015
xom <- get.hist.quote("XOM", #Tick mark
                      start="2015-01-01", #Start date YYYY-MM-DD
                      end="2015-12-31" #End date YYYY-MM-DD
)
#S&P 500. I used Vanguard Index Fund instead of directly using S&P.
snp <- get.hist.quote("VFINX", #Tick mark
                      start="2015-01-01", #Start date YYYY-MM-DD
                      end="2015-12-31" #End date YYYY-MM-DD
)

#I am going to use close value only
xom_zoo <- xom$Close
snp_zoo <- snp$Close

#Please check the post that mentions how to get the data from Google Sheet.
wti <- read.csv("https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdMvWU60L2M-SIbYqX0/pub?gid=0&single=true&output=csv")

#When it reads file first, it has categorical format we need to convert it.
wti$VALUE <- as.character(wti$VALUE)
#It also has garbage value "." in data. You can see in Excel. We can clean this with below command
wti <- wti[wti$VALUE!=".", #Get rid of any value that contains "."
           1:2 #I need first column, and second column as well.
           ]
#Finally I want to convert character to numeric value.
wti$VALUE <- as.numeric(wti$VALUE)
wti_zoo <- read.zoo(wti, format="%m/%d/%Y")

gas <- read.csv(url("https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NGKfJ32KOKRHixMZMk/pub?gid=0&single=true&output=csv"))
gas$Price <- as.character(gas$Price)
gas$Price <- as.numeric(gas$Price)
gas_zoo <- read.zoo(gas, format="%m/%d/%Y")

#Combine Two Time Series
two <- cbind(wti_zoo, snp_zoo)
three <- cbind(gas_zoo, two)
#Remove NA as S&P 500 has more trade days than normal stock.
three.f <- na.omit(three)

#What we need is return.
xom_rate <- (xom_zoo - lag(xom_zoo))/xom_zoo
wti_rate <- (three.f$wti_zoo - lag(three.f$wti_zoo))/three.f$wti_zoo
snp_rate <- (three.f$snp_zoo - lag(three.f$snp_zoo))/three.f$snp_zoo
gas_rate <- (three.f$gas_zoo - lag(three.f$gas_zoo))/three.f$gas_zoo
regression_result <- lm(xom_rate~wti_rate+snp_rate+gas_rate)

print(summary(regression_result))

cols<-rainbow(4)

#Draw Time Series Graph
#lty: Line Style
plot(xom_rate, col=cols[1], lty=1, main="Return of XOM, WTI, S&P500, Gas", xlab="2015", ylab="Return")
lines(wti_rate, col=cols[2], lty=2)
lines(snp_rate, col=cols[3], lty=2)
lines(gas_rate, col=cols[4], lty=2)
legend('bottomleft', #It's located in bottom left
       c("XOM", "WTI", "S&P", "Gas"),
       lty=c(1,4), #As this is a line graph, we are going to use line as a symbol
       col=cols, #color
       horiz=TRUE, #Horizontal alignment
       bty='n', #No background
       cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)

#Scatter plot for multi regression
#pch: Dot Style
plot(x=xom_rate, y=gas_rate, col=cols[1], pch=17, main="Scatter plot on return", xlab="XOM", ylab="Gas, S&P, WTI")
points(x=xom_rate, y=snp_rate, col=cols[2], pch=18)
points(x=xom_rate, y=wti_rate, col=cols[3], pch=19)
legend('bottomleft',
       #X, #Y position of the legend. The higher Y value, the higher position in the monitor.
       c("GAS", "S&P", "WTI"),
       pch=c(17, 18, 19), #As this is a line graph, we are going to use line as a symbol
       col=cols, #color
       horiz=TRUE, #Horizontal alignment
       bty='n', #No background
       cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)

[How to interpret the result]

<Table>
Call:
lm(formula = xom_rate ~ wti_rate + snp_rate + gas_rate)

Residuals:
      Min        1Q    Median        3Q       Max
-0.030185 -0.004732  0.000406  0.005333  0.039771

Coefficients:
                    Estimate     Std. Error  t-value  Pr(>|t|)  
(Intercept)  0.0004744  0.0005458   0.869   0.3856  
wti_rate     0.1578732  0.0193612   8.154   1.78e-14 ***
snp_rate     0.9308219  0.0577671  16.113  < 2e-16 ***
gas_rate    -0.0310367  0.0150969  -2.056   0.0409 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.008641 on 247 degrees of freedom
Multiple R-squared:  0.6416, Adjusted R-squared:  0.6372
F-statistic: 147.4 on 3 and 247 DF,  p-value: < 2.2e-16

Take a look at green highlights first. We need to look at Adjusted R-square as we used multiple variable. 63.72% variation can be explained by this model (Awesome!) Take a look at overall p-value. The likelihood that this model is nothing is less than 2.2*10^-16, meaning that this model is reliable.

Take a look at p-values for each variables (Yellow highlights) All p-values are less than 0.05, so that these variables are statistically meaningful too.

Take a look at blue highlights. Now we find the equation.

y(Exxon's stock return) = 0.0004744 + (0.1578732)(WTI return) + (0.9308219)(S&P return) - (0.0310367)(Natural gas return)

Wait. Something doesn't make sense to you. Is natural gas price is negatively correlated with the stock return on Exxon? I put forward several hypotheses to account for this unexpected outcome.
(1) As a result of the advent of fracking technology, natural gas is no longer the production of Exxon. Rather it becomes a raw material.
(2) Natural gas price responds the market later than WTI as investors more focus on WTI than natural gas.
(3) Exxon doesn't rely on Natural gas much. They hedged somehow as the oil price hit the lowest in historic level.

I'll let you decide on which theory is the most plausible.

[Time Series Graph]
We can see how much correlate with each other graphically. I drew all time series graphs first. You can see general tendency of up and down are consistent with one another.

[Scatter Plot]

You can also do the same thing with scatter plot, which allows you to see the correlation between XOM returns and other variables visually.






















[Correlation in real world]
"cor" command allows you to identify the correlation between two variables. Like I said, it is nearly impossible to be independent for macro economic variables.

> cor(wti_rate, snp_rate)
[1] 0.2796217
>

As WTI return and S&P 500 return are not independent of each other, our model could be undermined. However, 27% correlation is not enough to claim that there is a significant correlation between them. Again, it's about where do we draw the line. As a data scientist, you have your own standard to draw the line. From my standpoint, it's okay to have 0.27 correlation.

I'll give you identifying remaining correlation (S&P ~ gas, wti ~ gas) as assignments.

[Additional Tip]
If you think that, in the scatter plot, the blue dots totally eclipse the other dots, you can make the color transparent. It's not that difficult.

#Make the color transparent
t_col <- function(color, percent = 50, name = NULL) {
    #  color = color name
    # percent = % transparency
    #   name = an optional name for the color
    ## Get RGB values for named color
    rgb.val <- col2rgb(color)
    ## Make new color using input color as base and alpha set by transparency
    t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3],
                 max = 255,
                 alpha = (100-percent)*255/100,
                 names = name)
    ## Save the color
    return(t.col)
}

plot(x=xom_rate, y=gas_rate, col=t_col(cols[1]), pch=17, main="Scatter plot on return", xlab="XOM", ylab="Gas, S&P, WTI")
points(x=xom_rate, y=snp_rate, col=t_col(cols[2]), pch=18)
points(x=xom_rate, y=wti_rate, col=t_col(cols[3]), pch=19)
legend('bottomleft',
       c("GAS", "S&P", "WTI"),
       pch=c(17, 18, 19), #As this is a line graph, we are going to use line as a symbol
       col=cols, #color
       horiz=TRUE, #Horizontal alignment
       bty='n', #No background
       cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)


More R codes? www.mbaprogrammer.com








Views: 3559

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Alberto Gaidys on May 30, 2016 at 7:25am
Gregory:

Very good tutorial, thank you. There is a 4th more and more likely possibility for the negative coefficient for the price of Natural Gas: its strong correlation with the price of WTI Oil, causing what in Multiple Regression is called as "Multicolinearity". It is well documented in the literature that the presence of Multicolinearity among the independent variables causes their coefficients to loose precision individually. The regression is still valid, but one can't infer direct relationships between the dependent and each independent variable from its coefficients anymore, only their joint effect. One possible solution, if you still wish to isolate the individual effect of each independent variable, would be to extract the 3 principal components from the multivariate sample using princomp, run a regression exactly as you did and roll back the coefficients from the transformation matrix -- this is the so called "Principal Component Regression". Or use 3 Independent Components for the regression through fastICA. Or yet you can use Ridge Regression or Partial Least Squares Regression, that are known to mitigate this effect. I hope this helps.

A.Gaidys

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service