Contributed by Joseph Wang. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his first class project – R visualization (due on the 2nd week of the program).

**Motivation:**

With the recent down turn in the energy industry, I was curious to know if other industries, such as semiconductor and financial, may be hit based on the statistical inference from the analysis. For initial exploration, I picked two key players from each sector. For the energy industry, Exxon (XOM) and Chevron (CVX) are chosen. For the finance sector, J. P. Morgan (JPM) and Goldman Sachs (GS) are selected. AMD and Intel (INTL) are sure candidates for the semiconductor industry in USA.

**Data Exploration:**

I gathered all the time series that I was interested in from Yahoo Finance by using R package. The duration of data was selected based on the completeness of data across all the stocks. The time duration in the series ranged from June 1, 1999 to January 1, 2016. Since the maximal stock price for Goldman Sachs was much larger than other stocks, I scaled each stock by its maximal stock price during the long time duration for visualization. From FIG. 1, we observe Chevron’s stock price almost collapses into Exxon’s price in the past decade. It was interesting to see the seasonal oscillation at a period around four years in the energy stock prices through the course of history from the end of 2001. The regular oscillation did not occur for other sectors. However, one can sense the strong correlation between stocks within finance and energy sectors when energy stock prices plumbed. However, this was not true for semiconductor sector. By the trend of the time series, we could tell there were no symmetrical counts for the stock prices to follow normal distribution without filtering seasonal trends and bias. Instead, we could understand the correlation between stocks from a different perspective.

For stock trading, what is more interesting is the “up and down” for the stock prices which is defined as the difference of the stock prices in adjacent days, which could be fairly easily calculated by Matlab or R. Based on basic calculus, one can know the daily stock prices based on the time integration of the difference signals we discuss later. In other words, if one can learn from the difference signals which will be shown as Gaussian, it is likely that we can make a prediction for future stock prices.

In FIG. 2, we show the signals for all the stocks we selected, and we can see the signals are likely to be normal distribution as shown be comparable counts of positive and negative values with respect to the mean value which is approximately zero despite the highly non-normal distribution for original stock prices. In addition, we also observe there might be strong correlation between the signals under the same sectors. Let us investigate further in details by histograms.

In FIG. 3, we show the histograms for different sectors. The signals for each sector are done by the summation of constituent stocks under that sector. We observe the amazing symmetric normal distribution. This gives us a hope to draw statistical inference based firmly on Gaussian distribution.

In FIG. 4, we show the scatter plots for the difference signals between sectors. We observe the a stronger correlation between Finance and Energy sectors but much weaker correlation between other combinations. If we assume that the null hypothesis is that there is no correlation on the difference signals between different sectors. The correlation matrix between sectors and p-values can be numerically calculated as the following correlation matrix Cij where the indices i=1 to 3 is not equal to j=1 to 3(1: semiconductor sector; 2: finance sector ; 3: energy sector ). The linear correlation between sectors is given by off-diagonal Cij :C12=C21=0.3382, C13=C31=0.1968, and C32=c23=0.4984. The corresponding p values are almost zero to double precision. This means our null hypothesis is statistically rejected. Therefore, we can be statistically confident that there are linear correlation between different sectors. Based on the larger p-values between semiconductor sector and finance sector as well as finance sector and semiconductor sector, we are far more confident that they are correlated than the correlation between energy sector and semiconductor sector.

**Conclusion and Discussion:**

Based on a different strategy, we can identify the stronger linear correlation for the stock prices between finance sector and other sectors. The semiconductor and energy sector is 95% confident to be linearly correlated but is not strong. In order to model the shorter time correlations, we may need to further filter the difference stock prices on the scale shorter than days so that the seasonal signals and bias on the time scale of days can be accounted for. For longer time scales, the difference stock signal processing should be able to get rid of the bias and filter out the seasonal trends.

**Appendix:**

**Import time series data through R by R codes:**

library(quantmod)

data <- getSymbols(“XOM”, src = “yahoo”, from = “1999-06-01”, to = “2016-01-01”, auto.assign = FALSE)

write.csv(data, file=”XOM.csv”)

data <- getSymbols(“CVX”, src = “yahoo”, from =”1999-06-01″, to = “2016-01-01”, auto.assign = FALSE)

write.csv(data, file=”CVX.csv”)

data <- getSymbols(“AMD”, src = “yahoo”, from =”1999-06-01″, to = “2016-01-01”, auto.assign = FALSE)

write.csv(data,file=”AMD.csv”)

data <- getSymbols(“INTC”, src = “yahoo”, from =”1999-06-01″, to = “2016-01-01”, auto.assign = FALSE)

write.csv(data,file=”INTC.csv”)

data <- getSymbols(“GS”, src = “yahoo”, from =”1999-06-01″, to = “2016-01-01”, auto.assign = FALSE)

write.csv(data, file=”GS.csv”)

data <- getSymbols(“JPM”, src = “yahoo”, from =”1999-06-01″, to = “2016-01-01”, auto.assign = FALSE )

write.csv(data, file=”JPM.csv”)

**Next we read these csv files into Matlab data format files to prepare for visualization for our results in Matlab scripts (from this point on, codes are written in Matlab script .m files): **

M=csvread(‘XOM.csv’); save(‘XOM.mat’,’M’);

M=csvread(‘CSV.csv’); save(‘CSV.mat’,’M’);

M=csvread(‘AMD.csv’); save(‘AMD.mat’,’M’);

M=csvread(‘INTC.csv’);save(‘INTC.mat’,’M’);

M=csvread(‘GS.csv’);save(‘GS.mat’,’M’);

M=csvread(‘JPM.csv’);save(‘JPM.mat’,’M’);

**Now we load the .mat files into vector variables so that we can do data processing in Matlab languages:**

clear all

%After downloading the time serie data from Yahoo Finance through R

%library(quantmod), we save the data into .csv files and then converted into

%Matlab data files in .mat

%Time series data are loaded based on closing time on business days.

load XOM %EXXON stock prices

load CVX %Chevron stock price

load INTC %Intel stock price

load AMD %AMD stock price

load JPM %J.P. Morgan stock price

load GS %Goldman Sachs stock price

x=0:1:length(XOM(:,6))-1;

plot(x,XOM(:,6),‘k’) %Plot the sixth column of the Exxon data which is the adjusted stock price

hold on;

plot(x,CVX(:,6),‘b’);%Plot the Chevron data

hold on

plot(x,INTC(:,6),‘r’);%Plot the intel data

hold on

plot(x,AMD(:,6),‘y’) %Plot the AMD data

hold on;

plot(x,JPM(:,6),‘m’); %Plot the JPM data

hold on

plot(x,GS(:,6),‘c’); %Plot the GS data

ylabel(‘Scaled Stock prices(dolloars)’,‘fontsize’,14,‘fontweight’,‘b’);

%To observe better on the trend, we renormalize each stock

%prices based on its maxima price through the selected time series

figure

plot(x,XOM(:,6)/max(XOM(:,6)),‘k’)

hold on;

plot(x,CVX(:,6)/max(CVX(:,6)),‘b’);

hold on

plot(x,INTC(:,6)/max(INTC(:,6)),‘r’);

hold on

plot(x,AMD(:,6)/max(AMD(:,6)),‘y’)

hold on;

plot(x,JPM(:,6)/max(JPM(:,6)),‘m’);

hold on

plot(x,GS(:,6)/max(GS(:,6)),‘c’);

xlabel(‘Business Days’,‘fontsize’,14,‘fontweight’,‘b’);

ylabel(‘Renormalized Stock prices(dolloars)’,‘fontsize’,14,‘fontweight’,‘b’);

%By observing the trend, we do not expect the data is useful

%for statistical inference due to its non-normal distribution.

%Instead, what is more interesting is the “up and down” for the stock

%prices which is defined as the difference of the stock prices in adjacent

%days, which can be calculated by diff function in MATLAB.

figure

diff_XOM=diff(XOM(:,6));

diff_CVX=diff(CVX(:,6));

diff_INTC=diff(INTC(:,6));

diff_AMD=diff(AMD(:,6));

diff_JPM=diff(JPM(:,6));

diff_GS=diff(GS(:,6));

xx=0:1:length(XOM(:,6))-2;

subplot(6,1,1)

plot(xx,diff_XOM,‘k’)

subplot(6,1,2)

plot(xx,diff_CVX,‘b’);

subplot(6,1,3)

plot(xx,diff_INTC,‘r’);

subplot(6,1,4)

plot(xx,diff_AMD,‘y’)

subplot(6,1,5)

plot(xx,diff_JPM,‘m’);

subplot(6,1,6)

plot(xx,diff_GS,‘c’);

xlabel(‘Business Days’,‘fontsize’,14,‘fontweight’,‘b’)

ylabel(‘Stock Price Difference Daily’,‘fontsize’,14,‘fontweight’,‘b’)

%Histograms showing the normal distributed stock price difference

subplot(1,3,1)

hist(diff_XOM+diff_CVX,100,‘b’)

ylabel(‘Counts in 100 bins’,‘fontsize’,14,‘fontweight’,‘b’)

subplot(1,3,2)

hist(diff_INTC+diff_AMD,100,‘r’)

xlabel(‘Stock Price Difference Daily for semiconductor sector ‘,‘fontsize’,14,‘fontweight’,‘b’)

ylabel(‘Counts in 100 bins’,‘fontsize’,14,‘fontweight’,‘b’)

subplot(1,3,3)

hist(diff_JPM+diff_GS,100,‘g’)

ylabel(‘Counts in 100 bins’,‘fontsize’,14,‘fontweight’,‘b’)

%Sacatter plot between companies

%figure

%plot(diff_INTC,diff_XOM,’O’)

%xlabel(‘INTC’);ylabel(‘XOM’)

%figure

%plot(diff_INTC,diff_CVX,’*’)

%xlabel(‘INTC’);ylabel(‘CVX’)

%figure

%plot(diff_AMD,diff_XOM,’p’)

%xlabel(‘AMD’);ylabel(‘XOM’)

%figure

%plot(diff_AMD,diff_CVX,’+’)

%xlabel(‘AMD’);ylabel(‘CVX’)

%hold on;

%Plot scatter plots for different sectors

%Stock prices from the same industrial sectors are added together

subplot(1,3,1)

plot(diff_AMD+diff_INTC,diff_JPM+diff_GS,‘.’)

subplot(1,3,2)

plot(diff_JPM+diff_GS,diff_XOM+diff_CVX,‘.’)

subplot(1,3,3)

plot(diff_AMD+diff_INTC,diff_XOM+diff_CVX,‘.’)

%Calculation of correlation between companies and sectors

%X=[diff_AMD diff_INTC diff_JPM diff_GS diff_XOM diff_CVX];

%[correlation_com,pval_com]=corr(X);

%Calculation of correlation matrix and p values

Y=[diff_AMD+diff_INTC diff_JPM+diff_GS diff_XOM+diff_CVX];

[correlation_sec,pval_sec] = corr(Y);

%Here we only care about the correlation between the signs of the diffference of stocks

%What will be the probability of stocks in one sector goes up or down next

%day and the stocks in another sectors also goes up or down

%I found almost that almost 66 percent of the time this occured.

%Y1=diff_AMD+diff_INTC;

%Y2=diff_JPM+diff_GS;

%Y3=diff_XOM+diff_CVX;

%Only catch the sign

%for j=1:length(Y1)

% if Y1(j)>0

% Y1(j)=1;

% else

% Y1(j)=-1;

% end

% if Y2(j)>0

% Y2(j)=1;

% else

% Y2(j)=-1;

% end

% if Y3(j)>0

% Y3(j)=1;

% else

% Y3(j)=-1;

% end

%end

%Count the number of days both stocks are all up or down

%N1=0;

%N2=0;

%N3=0;

%for j=1:length(Y1)

% if Y1(j)*Y2(j)>0

% N1=N1+1;

% end

% if Y1(j)*Y3(j)>0

% N2=N2+1;

% end

% if Y2(j)*Y3(j)>0

% N3=N3+1;

% end

%end

%P1=N1/length(Y1);

%P2=N2/length(Y2);

%P3=N3/length(Y3);

Original Blog post :http://blog.nycdatascience.com/r/correlation-between-stock-prices-i…