The **numpy**, **scipy**, and **statsmodels** libraries are frequently used when it comes to generating regression output. While these libraries are frequently used in regression analysis, it is often the case that a user might choose different libraries depending on the data in question, among other considerations. Here, we will go through how to use each of the above to generate regression output.

We can use the numpy and scipy libraries to be able to conduct regression analysis on two arrays (x and y) in order to determine the intercept and slope, along with the **p-value**, **R-Squared value**, and **standard error**.

from scipy import stats

import numpy as np

x=[-0.018,-0.008,0.011,0.017,-0.008,-0.002]

y=[-0.006,-0.001,0.015,0.017,-0.0019,-0.005]

gradient,intercept,r_value,p_value,std_err=stats.linregress(x,y)

print "Gradient and intercept",gradient,intercept

Gradient and intercept 0.724084177708 0.00398211223694

print "R-squared",r_value**2

R-squared 0.854175903355

print "p-value",p_value

p-value 0.00839711859611

From the above, we see that the above libraries allow us to calculate deeper regression analysis on the two respective arrays. Moreover, with a beta of **0.72** and an R-Squared value of **0.85**, we see that the movements of the asset are highly correlated with that of the market.

When running a regression in Python, it is often the case that we wish to do so using multiple independent variables. While this can be done using scikit-learn or scipy, it is more extensive to set up, and it is often more efficient to use the statsmodels library for this purpose.

In this instance, we use the dataset that we initially regressed in R for the Ordinary Least Squares tutorial pertaining to stock market data (data set included below). To form a linear regression model, we set up our model in statsmodels as follows (note that you can simply copy and paste the y, x1, x2 and x3 data arrays to replicate the example for yourself):

import numpy as np

import statsmodels.api as sm

import statsmodels.formula.api as smf

#dependent variable (y)

y = [984.5410628019, 128.8194444444, 94.495412844, 312.0331950207,

65.1127819549, 168.3289588801, 301.7441860465, 90.1408450704,

249.4573643411, 239.0361445783, 181.1775200714, 327.2440944882,

230.9523809524, 158.9442815249, 30.3759398496, 152.5783619818,

157.5938566553, 150.9933774834, 92.2413793103, 37.5706214689,

161.2958226769, 125.1546391753, 181.0394610202, 423.9678899083,

449.3975903614, 208.9949518128, 151.5695067265, 205.4315027158,

187.1388998813, 121.733615222, 155.1660516605, 196.2901896125,

211.9626168224, 120.3170028818, 104.2812254517, 245.7815565729,

133.5061088486, 135.6054191363, 102.4313372355, 148.6814566764,

125.9274357929, 166.783352781, 104.2042042042, 203.8586703859,

183.3114323259, 251.4944939696, 124.5541022592, 264.9880095923]

#independent variable 1 (x1)

x1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

0, 1]

#independent variable 2 (x2)

x2 = [28, 48, 45, 43, 46, 32, 27, 42, 36, 31, 30, 26, 29, 40, 49, 38, 35,

37, 41, 47, 33, 39, 34, 12, 10, 4, 16, 15, 3, 20, 9, 2, 5, 23, 22, 11,

13, 17, 24, 14, 19, 8, 18, 7, 21, 1, 25, 6]

#independent variable 3 (x3)

x3 = [0.09, 0.12, 0.17, 0.23, 0.31, 0.32, 0.33, 0.34, 0.38, 0.42, 0.43,

0.5, 0.51, 0.52, 0.55, 0.74, 0.78, 0.79, 0.8, 0.83, 0.84, 0.93, 0.95,

1.05, 1.05, 1.09, 1.11, 1.13, 1.2, 1.21, 1.24, 1.27, 1.33, 1.35, 1.38,

1.43, 1.45, 1.45, 1.48, 1.6, 1.62, 1.7, 1.76, 1.79, 1.79, 1.79, 1.94, 2]

x = np.column_stack((x1,x2,x3))

x = sm.add_constant(x, prepend=True)

results = smf.OLS(y,x).fit()

print(results.summary())

In addition to defining our independent variables (x1, x2, x3), we see that we are stacking our independent variables into a single x variable using the **x = np.column_stack((x1,x2,x3))** command.

This allows our model to take into account multiple parameters and generate detailed output by first defining our results through results = smf.OLS(y,x).fit() and then generating a summary of the results through the print(results.summary()) command.

Once we run this code, we generate the following output in Python:

OLS Regression Results

==============================================================================

Dep. Variable: y R-squared: 0.363

Model: OLS Adj. R-squared: 0.319

Method: Least Squares F-statistic: 8.342

Date: Sun, 10 Jul 2016 Prob (F-statistic): 0.000168

Time: 20:55:58 Log-Likelihood: -295.45

No. Observations: 48 AIC: 598.9

Df Residuals: 44 BIC: 606.4

Df Model: 3

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------

const 680.7824 102.383 6.649 0.000 474.443 887.122

x1 -116.6710 76.995 -1.515 0.137 -271.844 38.502

x2 -10.5126 2.459 -4.275 0.000 -15.469 -5.556

x3 -168.7207 52.589 -3.208 0.002 -274.706 -62.735

==============================================================================

Omnibus: 62.282 Durbin-Watson: 1.256

Prob(Omnibus): 0.000 Jarque-Bera (JB): 530.769

Skew: 3.245 Prob(JB): 5.56e-116

Kurtosis: 17.942 Cond. No. 189.

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As we can see, the statsmodels library allows us to generate highly detailed output on a level similar to R, with additional statistics such as skew, kurtosis, R-Squared and AIC. While these readings can be generated through scipy or sklearn, doing so is a more intensive process and in many cases these statistics must be calculated individually.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central