Subscribe to DSC Newsletter

Basics of Linear Regression

Regression analysis is a statistical tool to determine relationships between different types of variables. Variables that remain unaffected by changes made in other variables are known as independent variables, also known as a predictor or explanatory variables while those that are affected are known as dependent variables also known as the response variable.
Linear regression is a statistical procedure which is used to predict the value of a response variable, on the basis of one or more predictor variables.

There are two types of linear regressions in R:

  • Simple Linear Regression – Value of response variable depends on a single explanatory variable.
  • Multiple Linear Regression – Value of response variable depends on more than 1 explanatory variables.

Some common examples of linear regression are calculating GDP, CAPM, oil and gas prices, medical diagnosis, capital asset pricing etc.

1. Simple Linear Regression in R

R Simple linear regression enables us to find a relationship between a continuous dependent variable Y and a continuous independent variable X. It is assumed that values of X are controlled and not subject to measurement error and corresponding values of Y are observed.

The general simple linear regression model to evaluate the value of Y for a value of X:

yi = β0 + β1+ ε

Here, the ith data point, yi, is determined by the variable xi;

β0 and β1 are regression coefficients;

εi is the error in the measurement of the ith value of x.

Regression analysis is implemented to do the following:

  • Establish a relationship between independent (x) and dependent (y) variables.
  • Predict the value of y based on a set of values of x1, x2…xn.
  • Identify independent variables to understand which of them are important to explain the dependent variable, and thereby establishing a more precise and accurate causal relationship between the variables.

2. Multiple Linear Regression in R

In the real world, you may find situations where you have to deal with more than 1 predictor variable to evaluate the value of response variable. In this case, simple linear models cannot be used and you need to use R multiple linear regressions to perform such analysis with multiple predictor variables.

R multiple linear regression models with two explanatory variables can be given as:

yi = β0 + β1x1i + β2x1i + εi

Here, the ith data point, yi, is determined by the levels of the two continuous explanatory variables x1i and x1i’ by the three parameters β0, β1, and βof the model, and by the residual ε1 of point i from the fitted surface.

General Multiple regression models can be represented as:

yi = Σβ1x1i + εi

Least Square Estimation

A simple or multiple regression models cannot explain a non-linear relationship between the variables.

Multiple regression equations are defined in the same way as single regression equation by using the least square method. Values of unknown parameters are calculated by least square estimation method.

Least square estimation method minimizes the sum of squares of errors to best fit the line for the given data. These errors are generated due to the deviation of observed points from proposed line. This deviation is called as residual in regression analysis.

The sum of squares of residuals (SSR) is calculated as follows:


Where e is the error, y and x are the variables, and b0 and b1 are the unknown parameters or coefficients.

Checking Model Adequacy

Regression models are used for predictions. For appropriate predictions, it is important to check first the adequacy of these models.

R Squared and Adjusted R Squared methods are used to check the adequacy of models.

High values of R-Squared represent a strong correlation between response and predictor variables while low values mean that developed regression model is not appropriate for required predictions.

The value of R  between 0 and 1 where 0 means no correlation between sample data and 1 mean exact linear relationship.

One can calculate R Squared using the following formula:

R= 1 – (SSR/SST)

Here, SST(Sum of Squares of Total) and SSR(Sum of Squares of Regression) are the total sums of the squares and the sum of squares of errors, respectively.

To add a new explanatory variable in an existing regression model, use adjusted R-squared. So adjusted R-squared method depends on a number of explanatory variables. However, it includes a statistical penalty for each new predictor variable in the regression model. These are the 2 properties of Adjusted R-Squared value.

Similar to R-squared adjusted R-squared is used to calculate the proportion of the variation in the dependent variable caused by all explanatory variables.

We can calculate the Adjusted R Squared  as follows:

R= R2 – [k(1-R2)/(n-k-1)]

Here, n represents the number of observations and k represents the number of parameters.

Regression Assumptions

When building a regression model, statisticians make some basic assumptions to ensure the validity of the regression model. These are:

  • Linearity – Assumes a linear relationship between the dependent and independent variables. Because it treats the predictor variables as fixed values (see above), linearity is really only a restriction on the parameters.
  • Independence – This assumes that the errors of the response variables are uncorrelated with each other.
  • Homoscedasticity – This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables. In practice this assumption is invalid (i.e. the errors are heteroscedastic) if the response variables can vary over a wide scale.
  • Normality – Assumes normal distribution of errors in the collected samples.

The regression model may be insufficient for making predictions if we violate any of these assumptions.

Note: Complexity of a regression model increases with increasing number of parameters.


A Multicollinearity refers to redundancy. It is a non-linear relationship between two explanatory variables, leading to inaccurate parameter estimates. Multicollinearity exists when two or more variables represent an exact or approximate linear relationship with respect to the dependent variable.

One can detect the Multicollinearity by calculating VIF with the help of the following formula:

VIF = 1/ (1-Ri2)

Here, Ri is the regression coefficient for the explanatory variable xi, with respect to all other explanatory variables.

In regression model, Multicollinearity is identified when significant change is observed in estimated regression coefficients while adding or deleting explanatory variables or when VIF is high(5 or above) for the regression model.

Following are some impacts of Multicollinearity:

  • Wrong estimation of regression coefficients
  • Inability to estimate standard errors and coefficients.
  • High variance and covariance in ordinary least squares for closely related variables, making it difficult to assess the estimation precisely.
  • Relatively large standard errors present more chances for accepting the null hypothesis
  • Deflated t-test and degradation of model predictability.

We have seen effects of Multicollinearity. So if possible remove it using following manner:

  • Specifying the regression model again.
  • Using prior information or restrictions while estimating the coefficients.
  • Collecting new data or increasing the sample size.

Working with R Linear Regression

Two numerical variables, X and Y, having at least a moderate correlation, been established through both correlation and scatterplot, are in some type of linear relationship. Researchers often use that relationship to predict the value of Y for a given value of X using a straight line.

X and Y are called explanatory(If x changes, slope explains how much is Y expected to change in response) and response variables or independent and dependent variables, respectively. The condition of linearity is checked by creating scatterplot that must form a linear pattern.

The following formula shows the regression line:

Y = mx + b

Where, m is the slope of the line(change in y over the change in x) and b is the y-intercept(that place on y-axis where the value of x is 0). During the estimation of the value of Y, the slope can be calculated by multiplying the correlation between X and Y with the division of the standard deviation of y-values by the standard deviation of x-values.

Using the least squares method, one can obtain the best fit line. This method takes the line with the least possible sum of squares of errors (SSE).

NOTE: Never do a regression analysis unless you have already found at least a moderately strong correlation between the two variables. The thumb rule is that the correlation should be at or beyond either positive or negative 0.50. If the data does not resemble a line to begin with, you should not try to use a line to fit the data and make predictions.

The y-intercept, b, of the best fit line is obtained by subtracting the product of slope and mean of x-values from the mean of y-values. The formula is b=  – m(x̄)

fit line is obtained by subtracting the product of slope and mean of x-values from the mean of y-values. The formula is b=  – m(x̄)

In the context of regression, the slope is interpreted from the change in y-values with respect to change in x-values.

The y-intercept, which is sometimes meaningful and sometimes not, is the place where the regression line crosses the Y-axis, where x=0.

In general, is the variable that you want to predict, and is the variable you are using to make that prediction.

Note that the slope of the best-fitting line can be a negative number because the correlation can be a negative number. A negative slope indicates that the line is going downhill. For example, an increase in police officers maps to a decrease in the number of crimes in a linear fashion; the correlation and hence the slope of the best-fitting line is negative in this case.

Always make sure to use proper units when interpreting a slope. If you do not consider units, you would not understand the correlation between the two variables. For example, if Y is an exam score and X is the study time, and you find that the slope of the equation is 5, the number does not mean anything without any units.

Simple Linear Regression in R

Linear regression analysis involves large and complex calculations. It is not feasible to do these calculations using simple calculators. R is a popular tool that provides you several inbuilt functions and commands for performing linear regression.

While implementing statistical tools, statisticians may come across large data sets that cannot be analyzed by using commonly used software tools. This data is Big Data. The size of Big Data may range from a few dozen terabytes to several petabytes. R is a statistical tool that has the capability to process such large amount of data and generate useful information for making predictions.
The five famous functions in R are as follows:

Famous Five Functions in R

  • sum(x) – Calculates the sum of all x values.
  • sum(y) – Calculates the sum of all y values.
  • sum(x2) – Calculates the sum of the squares of all the values of x.
  • sum(y2) – Calculates the sum of the squares of all the values of y.
  • sum(xy) – Calculates the sum of the product of each respective values of x and y.

These are the famous five functions for calculation in regression.

One of the calculations in regression is calculating of corrected sum of squares. The formula for calculating the sum of squares of x is:


One can calculate the sums of squares in a similar manner.

The calculation of the sum of products uses the following formula:


Note that for accuracy within a computer program, it is best not to use these shortcuts formulae, because they involve differences (minus) between potentially very large numbers (sums of squares) and hence are potentially subject to rounding errors. Instead, when programming, use the following equivalent formulae:

SSX=Σ(x-(mean of x-values))2

SSY=Σ(y-(mean of y-values))2

SSXY=Σ(x-(mean of x-values))(Σy-(mean of y-values))

An important issue is that, 2 datasets with exactly the same slope and intercept can look quite different. The variation is the sum of squares of errors or SSE.

The degree of scattering is calculated as the sum of squares of errors (SSE) by using a formula as the following:


The other calculations in the regression are the analysis of variance and unreliability estimates for parameters (For it, you need to calculate the standard error of the intercept and the standard error of the slope). After calculating the values, one can predict and plot the variables.

Linear Model Results Objects

As you know the simplest form of regression is similar to a correlation where you have 2 variables – a response variable and a predictor. We use the lm() function for this kind of linear modeling in R. A dataset, named fw, having two columns that can correlate, implements the lm() and summary() functions:

> fw.f = lm(count ~ speed, data = fw)

lm() command performs linear regression analysis for the count and speed data and stores the result in the fw.lm object.

> summary(fw.lm)

Summary() command takes the fw.lm object as an argument and accesses the information about the object’s components.

The names() command displays the other details contained in the result object as below:

> names(fw.lm)

It shows details contained in the result object.

We can extract the components which come in the output of names()command using the $ syntax, as follows:

> fw.lm$coefficients

You can extract the coefficients in a result object using the coef() command. To use this command, simply give the name of the result object as follows:

> coef(fw.lm)

It finds the coefficients in the regression analysis.

Confidence interval in statistics defines the range of values then specifies the reliability of parameter estimation. This range is calculated from the given set of sample data.

We can obtain the confidence interval on the coefficients using the confint()command as follows:

> confint(fw.lm)

It obtains the confidence intervals on the coefficients in the regression analysis.

You can use the fitted() command to extract values that are used to plot the regression line.

We can obtain the fitted values, residuals, and formula using the respective commands:

> fitted(fw.lm)        # Extracts the values used to plot the regression line.
> residuals(fw.lm)  # Shows the residuals in the regression analysis.
> formula(fw.lm)   # Accesses the formula used in linear regression model.

Model Building

When you have several predictor variables, you want to create the most statistically significant model from the data. Using two strategies one can create a regression model:

  • Forward Stepwise Regression using add1() Command – Start off with the single best variable and add more variables to build your model into a more complex form.
  • Backward Stepwise Deletion using drop1() Command – Put all the variables in and reduce the model by removing variables until you have only significant terms.

Adding Terms with Forward Stepwise Regression: You can use add1() command to see which of the predictor variables is the best one to add next.

> add1(object, scope)

It shows the syntax for using the add1() command to add a value to an object. The object is the linear model you are building and scope is the data that forms the candidates for inclusion in the new model. The result will be a list of terms and the effect these terms would have if added to the model.

> add1(mf.lm, scope = mf)

It adds a value to the mf object, which fits linearly.

> summary(mf.lm)

After adding a value it shows the view of the object.

add1(object.d can also add a new variable to an object. For example:

> add1(mf.lm, scope = mf, test = 'F')

It adds a new variable to the mf object.

Removing Terms with Backward Deletion:

First, it creates a full model, then deletes the required terms using drop1() command:

> mf.lm = lm(Length ~ ., data = mf)

It creates the full model using the predictor and response variables.

> drop1(mf.lm, test = 'F')

It drops the F term from the mf object

Comparing Models

It is often useful to compare models built from the same dataset. This is useful as you always try to create a model that most adequately describes the data with the minimum number of terms.

You can compare 2 linear models by using the anova() command.  It also compares regression models. The syntax is as below:

> anova(mf.lm1, mf.lm2)

It compares two linear models mf.lm1 and mf.lm2. These models act as arguments to the anova() command.

Curvilinear Regression

Linear regression models do not have to be in the form of a straight line. As long as you can describe the mathematical relationship, you can carry out linear regression. But when this mathematical relationship is not in straight line form, then it is curvilinear.

We estimate the regression by adding more predictor variables in the multiple regression formulae. A typical equation is as follows:


It takes a form similar to the multiple linear regression formulae.

A logarithmic relation in curvilinear regression is as follows:

y = m log(x) + c

A polynomial relationship in curvilinear regression is as follows:


We can say that logarithmic regression is similar to simple regression and polynomial regression is similar to multiple regression.


Views: 5097


You need to be a member of Data Science Central to add comments!

Join Data Science Central


  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service