Regression analysis is a statistical tool to determine relationships between different types of variables. Variables that remain unaffected by changes made in other variables are known as independent variables, also known as a predictor or explanatory variables while those that are affected are known as dependent variables also known as the response variable.
Linear regression is a statistical procedure which is used to predict the value of a response variable, on the basis of one or more predictor variables.
There are two types of linear regressions in R:
Some common examples of linear regression are calculating GDP, CAPM, oil and gas prices, medical diagnosis, capital asset pricing etc.
R Simple linear regression enables us to find a relationship between a continuous dependent variable Y and a continuous independent variable X. It is assumed that values of X are controlled and not subject to measurement error and corresponding values of Y are observed.
The general simple linear regression model to evaluate the value of Y for a value of X:
yi = β0 + β1x + ε
Here, the ith data point, yi, is determined by the variable xi;
β0 and β1 are regression coefficients;
εi is the error in the measurement of the ith value of x.
Regression analysis is implemented to do the following:
In the real world, you may find situations where you have to deal with more than 1 predictor variable to evaluate the value of response variable. In this case, simple linear models cannot be used and you need to use R multiple linear regressions to perform such analysis with multiple predictor variables.
R multiple linear regression models with two explanatory variables can be given as:
yi = β0 + β1x1i + β2x1i + εi
Here, the ith data point, yi, is determined by the levels of the two continuous explanatory variables x1i and x1i’ by the three parameters β0, β1, and β2 of the model, and by the residual ε1 of point i from the fitted surface.
General Multiple regression models can be represented as:
yi = Σβ1x1i + εi
A simple or multiple regression models cannot explain a nonlinear relationship between the variables.
Multiple regression equations are defined in the same way as single regression equation by using the least square method. Values of unknown parameters are calculated by least square estimation method.
Least square estimation method minimizes the sum of squares of errors to best fit the line for the given data. These errors are generated due to the deviation of observed points from proposed line. This deviation is called as residual in regression analysis.
The sum of squares of residuals (SSR) is calculated as follows:
SSR=Σe2=Σ(y(b0+b1x))2
Where e is the error, y and x are the variables, and b0 and b1 are the unknown parameters or coefficients.
Regression models are used for predictions. For appropriate predictions, it is important to check first the adequacy of these models.
R Squared and Adjusted R Squared methods are used to check the adequacy of models.
High values of RSquared represent a strong correlation between response and predictor variables while low values mean that developed regression model is not appropriate for required predictions.
The value of R between 0 and 1 where 0 means no correlation between sample data and 1 mean exact linear relationship.
One can calculate R Squared using the following formula:
R2 = 1 – (SSR/SST)
Here, SST(Sum of Squares of Total) and SSR(Sum of Squares of Regression) are the total sums of the squares and the sum of squares of errors, respectively.
To add a new explanatory variable in an existing regression model, use adjusted Rsquared. So adjusted Rsquared method depends on a number of explanatory variables. However, it includes a statistical penalty for each new predictor variable in the regression model. These are the 2 properties of Adjusted RSquared value.
Similar to Rsquared adjusted Rsquared is used to calculate the proportion of the variation in the dependent variable caused by all explanatory variables.
We can calculate the Adjusted R Squared as follows:
R2 = R2 – [k(1R2)/(nk1)]
Here, n represents the number of observations and k represents the number of parameters.
When building a regression model, statisticians make some basic assumptions to ensure the validity of the regression model. These are:
The regression model may be insufficient for making predictions if we violate any of these assumptions.
Note: Complexity of a regression model increases with increasing number of parameters.
A Multicollinearity refers to redundancy. It is a nonlinear relationship between two explanatory variables, leading to inaccurate parameter estimates. Multicollinearity exists when two or more variables represent an exact or approximate linear relationship with respect to the dependent variable.
One can detect the Multicollinearity by calculating VIF with the help of the following formula:
VIF = 1/ (1Ri2)
Here, Ri is the regression coefficient for the explanatory variable xi, with respect to all other explanatory variables.
In regression model, Multicollinearity is identified when significant change is observed in estimated regression coefficients while adding or deleting explanatory variables or when VIF is high(5 or above) for the regression model.
Following are some impacts of Multicollinearity:
We have seen effects of Multicollinearity. So if possible remove it using following manner:
Two numerical variables, X and Y, having at least a moderate correlation, been established through both correlation and scatterplot, are in some type of linear relationship. Researchers often use that relationship to predict the value of Y for a given value of X using a straight line.
X and Y are called explanatory(If x changes, slope explains how much is Y expected to change in response) and response variables or independent and dependent variables, respectively. The condition of linearity is checked by creating scatterplot that must form a linear pattern.
The following formula shows the regression line:
Y = mx + b
Where, m is the slope of the line(change in y over the change in x) and b is the yintercept(that place on yaxis where the value of x is 0). During the estimation of the value of Y, the slope can be calculated by multiplying the correlation between X and Y with the division of the standard deviation of yvalues by the standard deviation of xvalues.
Using the least squares method, one can obtain the best fit line. This method takes the line with the least possible sum of squares of errors (SSE).
NOTE: Never do a regression analysis unless you have already found at least a moderately strong correlation between the two variables. The thumb rule is that the correlation should be at or beyond either positive or negative 0.50. If the data does not resemble a line to begin with, you should not try to use a line to fit the data and make predictions.
The yintercept, b, of the best fit line is obtained by subtracting the product of slope and mean of xvalues from the mean of yvalues. The formula is b= – m(x̄)
fit line is obtained by subtracting the product of slope and mean of xvalues from the mean of yvalues. The formula is b= – m(x̄)
In the context of regression, the slope is interpreted from the change in yvalues with respect to change in xvalues.
The yintercept, which is sometimes meaningful and sometimes not, is the place where the regression line crosses the Yaxis, where x=0.
In general, Y is the variable that you want to predict, and X is the variable you are using to make that prediction.
Note that the slope of the bestfitting line can be a negative number because the correlation can be a negative number. A negative slope indicates that the line is going downhill. For example, an increase in police officers maps to a decrease in the number of crimes in a linear fashion; the correlation and hence the slope of the bestfitting line is negative in this case.
Always make sure to use proper units when interpreting a slope. If you do not consider units, you would not understand the correlation between the two variables. For example, if Y is an exam score and X is the study time, and you find that the slope of the equation is 5, the number does not mean anything without any units.
Linear regression analysis involves large and complex calculations. It is not feasible to do these calculations using simple calculators. R is a popular tool that provides you several inbuilt functions and commands for performing linear regression.
While implementing statistical tools, statisticians may come across large data sets that cannot be analyzed by using commonly used software tools. This data is Big Data. The size of Big Data may range from a few dozen terabytes to several petabytes. R is a statistical tool that has the capability to process such large amount of data and generate useful information for making predictions.
The five famous functions in R are as follows:
Famous Five Functions in R
These are the famous five functions for calculation in regression.
One of the calculations in regression is calculating of corrected sum of squares. The formula for calculating the sum of squares of x is:
SSX=Σx2((Σx)2/n)
One can calculate the sums of squares in a similar manner.
The calculation of the sum of products uses the following formula:
SSXY=Σxy((Σx)(Σy)/n)
Note that for accuracy within a computer program, it is best not to use these shortcuts formulae, because they involve differences (minus) between potentially very large numbers (sums of squares) and hence are potentially subject to rounding errors. Instead, when programming, use the following equivalent formulae:
SSX=Σ(x(mean of xvalues))2
SSY=Σ(y(mean of yvalues))2
SSXY=Σ(x(mean of xvalues))(Σy(mean of yvalues))
An important issue is that, 2 datasets with exactly the same slope and intercept can look quite different. The variation is the sum of squares of errors or SSE.
The degree of scattering is calculated as the sum of squares of errors (SSE) by using a formula as the following:
SSE=Σ(yabx)2
The other calculations in the regression are the analysis of variance and unreliability estimates for parameters (For it, you need to calculate the standard error of the intercept and the standard error of the slope). After calculating the values, one can predict and plot the variables.
As you know the simplest form of regression is similar to a correlation where you have 2 variables – a response variable and a predictor. We use the lm() function for this kind of linear modeling in R. A dataset, named fw, having two columns that can correlate, implements the lm() and summary() functions:
1

> fw.f = lm( count ~ speed, data = fw) 
lm() command performs linear regression analysis for the count and speed data and stores the result in the fw.lm object.
1

> summary(fw.lm) 
Summary() command takes the fw.lm object as an argument and accesses the information about the object’s components.
The names() command displays the other details contained in the result object as below:
1

> names(fw.lm) 
It shows details contained in the result object.
We can extract the components which come in the output of names()command using the $ syntax, as follows:
1

> fw.lm $coefficients 
You can extract the coefficients in a result object using the coef() command. To use this command, simply give the name of the result object as follows:
1

> coef(fw.lm) 
It finds the coefficients in the regression analysis.
Confidence interval in statistics defines the range of values then specifies the reliability of parameter estimation. This range is calculated from the given set of sample data.
We can obtain the confidence interval on the coefficients using the confint()command as follows:
1

> confint(fw.lm) 
It obtains the confidence intervals on the coefficients in the regression analysis.
You can use the fitted() command to extract values that are used to plot the regression line.
We can obtain the fitted values, residuals, and formula using the respective commands:
1
2
3
4
5

> fitted(fw.lm) # Extracts the values used to plot the regression line. > residuals(fw.lm) # Shows the residuals in the regression analysis. > formula(fw.lm) # Accesses the formula used in linear regression model. 
When you have several predictor variables, you want to create the most statistically significant model from the data. Using two strategies one can create a regression model:
Adding Terms with Forward Stepwise Regression: You can use add1() command to see which of the predictor variables is the best one to add next.
1

> add1(object, scope) 
It shows the syntax for using the add1() command to add a value to an object. The object is the linear model you are building and scope is the data that forms the candidates for inclusion in the new model. The result will be a list of terms and the effect these terms would have if added to the model.
1

> add1(mf.lm, scope = mf) 
It adds a value to the mf object, which fits linearly.
1

> summary(mf.lm) 
After adding a value it shows the view of the object.
add1(object.d can also add a new variable to an object. For example:
1

> add1(mf.lm, scope = mf, test = 'F' ) 
It adds a new variable to the mf object.
Removing Terms with Backward Deletion:
First, it creates a full model, then deletes the required terms using drop1() command:
1

> mf.lm = lm(Length ~ ., data = mf) 
It creates the full model using the predictor and response variables.
1

> drop1(mf.lm, test = 'F' ) 
It drops the F term from the mf object
It is often useful to compare models built from the same dataset. This is useful as you always try to create a model that most adequately describes the data with the minimum number of terms.
You can compare 2 linear models by using the anova() command. It also compares regression models. The syntax is as below:
1

> anova(mf.lm1, mf.lm2) 
It compares two linear models mf.lm1 and mf.lm2. These models act as arguments to the anova() command.
Linear regression models do not have to be in the form of a straight line. As long as you can describe the mathematical relationship, you can carry out linear regression. But when this mathematical relationship is not in straight line form, then it is curvilinear.
We estimate the regression by adding more predictor variables in the multiple regression formulae. A typical equation is as follows:
y=m1x1+m2x2+m3x3+….+mnxn+c
It takes a form similar to the multiple linear regression formulae.
A logarithmic relation in curvilinear regression is as follows:
y = m log(x) + c
A polynomial relationship in curvilinear regression is as follows:
y=m1x1+m2x2+m3x3+….+mnxn+c
We can say that logarithmic regression is similar to simple regression and polynomial regression is similar to multiple regression.
© 2020 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
Upcoming DSC Webinar
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 20082014  20152016  20172019  Book 1  Book 2  More
Upcoming DSC Webinar
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central