*This article was posted by S. Richter-Walsh. *

**A Brief Introduction: **

Linear regression is a classic supervised statistical technique for predictive modelling which is based on the linear hypothesis:

where *y *is the **response** or outcome variable, *m* is the **gradient** of the linear trend-line, *x* is the **predictor** variable and *c* is the **intercept**. The intercept is the point on the y-axis where the value of the predictor *x* is zero.

In order to apply the linear hypothesis to a dataset with the end aim of modelling the situation under investigation, there needs to be a linear relationship between the variables in question. A simple scatterplot is an excellent visual tool to assess linearity between two variables. Below is an example of a linear relationship between miles per gallon (mpg) and engine displacement volume (disp) of automobiles which could be modelled using linear regression. Note that there are various methods of transforming non-linear data to make them appear more linear such as log and square root transformations but we won’t discuss those here.

attach(mtcars)

plot(disp, mpg, col = "blue", pch = 20)

Now, in order to fit a good model, appropriate values for the intercept and slope must be found. R has a nice function, lm(), which creates a linear model from which we can extract the most appropriate intercept and slope (coefficients).

model <- lm(mpg ~ disp, data = mtcars)

coef(model)

(Intercept) disp

29.59985476 -0.04121512

We see that the intercept is set at 29.59985476 on the y-axis and that the gradient is -0.04121512. The negative gradient tells us that there is an inverse relationship between mpg and displacement with one unit increase in displacement resulting in a 0.04 unit decrease in mpg.

How are these intercept and gradient values calculated one may ask? In finding the appropriate values, the goal is to reduce the value of a statistic known as the **mean squared error (MSE)**:

where* y* represents the observed value of the response variable, *y_preds* represents the predicted *y* value from the linear model after plugging in values for the intercept and slope, and *n* is the number of observations in the dataset. Each set of *xy* data points are iterated over to find the squared error, all squared errors are summed and the sum is divided by* n* to get the MSE.

Using the linear model we created earlier, we can obtain *y* predictions and plot them on the scatterplot as a regression line. We use the predict() function in base R and plot the predicted values using abline().

y_preds <- predict(model)

abline(model)

Next, we can calculate the MSE by summing the squared differences between observed *y*values and our predicted *y* values then dividing by the number of observations* n.* This gives a MSE of 9.911209 for this linear model.

errors <- unname((mpg - y_preds) ^ 2)

sum(errors) / length(mpg)

To read the full article, click here.

**Top DSC Resources**

- Article: What is Data Science? 24 Fundamental Articles Answering This Question
- Article: Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Tutorial: Data Science Cheat Sheet
- Tutorial: How to Become a Data Scientist - On Your Own
- Categories: Data Science - Machine Learning - AI - IoT - Deep Learning
- Tools: Hadoop - DataViZ - Python - R - SQL - Excel
- Techniques: Clustering - Regression - SVM - Neural Nets - Ensembles - Decision Trees
- Links: Cheat Sheets - Books - Events - Webinars - Tutorials - Training - News - Jobs
- Links: Announcements - Salary Surveys - Data Sets - Certification - RSS Feeds - About Us
- Newsletter: Sign-up - Past Editions - Members-Only Section - Content Search - For Bloggers
- DSC on: Ning - Twitter - LinkedIn - Facebook - GooglePlus

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central