Subscribe to DSC Newsletter

# Machine Learning with Python - Linear Regression Model

I am pursuing a course in Data Science with Python. When i tried to implement Linear Regression model to predict the new outcome, it changes every time i re-run the "cross_validation.train_test_split()" function. I also noticed that whenever i run this cell not only the outcome changes but the intercept value, Coefficient value and Mean Square Error keeps on changing when changing my training and testing data set.

My questions:

1) What does Mean Square error signifies in Linear Regression? If it tells about the error in predicted outcome then, how to optimise my model so that MSE is the lowest?

2) Does "cross_validation.train_test_split()" function splits data at random for training and testing data set?

Views: 626

### Replies to This Discussion

Wikipedia gives a quite good answer to question 1)

Also in regression analysis, "mean squared error", often referred to as mean squared prediction error or "out-of-sample mean squared error", can refer to the mean value of the squared deviations of the predictions from the true values, over an out-of-sample test space, generated by a model estimated over a particular sample space. This also is a known, computed quantity, and it varies by sample and by out-of-sample test space.

An obvious approach is to minimize your error in prediction, i.e. to minimize, for instance, the MSE between the real-world data Y and the predicted data Y'. One way is to minimize the L2-metric |MSE(Y,Y')| by changing the values that determines Y'.

2) I do not know this specific (python) function, however, cross validation is about the following process:

1. Split your sample randomly to train and test data
2. Fit the model to train set
3. Test the model on test set
4. Calculate the prediction error (e.g. using the MSE)
5. Repeat the process n times

The heuristic behind is to average the "noise" (of the model) and thus to get a robust model.

The cross_validation.train_test_split() an extremely helpful function when creating test/training sets. All you have to do is input the X/Y array parameters (along with the split size) into the function and it creates it when one setting. In R - I was creating the test/training sets the long way whereas this function will do it in a simplistic manner.

For the split size I think you have to specify the size from 0.0 to 1.0. I typically to an 80/20 split so 0.8/0.2.

1

2

3

4

5

6

## Videos

• ### Powerful, Flexible and Accessible Code-free Data Science

Added by Andrei Macsin

• ### Building a Compelling Argument with Data

Added by Tim Matteson

• ### Predictive Forecasting with Time Series Analysis

Added by Tim Matteson