Subscribe to DSC Newsletter

If you are like me, back in engineering school you learned linear regression as a way to “fit a line to data” and probably called in “least squares”.  You probably extended it to multiple variables affecting a single dependent variable.  In a statistics class you had to calculate a bunch of stuff and estimate confidence intervals for those lines.  And that was probably about it for a long time, unless you were focusing on math or statistics.  You may have picked up along the way that there are assumptions inside of the decision to use “ordinary least squares”. 

Now, as there are languages and free code and packages to do most anything in analysis, it is quite easy to extend beyond ordinary least squares, and be of value to do so.  In R, doing a multiple linear regression using ordinary least squares requires only 1 line of code:

        Model  <- lm(Y ~ X, data = X_data)

Note that we could replace X by multiple variables.  You can use the model, now stored in Model, to make predictions from new data with one more line of code:

        Y_pred <- predict(Model, data = new_X_data)

We can generate some “ideal” data for regression easily in R:

        X_data <- seq(1, 100, 1)

        Y_raw <- 3.5 + 2.1 * X_data

        Y_noise <- rnorm(n = 100, mean = 0, sd = 5)

        Y <- data.frame(X = X_data, Y = Y_raw + Y_noise)

 Using the lm method, we get the following result:

On the left are the noisy data and the linear regression line; on the right are the residuals from the fit to the data plotted as a histogram, with a normal curve of same mean and standard deviation superimposed.  R gives us the model statistics by simply calling summary(Model):

     > summary(Model)

      Call:

      lm(formula = Y_noisy ~ X, data = Y)

       Residuals:

           Min       1Q   Median       3Q      Max

      -11.1348  -2.9799   0.3627   2.9478  10.3814

       Coefficients:

                 Estimate Std. Error t value Pr(>|t|)   

      (Intercept)  3.51543    0.98362   3.574 0.000548 ***

         X            2.11284    0.01691 124.946  < 2e-16 ***

      ---

      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

      Residual standard error: 4.881 on 98 degrees of freedom

      Multiple R-squared:  0.9938,  Adjusted R-squared:  0.9937

      F-statistic: 1.561e+04 on 1 and 98 DF,  p-value: < 2.2e-16

We can see that the coefficients deviate slightly from the underlying model.  In addition, both model parameters are highly significant, which is expected.  Now let’s look at a messier case, and what we can do about it.

        X_data <- seq(1, 1000, 1)

      #

      # Y is linear in x with uniform, periodic, and skewed noise

      #

        Y_raw <- 1.37 + 2.097 * x

        Y_noise <- (X_data / 100) * 25 * (sin(2 * pi * X_data/100)) *

        runif(n = length(X_data), min = 3, max  = 4.5) +

        (X_data / 100)^3 * runif(n = 100, min = 1, max = 5)

        Y <- data.frame(X = X_data, Y = Y_raw + Y_noise)

Again using lm we can obtain the following:

On the left are the raw data, the red line is the linear least squares line, and the dashed line is the “real” Y, which of course we may not know in advance.  On the right are the residuals and normal cure as before.

Before proceeding further, we need to come clean on a few things.  First, this is an extreme and possibly unrealistic example.  If we knew nothing at all, we would likely conclude that the real Y was periodic, increased with X, and the amplitude of the periodic part also increased with X.  Suppose, though, that we have good reason to believe the underlying phenomena was linear.  Also, let’s suppose we had some idea of what Y should be, and we think the red line is way too high.  Under all these conditions, we might then wonder what to do with the data.  (Yes, we could fire the technician who gathered them, send the measuring equipment to maintenance, or just start all over.  For now, suppose we are starting up a process tomorrow and we need the regression equation for the control program.)

So we think we have a really big noise problem.  However, except for the bit of tail to the right, the residual plot looks remarkably normally distributed (Gaussian).  Let’s look at the residuals in a couple of other ways.

On the left is a plot of the residual values versus the fitted Y values.  Making such a plot is usually a good idea.  The basic assumption of ordinary least squares is that the residuals do not depend on the value of the function, and the the distribution of the residuals is the same everywhere.  We can see that both these conditions are violated just by inspection of the left chart.  This gives us a lot more inormation than just the histogram.  On the right is a normal Q-Q plot, which is constructed such that residual values that meet our normality and constancy requiremetns would all be on the red line.  This shows that towards higher values, there is a problem.  That is another clue for us that we might have missed just looking at the fit or the histogram.

The chart on the left demonstrates a behavor statistictians and others call heteroskedasticity.  I hope you get to use that word in Scrabble™ some day.  Essentially heteroskedsticity means the residuals do exhibit the unwanted variaions we observe.  So what should we do?  There are actually a whole bunch of modeling approaches that could be used, but here we will look only at weighted linear regression.  There is a body of statistics that can be brought to bear on determining weights to be applied to the raw data points to improve the accuracy of the regression model.  In many cases, a good estimate of the weights is to divide the Y values by the variance of the residuals.  This leads to the question of how do we ge the variance values, especially I few have only 1 point per X value?  Ideally in such a case, you would have collected multiple mesurements at every location, but what if, as here, that was not done or is not possible?

We illustrate a pragmatic approach here.  We define a bin size, in this case 10 measurements, and calculate a moving value of the residual variance by calculating the variance of the 10 measurements in the bin and using that at the value for the X corresponding to the bin center. On the left and right ends of the data we just used the variance of the initial and last 10 values, respectively.  Bingo, we have a value for the variance of the residuals for every Y value.  The R package MASS contains a robust linear model function, which we can use with these weights:

        Weighted_fit <- rlm(Y ~ X, data = Y, weights = 1/sd_variance)

Using rlm, we obtain the following:

One the left, the new fit is the green line.  Note that without any additional information other than the residual variance, the model is now much closer to the “true” Y.  On the right, we see that the residuals are skewed, but the most likely value is near 0.

In summary, inspecting the behavior of residuals to a regression model is always a good idea, in addition to the usual regression summary table.  If there is evidence of issues, there ways to address them, including the (easy) weighted regression we demonstrated here.

Views: 14140

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Blaine Bateman on March 29, 2018 at 7:02am

Thank you for your kind remarks.

Comment by Abdoulaye Diallo on March 28, 2018 at 10:52pm

Realy interesting for all students, begging to manipulate data without any knowledge to this  can lead to a false interpretation meaning false analysis.

Thanks @Blaine Bateman

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service