Subscribe to DSC Newsletter

Regressions are widely used to estimate relations between variables or predict future values for a certain dataset.

If you want to know how much of variable "x" interferes with variable "y" you might want to do a regression in your data. If you have a bunch of data points in time, and you want to know what is your data going to look like in the future, you also might want to do regression. 


I will try to describe the steps that helped me successfully build linear and non-linear regression in R, using polynomials and splines. I am not going to go on too much details on each method. I just want to give an overall step-by-step on how to do a general regression with R, so that you guys can go further on your own.

 

First Steps: get to know your data

The first thing you should do is see what your data looks like. Plot the data, maybe try to get some statistics out of it, and try to understand what type of relation there is between variables.

There might be a linear (line) or non-linear (curvy) relation between your data points. The data in question might be dependent of only one variable or several variables.


Suppose you have the following dummy dataset:

 

x

y

1.1

1

2.1

4

3.1

10

4.1

16

5.1

25

6.1

36

7.6

52

8.1

64

9.1

81

10.1

100

11.1

121

13.1

138

13.1

169

14.1

196

15.1

225

16.1

256

17.1

289

18.1

324

19.1

361

18.1

400

21.1

441

22.1

484

23.1

529

24.6

574

25.1

625

26.1

676

27.1

729

29

789

29.1

841

30.1

900


Here we have one variable x that varies according to another variable y.


Plotting this data would get us the following graph

 

> plot(x, y, col="blue", main="Example Graph")

> grid(nx = 12, ny = 12, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE)

 

Visually, we can say that this data seems to follow a non-linear pattern. Even further, the relation between y and x seems to be a second degree polynomial. 

Models

The natural first thing to do is to check if a second degree polynomial would match our data. To create a model, we can use the lm() function. You should pass as parameter the equation you think might suit your data. Here we are actually "guessing" which model better fits to the data.


my_model <- lm(y ~ poly(x, 2))


In the example above we are creating a model where y=x^2.

Of course, you might want to test other alternatives:


my_model_linear <- lm(y ~ poly(x, 1))


Or (why not?)


my_model_degree_20 <- lm(y ~ poly(x, 20))


For more complex datasets, spline is a nice method to be used:

library(splines)

my_model_spline <- lm(y ~ bs(x))


 Here, bs is the base function. Use the parameters knots and df to make the function smoother or curvier.

 

We were lucky in our example with the second degree polynomial, but the idea here is to mess around a little with these functions and parameters, trying to find the best model possible.

 

Check Results

After you created some models, to visually check how they fit your data, you can plot your x values against the model values you created. Here we use the lines() method to do that:


lines(x, predict(lm(y ~ poly(x, 2))))





To further check how well your model fit your data, you can plot the model itself 


plot(my_model)

 

This is going to give you a bunch of information like the residuals against the fitted values. For more on that click here and here.

You can also use a t.test()  to see if the two groups (real versus modeled values) are similar. This test is going to compare their means, assuming they both are under a normal distribution.

t.test(y, predict(my_model))

Read more about the t.test() here.

 

Predict New Data

 

Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.

One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.

For example:

my_prediction <- predict(my_model, data.frame(column_name = c(value_to_be_predicted))).  

If you had used dates in numeric form, for example you would have:

my_date <- "2016-05-10"

date_df  <- data.frame(x=as.numeric(as.Date(my_date))) my_pred <- predict(cubic_model, date_df)


In our example we used generic numbers  with the name "x". 


my_pred <- predict(my_model, data.frame(x = c(31.1)))

 

Links


http://www.dummies.com/how-to/content/how-to-predict-new-data-values-with-r.html

http://www.r-bloggers.com/splines-opening-the-black-box/

http://statweb.stanford.edu/~jtaylo/courses/stats203/R/inference+polynomial/spline.R.html

http://data.princeton.edu/R/linearModels.html

http://www.r-bloggers.com/first-steps-with-non-linear-regression-in-r/

 


OBS: If you only desire to interpolate your data, create a "line" between your data points, check the smooth.spline function. It will interpolate your data, and you don't have to keep guessing the relation between data.


OBS2: If your data function is complex, if you are not being able to model your dataset correctly or if you are just willing to try new stuff, Neural Networks can be a very powerful way of learning your data.

Views: 7768

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service