Regressions are widely used to estimate relations between variables or predict future values for a certain dataset.
If you want to know how much of variable "x" interferes with variable "y" you might want to do a regression in your data. If you have a bunch of data points in time, and you want to know what is your data going to look like in the future, you also might want to do regression.
I will try to describe the steps that helped me successfully build linear and non-linear regression in R, using polynomials and splines. I am not going to go on too much details on each method. I just want to give an overall step-by-step on how to do a general regression with R, so that you guys can go further on your own.
The first thing you should do is see what your data looks like. Plot the data, maybe try to get some statistics out of it, and try to understand what type of relation there is between variables.
There might be a linear (line) or non-linear (curvy) relation between your data points. The data in question might be dependent of only one variable or several variables.
Suppose you have the following dummy dataset:
x |
y |
1.1 |
1 |
2.1 |
4 |
3.1 |
10 |
4.1 |
16 |
5.1 |
25 |
6.1 |
36 |
7.6 |
52 |
8.1 |
64 |
9.1 |
81 |
10.1 |
100 |
11.1 |
121 |
13.1 |
138 |
13.1 |
169 |
14.1 |
196 |
15.1 |
225 |
16.1 |
256 |
17.1 |
289 |
18.1 |
324 |
19.1 |
361 |
18.1 |
400 |
21.1 |
441 |
22.1 |
484 |
23.1 |
529 |
24.6 |
574 |
25.1 |
625 |
26.1 |
676 |
27.1 |
729 |
29 |
789 |
29.1 |
841 |
30.1 |
900 |
Here we have one variable x that varies according to another variable y.
Plotting this data would get us the following graph
> plot(x, y, col="blue", main="Example Graph")
> grid(nx = 12, ny = 12, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE)
Visually, we can say that this data seems to follow a non-linear pattern. Even further, the relation between y and x seems to be a second degree polynomial.
The natural first thing to do is to check if a second degree polynomial would match our data. To create a model, we can use the lm() function. You should pass as parameter the equation you think might suit your data. Here we are actually "guessing" which model better fits to the data.
my_model <- lm(y ~ poly(x, 2))
In the example above we are creating a model where y=x^2.
Of course, you might want to test other alternatives:
my_model_linear <- lm(y ~ poly(x, 1))
Or (why not?)
my_model_degree_20 <- lm(y ~ poly(x, 20))
For more complex datasets, spline is a nice method to be used:
library(splines)
my_model_spline <- lm(y ~ bs(x))
Here, bs is the base function. Use the parameters knots and df to make the function smoother or curvier.
We were lucky in our example with the second degree polynomial, but the idea here is to mess around a little with these functions and parameters, trying to find the best model possible.
After you created some models, to visually check how they fit your data, you can plot your x values against the model values you created. Here we use the lines() method to do that:
lines(x, predict(lm(y ~ poly(x, 2))))
To further check how well your model fit your data, you can plot the model itself
plot(my_model)
This is going to give you a bunch of information like the residuals against the fitted values. For more on that click here and here.
You can also use a t.test() to see if the two groups (real versus modeled values) are similar. This test is going to compare their means, assuming they both are under a normal distribution.
t.test(y, predict(my_model))
Read more about the t.test() here.
Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.
One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.
For example:
my_prediction <- predict(my_model, data.frame(column_name = c(value_to_be_predicted))).
If you had used dates in numeric form, for example you would have:
my_date <- "2016-05-10"
date_df <- data.frame(x=as.numeric(as.Date(my_date))) my_pred <- predict(cubic_model, date_df)
In our example we used generic numbers with the name "x".
my_pred <- predict(my_model, data.frame(x = c(31.1)))
http://www.dummies.com/how-to/content/how-to-predict-new-data-values-with-r.html
http://www.r-bloggers.com/splines-opening-the-black-box/
http://statweb.stanford.edu/~jtaylo/courses/stats203/R/inference+polynomial/spline.R.html
http://data.princeton.edu/R/linearModels.html
http://www.r-bloggers.com/first-steps-with-non-linear-regression-in-r/
OBS: If you only desire to interpolate your data, create a "line" between your data points, check the smooth.spline function. It will interpolate your data, and you don't have to keep guessing the relation between data.
OBS2: If your data function is complex, if you are not being able to model your dataset correctly or if you are just willing to try new stuff, Neural Networks can be a very powerful way of learning your data.
© 2019 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central