Subscribe to DSC Newsletter

I am doing some regression analysis. Some of the independent variables are continuous while some are categorical. The dependent variable is continuous. Can you please help me on which regression model should I pick? Thanks for your help.

 

 

Views: 1162

Reply to This

Replies to This Discussion

You could turn your categorical variables into dummy binary variables (google "dummy variable"), and then use any standard regression. This is often done in the context of logistic regression.

Thanks for your help Vincent

Vincent Granville said:

You could turn your categorical variables into dummy binary variables (google "dummy variable"), and then use any standard regression. This is often done in the context of logistic regression.

It's completely depend upon your objectives.As your dependent variable is continuous, I think you should use Multiple Linear regression Model .

Read about ANCOVA model. You can handle categorical variable using dummy variable as said by Vincent above.

Hi Prashanth.

You can build a basic multiple regression model after creating dummy variables to represent the variables with the categorical values.

Once you have built this base model, you will absolutely need to check for the following assumptions and make sure they hold. You can do this by plotting and visual inspection.

1) There is a linear relationship between your response and your predictors.

2) Your predictors are not highly correlated with each other.

3) The mean of the residuals (i.e. Predicted-Actual) is zero.

4) The residuals display constant variance

5) The residuals are normally distributed.

If the assumptions are not satisified, you will need to look into transforming your predictors( i.e. change x to x-squared, 1/x, log(x), (x-mean)/(standard deviation),etc.) and rechecking the assumptions. 

Thanks Mithun for your help. Creating dummy variables was the key as I did not know that before. I was able to get this done and come up with a prediction model.

Sounds great Prashanth. Were you able to check for the above assumptions?

Prashanth Southekal, PhD said:

Thanks Mithun for your help. Creating dummy variables was the key as I did not know that before. I was able to get this done and come up with a prediction model.

Hi Prashanth

As an alternate approach you could try using a CART (Classification And Regression Tree) like Random Forest . This can be helpful especially in circumstances when the number of categories are more in your categorical feature. So let's say if you had 25 categories then using dummy would add 25 columns. In such a scenario you'd be trapped in the curse of dimensionality. 

You could use a normal Linear Regression, as your dependent variable is continuous. The categorical variables, can be converted to dummy variables ( if there are not many unique ones ). 

Depending on the Overfitting or underfitting, you can play on the Independent variables exclusion ( removing the greatest P-valued variables).

You can try out the Lazzo or Ridge models, if you want to control the coefficients.

(Multiple) Linear Regression does *not* require a linear relationship between the response and the predictor(s).  "Linear" refers to the requirement that the model being considered is linear in the *parameters* to be estimated.

Mithun Alva said:

Hi Prashanth.

You can build a basic multiple regression model after creating dummy variables to represent the variables with the categorical values.

Once you have built this base model, you will absolutely need to check for the following assumptions and make sure they hold. You can do this by plotting and visual inspection.

1) There is a linear relationship between your response and your predictors.

2) Your predictors are not highly correlated with each other.

3) The mean of the residuals (i.e. Predicted-Actual) is zero.

4) The residuals display constant variance

5) The residuals are normally distributed.

If the assumptions are not satisified, you will need to look into transforming your predictors( i.e. change x to x-squared, 1/x, log(x), (x-mean)/(standard deviation),etc.) and rechecking the assumptions. 

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2018   Data Science Central™   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service