I am building out a data set to use in R and a xg-boost program I have developed. I want to ask about several data strategies and this issues involved. The data involved includes such fields as: term_gpa, total_gpa, total-failures, program of study, dev_ed_course_required_yes_no, assessments test scores like ACT, SAT and PERT, race and many others as well as and demographic data (income, parents income, county, zip code, etc.). 

The strategy is to take fall term data from previous terms (2013 to 2016) to build and test a model, then use it to predict on fall 2017 (or the latest term completed) to determine whether the student will return or go missing for the next year.

  1. In educational data, often students enter the data more than once, one time for each term they enroll - should only the most recent fall term the student attended be used in test and train or could multiple terms be used since many of the field values change (GPA, term_GPA etc.)?
  2. Educational data from one institution tends to be sparse. Less than 10,000 students per term. So breaking the data up into test and train reduces the size of the data and affects the ability to accurately mode. Question: After building a model using test and train data, is it good practice to build out the predictive model using all of the data (test data and train data) or should training data ONLY be used?

Any help with strategies, issue to consider would be enlightening.

william, Data and Policy Analyst, SF College

Tags: data, mining, modelling, predictive

Views: 259

Reply to This

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service