I am currently working on building a model for predicting school enrollment in a given county.
The county has N number of schools. I am trying to build the model for Kindergarten. I have the school enrollment information from year 2011 to 2016. I have taken a few demographic indicators like Number of Housing Units in the area the school is located, the ethnic population for each ethnicity etc. Obviously there are more than one school in a given area and hence a lot of these schools will have overlapping demographic information for a given year. I am then concatenating all the years information and form a single dataset and dropping off the 'year' feature and trying to predict the enrolment for a school.
As of now I have used the Linear Regressor and Random Forest Regressor and am finding that I get very poor results to the extent of 30% MAPE.
I want to know what I am trying to do is correct or not. Is it ok to concatenate all years information into a single data and remove the time element of the problem? Am I making some other grave, foundational mistake?
One thing that am thinking is maybe I can add some school specific features.
Any help will be greatly appreciated.
You could try taking a step back to see what, for schools in that county, determines which child goes to which school. In some counties it's simply the closest school to the home address, in others there are district lines that determine where the child goes. Maybe one school is on a charter, another not. Another thing to check: did the Kindergarten age cut-off change for that county in your 2011-2016 range? That would throw you off.
Just out of curiosity, if you model it simply that this year equals last year "enrollment(year n) = enrollment(year n-1)", do you beat that 30% MAPE?
Thank you Justin for your reply. There were a few indicators that I wanted to include like the one you suggested, viz for a given a school, what ratio of it's students are from it's own area/zipcode. But I am still searching for sources where I can find this data.
Other two points are valuable. I will try to include those features.
I will also find my MAPE by just predicting the current enrollment with previous year enrollment.
Thanks a lot for your reply.
That's certainly a challenging problem to model. There are quite a few factors that could contribute, some of which you had already listed. I would recommend listing and prioritizing all possible contributing factors. You may want to segment the existing data demographically. Some factors to consider are average household income, new schools and home developments within a specific radius, the rate of growth within the school area, school rating, ...
I doubt that a linear regression model would address this problem. This may end up being modeled as a combination of special mathematical functions.
Good luck. Please keep us updated on your findings.