I have a database made of some 5000 people (rows) and 50 variables (columns). The columns measure different demographic and psychographic items for each person. This data is gathered at time t. Some time later, say at time t+, for the same people, I gather consumption data of a given product, which I call the discriminant variable.
Most variables are continuous. The discriminant variable (continuous) should be reduced to 2 classes: Above and below a given consumption threshold.
How would you proceed to classify the people in terms of consumption *before* the data for the discriminant variable is available?
I tried the Discriminant Analysis but it doesn’t deliver satisfactory results (too many false assignments to the predicted group due to very close average Discriminant Scores for the two groups). Is there any other approach you would suggest?
Your help will be highly appreciated.
If your 5k database consists of some behavioural data you may try a Latent Class Analysis - it combines attributes and demonstrates different behaviour behind the curtains - and auto-eliminates useless columns.
The other method could be "brute force" your data with different algos: discriminant variable is the real life outcome (label), and let the algos run... Sometimes the simplest approach is the best.
Probably it would be great to generate other variables / attributes / columns based on the data you have, to fine tune your results
It is really hard to say anything useful without knowing the data themselves...