In a predictive model, you have (say) n=500 variables and k=50,000,000 observations, with significant cross-correlations between variables (this is often the case with data related to fraud detection). How to determine a good mix of variables? How many should you keep? I'm looking for something easy to interpret, robust, scalable, and model-independent. So PCA is not a good option here.
Any idea what criterion to use? With the right combination of variables, you get pretty much the same predictive power with 10 variables as with using the full set. Yet using the full set allows you to identify extremely rare cases that are extremely costly, in terms of fraud. What would be a good compromise?
What about adaptive variable selection, that is, variables included in your model changing every month to improve predictive performance? And how do you modify your set of variables without impacting fraud scores - that is, with keeping scores consistent over time?
Here a modern algorithm - not your typical stepwise procedure - to select the most useful variables.
As the person selecting the variables, the information accessible to you will be... variable. In order to allow for this facet of human intelligence, I suggest that anyone in the prediction business follows the path of the ancient Persians. When you have identified as many key variables as you can, go out for a few beers. Chances are that the next morning you will wake up with at least one major variable that slipped your mind.