Feature selection is one of the core topics in machine learning. In statistical science, it is called variable reduction or selection. Our scientist published a methodology to automate this process and efficiently handle la large number of features (called variables by statisticians). Click here for details.
Here, we mention an article published by Isabelle Guyon and Andre Elisseeff in Journal of Machine Learning Research. While published in 2003, it is still one of the best ML papers on feature selection.
Figure 3 from the article: A variable useless by itself can be useful together with others
Variable and feature selection have become the focus of much research in areas of application for
which datasets with tens or hundreds of thousands of variables are available. These areas include
text processing of internet documents, gene expression array analysis, and combinatorial chemistry.
The objective of variable selection is three-fold: improving the prediction performance of the predictors,
providing faster and more cost-effective predictors, and providing a better understanding of
the underlying process that generated the data. The contributions of this special issue cover a wide
range of aspects of such problems: providing a better definition of the objective function, feature
construction, feature ranking, multivariate feature selection, efficient search methods, and feature
validity assessment methods.
The recent developments in variable and feature selection have addressed the problem from the pragmatic point of view of improving the performance of predictors. They have met the challenge of operating on input spaces of several thousand variables. Sophisticated wrapper or embedded methods improve predictor performance compared to simpler variable ranking methods like correlation methods, but the improvements are not always significant: domains with large numbers of input variables suffer from the curse of dimensionality and multivariate methods may overfit the data. For some domains, applying first a method of automatic feature construction yields improved performance and a more compact set of features. The methods proposed in this special issue have been tested on a wide variety of data sets (see Table 1), which limits the possibility of making comparisons across papers. Further work includes the organization of a benchmark. The approaches are very diverse and motivated by various theoretical arguments, but a unifying theoretical framework is lacking. Because of these shortcomings, it is important when starting with a new problem to have a few baseline performance values. To that end, we recommend using a linear predictor of your choice (e.g. a linear SVM) and select variables in two alternate ways: (1) with a variable ranking method using a correlation coefficient or mutual information; (2) with a nested subset selection method performing forward or backward selection or with multiplicative updates. Further down the road, connections need to be made between the problems of variable and feature selection and those of experimental design and active learning, in an effort to move away from observational data toward experimental data, and to address problems of causality inference
Below is a screenshot from the article:
The 10 questions below is an extract from this long article:
Click here to read the full article (PDF).