Subscribe to DSC Newsletter

This question was posted on one of our LinkedIn groups. The author wrote:

In practice, given a wide range of classifiers, we often have to choose the one based on performance comparison through validation. Research literature shows that there is no classifier that performs universally best in all contexts for all problems. The following paper applied 8 most popular classifiers (e.g., SVM, Neural Net, Ensemble, KNN, Decision Tree, Logistic Regression, Discriminant Analysis, Naive Bayes, etc.) in Machine Learning arena to solve a problem currently confronting finance institutions such as banks, insurers, asset managers in their derivative valuation and risk management. The paper shows that when properly parameterized (which the paper discusses in details), the performances are either consistent with or contrary to some classic studies in the area. The paper is available at SSRN

I replied, saying that a better question is which classifier performs worst, as the answer is more simple. Answers that come to my mind are classifiers based on discriminate analysis or Naive Bayes. For instance, discriminate analysis only allows you to detect clusters that are linearly separated. A blend of various classifiers usually works better than a single one, and sometimes just transforming your data (using a log scale) provides substantial improvement.

However, this being an important question, I'd like to ask our members how they chose a classifier. Even defining performance is not straightforward here. 

Source for picture: click here (check out page 94) or also here

DSC Resources

Popular Articles

Views: 5338

Reply to This

Replies to This Discussion

I would be really glad if LDA and QDA become obsolete because the formulas for these models are really hard to understand!

Logistic regression enables the researcher to explicitly parameterize and estimate a theoretical model.  The other techniques to varying degrees put more emphasis on purely empirical estimation from the data.  

Whether this is advantageous or disadvantageous will of course depend on the situation but as we move to an increasingly Big Data world I believe that theory-based research will have an inherent advantage:  with thousands of potential explanatory variables available, the potential for noise to overwhelm the signal increases, unless we have theoretical models to suggest which variables are more likely to convey signal.  

It's similar to how computers can beat human chess (and now Go) players -- but the strongest players in the world are expert human players combined with computers.  Pure machine learning gets stronger every day, but still benefits from human input.

I expect KNN to fail because of the curse of dimensionality, so it should be put aside at the outset.

Exclude LDA next, Elements of Statistical Learning (Friedman, Hastie, Tibshirani) has a persuasive case that several of these are equivalent. LDA is not better, by any standard, than logistic regression, so rule out LDA. SVM can be seen as a special case of logistic regression with regularized parameters. Hastie's website has slides about it (https://web.stanford.edu/~hastie/Papers/svmtalk.pdf). These fall into a family of classifiers that do well.

To me, with a lot of multi-correlated observables, the random forest seems most persuasive. This article claims that a variant of SVM does as well as the random forest, and it is faster. Check out this article by J. Wainer, "Comparison of 14 different families of classification algorithms on 115 binary datasets" https://arxiv.org/pdf/1606.00930.

The classification is all 2 categories and I'm not entirely persuaded that a multi-category classification will be equivalent.

what about using multiple classifier together like MAS. let them work together?

Unless I missed it, but I quickly searched the paper for "boost" and I'm surprised I didn't find gradient boosted trees (see http://fastml.com/what-is-better-gradient-boosted-trees-or-random-f...) mentioned. And XGBoost in particular (and not the gradient boosted tree implementation in sklearn), which seems to be wining more than half of Kaggle contests (https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-c...). XGBoost seems to have tweaked the original description of gradient boosted trees and is thus, different than sklearn's implementation (https:[email protected]_bour/6-tricks-i-learned-from-the-otto-kag...). Users claim better prediction power than sklearn.

As indicated in our paper, the original objective of our research was to search for a real-world solution for a real-world need to find the proxy for a type of financial market feature variable (in this case, Credit Default Swap, CDS curves) for those illiquid corporates, i.e., those that don't have liquid quotes. It turned out classifier performance comparison is a natural extension of the research.

We followed through existing literature (two classic ones are mentioned in the paper); clearly, to find the best (in terms of optimal paramterization choices) of the best (in terms of classifier families across 8 most popular ones). Naturally, we had to compare 156 classifiers (we did a lot more than that; but we had to cut the paper size to current version). The paper is here: https://ssrn.com/abstract=2967184

 

The paper didn't discuss/compare Random Forest among the classifier families despite the popularity of it. I am curious to know how it compares with the rest.

It does compare Bagged Tree as an example of ensemble.

Certainly, there are other classifiers beyond 8 classifier families covered in our paper. We will take your suggestions and study them in our future research.

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service