Subscribe to DSC Newsletter

Which machine learning algorithm should I use?

By Hui Li, Principal Staff Scientist, Data Science, at SAS.

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

  • The size, quality, and nature of data.
  • The available computational time.
  • The urgency of the task.
  • What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

The machine learning algorithm cheat sheet

Click on the picture below to zoom in. 

Flow chart shows which algorithms to use when

To read more, click here

The article describes when using one of the following algorithms:

  • Linear regression and Logistic regression 
  • Linear SVM and kernel SVM
  • Trees and ensemble trees
  • Neural networks and deep learning
  • k-means/k-modes, GMM (Gaussian mixture model) clustering
  • DBSCAN
  • Hierarchical clustering
  • PCA, SVD and LDA

DSC Resources

Popular Articles

Views: 20395

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by R Bohn on May 5, 2017 at 2:02pm

Thanks or attempting this to Dr. Li. Obviously, it's a complex topic. I would be interested in her thoughts on how often the choice of algorithm matters. Which is more productive: improving features (variables), tuning Algorithm X, or trying Algorithm Y.

  I also noticed what appeared to be several bugs in the text. For example, 

When most dependent variables are numeric, logistic regression and SVM should be the first try for classification. These models are easy to implement, their parameters easy to tune, and the performances are also pretty good. So these models are appropriate for beginners."

I think this should say "when most independent variables..." With continuous dependent variables, you cannot even use logistic regression. (I was unable to put this comment on the source page.) 

 

Comment by Chris Pehura on May 5, 2017 at 11:28am

My manager is a data steward that needs to understand the algorithm to better understand the data. He needs to do this to develop the intuitiveness that the data is right.

Unlike things before, we cannot black box algorithms that process data. The moment we do, the plans are wrong, bad decisions are made, and we spend a lot of time chasing and investing in phantoms.

Comment by Siddhi Patel on April 27, 2017 at 7:04pm

Thanks... It is really very helpful...

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service