.

# Difference between Machine Learning and Statistics

I run into this question a lot and I have heard statisticians say things like we all do machine learning because none of us actually runs a regression or classification by hand on paper. We all use machine's.

On the other hand - some computer scientist's I talk to say that when you use programmatic techniques to orchestrate an analytical flow compared to using a GUI in SAS / SPSS you are using machine learning.

One more answer I have heard is that if you use algorithms like RandomForest , Deep Learning , GBM etc you are doing machine learning as compared to statistics.

I think all the above are observations that are partly right. But , as a person trained in Computer Science and Statistics , I have a very specific test to split the two.

To me - the difference really lies in the notion of  defining the  Loss function . In Statistics the Loss function is pre-defined and wired to the type of method you are running. i.e For Regression the Loss function is Mean Squared Error. The best results are the one that minimizes the MSE.

If you are using machine learning , you will most likely  write a custom program for a unique Loss Function specific to your problem. Let us say you might want to take an average of the MSE of several models and then select one based on some criteria(ensemble methods).  Or if you have a very skewed skewed dataset with 1% positives and 99% negatives. You might want to introduce a bias for positives and code your Loss Function appropriately.  These kind of operations require a very heavy programmatic approach.Because , ultimately the Loss Function you land on will be coded specific to your problem.

This to me is the key difference between the two.

Views: 7271

Comment

Join Data Science Central

Comment by Douglas Kell on August 27, 2014 at 2:22pm

Rubbish. Read the seminal article by Breiman L: Statistical modeling: The two cultures. Stat Sci 2001; 16:199-215. Stats starts with a hypothesis and tests the goodness of fit of the data to the hypothesis. ML starts with the data and finds the hypothesis that best fits the data. See also (epistemologically)

Kell DB, Oliver SG: Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. Bioessays 2004; 26:99-105.

Comment by Vincent Granville on August 6, 2014 at 9:11am

Ahahah - Louis, I started in Fortran too, long ago! And then Pascal on an Apple machine that could only display text, 25 lines x 80 columns (40 columns per screen, but you could logically - not physically - swap screens). It was a time where some people used their TV monitor as computer screen. And memory was limited to 64 KB, I think. I used IMSL statistical libraries in the eighties - they are still alive today, they recently run some advertisements on Data Science Central indeed; they've been acquired by Rogue Software, but still use the name IMSL.

Anyway, have you read my article that compares data science with machine learning and other disciplines? Here's the link.

Comment by Niraj on August 6, 2014 at 7:39am

Louis - You are probably one in the few who have been doing "Data science" for ages. I would say that Jerome Friedman , Leo Breiman  are all Data Scientist's .

I should have probably compared "Data Science" and "Analytics" in my post , rather than comparing Machine Learning and Statistics.

Comment by Louis Giokas on August 6, 2014 at 7:32am

I don't know if I fully agree.  I started doing statistics when the use of packages was fairly rare.  We coded everything FORTRAN.  So, to base this on writing a program is a little shaky.

Take Mahout, for example.  It is a machine learning library in which many "traditional" statistical functions are implemented using machine learning methods.  One example is logistic regression (a very traditional statistical method) trained using SGD.  This seems to be using both.  Is it statistics, or is it machine learning?  It seems to be both.