Subscribe to DSC Newsletter

Data Science and Machine Learning Without Mathematics

There is a set of techniques covering all aspects of machine learning (the statistical engine behind data science) that does not use any mathematics or statistical theory beyond high school level. So when you hear that some serious mathematical knowledge is required to become a data scientist, this should be taken with a grain of salt.

The reason maths is thought to be a requirement is because of the following reasons:

  • Standard tools such as logistic regression, decision trees or confidence intervals, are math-heavy
  • Most employers use standard tools
  • As a result, hiring managers are looking for candidates with a strong math background, mostly for historical reasons
  • Academic training for data scientists are math-heavy for historical reasons (using the professors that used to teach stat classes)

Because of this, you need to really be math savvy to get a "standard" job, so sticking to standard math-heavy training and standard tools work for people interested in becoming a data scientist. To make things more complicated, most of the courses advertised as "math-free" or "learn data science in three days" are selling you snake oil (it won't help you get a job, and many times the training material is laughable.) You can learn data science very quickly, even on your own if you are a self-learner with a strong background working with data and programming (maybe you have a physics background) but that is another story.

Yet there is a set of techniques, designed by a data scientist with a strong mathematical background and long list of publications in top statistical journals that does not use mathematics nor statistical modeling. These techniques work just as well and some of them have been proved to be equivalent to their math-heavy cousins, with the additional bonus of generally being more robust. They are easy to understand and lead to easy interpretations, yet it is not snake oil: it is actually based on years of experience processing large volumes of diverse data, mostly in automated mode.

If you create your own startup, develop your own data science consultancy, or work for an organization that does not care about the tools that you use -- as long as they are cheap, easy to implement, and reliable -- you might consider using these simple, scalable, math-free methods. For instance, if you develop algorithms for stock trading, you wouldn't want to use the same tools as your competitors. These math-free techniques can give you a competitive advantage. 

Below, I describe several math-free techniques covering a good chunk of data science, and how they differ from their traditional math-heavy cousins. I use them pretty much every day, though most of the time, in some automated ways.

  • Advanced Machine Learning with Basic Excel -- This is a light implementation of the technique described below. It is so simple (yet efficient) that basic Excel implementations exist. Also available in Python, Perl, Julia, and R. And we are currently working on an SQL implementation.
  • State-of-the-Art Machine Learning Automation with HDT -- This is blending two traditional techniques: decision trees, and regression. However this implementation does not involve any node splitting nor any traditional regression model (the regression part is the math-free Jackknife regression described below.)  An earlier version of this was based on logistic regression, but after noticing that simple data transformations and using fewer parameters resulted in better performance (more robust,) logistic regression was replaced by Jackknife regression described below. 
  • Model-Free Confidence Intervals -- To understand the concept of confidence interval, you typically need to have some notions about random variables and probability distributions, in short, all the stuff explained in the first 200 pages of any statistics textbook. Not here, these confidence intervals are based on percentiles (very easy to understand, and math-free, yet you can reliability use them for predictive analytics.) My confidence intervals are equivalent to the standard ones in most cases: the only difference being that you don't need to read 200 pages of stats to understand how it works. 
  • Tests of Hypotheses -- One of the difficult topics for students taking stats classes. Here, it has been replaced by a simple variant of my confidence intervals, so understanding the concept is now straightforward.  
  • Jackknife Regression with Excel -- This regression technique is so simple (and efficient) that it can be implemented in Excel or SQL. 
  • Jackknife Regression: Theory -- This is regression without statistical theory behind it: no even linear algebra. Yet it comes with confidence intervals. Despite using few meta-parameters, the loss of accuracy (compared with classic regression) is minimum. The methodology works well in the presence of outliers, highly correlated features, or other violations of the assumptions that must be satisfied by your data set when using traditional regression.
  • Indexation, Cataloguing, and NLP -- A "math free" approach to supervised clustering.
  • Fast Combinatorial Feature Selection -- Traditional techniques are based on some variance reduction principle, which usually requires understanding the concept of random variable. Not here.  
  • Debunking Some Statistical Myths -- What is better: a 99% accurate technique on a data set that is 80% accurate, or a 90% accurate technique, on a data set that is 90% accurate? 
  • Variance, Clustering, and Density Estimation Revisited -- No maths involved. 

I am now working on developing stats-free methods for time series, though I developed stats-heavy ones in the past in the context of extreme event analytics

DSC Resources

Popular Articles

Views: 29816

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Scott Burk on June 30, 2017 at 9:54am

Vincent, I enjoy the discussion.  I don't remember much measure theory nor many of the rote mechanics of calculus.  But taking the classes helped me learn to think and I still understand the principles of calculus and use those concepts even today.   I look forward to diving deeper in some of your links.  I think there are a lot of programs chasing students and agree with your snake oil analogy.  It is a shame to get students in these programs only to be disappointed by lack of a job or an employer disappointed by what they hired.  Look forward to more of these interesting discussion.  Thanks. 

Comment by Jeffrey Lapides on June 30, 2017 at 7:15am

Hi Vincent:

Really like your approach and look forward to reading through all these. I often tell my clients very similar things, namely that you don't need fancy statistics to do lots of interesting and useful analysis. I tell them that I focus on being a scientist, trying out ideas and making sure I understanding results, not blindly applying complex algorithms.

I learned this the hard way a long time ago when my undergraduate thesis advisor made a fool of me (not intentionally). I had spent 3 weeks trying to use FFTs to squeeze a signal out of a small amount of data on the college mainframe and he finally wondered what I was doing. He asked me for the data, took out a piece of graph paper, plotted every 3 points, looked at it and said, "Hmmm. . . , Looks like you screwed up the experiment, there is no signal here, I suggest you redo the experiment and stop analyzing the data."

My own view is that if you can't visualize the result, you won't convince anybody of anything too easily. My son told me this is the well known statistical test he learned in medical school, the TFO test: Totally F..ing Obvious.

Jeff Lapides

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service