Background

The arguments / discussions between the Bayesian vs frequentist approaches in statistics are long running. I am interested in how these approaches impact machine learning. Often, books on machine learning combine the two approaches, or in some cases, take only one approach. This does not help from a learning standpoint.

So, in this two-part blog we first discuss the differences between the Frequentist and Bayesian approaches. Then, we discuss how they apply to machine learning algorithms.

Introduction

Traditionally, we understand statistics as follows. Given a collection of items to be studied (ex: analysing heights of people) which we call as the population, you can acquire a sample of the population. You could calculate some useful properties of the sample (such as the mean). These give you the descriptive statistics for the sample. But if you wanted to generalise about the population based on the sample, you need to consider inferential statistics. The goal of inferential statistics is to infer some quantity about the population from the sample. There are two general philosophies for inferential statistics i.e. frequentist and Bayesian.

Frequentist and Bayesian approach differ in their interpretation of probability. In the frequentist world, you can only assign probabilities to repeated random phenomenon (such as the rolling of a dice). From the observations of these long-run phenomenon, you could infer the probability of occurrence of a specific event in question (for instance how many times the fair dice would roll to 6). Thus, in the frequentist world, to apply probability, we need a repeated event which is observed over a long duration. In contrast, in the Bayesian view, we assign probabilities to specific events and the probability represents the measure of belief/confidence for that event. The belief can be updated in the light of new evidence. In a purist frequentist sense, probabilities can be assigned only to repeated events – you could not assign probability to the outcome of an election (because it is not a repeated event).

There are three key points to remember when discussing the frequentist v.s. the Bayesian philosophies.

The first, which we already mentioned, Bayesians assign probability to a specific outcome.
Secondly, Bayesian inference yields probability distributions while frequentist inference focusses on point estimates.
Finally, in Bayesian statistics, parameters are assigned a probability whereas in the frequentist approach, the parameters are fixed. Thus, in frequentist statistics, we take random samples from the population and aim to find a set of fixed parameters that correspond to the underlying distribution that generated the data. In contrast for Bayesian statistics, we take the entire data and aim to find the parameters of the distribution that generated the data but we consider these parameters as probabilities i.e. not fixed.

Analysis

So, the question arises: We have seen how Bayesians incorporate uncertainty in their modelling but how do frequentists treat uncertainty if they work with point estimates?

The general approach for frequentists is: to make an estimate but to also specify the conditions under which the estimate is valid.

Frequentists use three ideas to understand uncertainty i.e. null hypothesis, p-values and confidence intervals – which come broadly under statistical hypothesis testing for frequentist approaches.

Use of p-values to indicate statistical significance: Assuming your null hypothesis is true, a high p-value indicates that your results are random i.e. not related to the experiment you have performed. In other words, the smaller the p-value, the more statistically significant the result. Note that p-values are statements about the data sample, not the hypothesis itself.
Use of confidence intervals to provide an estimated range of values which is likely to include the population parameter. A confidence intervalgives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl’s Statistics Glossary v1.1). Common choices for the confidence level C are 0.90, 0.95, and 0.99. For example, a 95% confidence interval
But of your population distribution is not normal or if your samples are large, we use the Central Limit Theorem.

To summarise

In this post, we summarised some complex ideas about frequentist and bayesian probability. In part two, we will see how these ideas apply to machine learning and deep learning algorithms.