This post is part of my forthcoming book The Mathematical Foundations of Data Science. Probability is one of the foundations of machine learning (along with linear algebra and optimization). In this post, we discuss the areas where probability theory could apply in machine learning applications. If you want to know more about the book, follow me on Ajit Jaokar linked
First, we explore some background behind probability theory
Probability as a measure of uncertainty
Probability is a measure of uncertainty. Probability applies to machine learning because in the real world, we need to make decisions with incomplete information. Hence, we need a mechanism to quantify uncertainty – which Probability provides us. Using probability, we can model elements of uncertainty such as risk in financial transactions and many other business processes. In contrast, in traditional programming, we work with deterministic problems i.e. the solution is not affected by uncertainty.
Probability of an event
Probability quantifies the likelihood or belief that an event will occur. Probability theory has three important concepts: Event – an outcome to which a probability is assigned; The Sample Space which represents the set of possible outcomes for the events and the Probability Function which maps a probability to an event. The probability function indicates the likelihood that the event being a part of the sample space is drawn. The probability distribution represents the shape or distribution of all events in the sample space. The probability of an event can be calculated directly by counting all the occurrences of the event and dividing them by the total possible outcomes of the event. Probability is a fractional value and has a value in the range between 0 and 1, where 0 indicates no probability and 1 represents full probability.
Two Schools of Probability
There are two ways of interpreting probability: frequentist probability which considers the actual likelihood of an event and the Bayesian probability which considers how strongly we believe that an event will occur. frequentist probability includes techniques like p-values and confidence intervals used in statistical inference and maximum likelihood estimation for parameter estimation.
Frequentist techniques are based on counts and Bayesian techniques are based on beliefs. In the Bayesian approach, probabilities are assigned to events based on evidence and personal belief. The Bayesian techniques are based on the Bayes’ theorem. Bayseian analysis can be used to model events that have not occurred before or occur infrequently. In contrast, frequentist techniques are based on sampling – hence the frequency of occurrence of an event. For example, the pValue indicates a number between 0 and 1. The larger the p-value – the more the data conforms to the null hypothesis. The smaller the p-value, the more the data conforms to the alternate hypothesis. If p-value is less than 0.05, then we reject the null hypothesis i.e. accept the alternate hypothesis.
With this background, let us explore how probability can apply to machine learning
Sampling – Dealing with non-deterministic processes
Probability forms the basis of sampling. In machine learning, uncertainty can arise in many ways – for example – noise in data. Probability provides a set of tools to model uncertainty. Noise could arise due to variability in the observations, as a measurement error or from other sources. Noise effects both inputs and outputs.
Apart from noise in the sample data, we should also cater for the effects of bias. Even when the observations are uniformly sampled i.e. no bias is assumed in the sampling – other limitations can introduce bias. For example, if we choose a set of participants from a specific region of the country., by definition. the sample is biased to that region. We could expand the sample scope and variance in the data by including more regions in the country. We need to balance the variance and the bias so that the sample chosen is representative of the task we are trying to model.
Typically, we are given a dataset i.e. we do not have control on the creation and sampling process of the dataset. To cater for this lack of control over sampling, we split the data into train and test sets or we use resampling techniques. Hence, probability (through sampling) is involved when we have incomplete coverage of the problem domain.
Pattern recognition is a key part of machine learning. We can approach machine learning as a pattern recognition problem from a Bayesian standpoint. In Pattern Recognition – Christopher Bishop takes a Bayesian view and presents approximate inference algorithms for situations where exact answers are not feasible. For the same reasons listed above, Probability theory is a key part of pattern recognition because it helps to cater for noise / uncertainty and for the finite size of the sample and also to apply Bayesian principles to machine learning.
Training – use in Maximum likelihood estimation
Many iterative machine learning techniques like Maximum likelihood estimation (MLE) are based on probability theory. MLE is used for training in models like linear regression, logistic regression and artificial neural networks.
Developing specific algorithms
Probability forms the basis of specific algorithms like Naive Bayes classifier
In machine learning models such as neural networks, hyperparameters are tuned through techniques like grid search. Bayesian optimization can be also used for hyperparameter optimization.
In binary classification tasks, we predict a single probability score. Model evaluation techniques require us to summarize the performance of a model based on predicted probabilities. For example – aggregation measures like log loss require the understanding of probability theory
Applied fields of study
Probability forms the foundation of many fields such as physics, biology, and computer science where maths is applied.
Probability is a key part of inference – MLE for frequentist and Bayesian inference for Bayesian
As we see above, there are many areas of machine learning where probability concepts apply. Yet, they are not so commonly taught in typical coding programs on machine learning. In the last blog, we discussed this trend in context of correlation vs causation. I suspect the same is true i.e. the starting point for most developers is a dataset which they are already provided. In contrast, if you conduct a PhD experiment / thesis – you have to typically build your experiment from scratch.
If you want to know more about the book, follow me on Ajit Jaokar linked
Image source Dice