**Bayesian Machine Learning (****part – 1****)**

**Introduction**

As a data scientist, I am curious about knowing different analytical processes from a probabilistic point of view. There are two most popular ways of looking into any event, namely **Bayesian** and **Frequentist **. When Frequentist researchers look at any event from frequency of occurrence, Bayesian researchers focus more on probability of events happening.

i am starting this series of blog posts to illustrate the Bayesian methods of performing analytics. I will try to cover as much theory as possible with illustrative examples and sample codes so that readers can learn and practice simultaneously.

Let’s start !!!

**Defining Baye’s Rule**

As we all know Baye’s rule is one of the most popular probability equation, which is defined as :

P(a **given** b) = P(a **intersection** b) / P(b) ….. (1)

Here a and b are events that have taken place.

In the above equation I have *bold-marked * **given** and **intersection **as these words have the major significance in Baye’s rule. **Given ** illustrates that event b has already happened and now we need ato determine the probability of happening event a. **Intersection ** illustrates the occurrence of event a and b simultanously.

The another form in which this above equation can be written is as follows:

P(a **given** b) = P(b **given** a) * P(a) / P(b) …. (2)

(this equation can easily be derived from equation 1)

The above equation formulates the foundation of Bayesian inference.

**Understanding the Baye’s Rule from Analytics Perspective**

In analytics we always try to identify the worldly behaviors from **models**. These models are mathematical equations with some **parameters** in them. These **parameters** are measured based upon the behavior of the events or the evidence we collect from the world. These evidence are popularly known as **Data.**

So the question occurs, how Bayesian methods helps in identifying these parameters ?

Let us first see how Baye’s Rule can incorporate these models in them. Now we will take *theta *and X as our events in the Baye’s rule and will re-write the equation 2.

P(*theta* **given** X) = P(X **given** *theta*) * P(*theta*) / P(X) ….. (3)

Now, let us define all the different components of the above equation:

- P(
*theta***given**X) : Posterior Distribution** - P(X
**given***theta*) : Likelihood - P(
*theta*) : Prior distribution** - P(X) : Evidence

****** We can use the term distribution as all these terms are probabilities ranging from 0 to 1. *theta * in above case becomes the parameters of the model we need to compute. X is the data on which the model is trained.

Equation 3 can be re-written as :

posterior distribution = likelihood * prior distribution / evidence ….. (4)

Now let us see all the above components individually.

**Prior Distribution :** We consider prior distribution of *theta * as the information we have regarding *theta* before even starting the analytical model fitting process. This information is mostly based upon the experience. Usually we take Normal distribution with mean = 0 and variance = 1 as the prior distribution of *theta *.

**Posterior Distribution : ** This is the solution distribution we get over our *theta *given our data. That is, once we have trained our model on the given data, we finally lands up at tuning our parameters of the model. Posterior distribution is the distribution over measured *theta. (this is again a big difference between frequentist and Bayesian way of inference)*

**Likelihood : **This term is not a probabilistic distribution over *theta . *But it is the probability of occurrence of the data given the

*theta .*In other words, given some

*theta*how likely are we to get the given data, that means how accurately our model with given

*theta*as parameters can understand the given data.

**Evidence : **it is the probability of the occurrence of the data itself.

Now that we have our definition in place, let us see an example showing how Bayesian can help in determining the selection of a hypothesis, given the data.

let us suppose we have following data:

X = {2,4,,8,32,64}

And we propose following two hypothesis:

1) 2^n where n is ranging from 0 to 9

2) 2*n where n is ranging from 0 to 50

Now let us see how we can use Baye’s rule

Note : as we have no prior information, we will have equal probability for all hypothesis.

—– Hypothesis 2^n where n is ranging from 0 to 10——

This Hypothesis takes following values : 1,2,4,8,16,32,64,128,256,512

**prior 1**: 1 / 2**Likelihood 1**: (1/10)*(1/10)*(1/10)*(1/10)*(1/10)**evidence**: constant for all the hypothesis as the input data is fixed**posterior 1**: (1/10)*(1/10)*(1/10)*(1/10)*(1/10) * (1/2) / evidence

—– Hypothesis 2*n where n is ranging from 0 to 50—–

This Hypothesis takes following values : 0,2,4,6,8,10,12,14,16…100.

**prior 2**: 1 / 2**Likelihood 2**: (1/50)*(1/50)*(1/50)*(1/50)*(1/50)**evidence**: constant for all the hypothesis as the input data is fixed**posterior 2**: (1/50)*(1/50)*(1/50)*(1/50)*(1/50) * (1/2) / evidence

Now from above analysis we can easily see that Posterior 1 >> Posterior 2 that means Hypothesis 1 defines the data in much better way then Hypothesis 2.

If we closely look into the evaluation of **posterior** for both the hypothesis, we will note the major difference creator was the **likelihood.** In future we will note that **maximizing** this **likelihood **will help in parameter tuning. This method is popularly known as **Maximum Likelihood Estimation**.

So in this post I introduced Baye’s Rule. In the next post we will see how to use it in estimating paramters for linear regression with example.

Thanks For Reading !!!