**Bayesian Machine Learning (****part - 1****)**

**Introduction**

As a data scientist, I am curious about knowing different analytical processes from a probabilistic point of view. There are two most popular ways of looking into any event, namely **Bayesian** and **Frequentist **. When Frequentist researchers look at any event from frequency of occurrence, Bayesian researchers focus more on probability of events happening.

i am starting this series of blog posts to illustrate the Bayesian methods of performing analytics. I will try to cover as much theory as possible with illustrative examples and sample codes so that readers can learn and practice simultaneously.

Let's start !!!

**Defining Baye's Rule**

As we all know Baye's rule is one of the most popular probability equation, which is defined as :

P(a **given** b) = P(a **intersection** b) / P(b) ..... (1)

Here a and b are events that have taken place.

In the above equation I have *bold-marked * **given** and **intersection **as these words have the major significance in Baye's rule. **Given ** illustrates that event b has already happened and now we need ato determine the probability of happening event a. **Intersection ** illustrates the occurrence of event a and b simultanously.

The another form in which this above equation can be written is as follows:

P(a **given** b) = P(b **given** a) * P(a) / P(b) .... (2)

(this equation can easily be derived from equation 1)

The above equation formulates the foundation of Bayesian inference.

**Understanding the Baye's Rule from Analytics Perspective**

In analytics we always try to identify the worldly behaviors from **models**. These models are mathematical equations with some **parameters** in them. These **parameters** are measured based upon the behavior of the events or the evidence we collect from the world. These evidence are popularly known as **Data.**

So the question occurs, how Bayesian methods helps in identifying these parameters ?

Let us first see how Baye's Rule can incorporate these models in them. Now we will take *theta *and X as our events in the Baye's rule and will re-write the equation 2.

P(*theta* **given** X) = P(X **given** *theta*) * P(*theta*) / P(X) ..... (3)

Now, let us define all the different components of the above equation:

- P(
*theta***given**X) : Posterior Distribution** - P(X
**given***theta*) : Likelihood - P(
*theta*) : Prior distribution** - P(X) : Evidence

****** We can use the term distribution as all these terms are probabilities ranging from 0 to 1. *theta * in above case becomes the parameters of the model we need to compute. X is the data on which the model is trained.

Equation 3 can be re-written as :

posterior distribution = likelihood * prior distribution / evidence ..... (4)

Now let us see all the above components individually.

**Prior Distribution :** We consider prior distribution of *theta * as the information we have regarding *theta* before even starting the analytical model fitting process. This information is mostly based upon the experience. Usually we take Normal distribution with mean = 0 and variance = 1 as the prior distribution of *theta *.

**Posterior Distribution : ** This is the solution distribution we get over our *theta *given our data. That is, once we have trained our model on the given data, we finally lands up at tuning our parameters of the model. Posterior distribution is the distribution over measured *theta. (this is again a big difference between frequentist and Bayesian way of inference)*

**Likelihood : **This term is not a probabilistic distribution over *theta . *But it is the probability of occurrence of the data given the

**Evidence : **it is the probability of the occurrence of the data itself.

Now that we have our definition in place, let us see an example showing how Bayesian can help in determining the selection of a hypothesis, given the data.

let us suppose we have following data:

X = {2,4,,8,32,64}

And we propose following two hypothesis:

1) 2^n where n is ranging from 0 to 9

2) 2*n where n is ranging from 0 to 50

Now let us see how we can use Baye's rule

Note : as we have no prior information, we will have equal probability for all hypothesis.

----- Hypothesis 2^n where n is ranging from 0 to 10------

This Hypothesis takes following values : 1,2,4,8,16,32,64,128,256,512

**prior 1**: 1 / 2**Likelihood 1**: (1/10)*(1/10)*(1/10)*(1/10)*(1/10)**evidence**: constant for all the hypothesis as the input data is fixed**posterior 1**: (1/10)*(1/10)*(1/10)*(1/10)*(1/10) * (1/2) / evidence

----- Hypothesis 2*n where n is ranging from 0 to 50-----

This Hypothesis takes following values : 0,2,4,6,8,10,12,14,16...100.

**prior 2**: 1 / 2**Likelihood 2**: (1/50)*(1/50)*(1/50)*(1/50)*(1/50)**evidence**: constant for all the hypothesis as the input data is fixed**posterior 2**: (1/50)*(1/50)*(1/50)*(1/50)*(1/50) * (1/2) / evidence

Now from above analysis we can easily see that Posterior 1 >> Posterior 2 that means Hypothesis 1 defines the data in much better way then Hypothesis 2.

If we closely look into the evaluation of **posterior** for both the hypothesis, we will note the major difference creator was the **likelihood.** In future we will note that **maximizing** this **likelihood **will help in parameter tuning. This method is popularly known as **Maximum Likelihood Estimation**.

So in this post I introduced Baye's Rule. In the next post we will see how to use it in estimating paramters for linear regression with example.

Thanks For Reading !!!

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central