# Subscribe to DSC Newsletter

Bayesian Machine Learning (part - 2)

Bayesian Way Of Linear Regression

Now that we have an understanding of Baye’s Rule, we will move ahead and try to use it to analyze linear regression models. To start with let us first define linear regression model mathematically.

Yj =i  wj* Xij

Where i is the dimensionality of the data X. j represents the index of input data X. wi    are the weights of the linear regression model. Y  is the corresponding output for Xj .

Let us see with an example, how our regression equation looks, let i = 3, which implies,

Yj  = w1* x1j  + w2* x2j   + w3* x3j

Where j is ranging from 1 to N where N is the number of data points we have.

Bayesian Model For Linear Regression

(We will discuss the process of Bayesian modelling in next part, but for now please consider the below model as true) P(w,Y,X) = P(Y/X, w) * P(w) * P(X) ….. (4)

Or

P(w ,Y ,X) * P(X) = P(Y/ X ,w) * P(w)  ….. (5)

Or

P(w, Y/X) = P(Y/X, w) * P(w)  ….. (6)

The model shown above is derived from Bayesian model theory, and the equation is from same model. We will see in detail the methodology of Bayesian in coming Posts. For now below is the statement  which is derived from the model:

Target Y is dependent on Weights W and input data X. And Weights and Data are independent of each other.

Now let us try to build our Baye’s equation for the above model. We aim at determining the parameters of our model i.e. weights w. Thus the posterior distribution with given Xtrain  , Ytrain  as data looks like:

P(w / Ytrain  , Xtrain) = P(Ytrain  / w, Xtrain) * P(w) / P(Ytrain  / Xtrain) ….. (7)

Here: Likelihood: P(Ytrain  / w, Xtrain)

Prior:  P(w)

Evidence:  P(Ytrain  / Xtrain) = constant, as data is fixed

Now we consider that likelihood is coming from a Normal distribution with mean as wTX and variance as σ2 I The probability density function looks like: P(Ytrain  / w, Xtrain)  ~ N(Y|wTX, σ2 I)

We have taken σ2 I as identity matrix because of calculation simplicity, but people can take different covariance matrix and that will mean that different dimensions of the data are intercorrelated.

As a prior distribution on w we take Normal distribution with mean = zero and variance = 1. The probability distribution function can be defined as P(w) ~ N(w|0,1)

Now our Posterior distribution looks like: [ N(Y | wT X ,σ2 I) * N(w | 0,1) / constant ]  - we need to maximize this with respect to w. This method is also known as Maximum A Posteriori.

Mathematical Calculation

P(w/Ytrain  , Xtrain) = P(Ytrain  / w , Xtrain) * P(w) ---- maximizing this term w.r.t  w

Taking log both sides-

log(P(w/Ytrain  , Xtrain))  =  log(P(Ytrain  / w , Xtrain)) + log(P(w))

LHS = log(C1 * e( -(y - wTx) ( 2σ2 I)-1  (y - wTx)T )) + log(C2 * e(- (w) ( 2γ2 )-1  (w)T ))

LHS = log(C1)    -   (2σ2  )-1  * || y - wT X||2     +   log(C2)    -   (γ2 )-1  * ||w|| -- maximizing w.r.t w

Removing constant terms as they won’t appear in differentiation

Multiplying the expression by -2σ2  and re-writing we get:

= ||y – WTX||  +    λ  * ||w||2   -- minimizing w.r.t w ----  (8)

The above minimization problem is the exact expression we obtain in L2 Norm regularization. Thus we see that Bayesian method of supervised linear regression takes care of overfitting or underfitting inherently.

Implementation Of  Bayesian Regression

Now we know that Bayesian model expresses the parameters of a linear regression equation in form of distribution, which we call as posterior distribution. To compute this distribution we have different methodologies, one of which is Monte-Carlo Markov-Chain (MCMC). MCMC is a sampling technique which samples out the points from parameter-space which are in proportion to the actual distribution of the parameter in its space. (I believe readers who are reading this post might not know MCMC at this stage, but don’t worry I will explain it in detail in coming Posts.)

For now I am not going into coding part of the postrior distiribution, as many of the stuff is still not explained by here. I will start posting about coding methodologies when relevant theory will be completely coverd. But if somebody wants to try can follow this link.

In the next post I will explain how to build Bayesian models.

Thanks for Reading !!!

Views: 2469

Comment

### You need to be a member of Data Science Central to add comments!

Join Data Science Central

## Videos

• ### DSC Webinar Series: How to Use Time Series Data to Forecast at Scale

Added by Tim Matteson

• ### DSC Webinar Series: Why Data Prep is Step 1 for Analytics Success

Added by Tim Matteson

• ### DSC Webinar Series: 9 Best Practices for Data Blending

Added by Tim Matteson