Bayesian Machine Learning (part - 2)
Bayesian Way Of Linear Regression
Now that we have an understanding of Baye’s Rule, we will move ahead and try to use it to analyze linear regression models. To start with let us first define linear regression model mathematically.
Y_{j =} ∑_{i} w_{j}* X_{ij}_{ }
Where i is the dimensionality of the data X. j represents the index of input data X. w_{i } are the weights of the linear regression model. Y_{j } is the corresponding output for X_{j} .
Let us see with an example, how our regression equation looks, let i = 3, which implies,
Y_{j} = w_{1*} x_{1j} + w_{2*} x_{2j } + w_{3*} x_{3j}
Where j is ranging from 1 to N where N is the number of data points we have.
Bayesian Model For Linear Regression
(We will discuss the process of Bayesian modelling in next part, but for now please consider the below model as true)
P(w,Y,X) = P(Y/X, w) * P(w) * P(X) ….. (4)
Or
P(w ,Y ,X) * P(X) = P(Y/ X ,w) * P(w) ….. (5)
Or
P(w, Y/X) = P(Y/X, w) * P(w) ….. (6)
The model shown above is derived from Bayesian model theory, and the equation is from same model. We will see in detail the methodology of Bayesian in coming Posts. For now below is the statement which is derived from the model:
Target Y is dependent on Weights W and input data X. And Weights and Data are independent of each other.
Now let us try to build our Baye’s equation for the above model. We aim at determining the parameters of our model i.e. weights w. Thus the posterior distribution with given X_{train} , Y_{train} as data looks like:
P(w / Y_{train} , X_{train}) = P(Y_{train} / w, X_{train}) * P(w) / P(Y_{train} / X_{train}) ….. (7)
Here: Likelihood: P(Y_{train} / w, X_{train})
Prior: P(w)
Evidence: P(Y_{train} / X_{train}) = constant, as data is fixed
Now we consider that likelihood is coming from a Normal distribution with mean as w^{T}X and variance as σ^{2} I The probability density function looks like: P(Y_{train} / w, X_{train}) ~ N(Y|w^{T}X, σ^{2} I)
We have taken σ^{2} I as identity matrix because of calculation simplicity, but people can take different covariance matrix and that will mean that different dimensions of the data are intercorrelated.
As a prior distribution on w we take Normal distribution with mean = zero and variance = 1. The probability distribution function can be defined as P(w) ~ N(w|0,1)
Now our Posterior distribution looks like: [ N(Y | w^{T} X ,σ^{2} I) * N(w | 0,1) / constant ] - we need to maximize this with respect to w. This method is also known as Maximum A Posteriori.
Mathematical Calculation
P(w/Y_{train} , X_{train}) = P(Y_{train} / w , X_{train}) * P(w) ---- maximizing this term w.r.t w
Taking log both sides-
log(P(w/Y_{train} , X_{train})) = log(P(Y_{train} / w , X_{train})) + log(P(w))
LHS = log(C_{1} * e( -(y - w^{T}x) ( 2σ^{2} I)^{-1} (y - w^{T}x)^{T} )) + log(C_{2} * e(- (w) ( 2γ^{2} )^{-1} (w)^{T} ))
LHS = log(C_{1}) - (2σ^{2} )^{-1} * || y - w^{T} X||^{2 } + log(C_{2}) - (γ^{2} )^{-1} * ||w||^{2 } -- maximizing w.r.t w
Removing constant terms as they won’t appear in differentiation
Multiplying the expression by -2σ^{2} and re-writing we get:
= ||y – W^{T}X||^{2 } + λ^{2 } * ||w||^{2}_{ } -- minimizing w.r.t w ---- ^{ }(8)
The above minimization problem is the exact expression we obtain in L^{2} Norm regularization. Thus we see that Bayesian method of supervised linear regression takes care of overfitting or underfitting inherently.
Implementation Of Bayesian Regression
Now we know that Bayesian model expresses the parameters of a linear regression equation in form of distribution, which we call as posterior distribution. To compute this distribution we have different methodologies, one of which is Monte-Carlo Markov-Chain (MCMC). MCMC is a sampling technique which samples out the points from parameter-space which are in proportion to the actual distribution of the parameter in its space. (I believe readers who are reading this post might not know MCMC at this stage, but don’t worry I will explain it in detail in coming Posts.)
For now I am not going into coding part of the postrior distiribution, as many of the stuff is still not explained by here. I will start posting about coding methodologies when relevant theory will be completely coverd. But if somebody wants to try can follow this link.
https://brendanhasz.github.io/2018/12/03/tfp-regression
In the next post I will explain how to build Bayesian models.
Thanks for Reading !!!
^{ }
Comment
Great post! Thank you for sharing this.
© 2020 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central