]]>

]]>

]]>

Bayesian Machine Learning (part -7)Expectation-Maximization : A solved ExampleI have covered the theoretical part of EM algorithm in my previous posts. For reference below are the links:Part 4 ,Part 5 ,Part 6In the previous posts I have explained the complete derivation of EM algorithm, in this post we take a very simple example and see what all we actually need to remember to apply an EM algorithm on a clustering problem.So let’s start !!!Problem StatementLet us first define our problem statements:Consider a following type of distribution as shown in the below image :In the above distribution random variable takes 3 values {1,2,3} with frequency {70,30,50} respectively. Our objective is to design a clustering model on the top of it.Key features about the data:The data is discrete in natureWe can consider the above data as a mixture of 2 probabilisticdistribution functionsSo now our objective is to find the parameters of the defined Pdfs. So let us first define our probability density functions as follows:The above expression states that for PDF p1: random variable takes values {1,2,3} with probability { α , 1- α ,0 } respectively and similarly for PDF p2 : random variable takes values {1,2,3} with probability {0, 1 - β , β} respectively.So now to be more precise we want to compute the exact values of α and β.Introduction of Latent Variable in the SceneLet us say as we need to apply the EM algorithm we need a latent Variable !!!We consider the concept of latent variable in the case when we are unable to explain the behavior of the observed variable. Latent variable are used to define the behavior of observed variable. In our case of clustering in the above problem, we consider a latent variable t which defines which observed data point comes from which cluster or pdf. function. In other words, which distribution is how much probable.So, let us define the distribution of our latent variable t.So now to be more precise, we need to compute the values of γ, α and β.So, lets start !!Prior SettingLet us first consider a prior value for all our variablesγ = α = β = 0.5now the above equation states that:P (X = 1 | t = P1) = 0.5P (X = 2 | t = P1) = 0.5P (X = 3 | t = P1) = 0P (X = 1 | t = P2) = 0P (X = 1 | t = P2) = 0.5P (X = 1 | t = P2) = 0.5P (t = P1) = 0.5P (t = P2) = 0.5E-StepTo solve the E step we need to focus on the equation shown below:q(t=c) = P(t=c | Xi , θ)for i ranging for all the given data points.Here q is the distribution from the lower bound after we use the Jensen’s inequality.(refer part 5)So, let us compute the q(t=c) for all the given data points in our problem.P(t=c | Xi ) = P(Xi | t = c)*P(t=c) / summation on c(P(Xi | t = c)*P(t=c))P(t=1 | X =1 ) = (α * γ) / (α * γ + 0) = 1 P(t=1 | X =2 ) = ((1 – α) * γ) / ((1-α) * γ + (1- β)*(1- γ)) = 0.5 P(t=1 | X =3 ) = (0 * γ) / (0 * γ + (β)*(1- γ)) = 0P(t=2 | X =1 ) = (0 * (1-γ)) / (0 * (1-γ)) + (α * γ)) = 0P(t=2 | X =2 ) = (1- β)*(1- γ) / ((1- β)*(1- γ) + (1-α) * γ) = 0.5P(t=2 | X =3 ) = (β)*(1- γ) / (β)*(1- γ) + 0 = 1Wasn’t it simple !!!Let see M stepM StepIn the M step we need to differentiate the below equation w.r.t γ ,α ,β and use the previously calculated values in E Step as constants. So the equation is :We can ignore θ in the above representation for this problem as it is simple and disrete in nature with direct probabilities for 3 values.So, now the new representation is as :Likelihood lower bound differential part = To remember the above expression is simple:Summation on datapoint(summation on q the values we computed in E step(log of joint distribution over Xi and t))So, let us expand our equation, we will differentiate to compute our parametersSolving the above equation by differentiating the equation 3 times, each time w.r.t 1 paramter and keeping the other 2 as constant we end up to the following result:γ = (85/150), α = (70/85) , β = (50/65)Now you can check the calculation yourself, these were my hand performed results :)Now we can again put the above computed paramter values in the E step for computation of the posterior and then again perform M step.(most probably the values won;t change for the above problem)We do the above iterations until our parameters values stop updating.And this is how we perform our EM algorithm !!!Once we know the values of our paramters, we know our probability density function of P1 and P2. Now for any new value taken by our random variable X, we can specifically say from which cluster of pdf function it belongs and with what probability.Thanks for reading !!!See More

Bayesian Machine Learning (part -6)Probabilistic Clustering – Gaussian Mixture ModelContinuing our discussion on probabilistically clustering of our data, where we left out discussion on part 4 of our Bayesian inference series. As we have seen the modelling theory of Expectation – Maximization algorithm in part-5, its time to implement it.So, let’s start!!!For a better revision, please follow the link :Part 4Part 5 -Remembering our problemWe were having observed data as given in the below picture.We have assumed there are in total 3 clusters and for each cluster we have defined the Gaussain distribution as follows :We considered that data came from a Latent variable t who knows which data point belongs to which data point with what probability. The Bayesian model looks like :Now the probability of the observed data given the parameters looks like :The point to note here is that the equation :Is already in its lower bound form, so no need to apply Jensen’s inequality.E-StepAs we know the E-step solutions is :q(t=c) = P(t=c | Xi ., θ)As we know there are 3 clusters in our case we will need to compute the posterior distribution for our latent carriable t for all the 3 clusters, so, let’s see how to do it.We will perform the above equation for all the observed data points, for every cluster with respect to every point. So, for example we have 100 observed data points and have 3 clusters, then the matrix of the posterior on the latent variable t is of dimension – 100 x 3.M-StepIn the M step, we maximize our lower bound variational inference w.r.t θ and we use above computed posterior distribution on our latent variable t as constants. So let’s start,The equation we need to maximize w.r.t θ and π1 ,π2 , π3 , is:The priors are computed with constraint that every prior >= 0 and sum of allprior = 1These E step and M step are iterated a multiple time one after the other, in a fassion that results of E step are used in M step and results of M step are used in E step. Doing this we end the iteration when the loss funtions stops reducing.The loss funtion is the same function which we are differentiating in the M step.The result of applying the EM algorithm on the given data set is as below:So, in this post we saw how we can implement the EM algorithm for probabilistic clustering.Thanks For Reading !!!See More

]]>

]]>

Bayesian Machine Learning (part -5)Introduction: Expectation-MaximizationIn this blog we are going to see how Expectation-maximization algorithm works very closely. This blog is in strict continuation of the previous blog. Previously we saw how probabilistic clustering ended up into chicken-egg problem. That is, if we have distribution of latent variable, we can compute the parameters of the clusters and vice-versa. To understand how the entire approach works we need to learn few mathematical tools, namely : Jensen’s inequality and KL divergence.So, let’s start!!!Jensen’s InequalityJensen’s inequality states that, for a convex function f(x) , if we select points as x = a and x = b, also we take α,β such that α + β = 1 ,thenIf we consider x as a random variable and it takes the values a, b with probability α, β (as we know α + β = 1)., then the expression αa + βb is the expectation of x –> E(x), and the expression αf(a) + βf(b) -- > is E(f(x)). Thus, the expression of Jensen’s inequality can be re-written as:This above expression will be used in further analysis.Kullback–Leibler divergenceThis is the method of computation of divergence between two probability distribution functions. The mathematical expression is as follows:Where q(x) and p(x) are pdf. functions on a random variable x. KL(q||p) is always ≥ 0.Expectation – Maximization AlgorithmNow from our previous blog discussion, we know that marginalized posterior distribution is as follows :Now we need to maximize the above expression w.r.t θ. The problem here is that maximization of the above expression is not simple, and function usually has many local maxima. We can use gradient – decent here but the process will become too lengthy. To solve this, we use a different approach here all together. We will try and construct a lower bound of the above expression. We will use Jensen’s inequality. And once we build our lower bound, we will try and maximize it.the lower bound is also called variational lower boundhere we will try to apply Jensen’s inequality to compute the lower boundconsider the log as our function f and let us introduce a new function q(t=c, θ), and multiply it in denominator and numerator. So the new equation looks like:Now consider the ratio (P(XI, t=c | θ)/ q(t=c, θ)) as random variable and q(t=c, θ) as the probabilistic coefficient as ∑3c=1 q(t=c, θ) = 1Now from Jensen’s inequality, we know that:GraphicallyAll the above small curves are the family from lower bound L(θ, q).Expectation StepSo, one thing is clear that Lower bound will always be beneath the actual function, and we need to maximize the lower bound to be as much closer to actual function as possible. In expectation step we fix our θ and try to maximize the lower bound. We do it by reducing the gap between actual function and the lower bound. The equation is as follows :putting in some linear algebra, the equation ends up in the expression of KL divergence between q and posterior tthe solution of expectation step is :q(t=c) = P(t=c | Xi ., θ)Maximization stepOnce we get the gap reduced between the variational lower function L(θ, q) and log(P(X |θ)), we maximize our lower bound function w.r.t θ. The second component of above equation is const. w.r.t θ. Therefore, we drop it, the new equation remains is:In the next post we will continue our discussion with Gaussian mixture model and try to implement Expectation – Maximization algorithm to perform clustering.Thanks for reading !!! See More

Bayesian Machine Learning (part -5)Introduction: Expectation-MaximizationIn this blog we are going to see how Expectation-maximization algorithm works very closely. This blog is in strict continuation of the previous blog. Previously we saw how probabilistic clustering ended up into chicken-egg problem. That is, if we have distribution of latent variable, we can compute the parameters of the clusters and vice-versa. To understand how the entire approach works we need to learn few mathematical tools, namely : Jensen’s inequality and KL divergence.So, let’s start!!!Jensen’s InequalityJensen’s inequality states that, for a convex function f(x) , if we select points as x = a and x = b, also we take α,β such that α + β = 1 ,thenIf we consider x as a random variable and it takes the values a, b with probability α, β (as we know α + β = 1)., then the expression αa + βb is the expectation of x –> E(x), and the expression αf(a) + βf(b) -- > is E(f(x)). Thus, the expression of Jensen’s inequality can be re-written as:This above expression will be used in further analysis.Kullback–Leibler divergenceThis is the method of computation of divergence between two probability distribution functions. The mathematical expression is as follows:Where q(x) and p(x) are pdf. functions on a random variable x. KL(q||p) is always ≥ 0.Expectation – Maximization AlgorithmNow from our previous blog discussion, we know that marginalized posterior distribution is as follows :Now we need to maximize the above expression w.r.t θ. The problem here is that maximization of the above expression is not simple, and function usually has many local maxima. We can use gradient – decent here but the process will become too lengthy. To solve this, we use a different approach here all together. We will try and construct a lower bound of the above expression. We will use Jensen’s inequality. And once we build our lower bound, we will try and maximize it.the lower bound is also called variational lower boundhere we will try to apply Jensen’s inequality to compute the lower boundconsider the log as our function f and let us introduce a new function q(t=c, θ), and multiply it in denominator and numerator. So the new equation looks like:Now consider the ratio (P(XI, t=c | θ)/ q(t=c, θ)) as random variable and q(t=c, θ) as the probabilistic coefficient as ∑3c=1 q(t=c, θ) = 1Now from Jensen’s inequality, we know that:GraphicallyAll the above small curves are the family from lower bound L(θ, q).Expectation StepSo, one thing is clear that Lower bound will always be beneath the actual function, and we need to maximize the lower bound to be as much closer to actual function as possible. In expectation step we fix our θ and try to maximize the lower bound. We do it by reducing the gap between actual function and the lower bound. The equation is as follows :putting in some linear algebra, the equation ends up in the expression of KL divergence between q and posterior tthe solution of expectation step is :q(t=c) = P(t=c | Xi ., θ)Maximization stepOnce we get the gap reduced between the variational lower function L(θ, q) and log(P(X |θ)), we maximize our lower bound function w.r.t θ. The second component of above equation is const. w.r.t θ. Therefore, we drop it, the new equation remains is:In the next post we will continue our discussion with Gaussian mixture model and try to implement Expectation – Maximization algorithm to perform clustering.Thanks for reading !!! See More

]]>