]]>

]]>

Bayesian Machine Learning (part - 4)IntroductionIn the previous post we have learnt about the importance of Latent Variables in Bayesian modelling. Now starting from this post, we will see Bayesian in action. We will walk through different aspects of machine learning and see how Bayesian methods will help us in designing the solutions. And also the additional capabilities and insights we can have by using it. The sections which follows are generally known as Bayesian inference. In this post we will see how Bayesian methods can be used to do clustering on the given data.Probabilistic ClusteringClustering is the method of splitting the data into separate chunks based on the inherent properties of the data. When we use the word ‘probabilistic’ in it, we imply that each and every point in the given data is a part of every cluster, but with some probability. And thus the word Probabilistic clustering. So let’s start!!!As it is clear from the heading, each and every point in the given data will belong to every cluster with some probability, and the maximum probability cluster will define the point. Now for such a kind of solution in clustering we need to know few things in advance.How each cluster will be defined probabilistically?How many clusters will be formed?Answer to these above 2 questions will help us define each point in the data as a probabilistic part of each cluster. So let us first formulate the solutions to these questions.How each cluster will be defined probabilistically?We define each cluster to have come from a Gaussian distribution, having mean = µ and standard deviation = ∑, and the equation looks likeHow many clusters will be formed?To answer this question let us take a sample data set visual :Have a close look to the image above, and you will feel that there are 3 clusters in it. So lets define our 3 clusters probabilistically.Now, let us try to write an equation which can define the probability of a point in the data to be a part of every cluster, but before that we need a mechanism or a probability function that can tell to which cluster a point can be from with what probability. Suppose for now let us say that a point can be a part of any cluster with equal probability of 1/3. Then the probabilistic distribution of a point would look like:But in the above case it is an assumption that every point can belong from any cluster with equal probability. But is not true right!!So somebody has to tell this, and for it we use a Latent Variable which knows the distribution of every point belonging to every cluster. And the Bayesian model looks like:The above model states that Latent Variable t knows, to which cluster the point x belongs and with what probability. So if we re-write the above equation:Now let us call all our parameters asthe above equation states that the probability by which the Latent variable takes value c, given all the parameters is πcThe above equation states that probability of a point sampled from cluster c, given that it has come from cluster c is Ɲ (X|µc, ∑c2).Now we know that P(X, t=c | θ) = P(X| t=c, θ) * P(t=c | θ) from our Bayesian model. We can marginalize t to compute The above equation is exactly same as our original one.Fact Checks:In the equation P (t = c | θ) = πc, this has no condition over the data points x, and thus it can be considered as Prior distribution and as a hard classifier.Marginalization was done on t, to have the exact expression on LHS as was chosen originally. Therefore, t remains latent and unobserved.Iif we want the probability of a point belonging to a specific cluster, given the parameters θ and the point x, then the expression looks like:Where LHS is the posterior distribution of t w.r.t X and θ. Z is the normalization constant and can be calculated as :The Chicken-Egg Problem in Probabilistic ClusteringNow it is obvious that the Bayesian model we made between t and x, when we say the latent variable knows the cluster number of every point, we say it probabilistically. That means the formulae applied is the posterior distribution as we discussed in Fact – check section. Now the problem is, to compute θ, we should know t and to compute t, we should know θ. We will see this in the following equations:And the posterior t is computed usingNow, from the above expression it is clear that means and standard deviations are computed if we know the source t and vice-versa.So, in the next post we will see how to solve the above problem using Expectation – Maximization algorithm. Thanks for reading!!!See More

Bayesian Machine Learning (part - 3) Bayesian ModellingIn this post we will see the methodology of building Bayesian models. In my previous post I used a Bayesian model for linear regression. The model looks like:So, let us first understand the construction of the above model:when there is an arrow pointing from one node to another, that implies start nodes causes end node. For example, in above case Target node depends on Weights node as well as Data node.Start node is known as Parent and the end node is known as ChildMost importantly, cycles are avoided while building a Bayesian model.These structures normally are generated from the given data and experienceMathematical representation of the Bayesian model is done using Chain rule. For example, in the above diagram the chain rule is applied as follows:P(y,w,x) = P(y/w,x)P(w)P(x)Generalized chain rule looks like: The Bayesian models are build based upon the subject matter expertise and experience of the developer. An ExampleProblem Statement : Given are three variables : sprinkle, rain , wet grass, where sprinkle and rain are predictors and wet grass is a predicate variable. Design a Bayesian model over it.Solution:Theory behind above model:Sprinkle is used to wet grass, therefore Sprinkle causes wet grass so Sprinkle node is parent to wet grass nodeRain also wet the grass, therefore Sprinkle causes wet grass so Rain node is parent to wet grass nodeif there is rain, there is no need to sprinkle, therefore there Is a negative relation between Sprinkle and rain node. So, rain node is parent to Sprinkle nodeChain rule implementation is : P(S,R,W) = P(W/S,R)*P(S/R)*P(R)Latent Variable IntroductionWiki definition: In statistics, latent variables (from Latin: present participle of lateo (“lie hidden”), as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured)In my words: Latent variables are hidden variables i.e. they are not observed. Latent variables are rather inferred and are been thought as the cause of the observed variables. Mostly in Bayesian models they are used when we end up in cycle generation in our model. Latent variables help us in simplifying the mathematical solution of our problem, but this is not always correct.Let us see with some examplesSuppose we have a problem , hunger, eat and work . if we create a Bayesian model, the model looks like:The above model reads like, if we work, we feel hunger. If we feel hunger – we eat. If we eat, we have energy to work. Now this above model has a cycle in it and thus if chain rule is applied to it, the chain will become infinitely long. So, the above Bayesian model is not correct. Thus, we need to introduce the Latent variable here, let us call it as TNow the above mode states that, T is responsible for eat, hunger and work to happen. This variable T is not observed but can be inferred as the cause of happening of work, eat and hunger. This assumption seems to be correct also, as in a biological body – something resides in it that pushes it to eat and work, even though we cannot observe it physically.Let us write the chain rule equation for the above model:P(W,E,H,T) = P(W/T)*P(E/T)*P(H/T)*P(T)Another ExampleLet us see another example. Suppose we have following variables: GPA, IQ, School. The model reads like if a person has good IQ, he/she will get good School and GPA, if he/she got good School, he/she will have good IQ and may get good GPA. If he/she gets good GPA, he/she may have good IQ, and he/she may be from good School. The model looks like:Now from above description of the model, all the nodes are connected to every other node. Thus chain rule cannot be applied to this model. So we need a Latent variable to be introduced here. Let us call the new Latent variable as I. The new model looks like:Now we read the above model as Latent variable I is responsible for all the other three variables. Now the chain rule can easily be applied. And it looks like:P(S,G,Q,I) = P(S/I)*P(G/I)*P(Q/I)*P(I)Hence in this post we saw how to model and create Latent Variables. They mostly help in reducing the complexity of the problem.Now from then next post we will start much interesting part of Bayesian inferencing using the above Latent variables wherever required. Thanks for reading!!! See More

]]>

]]>

Bayesian Machine Learning (part - 2) Bayesian Way Of Linear RegressionNow that we have an understanding of Baye’s Rule, we will move ahead and try to use it to analyze linear regression models. To start with let us first define linear regression model mathematically.Yj = ∑i wj* Xij Where i is the dimensionality of the data X. j represents the index of input data X. wi are the weights of the linear regression model. Yj is the corresponding output for Xj .Let us see with an example, how our regression equation looks, let i = 3, which implies,Yj = w1* x1j + w2* x2j + w3* x3jWhere j is ranging from 1 to N where N is the number of data points we have.Bayesian Model For Linear Regression(We will discuss the process of Bayesian modelling in next part, but for now please consider the below model as true)P(w,Y,X) = P(Y/X, w) * P(w) * P(X) ….. (4)OrP(w ,Y ,X) * P(X) = P(Y/ X ,w) * P(w) ….. (5)OrP(w, Y/X) = P(Y/X, w) * P(w) ….. (6)The model shown above is derived from Bayesian model theory, and the equation is from same model. We will see in detail the methodology of Bayesian in coming Posts. For now below is the statement which is derived from the model:Target Y is dependent on Weights W and input data X. And Weights and Data are independent of each other.Now let us try to build our Baye’s equation for the above model. We aim at determining the parameters of our model i.e. weights w. Thus the posterior distribution with given Xtrain , Ytrain as data looks like:P(w / Ytrain , Xtrain) = P(Ytrain / w, Xtrain) * P(w) / P(Ytrain / Xtrain) ….. (7)Here: Likelihood: P(Ytrain / w, Xtrain) Prior: P(w) Evidence: P(Ytrain / Xtrain) = constant, as data is fixedNow we consider that likelihood is coming from a Normal distribution with mean as wTX and variance as σ2 I The probability density function looks like: P(Ytrain / w, Xtrain) ~ N(Y|wTX, σ2 I)We have taken σ2 I as identity matrix because of calculation simplicity, but people can take different covariance matrix and that will mean that different dimensions of the data are intercorrelated.As a prior distribution on w we take Normal distribution with mean = zero and variance = 1. The probability distribution function can be defined as P(w) ~ N(w|0,1)Now our Posterior distribution looks like: [ N(Y | wT X ,σ2 I) * N(w | 0,1) / constant ] - we need to maximize this with respect to w. This method is also known as Maximum A Posteriori.Mathematical CalculationP(w/Ytrain , Xtrain) = P(Ytrain / w , Xtrain) * P(w) ---- maximizing this term w.r.t wTaking log both sides-log(P(w/Ytrain , Xtrain)) = log(P(Ytrain / w , Xtrain)) + log(P(w))LHS = log(C1 * e( -(y - wTx) ( 2σ2 I)-1 (y - wTx)T )) + log(C2 * e(- (w) ( 2γ2 )-1 (w)T ))LHS = log(C1) - (2σ2 )-1 * || y - wT X||2 + log(C2) - (γ2 )-1 * ||w||2 -- maximizing w.r.t wRemoving constant terms as they won’t appear in differentiationMultiplying the expression by -2σ2 and re-writing we get:= ||y – WTX||2 + λ2 * ||w||2 -- minimizing w.r.t w ---- (8)The above minimization problem is the exact expression we obtain in L2 Norm regularization. Thus we see that Bayesian method of supervised linear regression takes care of overfitting or underfitting inherently.Implementation Of Bayesian RegressionNow we know that Bayesian model expresses the parameters of a linear regression equation in form of distribution, which we call as posterior distribution. To compute this distribution we have different methodologies, one of which is Monte-Carlo Markov-Chain (MCMC). MCMC is a sampling technique which samples out the points from parameter-space which are in proportion to the actual distribution of the parameter in its space. (I believe readers who are reading this post might not know MCMC at this stage, but don’t worry I will explain it in detail in coming Posts.)For now I am not going into coding part of the postrior distiribution, as many of the stuff is still not explained by here. I will start posting about coding methodologies when relevant theory will be completely coverd. But if somebody wants to try can follow this link. https://brendanhasz.github.io/2018/12/03/tfp-regressionIn the next post I will explain how to build Bayesian models.Thanks for Reading !!! See More

Bayesian Machine Learning (part - 1)IntroductionAs a data scientist, I am curious about knowing different analytical processes from a probabilistic point of view. There are two most popular ways of looking into any event, namely Bayesian and Frequentist . When Frequentist researchers look at any event from frequency of occurrence, Bayesian researchers focus more on probability of events happening. i am starting this series of blog posts to illustrate the Bayesian methods of performing analytics. I will try to cover as much theory as possible with illustrative examples and sample codes so that readers can learn and practice simultaneously. Let's start !!!Defining Baye's RuleAs we all know Baye's rule is one of the most popular probability equation, which is defined as :P(a given b) = P(a intersection b) / P(b) ..... (1)Here a and b are events that have taken place.In the above equation I have bold-marked given and intersection as these words have the major significance in Baye's rule. Given illustrates that event b has already happened and now we need ato determine the probability of happening event a. Intersection illustrates the occurrence of event a and b simultanously.The another form in which this above equation can be written is as follows:P(a given b) = P(b given a) * P(a) / P(b) .... (2) (this equation can easily be derived from equation 1)The above equation formulates the foundation of Bayesian inference.Understanding the Baye's Rule from Analytics PerspectiveIn analytics we always try to identify the worldly behaviors from models. These models are mathematical equations with some parameters in them. These parameters are measured based upon the behavior of the events or the evidence we collect from the world. These evidence are popularly known as Data.So the question occurs, how Bayesian methods helps in identifying these parameters ?Let us first see how Baye's Rule can incorporate these models in them. Now we will take theta and X as our events in the Baye's rule and will re-write the equation 2.P(theta given X) = P(X given theta) * P(theta) / P(X) ..... (3)Now, let us define all the different components of the above equation:P(theta given X) : Posterior Distribution**P(X given theta) : LikelihoodP(theta) : Prior distribution**P(X) : Evidence** We can use the term distribution as all these terms are probabilities ranging from 0 to 1. theta in above case becomes the parameters of the model we need to compute. X is the data on which the model is trained.Equation 3 can be re-written as :posterior distribution = likelihood * prior distribution / evidence ..... (4)Now let us see all the above components individually.Prior Distribution : We consider prior distribution of theta as the information we have regarding theta before even starting the analytical model fitting process. This information is mostly based upon the experience. Usually we take Normal distribution with mean = 0 and variance = 1 as the prior distribution of theta .Posterior Distribution : This is the solution distribution we get over our theta given our data. That is, once we have trained our model on the given data, we finally lands up at tuning our parameters of the model. Posterior distribution is the distribution over measured theta. (this is again a big difference between frequentist and Bayesian way of inference) Likelihood : This term is not a probabilistic distribution over theta. But it is the probability of occurrence of the data given the theta . In other words, given some theta how likely are we to get the given data, that means how accurately our model with given theta as parameters can understand the given data.Evidence : it is the probability of the occurrence of the data itself.Now that we have our definition in place, let us see an example showing how Bayesian can help in determining the selection of a hypothesis, given the data. let us suppose we have following data: X = {2,4,,8,32,64} And we propose following two hypothesis: 1) 2^n where n is ranging from 0 to 9 2) 2*n where n is ranging from 0 to 50Now let us see how we can use Baye's rule Note : as we have no prior information, we will have equal probability for all hypothesis.----- Hypothesis 2^n where n is ranging from 0 to 10------This Hypothesis takes following values : 1,2,4,8,16,32,64,128,256,512prior 1 : 1 / 2Likelihood 1 : (1/10)*(1/10)*(1/10)*(1/10)*(1/10)evidence : constant for all the hypothesis as the input data is fixedposterior 1 : (1/10)*(1/10)*(1/10)*(1/10)*(1/10) * (1/2) / evidence----- Hypothesis 2*n where n is ranging from 0 to 50-----This Hypothesis takes following values : 0,2,4,6,8,10,12,14,16...100.prior 2 : 1 / 2Likelihood 2 : (1/50)*(1/50)*(1/50)*(1/50)*(1/50)evidence : constant for all the hypothesis as the input data is fixedposterior 2 : (1/50)*(1/50)*(1/50)*(1/50)*(1/50) * (1/2) / evidenceNow from above analysis we can easily see that Posterior 1 >> Posterior 2 that means Hypothesis 1 defines the data in much better way then Hypothesis 2.If we closely look into the evaluation of posterior for both the hypothesis, we will note the major difference creator was the likelihood. In future we will note that maximizing this likelihood will help in parameter tuning. This method is popularly known as Maximum Likelihood Estimation.So in this post I introduced Baye's Rule. In the next post we will see how to use it in estimating paramters for linear regression with example.Thanks For Reading !!!See More