Maiia Bakhova's Posts - Data Science Central2019-12-09T14:17:44ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhovahttps://storage.ning.com/topology/rest/1.0/file/get/2800449814?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1https://www.datasciencecentral.com/profiles/blog/feed?user=3nldv2dx1thoq&xn_auth=noOnline Ad selection: Upper Confidence Bound method and Thompson Sampling methodtag:www.datasciencecentral.com,2019-02-28:6448529:BlogPost:8061442019-02-28T19:00:00.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p>The post is devoted to select the most popular ad to display on a webpage to gather the most clicks. The rate at which the webpage visitors click on an ad is called a conversion rate for the add.</p>
<p>Assume that we have several ads and a place on a webpage to show one of them. We can display them one by one, record all the clicks, analyse the results afterwards and figure the most popular. But an ad display may be pricey. It would be more efficient to estimate rates in real time and to…</p>
<p>The post is devoted to select the most popular ad to display on a webpage to gather the most clicks. The rate at which the webpage visitors click on an ad is called a conversion rate for the add.</p>
<p>Assume that we have several ads and a place on a webpage to show one of them. We can display them one by one, record all the clicks, analyse the results afterwards and figure the most popular. But an ad display may be pricey. It would be more efficient to estimate rates in real time and to display the most popular one as soon as rates can be compared. Especially if an ad leads to a page for a visitor to buy something. There are couple of method for such estimations: Upper Confidence Bound method and Thompson Sampling method.</p>
<p>The first one is based on an confidence interval concept which is studied in a Statistics course and has a good intuitive explanation. Roughly speaking a confidence interval is a numeric interval were our value is supposed to lie with some probability, usually 95%. (The real statistical definition is more technical and means not quite this, but it practice the above explanation is close enough.)<span>During our ad displays we can compute average rates at each step with corresponding confidence intervals and pick up for next display an ad with a highest upper confidence bound</span> You can see how it happens in the video below.</p>
<p><a href="https://youtu.be/ptinlR-FtJM">https://youtu.be/faf60dNjlsw</a></p>
<p><span>The method has some drawbacks. It does not take into account that our rates must be between 0 and 1, so initial confidence intervals usually are much greater. It means that we loose some time on getting realistic values for our intervals. The worse thing is that if we throw in an additional ad then the process takes a lot of time to recover.</span><br/> <br/> <span>Here is another method which is more efficient, Thompson Sampling Method. It constructs Beta distributions for each ad rate and instead of computing averages draws a random number in accordance with the distribution. There is a picture how it goes for one ad, with a blue vertical line marking a mean and the red line for a random value:</span></p>
<p><a href="https://youtu.be/faf60dNjlsw">https://youtu.be/faf60dNjlsw</a></p>
<p>As you see since a random value has more probability to appear were our line is higher, then it get closer and closer to the mean at each step. You might view the area where a curve appears higher than horizontal axis as a confidence interval analogue.</p>
<p>Here how it works for a few ads (I dropped means to make picture more clear):</p>
<p><a href="https://youtu.be/ptinlR-FtJM">https://youtu.be/ptinlR-FtJM</a></p>
<p><span>In addition it accommodates an additional ad in the middle of the process more easily. </span></p>
<p></p>
<p>Do not hesitate to ask questions and point out mistakes!</p>
<p>(the post originally appeared here: <a href="https://myabakhova.blogspot.com/2019/02/online-ad-selection-upper-confidence.html" target="_blank" rel="noopener">Mya Bakhoava's blog</a></p>San Diego Water Pollution Map by Stationstag:www.datasciencecentral.com,2018-11-06:6448529:BlogPost:7758642018-11-06T20:46:45.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p>I have been working with San Diego Water quality data project:</p>
<p><a href="https://www.sandiegodata.org/2018/04/summer-water-quality-data-project/">https://www.sandiegodata.org/2018/04/summer-water-quality-data-project/</a></p>
<p>Here are data sets:</p>
<p><a href="https://data.sandiegodata.org/dataset?tags=water-project">https://data.sandiegodata.org/dataset?tags=water-project</a></p>
<p>Regretfully my complete works do not fit into the blog post (or even a few posts) because of a post…</p>
<p>I have been working with San Diego Water quality data project:</p>
<p><a href="https://www.sandiegodata.org/2018/04/summer-water-quality-data-project/">https://www.sandiegodata.org/2018/04/summer-water-quality-data-project/</a></p>
<p>Here are data sets:</p>
<p><a href="https://data.sandiegodata.org/dataset?tags=water-project">https://data.sandiegodata.org/dataset?tags=water-project</a></p>
<p>Regretfully my complete works do not fit into the blog post (or even a few posts) because of a post size restriction (1 Mb). Here is a repository for it: <a href="https://github.com/san-diego-water-quality/MyaBakhova">https://github.com/san-diego-water-quality/MyaBakhova</a></p>
<p>In particular I created a number of maps. You can see below one of them, where sizes of station marks correlate to pollution amounts:</p>
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808372018?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808372018?profile=RESIZE_1024x1024" width="750" class="align-full"/></a></p>Neural Networks as a Corporation Chain of Commandtag:www.datasciencecentral.com,2017-06-26:6448529:BlogPost:5817162017-06-26T19:00:00.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p><span><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808333220?profile=original" target="_self"><img class="align-full" src="http://storage.ning.com/topology/rest/1.0/file/get/2808333220?profile=RESIZE_320x320" width="258"></img></a></span></p>
<p><span>Neural networks are considered complicated and they are always explained using neurons and a brain function. But we do not need to learn how to brain works to understand Neural networks structure and how they operate. We can look as something people encounter in everyday life more often, like a corporation hierarchy.…</span></p>
<p><span><br></br></span></p>
<p><span><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808333220?profile=original" target="_self"><img width="258" src="http://storage.ning.com/topology/rest/1.0/file/get/2808333220?profile=RESIZE_320x320" width="258" class="align-full"/></a></span></p>
<p><span>Neural networks are considered complicated and they are always explained using neurons and a brain function. But we do not need to learn how to brain works to understand Neural networks structure and how they operate. We can look as something people encounter in everyday life more often, like a corporation hierarchy.</span></p>
<p><span><br/> <span>Let us start with logistic regression. Recall that a logistic regression divides 2 sets by a line (or a hyperplane if we have higher dimensions)</span><br/></span></p>
<div class="separator"><a href="https://3.bp.blogspot.com/-DtoFDmc1pQc/WVCQYnHLaXI/AAAAAAAAD6o/PrWi3iSmu9oLmmczNLeVjpYbYWR6EHjpQCLcBGAs/s1600/unnamed-chunk-1-1.png"><img border="0" height="320" src="https://3.bp.blogspot.com/-DtoFDmc1pQc/WVCQYnHLaXI/AAAAAAAAD6o/PrWi3iSmu9oLmmczNLeVjpYbYWR6EHjpQCLcBGAs/s320/unnamed-chunk-1-1.png" width="280"/></a></div>
<p><span><br/> <span>The logistic regression yields values form 0 to 1, and we can consider the process as making a evaluation. In the process we get data and we calculate our evaluation by a formula.</span><br/> <br/> <span> For example we may have the following assignment: to compute if we have enough goods in storage to last for a week of sales. This is quite a common problem, and say some clerks report their numbers to their manager to figure it out. The manager collects information, processes it and makes an evaluation.</span><br/> <br/> <span>Note that this is how a logistic regression functions.</span><br/> <br/> <br/> <br/> <br/></span></p>
<div class="separator"><a href="https://1.bp.blogspot.com/-7ZYrqryBt6M/WVCS47VzRzI/AAAAAAAAD60/lHlaKfLiw4kwIReDGe90JknLL1mLFSomwCLcBGAs/s1600/LinRegAsChain1.png"><img alt="" border="0" height="254" src="https://1.bp.blogspot.com/-7ZYrqryBt6M/WVCS47VzRzI/AAAAAAAAD60/lHlaKfLiw4kwIReDGe90JknLL1mLFSomwCLcBGAs/s320/LinRegAsChain1.png" title="" width="320"/></a><a href="https://2.bp.blogspot.com/-VM1LE2Qj44c/WVCTXdvT0eI/AAAAAAAAD7E/gO7kY-06M4cFgo_8RvbLbqrXmukyO0hgwCLcBGAs/s1600/LinRegrWithWeights.png"><img border="0" height="220" src="https://2.bp.blogspot.com/-VM1LE2Qj44c/WVCTXdvT0eI/AAAAAAAAD7E/gO7kY-06M4cFgo_8RvbLbqrXmukyO0hgwCLcBGAs/s320/LinRegrWithWeights.png" width="320"/></a></div>
<div class="separator">Usually computing if an amount of goods is sufficient is not the only problem. In addition we need to know, for example, if our storage is full to optimal capacity (75% -85% or something like this). Therefore we need to evaluate another statistic. </div>
<div class="separator"><a href="https://3.bp.blogspot.com/-KcHjibJ-k9w/WVCUrCnpJOI/AAAAAAAAD7M/2d-V77vQJdUk3VQz-eOFpqO9wo__SoCRwCLcBGAs/s1600/LinRegAsChain2.png"><img border="0" height="275" src="https://3.bp.blogspot.com/-KcHjibJ-k9w/WVCUrCnpJOI/AAAAAAAAD7M/2d-V77vQJdUk3VQz-eOFpqO9wo__SoCRwCLcBGAs/s320/LinRegAsChain2.png" width="320"/></a></div>
<div class="separator">And of course these people should report to their supervisor who will make another evaluation:</div>
<div class="separator"><a href="https://1.bp.blogspot.com/-Cbx6qj31OY4/WVCWAXO9cZI/AAAAAAAAD7U/PUvx9c86XrEJ3Lot3bg-1NaEtSVauM58wCLcBGAs/s1600/LinRegAsChain3.png"><img border="0" height="226" src="https://1.bp.blogspot.com/-Cbx6qj31OY4/WVCWAXO9cZI/AAAAAAAAD7U/PUvx9c86XrEJ3Lot3bg-1NaEtSVauM58wCLcBGAs/s400/LinRegAsChain3.png" width="400"/></a></div>
<div class="separator">So we get a whole hierarchy of evaluations and at the end they report to CEO. We can compare it with a neural network structure:</div>
<div class="separator"><a href="https://3.bp.blogspot.com/-e_pGRWgGPTw/WVCWeM7zkuI/AAAAAAAAD7Y/a1tVBn9AGy0lX36El0GLbZKc0yPxIejggCLcBGAs/s1600/NN_Layers.png"><img border="0" height="347" src="https://3.bp.blogspot.com/-e_pGRWgGPTw/WVCWeM7zkuI/AAAAAAAAD7Y/a1tVBn9AGy0lX36El0GLbZKc0yPxIejggCLcBGAs/s640/NN_Layers.png" width="640"/></a></div>
<div class="separator">We can observe a lot of in common with a corporation chain of command. As we see middle managers are hidden layers which do the balk of the job. We have the similar information flow and processing which is analogous to forward propagation and backward propagation. </div>
<div class="separator"></div>
<div class="separator">What is left now is to explain that dealing with sigmoid function at each node is too costly so it mostly reserved for CEO level. </div>
<p></p>
<p><span>To read entire post, click <a href="http://myabakhova.blogspot.com/2017/06/" target="_blank">here</a></span></p>Detection of Practical Dependency of Variables with Confidence Intervalstag:www.datasciencecentral.com,2016-11-02:6448529:BlogPost:4825702016-11-02T20:30:00.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p><span>This is an article which attempts to detect dependable variables with non-linear method.</span></p>
<p>I'm going to apply a method for checking variable dependency which was introduced in <a href="http://myabakhova.blogspot.com/2016/09/measuring-indepence-of-variables-with.html" target="_blank">my previous post</a>. Because the "dependency" I get with this rule is not true dependency as defined in Probability then I will call variables <em>practically dependent at a confidence level…</em></p>
<p><span>This is an article which attempts to detect dependable variables with non-linear method.</span></p>
<p>I'm going to apply a method for checking variable dependency which was introduced in <a href="http://myabakhova.blogspot.com/2016/09/measuring-indepence-of-variables-with.html" target="_blank">my previous post</a>. Because the "dependency" I get with this rule is not true dependency as defined in Probability then I will call variables <em>practically dependent at a confidence level "alpha"</em>, where "alpha" is a confidence level of bootstrapped confidence intervals.</p>
<p>I will modify the idea slightly: I won’t compute means with interval lengths, because it is sufficient to verify that confidence intervals for Pr(A and B) and Pr(A)Pr(B) do not intersect. For this I only need the confidence interval endpoints. In addition I’ve noted that if a variable has only two values, then it is enough to check for practical dependency of only one value, because relative frequency values for such variable are complementary.</p>
<p></p>
<p></p>
<div id="introduction" class="section level4"><p>I have tried “boot” package mentioned in the previous post and discovered that it is not convenient for a really big data. It generates a huge matrix and then calculates a statistic for each column. Such approach requires a lot of memory. It is more prudent to generate a vector, calculate the statistic and then generate next vector, replacing the previous.</p>
</div>
<p></p>
<div id="data-description-load-and-initial-investigation" class="section level4"><h4>Data Description, Load and Initial Investigation</h4>
<p>I’m going to use data from KDD cup 1998, from <a href="https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html">here</a>. There is a training data set in text format, a data dictionary and some other files.</p>
<p>I will load the data set, which is already in my working directory. Then we can look at our data set and compare it with the data dictionary, as usual.</p>
<p></p>
<p><em>To read more, click <a href="http://myabakhova.blogspot.com/2016/09/measuring-indepence-of-variables-with.html" target="_blank">here</a>.</em></p>
</div>Measuring Dependence of Variables with Confidence Intervals.tag:www.datasciencecentral.com,2016-09-06:6448529:BlogPost:4656902016-09-06T22:07:20.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p>In this post I will sometimes use a term “variable” for “feature”(“predictor”“) or”outcome“(”predicted value“”).</p>
<p>The question of variable dependencies for a particular data is quite important, because it can help to reduce an amount of predictors used for a model. Or it can tell us what feature is not helpful for a model construction, although it still can be used for engineering of another predictor. For example sometimes it is better to compute speed than to use distance values. In…</p>
<p>In this post I will sometimes use a term “variable” for “feature”(“predictor”“) or”outcome“(”predicted value“”).</p>
<p>The question of variable dependencies for a particular data is quite important, because it can help to reduce an amount of predictors used for a model. Or it can tell us what feature is not helpful for a model construction, although it still can be used for engineering of another predictor. For example sometimes it is better to compute speed than to use distance values. In addition some standard algorithms assume independence of features and knowing how close to reality such assumption is useful.</p>
<p>The standard way to check dependencies of variables is to compute their covariance matrix. But it yields only linear dependencies. If dependencies are not linear then the covariance matrix may not pick it up. There are well known and numerous examples so I will not repeat them again.</p>
<p>Let us take a different approach. The definition of independent events is the following equality:</p>
<p><span> <strong>Pr</strong>(A and B)=<strong>Pr</strong>(A)<strong>Pr</strong>(B).<br/></span></p>
<p><span>Hence for dependent events we should have inequality. A simple measure of such disparity is an absolute value of difference of the expressions on the right hand side and on the left hand side:</span></p>
<div class="MathJax_Display"><span class="MathJax" id="MathJax-Element-2-Frame"> |<strong>Pr</strong>(A and B)−<strong>Pr</strong>(A)<strong>Pr</strong>(B)|.</span></div>
<p>Since in Data Science we work with probability estimations, then the true equality in the first formula is not likely anyway. The question is, how far from zero may be the difference in the second formula for us to believe that considered variables are dependent?</p>
<p>Well, in Data Science we can estimate bounds of a particular value with confidence intervals computed from a given data. For example with R it can be done with package “boot” and with python it is done with “scikits.bootstrap”. Thus confidence intervals of <strong>Pr</strong>(A and B), <span class="math inline"><span class="MathJax" id="MathJax-Element-6-Frame"><span class="MJX_Assistive_MathML"><strong>Pr</strong>(A) and <strong>Pr</strong>(B) <span>can be estimated with desired degree of probability. What is left to work out is a confidence interval of a product, </span></span></span></span> <strong>Pr</strong>(A)<strong>Pr</strong>(B)</p>
<p><span>To estimate bounds for the product we can use a standard approach from Numerical Analysis which is used <span>to compute an accrued error of calculation caused by truncation errors</span>.</span></p>
<p><a href="http://myabakhova.blogspot.com/2016/09/measuring-indepence-of-variables-with.html" target="_blank">read more</a></p>Visualizing Bagged Trees as Approximating Borderstag:www.datasciencecentral.com,2016-05-18:6448529:BlogPost:4265832016-05-18T23:12:09.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808313054?profile=original" target="_self"><img class="align-center" src="http://storage.ning.com/topology/rest/1.0/file/get/2808313054?profile=original" width="360"></img></a></p>
<p>The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.…</p>
<p></p>
<p><a target="_self" href="http://storage.ning.com/topology/rest/1.0/file/get/2808313054?profile=original"><img class="align-center" src="http://storage.ning.com/topology/rest/1.0/file/get/2808313054?profile=original" width="360"/></a></p>
<p>The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.</p>
<p><a href="http://myabakhova.blogspot.com/2016/05/visualizing-bagged-trees-as.html" target="_blank">Visualizing Bagged Trees as Approximating Borders, Part 1</a></p>
<p><a href="http://myabakhova.blogspot.com/2016/05/visualizing-bagged-trees-part-2.html" target="_blank">Visualizing Bagged Trees, Part 2</a></p>
<p><strong>Conclusion:</strong> Other tree aggregation methods differ in how they grow trees and they may compute weighted average. But in the end we can visualize the result of a algorithm as borders between classified sets in a shape of connected perpendicular segments, as in this 2-dimensional case. As for higher dimensions these became multidimensional rectangular pieces of hyperplanes which are perpendicular to each other.</p>Improving performance of random forests for a particular value of outcome by adding chosen featurestag:www.datasciencecentral.com,2016-05-05:6448529:BlogPost:4203802016-05-05T20:30:00.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p>Choosing features to improve a performance of a particular algorithm is a difficult question. Currently here is PCA, which is difficult to understand (although it can be used out-of-the-box), requires centralizing and scaling of features and is not easy to interpret. In addition, it does not allows to improve prediction performance for a particular outcome (if its accuracy is lower than for others or it has a particular importance). My method enables to use features without preprocessing.…</p>
<p>Choosing features to improve a performance of a particular algorithm is a difficult question. Currently here is PCA, which is difficult to understand (although it can be used out-of-the-box), requires centralizing and scaling of features and is not easy to interpret. In addition, it does not allows to improve prediction performance for a particular outcome (if its accuracy is lower than for others or it has a particular importance). My method enables to use features without preprocessing. Therefore a resulting prediction is easy to explain. Plus it can be used to improve a performance of a some outcome value. It based on comparison of feature densities and has a good visual interpretation, which does not require thorough knowledge of linear algebra or calculus. I have an example of the method application with adding chosen features completely worked out with R code here:</p>
<p><a href="http://myabakhova.blogspot.com/2016/04/improving-performance-of-random-forests.html" target="_blank">a long blog post</a> (includes source code and datasets)</p>
<p>The method can be used to evaluate consistency of feature differences during boosting or cross-validation as well. </p>
<p>Regretfully there is no definite rule which tells how to use the method to get a specified accuracy, for example, 99%. I believe if enough people are interested it may be worked out. </p>
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808311980?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808311980?profile=original" width="487" class="align-center"/></a></p>
<p style="text-align: center;"><em>Chart from the long article</em></p>Choosing features for random forests algorithmtag:www.datasciencecentral.com,2016-02-18:6448529:BlogPost:3889872016-02-18T20:00:00.000ZMaiia Bakhovahttps://www.datasciencecentral.com/profile/MaiiaBakhova
<p>There are many ways to choose features with given data, and it is always a challenge to pick up the ones with which a particular algorithm will work better. Here I will consider data from monitoring performance of physical exercises with wearable accelerometers, for example, wrist bands.</p>
<p>The data for this project come from this source: <a href="http://groupware.les.inf.puc-rio.br/har">http://groupware.les.inf.puc-rio.br/har</a>.</p>
<p>In this project, researchers used data from…</p>
<p>There are many ways to choose features with given data, and it is always a challenge to pick up the ones with which a particular algorithm will work better. Here I will consider data from monitoring performance of physical exercises with wearable accelerometers, for example, wrist bands.</p>
<p>The data for this project come from this source: <a href="http://groupware.les.inf.puc-rio.br/har">http://groupware.les.inf.puc-rio.br/har</a>.</p>
<p>In this project, researchers used data from accelerometers on the belt, forearm, arm, and dumbbell of few participants. They were asked to perform barbell lifts correctly, marked as "A", and incorrectly with four typical mistakes, marked as "B", "C", "D" and "E". The goal of the project is to predict the manner in which they did the exercise.</p>
<p>There are 52 numeric variables and one classification variable, the outcome. We can plot density graphs for first 6 features, which are in effect smoothed out histograms.</p>
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/1995010?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/1995010?profile=original" width="640" class="align-full"/></a></p>
<div class="separator">We can see that data behaviors are complicated. Some of features are bimodal and even multimodal. These properties could be caused by participants' different sizes or training levels or something else, but we do not have enough information to check it out. Nevertheless it is clear that our variables do not obey normal distribution. Therefore we are better with algorithms which do not assume normality, like trees and random forests. We can visualize the algorithms work in the following way: as finding vertical lines which divide areas under curves such that areas to the right and to to left of the line are significantly different for different outcomes.</div>
<div class="separator"></div>
<div class="separator"></div>
<div class="separator">There are a number of ways to distinguish functions analytically on an interval in Functional Analysis. It looks like the most suitable is to consider areas between curves. Clearly, we should scale it with respect to the size of the curves. For every feature I will consider all pairs of density curves to find out if they are sufficiently different. Here is my final criterion:</div>
<div class="separator"><a href="http://storage.ning.com/topology/rest/1.0/file/get/1995018?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/1995018?profile=original" width="331" class="align-center"/></a></div>
<div class="separator"></div>
<div class="separator"><span>If there is a pair for a feature which satisfies it then the feature is chosen for a prediction. As result I got 21 features for a random forests algorithm. The last one yielded accuracy 99% for the model itself and on a validation set. I checked how many variables we need for the same accuracy with PCA preprocessing, and it was 36. Mind you that the variables will be scaled and rotated, and that we still use the same original 52 features to construct them. Thus more efforts are needed to construct a prediction and to explain it. While with the above method it is easier, since areas under curves represent numbers of observations.</span></div>