Pablo Gutierrez's Posts - Data Science Central2021-06-13T20:43:06ZPablo Gutierrezhttps://www.datasciencecentral.com/profile/PabloGutierrezhttps://storage.ning.com/topology/rest/1.0/file/get/3742227170?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1https://www.datasciencecentral.com/profiles/blog/feed?user=0eypmp7gwldnv&xn_auth=noEntropy of rolling dicestag:www.datasciencecentral.com,2020-05-21:6448529:BlogPost:9534042020-05-21T15:53:47.000ZPablo Gutierrezhttps://www.datasciencecentral.com/profile/PabloGutierrez
<p>The Entropy is one of the most important concepts in many fields like physics, mathematics, information theory, etc.</p>
<p>Entropy is related to the number of states that one stochastic system can take and how this system will evolve with time, in such a way that the uncertainty will be maximized.</p>
<p>This will happened y two ways, first, every system will choose the configuration with a higher degree of entropy among all that are available and second, if we let the system evolve, after…</p>
<p>The Entropy is one of the most important concepts in many fields like physics, mathematics, information theory, etc.</p>
<p>Entropy is related to the number of states that one stochastic system can take and how this system will evolve with time, in such a way that the uncertainty will be maximized.</p>
<p>This will happened y two ways, first, every system will choose the configuration with a higher degree of entropy among all that are available and second, if we let the system evolve, after some time it will be in a configuration with a higher entropy than the initial value.</p>
<p>We define the entropy as the logarithm of number of states that are available for the system.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222022298?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222022298?profile=RESIZE_710x" class="align-center" width="99" height="21"/></a>The number of states available for one system is given by the probabilities of each one of these states, so it looks reasonable to think in a definition based on these probabilities.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222060259?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222060259?profile=RESIZE_710x" class="align-center" width="157" height="62"/></a>We also ask this function to compliant some restrictions:</p>
<p>The sum of probabilities of the available states must equals 1</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222141898?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222141898?profile=RESIZE_710x" class="align-center" width="98" height="65"/></a></p>
<p>The entropy of a system made by two subsystems must be the sum of the entropies of these subsystems</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222179699?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222179699?profile=RESIZE_710x" class="align-center" width="128" height="21"/></a></p>
<p>These requirements will force the entropy function to be consistent either if we look at the system as hole or we look at it as the sum of two subsystems 1 and 2. Thus</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222257885?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222257885?profile=RESIZE_710x" class="align-center" width="338" height="65"/></a></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222288685?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222288685?profile=RESIZE_710x" class="align-center" width="218" height="66"/></a></p>
<p>So, we find that</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222288685?profile=original" target="_blank" rel="noopener"></a></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222228901?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222228901?profile=RESIZE_710x" class="align-center" width="445" height="60"/></a></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222320670?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222320670?profile=RESIZE_710x" class="align-center" width="422" height="66"/></a></p>
<p>The only function that can maintain the relationship is the so we introduce the definition of entropy as a function of the probabilities of the states in this way.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222400895?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222400895?profile=RESIZE_710x" class="align-center" width="194" height="33"/></a>Now that we have an entropy function that we can manage, let’s think of a system that can take different states with different probabilities.</p>
<p>One example that familiar and easy to analyze is a rolling dice.</p>
<p>One dice has 6 faces with values (1,2,3,4,5,6) and a uniform distribution of probability 1/6 for every value, so the entropy for one dice is given by =1.792</p>
<p>If we increase the complexity of the system introducing mode dices, n=2, n=3, n=4, etc., the entropy of the system will be sum of them = (1.792·n)</p>
<p>For example, using R, we calculate possible combinations for two dices:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222472480?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222472480?profile=RESIZE_710x" class="align-center"/></a></p>
<p>We are interested in seeing what happens if consider the dices indistinguishable and what happens to the entropy in this case. This situation may occur when we do not have direct access to the states, but to a function of them (sometimes called observable), to illustrate this, we will consider the sum of the values of the n dices used in every draft.</p>
<p>Considering the states given by the observable “sum”, we will observe a dramatical change in the probability distribution and entropy. Now the state (2,1) and (1,2) will be the same state because both sum 3, (6,1), (4,3), (3,4), (4,3), (2,5), (5,2) and (1,6) will be the same state because they sum 7 and so on.</p>
<p>To calculate the new states and new probabilities of the system we use a numerical approach.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222754252?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222754252?profile=RESIZE_710x" class="align-center"/></a>Using this function, we can calculate the new states and their probabilities</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222877901?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222877901?profile=RESIZE_710x" class="align-left"/></a><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222987259?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222987259?profile=RESIZE_710x" class="align-left"/></a><a href="https://storage.ning.com/topology/rest/1.0/file/get/5222942462?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5222942462?profile=RESIZE_710x" class="align-left"/></a><a href="https://storage.ning.com/topology/rest/1.0/file/get/5223090252?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5223090252?profile=RESIZE_710x" class="align-left"/></a></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>Plotting the probability distribution for the different numbers of dices, is easy to observe that when more dices are added, the global probability peaks to a concrete value.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5223330499?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5223330499?profile=RESIZE_710x" class="align-center" width="680" height="288"/></a></p>
<p>Indistinguishability has introduced symmetry and symmetry has modified the probability distribution making some states more likely than others.</p>
<p>This of course brings a significant reduction in the uncertainty and <span>consequently</span> in the entropy.</p>
<p>To calculate the entropy of the system using indistinguishable states we use this function</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5223405867?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5223405867?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Plotting the entropy per dice for every <strong>n</strong>, we can see that it decreases with <strong>n</strong>.</p>
<p>The value for n=1 is 1.792, that is the unitary value for entropy, when n increases, the symmetry starts to reduce the value.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5223330499?profile=original" target="_blank" rel="noopener"></a></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/5223748488?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/5223748488?profile=RESIZE_710x" class="align-center"/></a></p>
<p>In a system without Indistinguishability the entropy would be 1.792·n for every n.</p>Nonlinear regression of COVID19 infected cases.tag:www.datasciencecentral.com,2020-04-12:6448529:BlogPost:9446752020-04-12T20:23:14.000ZPablo Gutierrezhttps://www.datasciencecentral.com/profile/PabloGutierrez
<p>In 1927, W. O. Kermack y A. G. McKendrick described the first mathematical model for infectious diseases using a set of differential equations. This model is called SIR because of the three states one individual can have.<br></br> These states are:</p>
<ul>
<li>Susceptible: The individuals that can be infected by the disease</li>
<li>Infected: The individuals that have been infected and suffer the disease.</li>
<li>Recovered: The individuals that recovered from the disease and have become…</li>
</ul>
<p>In 1927, W. O. Kermack y A. G. McKendrick described the first mathematical model for infectious diseases using a set of differential equations. This model is called SIR because of the three states one individual can have.<br/> These states are:</p>
<ul>
<li>Susceptible: The individuals that can be infected by the disease</li>
<li>Infected: The individuals that have been infected and suffer the disease.</li>
<li>Recovered: The individuals that recovered from the disease and have become immune.</li>
</ul>
<p>The equations that represent these states are as follows:</p>
<ol>
<li>Variation with time of the susceptible individuals to be infected will depend inversely on a transmission factor β and the susceptible population.<a href="https://storage.ning.com/topology/rest/1.0/file/get/4403172973?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403172973?profile=RESIZE_710x" class="align-center"/></a></li>
<li>Variation of those infected will depend on the number of people who are still susceptible of being infected, minus the number of people who have already recovered and are therefore immune.<a href="https://storage.ning.com/topology/rest/1.0/file/get/4403176583?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403176583?profile=RESIZE_710x" class="align-center"/></a></li>
<li>The variation of recovered ones depends directly on the number of infected multiplied by α, a factor that determines the time that infected need to recover, that is:<a href="https://storage.ning.com/topology/rest/1.0/file/get/4403183149?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403183149?profile=RESIZE_710x" class="align-center"/></a></li>
</ol>
<p>The boundary conditions are:</p>
<ul>
<li>Population must always remain constant<a href="https://storage.ning.com/topology/rest/1.0/file/get/4403193435?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403193435?profile=RESIZE_710x" class="align-center"/></a>At t=0</li>
</ul>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/4403198067?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403198067?profile=RESIZE_710x" class="align-center"/></a><a href="https://storage.ning.com/topology/rest/1.0/file/get/4403201722?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403201722?profile=RESIZE_710x" class="align-center"/></a></p>
<p>The analytical solution of this system can be found in different articles, for example here: <a href="https://arxiv.org/abs/1403.2160">arXiv:1403.2160</a></p>
<p>Instead of that, I will focus in equation (2) to note that it is a Bernoulli equation of the form</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/4403208181?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403208181?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Where</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/4403219260?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403219260?profile=RESIZE_710x" class="align-center"/></a></p>
<p>The solution for this Bernoulli differential equation is the <strong>logistic</strong> function, which most general form is this:</p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/4403231925?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403231925?profile=RESIZE_710x" class="align-center"/></a></p>
<p>In the epidemiologic context, this logistic function represents the accumulative number of infected people as a function of time.</p>
<p>Using this model, it’s possible to fit it to the real data, to obtain the values for the variables, the way to do it consists in minimizing the residuals in the loss function</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/4403243711?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403243711?profile=RESIZE_710x" class="align-center"/></a>Because the function to be fitted is not linear, the method to minimize de loss function must be suitable for nonlinear regressions. To do this regression, I used the NLS package for R, which implements the Gauss-Newton algorithm.</p>
<p></p>
<p>The data corresponds to the number of infected people in Spain as a function of time provided by the Ministry of Health.</p>
<p></p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/4403279266?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403279266?profile=RESIZE_710x" class="align-center" width="472" height="288"/></a>This graph represents the data.</p>
<p style="text-align: left;">How to execute the regression using R.</p>
<ol>
<li>Load the CSV with data using read_csv</li>
</ol>
<p><em>descarga <- read_csv("serie_historica_acumulados.csv",col_types = colsFallecidos = col_double(), Fecha = col_date(format = "%d/%m/%Y"), Hospitalizados = col_double(), Recuperados = col_double(), UCI = col_double(), X8 = col_skip()))</em></p>
<p> </p>
<ol start="2">
<li>Group the data by date and sum all regions</li>
</ol>
<p><em>agregados_por_fecha<-descarga %>% group_by(Fecha) %>% summarize(Fallecidos=sum(Fallecidos), Casos=sum(Casos), Hospitalizados=sum(Hospitalizados),UCI=sum(UCI), Recuperados=sum(Recuperados))</em></p>
<p> </p>
<ol start="3">
<li><span>Create a sequence to use it as a time scale</span></li>
</ol>
<p><em> s<-seq(1:length(tabla_absolutos$Fecha))</em></p>
<p><em>tabla_absolutos["dia"] <- s</em></p>
<p> </p>
<ol start="4">
<li>Use nls to fit the curve. To have a good fit, it is necessary to provide initial data compatible with the data. This need to be made manually.</li>
</ol>
<p> <em>logis.m1 <- nls(Casos ~ logis(dia, a, b, c,d), data = agregados_por_fecha, start = list(a = 0, b = 180000, c = 40, d=5))</em></p>
<p> </p>
<ol start="5">
<li>Use summary to retrieve the details of the regression.</li>
</ol>
<p><em>summary(logis.m1)</em></p>
<p> </p>
<p>Formula: Casos ~ logis(dia, a, b, c, d)</p>
<p> Parameters:</p>
<p> Estimate Std. Error t value Pr(>|t|) </p>
<p>a -2.320e+03 5.344e+02 -4.342 0.000115 ***</p>
<p>b 1.788e+05 2.111e+03 84.706 < 2e-16 ***</p>
<p>c 3.914e+01 1.317e-01 297.217 < 2e-16 ***</p>
<p>d 5.362e+00 1.033e-01 51.920 < 2e-16 ***</p>
<p>---</p>
<p>Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</p>
<p></p>
<p style="text-align: center;"></p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/4403276439?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/4403276439?profile=RESIZE_710x" class="align-center" width="485" height="277"/></a>This graph represents the data and the regression curve.</p>
<p style="text-align: left;">Conclusions:</p>
<ul>
<li>The regression found the values for the variable that are compatible withe the data.</li>
<li>The inflexion point occurred on day 39 (march29)</li>
<li>The maximum number of infected people will be 180.000 people</li>
<li>The number of infected will grow until May 15th.</li>
</ul>
<p></p>An easy way to evaluate the probability of winning a commercial opportunitytag:www.datasciencecentral.com,2019-11-26:6448529:BlogPost:9107622019-11-26T12:05:05.000ZPablo Gutierrezhttps://www.datasciencecentral.com/profile/PabloGutierrez
<p>When ever we visit a client and present our proposal, we start wondering if it will be accepted or rejected by the customer. Usually, our customer will analyze our proposal, compare it with other competitors’ and make a decision.</p>
<p>In order to build our commercial forecast system, we need to assign a probability to every proposal we have presented and assign a numerical value to every one of them.<br></br> One way of doing this is multiplying the value of the proposal by the probability of…</p>
<p>When ever we visit a client and present our proposal, we start wondering if it will be accepted or rejected by the customer. Usually, our customer will analyze our proposal, compare it with other competitors’ and make a decision.</p>
<p>In order to build our commercial forecast system, we need to assign a probability to every proposal we have presented and assign a numerical value to every one of them.<br/> One way of doing this is multiplying the value of the proposal by the probability of wining it.</p>
<p>Expected_income=proposal_value* proposal_probability</p>
<p>But, how to assign a probability to the different opportunities? <br/> Most commercial departments calculate the probability of wining the opportunity using the knowledge, experience and instinct of the team.</p>
<p>And, is there any other way to calculate the chances for this opportunity to be successful?</p>
<p>The answer is yes. Using logistic regression, one of the most popular techniques in machine learning, it is possible to train one algorithm that calculates the probability for one commercial opportunity to be successful.</p>
<p>Logistic regression can be applied to many variables, but to keep it simple we will use only one, and that is the time (in days) that the opportunity has been “alive” since it was created.</p>
<p>The reason why we choose this variable is because most of purchasing departments need a certain time to analyze proposals and make a decision for a specific kind of product, after this time the opportunity doesn’t have many chances to be successful.</p>
<p>To train the model it is necessary to use the historical information and prepare the data in such a way that in the X column we have the time one opportunity was “alive” and in the Y column we place “0” or “1” depending if this opportunity was successful or not.</p>
<p>Using R we can perform this regression in very easily<br/> <em>fit <- glm(result ~ time, data = Products, family = 'binomial')</em></p>
<p>R will store the values for the independent variable and the intercept, so we will be able to construcnt the expression z=a*time+intercept .</p>
<p>Finally we calculate the time dependent probability as p(t)=sigmoid(z)</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3742219732?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3742219732?profile=RESIZE_710x" class="align-full"/></a>We shoul obtain a trend like this. The final shape of the curve will depend on the market nature.</p>
<p></p>
<p></p>