Kostas Hatalis's Posts - Data Science Central 2020-11-28T14:53:26Z Kostas Hatalis https://www.datasciencecentral.com/profile/KostasHatalis https://storage.ning.com/topology/rest/1.0/file/get/2801342535?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1 https://www.datasciencecentral.com/profiles/blog/feed?user=0cdd94ehr5tws&xn_auth=no Tutorial: Multistep Forecasting with Seasonal ARIMA in Python tag:www.datasciencecentral.com,2018-04-12:6448529:BlogPost:712181 2018-04-12T19:30:00.000Z Kostas Hatalis https://www.datasciencecentral.com/profile/KostasHatalis <div class="output_subarea output_png"><p>When trend and seasonality is present in a time series, instead of decomposing it manually to fit an ARMA model using the Box Jenkins method, another very popular method is to use the seasonal autoregressive integrated moving average (SARIMA) model which is a generalization of an ARMA model. SARIMA models are denoted SARIMA(p,d,q)(P,D,Q)[S], where S refers to the number of periods in each season, d is the degree of differencing (the number of times the…</p> </div> <div class="output_subarea output_png"><p>When trend and seasonality is present in a time series, instead of decomposing it manually to fit an ARMA model using the Box Jenkins method, another very popular method is to use the seasonal autoregressive integrated moving average (SARIMA) model which is a generalization of an ARMA model. SARIMA models are denoted SARIMA(p,d,q)(P,D,Q)[S], where S refers to the number of periods in each season, d is the degree of differencing (the number of times the data have had past values subtracted), and the uppercase P, D, and Q refer to the autoregressive, differencing, and moving average terms for the seasonal part of the ARIMA model.</p> <p>The SARIMA model is a bit complex to write out directly so a backshift operator is needed to describe it. For example SARIMA(1,1,1)(1,1,1) is written as:</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808355691?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808355691?profile=RESIZE_1024x1024" class="align-center" width="495" height="134"/></a></p> <p>The backward shift operator B is a useful notational device when working with time series lags: By(t)=y(t−1)</p> <div class="cell text_cell unselected rendered"><div class="inner_cell"><div class="text_cell_render rendered_html"><h3 id="5.1-Rules-for-SARIMA-model-selection-from-ACF/PACF-plots">Rules for SARIMA model selection from ACF/PACF plots<a class="anchor-link" href="http://localhost:8888/notebooks/Time%20Series%20Analysis%20Notes.ipynb#5.1-Rules-for-SARIMA-model-selection-from-ACF/PACF-plots"></a></h3> </div> </div> </div> <div class="cell text_cell unselected rendered"><div class="inner_cell"><div class="text_cell_render rendered_html"><p>These are all rule of thumbs, not an exact science for picking the number of each parameters in SARIMA(p,d,q)(P,D,Q)[S]. It is an art in picking good parameters from the ACF/PACF plots. The following rules also apply to ARMA and ARIMA models.</p> <p><strong>Identifying the order of differencing:</strong></p> <p>d=0 if the series has no visible trend or ACF at all lags is low.</p> <p>d≥1 if the series has visible trend or positive ACF values out to a high number of lags.</p> <p>Note: if after applying differencing to the series and the ACF at lag 1 is -0.5 or more negative the series may be overdifferenced.</p> <p>Note: If you find the best d to be d=1 then the original series has a constant trend. A model with d=2 assumes that the original series has a time-varying trend.</p> <strong>Identifying the number of AR and MA terms</strong></div> <div class="text_cell_render rendered_html"><p>p is equal to the first lag where the PACF value is above the significance level.</p> <p>q is equal to the first lag where the ACF value is above the significance level.</p> <p><strong>Identifying the seasonal part of the model:</strong></p> <p>S is equal to the ACF lag with the highest value (typically at a high lag).</p> <p>D=1 if the series has a stable seasonal pattern over time.</p> <p>D=0 if the series has an unstable seasonal pattern over time.</p> <p>Rule of thumb: d+D≤2</p> <p>P≥1 if the ACF is positive at lag S, else P=0.</p> <p>Q≥1 if the ACF is negative at lag S, else Q=0.</p> <p>Rule of thumb: P+Q≤2</p> <div class="inner_cell"><div class="text_cell_render rendered_html"><h3 id="5.2-Grid-search-for-SARIMA-model-selection"><span style="font-size: 12pt;">Grid search for SARIMA model selection</span></h3> </div> </div> </div> </div> </div> <div class="inner_cell"><div class="text_cell_render rendered_html"><p>Doing a full manual time series analysis can be a tedious task, especially when you have many data sets to analyze. It is preferred to then automate the task of model selection with grid search. For SARIMA, since we have many parameters, grid search may take hours to complete on one data set if we set the limit of each parameter too high. Setting the limits too high will also make your model too complex and overfit the training data.</p> <p>To prevent the long runtime and overfitting problem, we apply what is known as the parsimony principle where we create a combination of all parameters such that p+d+q+P+D+Q≤ 6. Another approach is to set each parameter as 0 or 1 or 2 and do grid search using AIC with each combination. It is more common in forecasting studies to apply grid search on SARIMA when you are using it as a benchmark method to more advanced models such as deep neural networks.</p> <p>But as a reminder, grid search may not always give you the best model. To get the best model you may need to manually experiment with different parameters using the ACF/PACF plots.</p> </div> </div> <p><span style="font-size: 12pt;"><strong>Python Tutorial</strong></span><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808359054?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808359288?profile=RESIZE_1024x1024" class="align-center" width="750"/><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808359054?profile=RESIZE_1024x1024" class="align-center" width="750"/></a>After loading in our time series we plot it, here we use the classical Air Passengers time series.</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808365875?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808365875?profile=RESIZE_1024x1024" class="align-center" width="750"/></a></p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808366235?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808366235?profile=original" class="align-center" width="392"/></a>From inspecting the plot we can conclude that this time series has a positive linear trend, multiplicative seasonal patterns, and possibly some irregular patterns. This information strongly suggests for us to use a SARIMA model to do our forecasting. Let's get to it! First we split 70% of data for training and 30% fo testing.</p> <div class="output_subarea output_png"><div class="cell text_cell rendered selected"><div class="inner_cell"><div class="text_cell_render rendered_html"><p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808366338?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808366338?profile=RESIZE_1024x1024" class="align-center" width="750"/></a>Next since the data has multiplicative seasonality we apply a log filter and then analyze the residuals with autocorrelation plots.</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808376041?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808376041?profile=RESIZE_1024x1024" class="align-center" width="750"/></a></p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808376422?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808376422?profile=original" class="align-center" width="404"/></a></p> <p>We see here that there is no more a multiplicative affect and no more trend. However, an unstable seasonal pattern is still present in this residual series. It indicates that we need to remove the seasonal pattern which can be done with SARIMA. We can select the seasonal pattern parameters of SARIMA by looking at the ACF and PACF plots.<a href="http://storage.ning.com/topology/rest/1.0/file/get/2808377487?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808377487?profile=RESIZE_1024x1024" class="align-center" width="750"/></a><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808377761?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808377761?profile=RESIZE_1024x1024" width="750"/></a></p> <div class="text_cell_render rendered_html"><p>Looking at the ACF and PACF plots of the differenced series we see our first significant value at lag 4 for ACF and at the same lag 4 for the PACF which suggest to use p = 4 and q = 4. We also have a big value at lag 12 in the ACF plot which suggests our season is S = 12 and since this lag is positive it suggests P = 1 and Q = 0. Since this is a differenced series for SARIMA we set d = 1, and since the seasonal pattern is not stable over time we set D = 0. All together this gives us a SARIMA(4,1,4)(1,0,0) model. Next we run SARIMA with these values to fit a model on our training data.</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808377871?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808377871?profile=RESIZE_1024x1024" class="align-center" width="750"/></a>Now we can forecast.</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808378154?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808378154?profile=RESIZE_1024x1024" class="align-center" width="750"/></a><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808378306?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808378306?profile=original" class="align-center" width="615"/></a>We can see here that the multi-step forecast of our SARIMA(4,1,4)(1,0,0) model fits the testing data extremely well with an RMSE of 23.7! When you manually conduct a good time series analysis, as I have done here, it will be difficult to beat ARMA models for forecasting. I also ran grid search and found the best model to be SARIMA(1, 0, 1)x(1, 1, 1) which had an AIC of 696.05. This resulted in a forecast with an RMSE of 24.74, which is also pretty good! In conclusion depending on your forecasting problem, SARIMA is always a great choice to choose.</p> <div class="output_subarea output_png"></div> <div class="output_subarea output_png"></div> </div> </div> </div> </div> </div> <div class="output_subarea output_png"></div> </div> What is intelligence? tag:www.datasciencecentral.com,2018-03-15:6448529:BlogPost:704321 2018-03-15T21:30:00.000Z Kostas Hatalis https://www.datasciencecentral.com/profile/KostasHatalis <p>This is an AI related post on the nature and philosophy of intelligence. In the various fields that study the mind, human or otherwise, there are many definitions (and lack of) for the term 'intelligence'. What is it, how can we measure it, how can we reproduce it? What implications does this have in the fields of AI, machine learning, and data science? A paper  by Shane Legg and Marcus Hutter, attempted to survey the definition from these various fields. The following are some sample…</p> <p>This is an AI related post on the nature and philosophy of intelligence. In the various fields that study the mind, human or otherwise, there are many definitions (and lack of) for the term 'intelligence'. What is it, how can we measure it, how can we reproduce it? What implications does this have in the fields of AI, machine learning, and data science? A paper  by Shane Legg and Marcus Hutter, attempted to survey the definition from these various fields. The following are some sample definitions they collected.</p> <p></p> <p><strong>Collective definitions:</strong></p> <p>“The ability to use memory, knowledge, experience, understanding, reasoning, imagination and judgment in order to solve problems and adapt to new situations.” AllWords Dictionary, 2006</p> <p>“The capacity to acquire and apply knowledge.” The American Heritage Dictionary, fourth edition, 2000</p> <p></p> <p><strong>Psychologist definitions:</strong></p> <p>“The facet of mind underlying our capacity to think, to solve novel problems, to reason and to have knowledge of the world.” M. Anderson</p> <p>“Sensation, perception, association, memory, imagination, discrimination, judgement and reasoning.” N. E. Haggerty</p> <p></p> <p><strong>AI researcher definitions:</strong></p> <p>“Achieving complex goals in complex environments” B. Goertzel</p> <p>“Intelligence is the ability to use optimally limited resources – including time– to achieve goals.” R. Kurzweil</p> <p>“The ability to solve hard problems.” M. Minsky</p> <p></p> <p><strong>Definition by S. Legg and M. Hutter:</strong></p> <p>“Intelligence measures an agent’s ability to achieve goals in a wide range of environments”</p> <p></p> <p><strong>Here is also a definition from the Wikipedia article on an Intelligent Agent:</strong></p> <p>“An intelligent agent is an autonomous entity which observes through sensors and acts upon an environment using actuators and directs its activity towards achieving goals.”</p> <p></p> <p>From the given definitions it is clear that intelligence has certain characteristics:</p> <ol> <li>It is an aspect arising from an agent.</li> <li>The agent exists in an uncertain environment.</li> <li>The agent can perceive this environment (has percepts).</li> <li>The agent can take actions in this environment (has actuators).</li> <li>The agent has some means of computation for decision making, ie think. It has the ability to input information from it’s percepts, process this information to make a decision, and output information to take an action.</li> <li>The agents actions are goal directed (very important).</li> <li>The agent can measure how well it is meeting its goal.</li> </ol> <p>Depending on the goals to be achieved by an agent, intelligence can be fairly simple or very complex. Human level intelligence for instance is extremely difficult to measure and define. People certainly have a number of processing abilities including learning, reasoning, adapting online, ability to self analyze, able to process large amounts of data, short and long term memory, etc.</p> <p>However, based on the above-aggregated definitions, an intelligence can be any black box which has a defined goal, the ability to input percepts, process the percepts (perform a calculation), and output an action related to the goal. For instance this can be a reflex agent and be something as simple as a computer function based on condition-action rules such as using if-else. It can even be more simpler and be a function that takes two variables, x and y, which are its percepts, adds them which is its calculation, and outputs the result which is its goal directed action. Though this would be as low level of an intelligence as possible.</p> <p>There is currently no universal measure for intelligence (say between a calculator, an ant, and a human) since intelligence is largely based purely on how well a certain goal can be achieved. Thus intelligence, as so far as I understand, can be only applied as a relativistic measure to agents that have similar goals and problem solving capabilities.</p> <p> <a href="http://arxiv.org/pdf/0706.3639.pdf" rel="nofollow">http://arxiv.org/pdf/0706.3639.pdf</a></p> Basics of Bayesian Decision Theory tag:www.datasciencecentral.com,2018-03-15:6448529:BlogPost:704430 2018-03-15T21:00:00.000Z Kostas Hatalis https://www.datasciencecentral.com/profile/KostasHatalis <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808355431?profile=original" target="_self"><img class="align-center" src="http://storage.ning.com/topology/rest/1.0/file/get/2808355431?profile=original" width="470"></img></a> The use of formal statistical methods to analyse quantitative data in data science has increased considerably over the last few years. One such approach, <strong>Bayesian Decision Theory (BDT)</strong>, also known as Bayesian Hypothesis Testing and Bayesian inference, is a fundamental statistical approach that quantifies the tradeoffs between various decisions using…</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808355431?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808355431?profile=original" class="align-center" width="470"/></a>The use of formal statistical methods to analyse quantitative data in data science has increased considerably over the last few years. One such approach, <strong>Bayesian Decision Theory (BDT)</strong>, also known as Bayesian Hypothesis Testing and Bayesian inference, is a fundamental statistical approach that quantifies the tradeoffs between various decisions using distributions and costs that accompany such decisions. In pattern recognition it is used for designing classifiers making the assumption that the problem is posed in probabilistic terms, and that all of the relevant probability values are known. Generally, we don’t have such perfect information but it is a good place to start when studying machine learning, statistical inference, and detection theory in signal processing. BDT also has many applications in science, engineering, and medicine.</p> <p>From the perspective of BDT, any kind of probability distribution - such as the distribution for tomorrow's weather - represents a prior distribution. That is, it represents how we expect today the weather is going to be tomorrow. This contrasts with frequentist inference, the classical probability interpretation, where conclusions about an experiment are drawn from a set of repetitions of such experience, each producing statistically independent results. For a frequentist, a probability function would be a simple distribution function with no special meaning.</p> <div>In BDT a decision can be viewed as a hypothesis deciding where observations of the random variable <em>Y</em> come from. For instance, in image analysis you may want to decide if a picture is of a cat or a dog, in medicine you want to see if heart beat is nominal or irregular, or in radar may want to decide if a target is on the map or not. We assume two possible hypotheses <img src="https://s0.wp.com/latex.php?latex=H_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{0}" title="H_{0}" class="latex"/> (null hypothesis) and <img src="https://s0.wp.com/latex.php?latex=H_%7B1%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{1}" title="H_{1}" class="latex"/> (alternate hypothesis) corresponding to two possible probability distributions <img src="https://s0.wp.com/latex.php?latex=P_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="P_{0}" title="P_{0}" class="latex"/> and <img src="https://s0.wp.com/latex.php?latex=P_%7B1%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="P_{1}" title="P_{1}" class="latex"/> on the observation space <img src="https://s0.wp.com/latex.php?latex=%5CGamma&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\Gamma" title="\Gamma" class="latex"/>. We write this problem as <img src="https://s0.wp.com/latex.php?latex=H_%7B0%7D%3A+P_%7B0%7D%28y%29&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{0}: P_{0}(y)" title="H_{0}: P_{0}(y)" class="latex"/> versus <img src="https://s0.wp.com/latex.php?latex=H_%7B1%7D%3A+P_%7B1%7D%28y%29&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{1}: P_{1}(y)" title="H_{1}: P_{1}(y)" class="latex"/>. A decision rule <img src="https://s0.wp.com/latex.php?latex=%5Cdelta&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\delta" title="\delta" class="latex"/> for <img src="https://s0.wp.com/latex.php?latex=H_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{0}" title="H_{0}" class="latex"/> versus <img src="https://s0.wp.com/latex.php?latex=H_%7B1%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{1}" title="H_{1}" class="latex"/> is any partition of the observation set <img src="https://s0.wp.com/latex.php?latex=%5CGamma&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\Gamma" title="\Gamma" class="latex"/> into sets <img src="https://s0.wp.com/latex.php?latex=%5CGamma_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\Gamma_{0}" title="\Gamma_{0}" class="latex"/> and <img src="https://s0.wp.com/latex.php?latex=%5CGamma_%7B1%7D%3D1-+%5CGamma_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\Gamma_{1}=1- \Gamma_{0}" title="\Gamma_{1}=1- \Gamma_{0}" class="latex"/>. We think of the decision rule as such:</div> <div style="text-align: center;"><img src="https://s0.wp.com/latex.php?latex=%5Cdelta%28y%29+%3D+%5Cleft%5C%7B+%5Cbegin%7Barray%7D%7Bll%7D+1+if+y+%5Cin+%5CGamma_%7B1%7D%5C%5C+0+if+y+%5Cin+%5CGamma_%7B0%7D+%5Cend%7Barray%7D+%5Cright.&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\delta(y) = \left\{ \begin{array}{ll} 1 if y \in \Gamma_{1}\\ 0 if y \in \Gamma_{0} \end{array} \right." title="\delta(y) = \left\{ \begin{array}{ll} 1 if y \in \Gamma_{1}\\ 0 if y \in \Gamma_{0} \end{array} \right." class="latex"/></div> <div>We would like to optimize how we choose <img src="https://s0.wp.com/latex.php?latex=%5CGamma_%7B1%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\Gamma_{1}" title="\Gamma_{1}" class="latex"/> so to do so we assign costs to our decisions, which are some positive numbers. <img src="https://s0.wp.com/latex.php?latex=C_%7Bij%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="C_{ij}" title="C_{ij}" class="latex"/> is the cost incurred by choosing hypothesis <img src="https://s0.wp.com/latex.php?latex=H_%7Bi%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{i}" title="H_{i}" class="latex"/> when hypothesis <img src="https://s0.wp.com/latex.php?latex=H_%7Bj%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{j}" title="H_{j}" class="latex"/> is true. The decision rule is alternatively written as the likelihood ratio L(y) for the observed value of Y and then makes its decision by comparing this ration to the threshold <img src="https://s0.wp.com/latex.php?latex=%5Ctau&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\tau" title="\tau" class="latex"/>:</div> <div style="text-align: center;"> <img src="https://s0.wp.com/latex.php?latex=%5Cdelta%28y%29+%3D+%5Cleft%5C%7B+%5Cbegin%7Barray%7D%7Bll%7D+1+if+L%28y%29+%5Cgeq+%5Ctau+%5C%5C+0+if+L%28y%29+%3C+%5Ctau+%5Cend%7Barray%7D+%5Cright.&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\delta(y) = \left\{ \begin{array}{ll} 1 if L(y) \geq \tau \\ 0 if L(y) &lt; \tau \end{array} \right." title="\delta(y) = \left\{ \begin{array}{ll} 1 if L(y) \geq \tau \\ 0 if L(y) &lt; \tau \end{array} \right." class="latex"/></div> <div><div>where</div> </div> <div style="text-align: center;"><img src="https://s0.wp.com/latex.php?latex=L%28y%29+%3D+%5Cfrac%7Bp_%7B1%7D%28y%29%7D%7Bp_%7B0%7D%28y%29%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="L(y) = \frac{p_{1}(y)}{p_{0}(y)}" title="L(y) = \frac{p_{1}(y)}{p_{0}(y)}" class="latex"/> and <img src="https://s0.wp.com/latex.php?latex=%5Ctau+%3D+%5Cfrac%7B%5Cpi_%7B0%7D%28C_%7B10%7D-C_%7B00%7D%29%7D%7B%5Cpi_%7B1%7D%28C_%7B01%7D-C_%7B11%7D%29%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\tau = \frac{\pi_{0}(C_{10}-C_{00})}{\pi_{1}(C_{01}-C_{11})}" title="\tau = \frac{\pi_{0}(C_{10}-C_{00})}{\pi_{1}(C_{01}-C_{11})}" class="latex"/></div> <div>We then define the conditional risk for each hypothesis as the expected (average) cost incurred by the decision rule <img src="https://s0.wp.com/latex.php?latex=%5Cdelta&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\delta" title="\delta" class="latex"/> when that hypothesis is :</div> <div style="text-align: center;"><img src="https://s0.wp.com/latex.php?latex=R_%7B0%7D+%3D+C_%7B00%7DP_%7B0%7D%28%5CGamma_%7B0%7D%29%2BC_%7B10%7DP_%7B0%7D%28%5CGamma_%7B1%7D%29&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="R_{0} = C_{00}P_{0}(\Gamma_{0})+C_{10}P_{0}(\Gamma_{1})" title="R_{0} = C_{00}P_{0}(\Gamma_{0})+C_{10}P_{0}(\Gamma_{1})" class="latex"/></div> <div style="text-align: center;"><img src="https://s0.wp.com/latex.php?latex=R_%7B1%7D+%3D+C_%7B11%7DP_%7B1%7D%28%5CGamma_%7B1%7D%29%2BC_%7B01%7DP_%7B1%7D%28%5CGamma_%7B0%7D%29&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="R_{1} = C_{11}P_{1}(\Gamma_{1})+C_{01}P_{1}(\Gamma_{0})" title="R_{1} = C_{11}P_{1}(\Gamma_{1})+C_{01}P_{1}(\Gamma_{0})" class="latex"/></div> <div><img src="https://s0.wp.com/latex.php?latex=R_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="R_{0}" title="R_{0}" class="latex"/> is the risk of choosing <img src="https://s0.wp.com/latex.php?latex=H_%7B0%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{0} " title="H_{0} " class="latex"/> when  <img src="https://s0.wp.com/latex.php?latex=H_%7B1%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{1} " title="H_{1} " class="latex"/> is true multiplied the probability of this decision plus choosing <img src="https://s0.wp.com/latex.php?latex=H_%7B1%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{1} " title="H_{1} " class="latex"/> when  <img src="https://s0.wp.com/latex.php?latex=H_%7B0%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{0} " title="H_{0} " class="latex"/> is true multiplied the probability of doing this. Next we assign priori probability <img src="https://s0.wp.com/latex.php?latex=%5Cpi_%7B0%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\pi_{0} " title="\pi_{0} " class="latex"/> that  <img src="https://s0.wp.com/latex.php?latex=H_%7B0%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{0} " title="H_{0} " class="latex"/> is true unconditioned of the observation, and we assign priori probability <img src="https://s0.wp.com/latex.php?latex=%5Cpi_%7B1%7D+%3D+1-+%5Cpi_%7B0%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\pi_{1} = 1- \pi_{0} " title="\pi_{1} = 1- \pi_{0} " class="latex"/> that  <img src="https://s0.wp.com/latex.php?latex=H_%7B1%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{1} " title="H_{1} " class="latex"/> is true. Given the risks and prior probabilities we can then define the Bayes Risk which is the overall average cost of the decision rule:</div> <div style="text-align: center;"><img src="https://s0.wp.com/latex.php?latex=r%28%5Cdelta%29%3D+%5Cpi_%7B0%7DR_%7B0%7D%28%5Cdelta%29%2B+%5Cpi_%7B1%7DR_%7B1%7D%28%5Cdelta%29&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="r(\delta)= \pi_{0}R_{0}(\delta)+ \pi_{1}R_{1}(\delta)" title="r(\delta)= \pi_{0}R_{0}(\delta)+ \pi_{1}R_{1}(\delta)" class="latex"/></div> <div>The optimum decision rule for <img src="https://s0.wp.com/latex.php?latex=H_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{0}" title="H_{0}" class="latex"/> versus <img src="https://s0.wp.com/latex.php?latex=H_%7B1%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="H_{1}" title="H_{1}" class="latex"/> is one that minimizes over all decision rules the Bayes risk. Such as rule is called the Bayes rule. Below is a simple illustrative example of the decision boundary where <img src="https://s0.wp.com/latex.php?latex=p_%7B0%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="p_{0}" title="p_{0}" class="latex"/> and <img src="https://s0.wp.com/latex.php?latex=p_%7B1%7D&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="p_{1}" title="p_{1}" class="latex"/> are Gaussian, and we have uniform costs, and equal priors.</div> Swarm Optimization: Goodbye Gradients tag:www.datasciencecentral.com,2018-03-15:6448529:BlogPost:704428 2018-03-15T21:00:00.000Z Kostas Hatalis https://www.datasciencecentral.com/profile/KostasHatalis <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808355324?profile=original" target="_self"><img class="align-center" height="293" src="http://storage.ning.com/topology/rest/1.0/file/get/2808355324?profile=original" width="514"></img></a> Fish schools, bird flocks, and bee swarms. These combinations of real-time biological systems can blend knowledge, exploration, and exploitation to unify intelligence and solve problems more efficiently. There’s no centralized control. These simple agents interact locally, within their environment, and new behaviors emerge from the group as a whole. In the world of…</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808355324?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808355324?profile=original" class="align-center" width="514" height="293"/></a>Fish schools, bird flocks, and bee swarms. These combinations of real-time biological systems can blend knowledge, exploration, and exploitation to unify intelligence and solve problems more efficiently. There’s no centralized control. These simple agents interact locally, within their environment, and new behaviors emerge from the group as a whole. In the world of evolutionary alogirthms one such inspired method is p<strong>article swarm optimization (PSO).</strong> It is a swarm intelligence based computational technique that can be used to find an approximate solution to a problem by iteratively trying to search candidate solutions (called particles) with regard to a given measure of quality around a global optimum. The movements of the particles are guided by their own best known position in the search-space as well as the entire swarm’s best known position. PSO makes few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. As a global optimization method PSO does not use the gradient of the problem being optimized, which means PSO does not require that the optimization problem be differentiable as is required by classic optimization methods such as gradient descent. This makes it a widely popular optimizer for many nonconvex or nondifferentiable problems. <strong>A brief review of applications that may be of interest to data scientists include:</strong></p> <ol> <li>Hyperparameter optimization for deep learning models .</li> <li>Training different neural networks .</li> <li>Learning to play video games .</li> <li>Natural language processing tasks .</li> <li>Data clustering .</li> <li>Feature selection in classification .</li> </ol> <p>In 1995, Dr. Eberhart and Dr. Kennedy developed PSO as a population based stochastic optimization strategy inspired by the social behavior of a ﬂock of birds. When using PSO, a possible solution to the numeric optimization problem under investigation is represented by the position of a particle. Each particle has a current velocity, which represents a magnitude and direction toward a new, presumably better, solution. A particle also has a measure of the quality of its current position, the particle’s best known position (a previous position with the best known quality), and the quality of the global best known position of the swarm. In other words, in a PSO system, particles ﬂy around in a multidimensional search space. During ﬂight, each particle adjusts its position according to its own experience, and according to the experience of a neighboring particle, making use of the best position encountered by itself and its neighbor.</p> <p>The PSO algorithm keeps track of three global variables: target value, global best (gBest) value indicating which particle’s data in the population is currently closest to the target, and stopping value indicating when the algorithm should stop if the target isn’t found. The particle associated with the best solution (ﬁtness value) is the leader and each particle keeps track of its coordinates in the problem space. Basically each particle consists of: data representing a possible solution, a velocity value indicating how much the data can be changed, and a personal best (pBest) ﬁtness value indicating the closest the particle’s data has ever come to the target since the algorithm started. The particles data could be anything. For example, in a flock of birds flying over a food source, the data would be the X, Y, Z coordinates of each bird. The individual coordinates of each bird would try to move closer to the coordinates of the bird (gBest) which is closer to the food’s coordinates. The gBest value only changes when any particle’s pBest value comes closer to the target than gBest. At each iteration of the algorithm, gBest gradually moves closer and closer to the target until one of the particles reaches the target. If the data is a pattern or sequence, then individual pieces of the data would be manipulated until the pattern matches the target pattern.</p> <p style="text-align: center;"><a href="http://storage.ning.com/topology/rest/1.0/file/get/2773310271?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2773310271?profile=original" class="align-center" width="560"/></a><strong><em>Figure:</em></strong> <em>A particle swarm searching for the global minimum of a function (Source: Wikipedia).</em></p> <p>The velocity value is calculated according to how far an individual’s data is from the target. The further it is, the larger the velocity value. In the bird example, the individuals furthest from the food would make an effort to keep up with the others by flying faster toward the gBest bird. If the data is a pattern or sequence, the velocity would describe how different the pattern is from the target, and thus, how much it needs to be changed to match the target. The swarm of particles initialized with a population of random candidate solutions move through the D-dimension problem space to search the new solutions. The ﬁtness, f, can be calculated as the certain qualities measure. Each particle has a position represented by a position-vector and a velocity. The update of the particles velocity from the previous velocity to the new one is determined by the following equation:</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808358843?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808358843?profile=original" class="align-center" width="566"/></a>From this a particle decides where to move next, considering its own experience, which is the memory of its best past position, and the experience of its most successful particle in the swarm. The new position is then determined by the sum of the previous position and the new velocity. In a more mathematical notation the previous two equations could be defined as:</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808358901?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808358901?profile=original" class="align-center" width="454"/></a><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808359296?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808359296?profile=original" class="align-center" width="216"/></a>The first equation updates a particle’s velocity which is a vector value with multiple components. The term <img src="https://s0.wp.com/latex.php?latex=%5Coverrightarrow%7Bv%7D_%7Bi%7D%28t%2B1%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\overrightarrow{v}_{i}(t+1) " title="\overrightarrow{v}_{i}(t+1) " class="latex"/> means the new velocity at time <img src="https://s0.wp.com/latex.php?latex=t%2B1+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="t+1 " title="t+1 " class="latex"/>. The new velocity depends on three terms. The first term is $latex \overrightarrow{v}_{i}(t)$ which is the current velocity at time <img src="https://s0.wp.com/latex.php?latex=t+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="t " title="t " class="latex"/>. The second part is <img src="https://s0.wp.com/latex.php?latex=c_%7B1%7D+%5Cphi_%7B1%7D+%28%5Coverrightarrow%7Bp%7D_%7Bi%7D%28t%29+-+%5Coverrightarrow%7Bx%7D_%7Bi%7D%28t%29%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="c_{1} \phi_{1} (\overrightarrow{p}_{i}(t) - \overrightarrow{x}_{i}(t)) " title="c_{1} \phi_{1} (\overrightarrow{p}_{i}(t) - \overrightarrow{x}_{i}(t)) " class="latex"/>. The <img src="https://s0.wp.com/latex.php?latex=c_%7B1%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="c_{1} " title="c_{1} " class="latex"/> term is a positive constant called as coefficient of the self-recognition component. The <img src="https://s0.wp.com/latex.php?latex=%5Cphi_%7B1%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\phi_{1} " title="\phi_{1} " class="latex"/> and <img src="https://s0.wp.com/latex.php?latex=%5Cphi_%7B2%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\phi_{2} " title="\phi_{2} " class="latex"/> factors are uniformly distributed random numbers in [0,1]. The <img src="https://s0.wp.com/latex.php?latex=%5Coverrightarrow%7Bp%7D_%7Bi%7D%28t%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\overrightarrow{p}_{i}(t) " title="\overrightarrow{p}_{i}(t) " class="latex"/> vector value is the particle’s best position found so far. The <img src="https://s0.wp.com/latex.php?latex=%5Coverrightarrow%7Bx%7D_%7Bi%7D%28t%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\overrightarrow{x}_{i}(t) " title="\overrightarrow{x}_{i}(t) " class="latex"/> vector value is the particle’s current position. The third term in the velocity update equation is <img src="https://s0.wp.com/latex.php?latex=c_%7B2%7D+%5Cphi_%7B2%7D+%28%5Coverrightarrow%7Bp%7D_%7Bg%7D%28t%29+-+%5Coverrightarrow%7Bx%7D_%7Bi%7D%28t%29%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="c_{2} \phi_{2} (\overrightarrow{p}_{g}(t) - \overrightarrow{x}_{i}(t)) " title="c_{2} \phi_{2} (\overrightarrow{p}_{g}(t) - \overrightarrow{x}_{i}(t)) " class="latex"/>. The <img src="https://s0.wp.com/latex.php?latex=c_%7B2%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="c_{2} " title="c_{2} " class="latex"/> factor is a constant called the coefficient of the social component. The <img src="https://s0.wp.com/latex.php?latex=%5Coverrightarrow%7Bp%7D_%7Bg%7D%28t%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\overrightarrow{p}_{g}(t) " title="\overrightarrow{p}_{g}(t) " class="latex"/> vector value is the best known position found by any particle in the swarm so far. Once the new velocity,  <img src="https://s0.wp.com/latex.php?latex=%5Coverrightarrow%7Bv%7D_%7Bi%7D%28t%2B1%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\overrightarrow{v}_{i}(t+1) " title="\overrightarrow{v}_{i}(t+1) " class="latex"/>, has been determined, it’s used to compute the new particle position <img src="https://s0.wp.com/latex.php?latex=%5Coverrightarrow%7Bx%7D_%7Bi%7D%28t%2B1%29+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\overrightarrow{x}_{i}(t+1) " title="\overrightarrow{x}_{i}(t+1) " class="latex"/>. The term <img src="https://s0.wp.com/latex.php?latex=%5Coverrightarrow%7Bv%7D_%7Bi%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\overrightarrow{v}_{i} " title="\overrightarrow{v}_{i} " class="latex"/> is limited to the range <img src="https://s0.wp.com/latex.php?latex=%5Cpm+%5Coverrightarrow%7Bv%7D_%7Bmax%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="\pm \overrightarrow{v}_{max} " title="\pm \overrightarrow{v}_{max} " class="latex"/>. If the velocity exceeds this limit, it is set to it.</p> <p>PSO does not require a large number of parameters to be initialized. But the choice of PSO parameters can have a large impact on optimization performance and has been the subject of much research. The number of particles is a very important factor to be considered. For most of the practical applications an example good choice of the number of particles is typically in the range [20,40]. Usually 10 particles is a large number which is sufficient enough to get best results. In case of difficult problems the choice can be increased to 100 or 200 particles. The parameters <img src="https://s0.wp.com/latex.php?latex=c_%7B1%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="c_{1} " title="c_{1} " class="latex"/> and <img src="https://s0.wp.com/latex.php?latex=c_%7B2%7D+&amp;bg=ffffff&amp;fg=444444&amp;s=0" alt="c_{2} " title="c_{2} " class="latex"/>, coefficient of self-recognition and social components, are critical for the convergence of the PSO algorithm. Fine-tuning of these learning vectors aids in faster convergence and alleviation of local minima. Usually the choice for these parameters is 2.</p> <p>Below is the full PSO algorithm shown as a flow chart:</p> <p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2808359494?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2808359494?profile=original" class="align-center" width="470"/></a><strong>References:</strong></p> <p> Lorenzo, Pablo Ribalta, et al. "Particle swarm optimization for hyper-parameter selection in deep neural networks." <i>Proceedings of the Genetic and Evolutionary Computation Conference</i>. ACM, 2017.</p> <p> Gudise, Venu G., and Ganesh K. Venayagamoorthy. "Comparison of particle swarm optimization and backpropagation as training algorithms for neural networks." <i>Swarm Intelligence Symposium, 2003. SIS'03. Proceedings of the 2003 IEEE</i>. IEEE, 2003.</p> <p> Singh, Garima, and Kusum Deep. "Role of Particle Swarm Optimization in Computer Games." <i>Proceedings of Fourth International Conference on Soft Computing for Problem Solving</i>. Springer, New Delhi, 2015.</p> <p> Tambouratzis, George. "Applying PSO to natural language processing tasks: Optimizing the identification of syntactic phrases." <i>Evolutionary Computation (CEC), 2016 IEEE Congress on</i>. IEEE, 2016.</p> <p> Chuang, Li-Yeh, Chih-Jen Hsiao, and Cheng-Hong Yang. "Chaotic particle swarm optimization for data clustering." <i>Expert systems with Applications</i> 38.12 (2011): 14555-14563.</p> <p> Xue, Bing, Mengjie Zhang, and Will N. Browne. "Particle swarm optimization for feature selection in classification: A multi-objective approach." <i>IEEE transactions on cybernetics</i> 43.6 (2013): 1656-1671.</p> <p></p> Probabilistic Forecasting: Learning Uncertainty tag:www.datasciencecentral.com,2018-03-15:6448529:BlogPost:704411 2018-03-15T21:00:00.000Z Kostas Hatalis https://www.datasciencecentral.com/profile/KostasHatalis <p></p> <p>The majority of industry and academic numeric predictive projects deal with deterministic or <strong>point forecasts</strong> of expected values of a random variable given some conditional information. In some cases, these predictions are enough for decision making. However, these predictions don’t say much about the uncertainty of your underlying stochastic process. A common desire of all data scientists is to make predictions for an uncertain future. Clearly then, forecasts should…</p> <p></p> <p>The majority of industry and academic numeric predictive projects deal with deterministic or <strong>point forecasts</strong> of expected values of a random variable given some conditional information. In some cases, these predictions are enough for decision making. However, these predictions don’t say much about the uncertainty of your underlying stochastic process. A common desire of all data scientists is to make predictions for an uncertain future. Clearly then, forecasts should be probabilistic, i.e., they should take the form of probability distributions over future quantities or events. This form of prediction is known as <strong>probabilistic forecasting</strong> and in the last decade has seen a surge in popularity. Recent evidence of this are the 2014 and 2017 Global Energy Forecasting Competitions (GEFCom). GEFCom2014 focused on producing multiple quantile forecasts for wind, solar, load, and electricity prices, and GEFCom2017 focused on hierarchical rolling probabilistic forecasts of load. More recently the M4 Competition aims to produce point forecasts of 100,000-time series but has also optionally for the first time opened to submitting prediction interval forecasts too.</p> <p>So, what are probabilistic forecasts exactly? In a nutshell they try to quantify the uncertainty in a prediction, which can be an essential ingredient for optimal decision making. Probabilistic forecasting comes in three main flavors, the estimation of quantiles, prediction intervals, and full density functions. The general goal of these predictions is to maximize the sharpness of the predictive distributions, subject to calibration. <strong>Calibration</strong> refers to the statistical consistency between the distributional forecasts and the observations and is a joint property of the predictions and the observed values. <strong>Sharpness</strong> refers to the concentration of the predictive distributions and is a property of the forecasts only.</p> <p>In more formal terms, probabilistic forecasts can be defined as such. For a random variable <em>Y_t</em> such at time <em>t</em> its probability density function is defined as <em>f_t</em> and it’s the cumulative distribution function as <em>F_t</em>. If <em>F_t</em> is a strictly increasing, the quantile <em>q(t, τ)</em> with proportion <em>τ</em> ϵ [0,1] of the random variable <em>Y_t</em> is uniquely defined as the value <em>x</em> such that <em>P(Y_t &lt; x) = τ</em> or equivalently as the inverse of the distribution function. A quantile forecast <em>q(t+k, τ)</em> with nominal proportion <em>τ</em> is an estimate of the true quantile for the lead time <em>t+k</em>, given predictor values. Prediction intervals then give a range of possible values within which an observed value is expected to lie with a certain probability. A prediction interval produced at time <em>t</em> for future horizon <em>t+k</em> is defined by its lower and upper bounds, which are the quantile forecasts <em>q(t+k, τ_l)</em> and <em>q(t+k, τ_u)</em>. Below is an example of prediction interval forecasts on the popular Air Passengers time series. The forecasts are produced by a SARIMA model assuming a normal density:<a href="http://storage.ning.com/topology/rest/1.0/file/get/2808355561?profile=original" target="_self"><img width="750" src="http://storage.ning.com/topology/rest/1.0/file/get/2808355561?profile=RESIZE_1024x1024" class="align-center" width="660" height="376"/></a>When it is assumed the future density function will take a certain form, this is called <strong>parametric</strong> probabilistic forecasting. For instance, if a process is assumed to be Gaussian then all we must do is estimate the future mean and variance of that process. If no assumption is made about the shape of the distribution, a <strong>nonparametric</strong> probabilistic forecast can be made of the density function. This can be done by either gathering a set of finite quantiles forecasts such that with chosen nominal proportions spread on the unit interval, most common approach is to use quantile regression, or through direct distribution estimation methods such as kernel density estimation. In most stochastic processes, from renewable energy production, to online sales, to disease propagation, it is often hard to say if they come from a specific distribution thus making nonparametric probabilistic forecasting a more reasonable choice.</p>