Vincent Granville's Posts - Data Science Central
2021-06-14T01:18:40Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
https://storage.ning.com/topology/rest/1.0/file/get/2800211702?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1
https://www.datasciencecentral.com/profiles/blog/feed?user=3v6n5b6g08kgn&xn_auth=no
The Machine Learning Process in 7 Steps
tag:www.datasciencecentral.com,2021-06-13:6448529:BlogPost:1053382
2021-06-13T04:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9084500862?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/9084500862?profile=RESIZE_710x" width="600"></img></a></p>
<p></p>
<p>In this article, I describe the various steps involved in managing a machine learning process from beginning to end. Depending on which company you work for, you may or may not be involved in all the steps. In larger companies, you typically focus on one or two specialized aspects of a project. In small companies, you may be involved in all…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9084500862?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/9084500862?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p></p>
<p>In this article, I describe the various steps involved in managing a machine learning process from beginning to end. Depending on which company you work for, you may or may not be involved in all the steps. In larger companies, you typically focus on one or two specialized aspects of a project. In small companies, you may be involved in all the steps. Here the focus is on large projects, such as developing a taxonomy, as opposed to ad-hoc or one-time analyses. I also mention all the people involved, besides machine learning professionals.</p>
<p><span style="font-size: 14pt;"><strong>Steps involved in machine learning projects</strong></span></p>
<p>In chronological order, here are the main steps. Sometimes it is necessary to recognize errors in the process and move back and start again at an earlier step. This is by no mean a linear process, but more like trial and error experimentation. </p>
<p><strong>1</strong>. <strong>Defining the problem</strong> and the metrics (also called features) that we want to track. Assessing the data available (internal and third party sources) or the databases that need to be created, as well as database architecture for optimum storing and processing. Discuss cloud architectures to choose from, data volume (potential future scaling issues), and data flows. Do we need real-time data? How much can safely be outsourced? Do we need to hire some staff? Discuss costs, ROI, vendors, and timeframe. Decision makers and business analysts are heavily involved, and data scientists and engineers may participate in the discussion.</p>
<p><strong>2. Defining goals</strong> and types of analyses to be performed. Can we monetize the data? Are we going to use the data for segmentation, customer profiling and better targeting, to optimize some processes such as pricing or supply chain, for fraud detection, taxonomy creation, to increase sales, for competitive or marketing intelligence, or to improve user experience for instance via a recommendation engine or better search capacities? What are the most relevant goals? Who will be the main users?</p>
<p><b>3</b>. <strong>Collecting the data</strong>. Assessing who has access to the data (and which parts of the data, such as summary tables versus life databases), and in what capacity. Here privacy and security issues are also discussed. The IT team, legal team and data engineers are typically involved. Dashboard design is also discussed, with the purpose of designing good dashboards for end-users such as decision makers, product or marketing team, or customers. </p>
<p><strong>4. Exploratory data analysis</strong>. Here data scientists are more heavily involved, though this step should be automated as much as possible. You need to detect missing data and how to handle it (using imputation methods), identify outliers and what they mean, summarize and visualize the data, find erroneously coded data and duplicates, find correlations, perform preliminary analyses, find best predicting features and optimum binning techniques (see section 4 <a href="https://www.datasciencecentral.com/profiles/blogs/decomposition-of-statistical-distributions-using-mixture-models-a" target="_blank" rel="noopener">in this article</a>). This could lead to the discovery of data flaws, and may force you to revisit and start again from a previous step, to fix any significant issue.</p>
<p><strong>5. The true machine learning / modeling step</strong>. At this point, we assume that the data collected is stable enough, and can be used for its original purpose. Predictive models are being tested, neural networks or other algorithms / models are being trained with goodness-of-fit tests and cross-validation. The data is available for various analyses, such as post-mortem, fraud detection, or proof of concept. Algorithms are prototyped, automated and eventually implemented in production mode. Output data is stored in auxiliary tables for further use, such as email alerts or to populate dashboards. External data sources may be added and integrated. As this point, major data issues have been fixed.</p>
<p><strong>6. Creation of end-user platform</strong>. Typically, it comes as dashboards featuring visualizations and summary data that can be exported in standardized formats, even spreadsheets. This provides the insights that can be acted upon by decision makers. The platform can be used for A/B testing. It can also come as a system of email alerts sent to decision makers, customers, or anyone who need to be informed.</p>
<p><strong>7. Maintenance</strong>. The models need to be adapted to changing data, changing patterns, or changing definitions of core metrics. Some satellite database tables must be updated, for instance every six months. Maybe a new platform able to store more data is needed, and data migration must be planned. Audits are performed to keep the system sound. New metrics may be introduced, as new sources of data are collected. Old data may be archived. Now we should get a good idea of the long-term yield (ROI) of the project, what works well and what needs to be improved. </p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
The Pros and Cons of Working for a Startup
tag:www.datasciencecentral.com,2021-06-04:6448529:BlogPost:1052488
2021-06-04T03:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9032472853?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/9032472853?profile=RESIZE_710x" width="600"></img></a></p>
<p></p>
<p>As a machine learning professional, I have worked for several startups ranging from zero to 600 employees, as well as companies such as eBay, Wells Fargo, Visa and Microsoft. Here I share my experience. A brief summary can be found in my conclusions, at the bottom of this article.</p>
<p>It is not easy to define what a startup is. The first…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9032472853?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/9032472853?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p></p>
<p>As a machine learning professional, I have worked for several startups ranging from zero to 600 employees, as well as companies such as eBay, Wells Fargo, Visa and Microsoft. Here I share my experience. A brief summary can be found in my conclusions, at the bottom of this article.</p>
<p>It is not easy to define what a startup is. The first one I worked for was NBCi, a spinoff of CNET, and had 600 employees almost on day one, and nearly half a billion dollars in funding, from GE. The pay was not great especially for San Francisco, I had stock options but the company went the way many startups go: it was shut down after two years when the Internet bubble popped, so I was able to only cash one year worth of salary from my stock options. Still not bad, but a far cry from what most people imagine. I was essentially the only statistician in the company, though they had a big IT team, with many data engineers, and were collecting a lot of data. I quickly learned that my best allies were in the IT department, and I was the bridge between the IT and the marketing department. I was probably the only "neutral" employee who could talk to both departments, as they were at war against each other (my boss was the lead of the marketing department). I also interacted a lot with the sales, product, and finance teams, and executives. I really liked that situation though, and the fact that there was a large turnover, allowing me to work with many new people (thus new connections and friends) on a regular basis, and on many original projects. The drawback: I was the only statistician. It was not an issue for me.</p>
<p>When people think about startups, many think about a company starting from scratch, with 20 employees, and funded with VC money. I also experienced that situation, and again, I was the only statistician (actually chief scientist and co-founder) though we also had a strong IT team. It lasted a few years until the 2008 crash, I had a great salary, and great stock options that never materialized. But they eventually bought one of my patents. I was hired as co-founder because I was (back then) the top expert in my field: click fraud detection, and scoring Internet traffic for advertisers and publishers. Again, I was the only machine learning guy, and not involved with live databases other than to set the rules and analyze the data. And to conceptually design the dashboard platform for our customers. I was interacting with various people from various small teams, occasionally even with clients, and prototyping solutions and working on proofs of concept - some helped us win a few large customers. I was in all the big meetings involving large, new clients, sometimes flying to the client's location. This is one of the benefits of working as a sole data scientist. Another one, especially if you have specialized, hard-to-find skills (earned by running small businesses on the side), is that I worked remotely, from home. </p>
<p>Yet another startup, the last one I co-founded, structured as a S-corp, had zero employee, no payroll, no funding, no CEO, and no office or headquarter (the official address, needed for tax purposes, was listed as my home address). It had no home-made Internet platform or database: this was inexpensively outsourced. We were working with people in different countries, our IT team (a one-man operation) was in Eastern Europe. This is the one that was acquired recently by a tech publisher, and my most successful exit. It still continues to grow very nicely today, despite (or thanks to) Covid. It started bare-bone unlike the other ones, making its survival more likely to happen, with 50% profit margins. However, people working with us were well paid, offered a lot of flexibility, and of course everyone was always working from home. We only met face-to-face when visiting a client. No stock options were ever issued; I made money in a different way. I was interacting mostly with sales, and also contributing content and automatically growing our membership using proprietary techniques of my own, that outsmarted all the competitors.</p>
<p>As for the big companies I worked for, I will say this. At Wells Fargo, I was part of a small group (about 100 people) with open office, relatively low hierarchy, and all the feelings of working for a startup. I was told that this was a special Wells Fargo experiment that the company reluctantly tried, in order to hire a different type of talent. It is unusual to be in such a working environment at Wells Fargo. To the contrary, Visa looked more like a big corporation, with many machine learning people each working on very specialized tasks, and a heavier hierarchy. Still I loved the place, and it really helped grow my career. The data sets were quite big, which pleased me. One of the benefits of working for such a company is the career opportunities that it provides. Finally, it is possible to work for a startup within a big company, in what is called a corporate startup. My first example, NBCi, illustrates this concept; in the end I was indirectly working for GE or NBC and even met with the GE auditing team and their six-sigma philosophy. Many of the folks they brought to the company were actually GE and NBC internal employees. </p>
<p><strong>Conclusion</strong></p>
<p>Finding a job at a startup may be easier than applying for positions at big companies. If you have solid expertise, the salary might even be better. Stock options could prove to be elusive. The job is usually more flexible and requires creativity; you might be the only machine learning employee in the company, interacting with various teams and even with clients. Projects can potentially be more varied and interesting, and the environment is usually fast-paced. Working from home is usually an option. You may report directly to the CEO; the hierarchy is typically less heavy. It requires adaptation and may not be a good fit for everyone. You can also work for a startup within a big corporation: it is called a corporate startup. Working for a big company may be a better move for your career, especially if your plan is to work for big companies in the future. Of course, startups also try to attract talent from big companies. </p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
Simple Introduction to Public-Key Cryptography and Cryptanalysis: Illustration with Random Permutations
tag:www.datasciencecentral.com,2021-06-02:6448529:BlogPost:1052064
2021-06-02T04:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9022337069?profile=original" rel="noopener" target="_blank"><img class="align-full" src="https://storage.ning.com/topology/rest/1.0/file/get/9022337069?profile=RESIZE_710x" width="720"></img></a></p>
<p></p>
<p>In this article, I illustrate the concept of asymmetric key with a simple example. Rather than discussing algorithms such as RSA, (still widely used, for instance to set up a secure website) I focus on a system easier to understand, based on random permutations. I discuss how to generate these random permutations and compound them, and how to…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9022337069?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/9022337069?profile=RESIZE_710x" width="720" class="align-full"/></a></p>
<p></p>
<p>In this article, I illustrate the concept of asymmetric key with a simple example. Rather than discussing algorithms such as RSA, (still widely used, for instance to set up a secure website) I focus on a system easier to understand, based on random permutations. I discuss how to generate these random permutations and compound them, and how to enhance such a system, using steganography techniques. I also explain why permutation-based cryptography is not good for public key encryption. In particular, I show how such as system can be reverse-engineered, no matter how sophisticated it is, using cryptanalysis methods. This article also features some nontrivial, interesting asymptotic properties of permutations (usually no taught in math classes) as well as the connection with a specific kind of matrices, yet using simple English rather than advanced math, so that this article can be understood by a wide audience.</p>
<p><span style="font-size: 14pt;"><strong>1. Description of my public key encryption system</strong></span></p>
<p>Here <em>x</em> is the original message created by the sender, and <em>y</em> is the encrypted version that the receiver gets. The original message can be described as sequence of bits (zeros and ones). This is the format in which is it is internally encoded on a computer or when traveling through the Internet, be it encrypted or not, as computers only deal with bits (we are not talking about quantum computers or quantum Internet here, which operate differently). </p>
<p>The general system can be broken down into three main-components:</p>
<ul>
<li>Pre-processing: blurring the message to make it appear like random noise</li>
<li>Encryption via bit-reshuffling </li>
<li>Decryption</li>
</ul>
<p>We now explain these three steps. Note that the whole system processes information by blocks, each block (say 2048 bits) being processed separately.</p>
<p><strong>1.1. Blurring the message</strong></p>
<p>This steps consist in adding random bits at the end of each block (sometimes referred to as <em>padding</em>), then perform a XOR to further randomize the message. The bits to be added consist of zeroes and ones in such a proportion that the resulting, extended block has roughly 50 percent of zeroes and ones. For instance, if the original block contains 2048 bits, the extended blocks may contain up to 4096 bits.</p>
<p>Then, use a random string of bits, for instance 4096 binary digits of square root of two, and do a bitwise XOR (see <a href="https://en.wikipedia.org/wiki/Exclusive_or" target="_blank" rel="noopener">here</a>) with the 4096 bits obtained in the previous step. The resulting bit string is the input for the next step. </p>
<p><strong>1.2. Actual encryption step</strong></p>
<p>The block to be encoded is still denoted as <em>x</em>, though it is assumed to be the output of the previous step discussed in section 1.1, not part of the original message. The encryption step transforms <em>x</em> into <em>y</em>, and the general transformation can be described by</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9021486485?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/9021486485?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p>Here * is an <a href="https://en.wikipedia.org/wiki/Associative_property" target="_blank" rel="noopener">associative operator</a>, typically the matrix multiplication or the <a href="https://en.wikipedia.org/wiki/Function_composition" target="_blank" rel="noopener">composition operator</a> between two functions, the latter one usually denoted as o as in (<em>f</em> o <em>g</em>)(<em>x</em>) = <em>f(g</em>(<em>x</em>)). The transforms <em>K</em> and <em>L</em> can be seen as <a href="https://en.wikipedia.org/wiki/Permutation_matrix" target="_blank" rel="noopener">permutation matrices</a>. In our case they are actual permutations whose purpose is to reshuffle the bits of <em>x</em>, but permutations can be represented by matrices. The crucial element here is that <em>L</em> * <em>K</em> = <em>L</em>^<em>n</em> = <em>I</em> (that is, <em>L</em> at power <em>n</em> is the identity operator): this allows us to easily decrypt the message. Indeed, <em>x</em> = <em>L</em> * <em>y</em>. We need to be very careful in our choice of <em>L</em>, so that the smallest <em>n</em> satisfying <em>L</em>^<em>n</em> = <em>I</em> is very large. More on this in section 2. This is related to the mathematical theory of finite groups, but the reader does not need to be familiar with <a href="https://en.wikipedia.org/wiki/Group_theory" target="_blank" rel="noopener">group theory</a> to understand the concept. It is enough to know that permutations can be multiplied (composed), elevated to any power, or inversed, just like matrices. More about this can be found <a href="https://en.wikipedia.org/wiki/Permutation_group" target="_blank" rel="noopener">here</a>.</p>
<p>That said, the public and private keys are:</p>
<ul>
<li><strong>Public key</strong>: <em>K</em> (this all the sender needs to know to encrypt the block <em>x</em> as as <em>y</em> = <em>K</em> * <em>x</em>)</li>
<li><strong>Private keys</strong>: <em>n</em> and <em>L</em> (kept secret by the recipient); the decrypted block is <em>x</em> = <em>L</em> * <em>y</em></li>
</ul>
<p><strong>1.3. Decryption step</strong></p>
<p>I explained how to retrieve the block <em>x</em> in section 1.2 when you actually receive <em>y</em>. Once a block is decrypted, you still need to reverse the step described in section 1.1. This is accomplished by applying to <em>x</em> the same XOR as in section 1.1, then by removing the padding (the extra bits that were added to pre-process the message).</p>
<p><span style="font-size: 14pt;"><strong>2. About the random permutations</strong></span></p>
<p>Many algorithms are available to reshuffle the bits of <em>x</em>, see for instance <a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle" target="_blank" rel="noopener">here</a>. Our focus is to explain the most simple one, and to discuss some interesting background about permutations, in order to reverse-engineer our encryption system (see section 3).</p>
<p><strong>2.1. Permutation algebra: basics</strong></p>
<p>Let's begin with basic definitions. A permutation <em>L</em> of <em>m</em> elements can be represented by a <em>m</em>-dimensional vector. For instance <em>L</em> = (5, 4, 1, 2, 3) means that the first element of your bitstream is moved to position 5, the second one to position 4, the third one to position 1, and so forth. This can be written as <em>L</em>(1) = 5 , <em>L</em>(2) = 4, <em>L</em>(3) = 1, <em>L</em>(4) = 2, and <em>L</em>(5) = 3. Now the square of <em>L</em> is simply <em>L</em>(<em>L</em>), and the <em>n</em>-th power is <em>L</em>(<em>L</em>(<em>L</em>...))) where <em>L</em> appears <em>n</em> times in that expression. The <strong><em>order</em></strong> of a permutation (see <a href="http://mathonline.wikidot.com/the-order-of-a-permutation" target="_blank" rel="noopener">here</a>) is the smallest <em>n</em> such that <em>L</em>^<em>n</em> is the identity permutation.</p>
<p>Each permutation is made up of a number of usually small sub-cycles, themselves treated as sub-permutations. For instance, in our example, <em>L</em>(1) = 5, <em>L</em>(5) = 3, <em>L</em>(3) = 1. This constitutes a sub-cycle of length 3. The other cycle, of length 2, is <em>L</em>(2) = 4, <em>L</em>(4) = 2. To compute the order of a permutation, compute the orders of each sub-cycle. The least common multiple of these orders is the order of your permutation. The successive powers of a permutation have the same sub-cycle structure. As a result, if <em>K</em> is a power of <em>L</em>, and <em>L</em> has order <em>n</em>, then both <em>L</em>^<em>n</em> and <em>K</em>^<em>n</em> are the identity permutation. This fact is of crucial importance to reverse-engineer this encryption system. </p>
<p>Finally, the power of a permutation can be computed very fast, using the <a href="https://en.wikipedia.org/wiki/Exponentiation_by_squaring" target="_blank" rel="noopener">exponentiation by squaring algorithm</a>, applied to permutations. Thus even if the order <em>n</em> is very large, it is easy to compute <em>K</em> (the public key). Unfortunately, the same algorithm can be used by a hacker to discover the private key <em>L</em>, and the order <em>n</em> (kept secret) of the permutation in question, once she has discovered the sub-cycles of <em>K</em> (which is easy to do, as illustrated in my example). For the average length of a sub-cycle in a random permutation, see <a href="https://math.stackexchange.com/questions/1409862/average-length-of-a-cycle-in-a-n-permutation" target="_blank" rel="noopener">this article</a>.</p>
<p><strong>2.2. Main asymptotic result</strong></p>
<p>The expected order <em>n</em> of a random permutation of length <em>m</em> (that is, when reshuffling <em>m</em> bits) is</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9022071859?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/9022071859?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p>For details, see <a href="https://en.wikipedia.org/wiki/Random_permutation_statistics#Order_of_a_random_permutation" target="_blank" rel="noopener">here</a>. For instance, if <em>m</em> = 4,096 then <em>n</em> is approximately equal to 6 x 10^10. If <em>m</em> = 65,536, then <em>n</em> is approximately equal to 2 x 10^37. It is possible to add many bits all equal to zero to the block being encrypted, to increase its size <em>m</em> and thus <em>n</em>, without increasing too much the size of the encrypted message after compression. However, if used with a public key, this encryption system has a fundamental flaw discussed in section 3, no matter how large <em>n</em> is.</p>
<p><strong>2.3. Random permutations</strong></p>
<p>The easiest way to produce a random permutation of <em>m</em> elements is as follows.</p>
<ul>
<li>Generate <em>L</em>(1) as a pseudo random integer between 1 and <em>m</em>. If <em>L</em>(1) = 1, repeat until <em>L</em>(1) is different from 1.</li>
<li>Assume that <em>L</em>(1), ..., <em>L</em>(<em>k</em>-1) have been generated. Generate <em>L</em>(<em>k</em>) as a pseudo random integer between 1 and <em>m</em>. If <em>L</em>(<em>k</em>) is equal to one of the previous <em>L</em>(1), ..., <em>L</em>(<em>k</em>-1), or if it is equal to <em>k</em>, repeat until this is no longer the case.</li>
<li>Stop after generating the last entry, <em>L</em>(<em>m</em>).</li>
</ul>
<p>I use binary digits of irrational numbers, stored in a large table, to simulate random integers, but there are better (faster) solutions. Also, the Fisher-Yates algorithm (see <a href="https://en.wikipedia.org/wiki/Random_permutation#Fisher-Yates_shuffles" target="_blank" rel="noopener">here</a>) is more efficient. </p>
<p><span style="font-size: 14pt;"><strong>3. Reverse-engineering the system: cryptanalysis</strong></span></p>
<p>To reverse-engineer my system, you need to be able to decrypt the encrypted block <em>y</em> if you only know the public key <em>K</em>, but not the private key <em>L</em> nor <em>n</em>. As discussed in section 2, the first step is to identify all the sub-cycles in the permutation <em>K</em>. This is easily done, see example in section 2.1. Once this is accomplished, compute all the orders of these sub-cycle permutations and compute the least common multiple of these orders. Again, this is easy to do, and this allows you to retrieve <em>n</em> even though it was kept secret. Now you know that <em>K</em>^<em>n</em> is the identity permutation. Compute <em>K</em> at power <em>n</em>-1, and apply this new permutation to the encrypted block <em>y</em>. Since <em>y</em> = <em>K</em> * <em>x</em>, you get the following:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/9022314263?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/9022314263?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p>Now you've found <em>x</em>, problem solved. You can compute <em>K</em> at the power <em>n</em>-1 very fast even if <em>n</em> is very large, using the exponentiation by squaring algorithm mentioned in section 2.1. Of course you also need to undo the step discussed in section 1.1 to really fully decrypt the message, but that is another problem. The goal here was simply to break the step described in section 1.2.</p>
<p>In order to make a secure system, one must choose a transform <em>K</em> that is very difficult to invert, and permutations or permutation matrices (which can be hacked using the same technique) do not fit the bill. Permutation-based encryption may still be a good idea for symmetric key systems, that is, when no public key is involved.</p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
Could Machine Learning Practitioners Prove Deep Math Conjectures?
tag:www.datasciencecentral.com,2021-05-26:6448529:BlogPost:1051791
2021-05-26T04:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8980996868?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8980996868?profile=RESIZE_710x" width="600"></img></a></p>
<p></p>
<p>Many of us have solid foundations in math or have an interest in learning more, and are passionate about solving difficult problems during our free time. Of course, most of us are not professional mathematicians, but we may bring some value to help solve some of the most challenging mathematical conjectures, especially the ones that can be…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8980996868?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8980996868?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p></p>
<p>Many of us have solid foundations in math or have an interest in learning more, and are passionate about solving difficult problems during our free time. Of course, most of us are not professional mathematicians, but we may bring some value to help solve some of the most challenging mathematical conjectures, especially the ones that can be stated in rather simple words. In my opinion, the less math-trained you are (up to some extent), the more likely you could come up with original, creative solutions. Not that we could end up proving the Riemann hypothesis or other problems of the same caliber and popularity: the short answer is no. But we might think of a different path, a potential new approach to tackle these problems, and discover new theories, models and techniques along the way, some applicable to data analysis and real business problems. And sharing our ideas with professional mathematicians could have benefits for them and for us. Working on these problems during our leisure time could also benefit our machine learning career, if anything. In this article, I elaborate on these various points.</p>
<p><strong>The less math you learned, the more creative you could be</strong></p>
<p>Of course, this is true only up to some extent. You need to know much more than just high school math. When I started my PhD studies and asked my mentor if I should attend some classes or learn material that I knew was missing in my education, his answer was no: he said that the more you learn, the more you can get stuck in one particular way of thinking, and it can hurt creativity. He meant to say that acquiring deep vertical knowledge too fast, may not help; of course acquiring horizontal knowledge in various relevant fields broadens your horizon and can be very useful. That said, you still need to know a minimum (that is, acquiring a decent, deep enough vertical knowledge about the problem you are trying to solve), and these days it is very easy to self-learn advanced math by reading articles, using tools such as <a href="https://oeis.org/" target="_blank" rel="noopener">OEIS</a> or <a href="https://www.wolframalpha.com/" target="_blank" rel="noopener">Wolfram Alpha</a> (Mathematica) and posting questions on websites such as MathOverflow (see my profile and my posted questions <a href="https://mathoverflow.net/users/140356/vincent-granville" target="_blank" rel="noopener">here</a>), which are frequented by professional, research-level mathematicians. The drawback by not reading the classics (you should read them) is that you are bound to re-invent the wheel time and over, though in my case, that's the best way I learn new things. In addition to re-inventing the wheel, your knowledge will have big gaps, and it will show up.</p>
<p>Professionals with a background in physics, computer science, probability theory, statistics, pure math, or quantitative finance, may have a competitive advantage. Most importantly, you need to be passionate about your own private research, have a lot of modesty, perseverance, and patience as you fill face many disappointments, and not expect fame or financial rewards - in short, not any different than starting a PhD program. Some companies like Google may allow you to work on pet projects, and experimental research in number theory geared towards applications, may fit the bill. After all, some of the people who computed trillions of digits of the number Pi (and analyzed them) did it during their tenure at Google, and in the process contributed to the development of high performance computing. Some of them also contributed to deepen the field of number theory.</p>
<p>In my case, it was never my goal to prove any big conjecture. I stumbled time and over upon them while working on otherwise un-related math projects. It peeked my interest, and over time, I spent a lot of energy trying to understand the depth of these conjectures and why they may be true. And I got more and more interested in trying to pierce their mystery. This is true for the Riemann hypothesis (RH), a tantalizing conjecture with many implications if true, and relatively easy to understand. Even quantum physicists have worked on it, and obtained promising results. I know I will never prove RH, but if I can find a new direction to prove it, that is all I am asking for. Then I will work with mathematicians who know much more than I do, if my scenario for a proof is worth exploring, and enroll them to work on my foundations (likely to involve brand new math). The hope is that they can finish a work that I started myself, but that I can not complete due to my somewhat limited mathematical knowledge.</p>
<p>In the end, many top mathematicians made stellar discoveries in their thirties, out-performing their peers that were 30 years older despite the fact that their knowledge was limited because of their young age. This is another example that if you know too much, it might not necessarily help you.</p>
<p>Note that to get a job, "the less you know, the better" does not work, as employers expect you to know everything that is needed to work properly in their company. You can and should continue to learn a lot on the job, but you must master the basics just to be offered a job, and to be able to keep it. </p>
<p><strong>What I learned from working on these math projects: the benefits</strong></p>
<p>To begin with, not being affiliated with a professional research lab or the academia has some benefits: you don't have to publish, you choose your research project yourself, you work at your own pace (it better be much faster than in the academia), you don't have to face politics, and you don't have to teach. Yet you have access to similar resources (computing power, literature, and so on). You can even teach if you want to; in my case I don't really teach, but I write a lot of tutorials to get more people interested in the subject, and I will probably self-publish books in the future, which could become a source of revenue. My math questions on MathOverflow get a lot of criticism and some great answers too, which serves as peer-review, and readers even point me to some literature that I should read, as well as new, state-of-the-art yet unpublished research results. On occasions, I correspond with well known university professors, which further helps me not going in the wrong direction. </p>
<p>The top benefits I've found working on these problems is the incredible opportunities it offers to hone your machine learning skills. The biggest data sets I ever worked on come from these math projects. It allows you to test and benchmark various statistical models, discover new probability distributions with applications to real-world problems (see <a href="https://www.datasciencecentral.com/profiles/blogs/hurwitz-riemann-zeta-and-other-special-probability-distributions" target="_blank" rel="noopener">this example</a>), new visualizations (see <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" target="_blank" rel="noopener">here</a>), develop new statistical tests of randomness and new probabilistic games (see <a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">here</a>), and even discover interesting math theory, sometimes truly original: for instance complex random variables with applications (see <a href="https://www.datasciencecentral.com/profiles/blogs/introduction-to-complex-random-variables-with-applications" target="_blank" rel="noopener">here</a>), lattice points distribution in the infinite-dimensional simplex (yet unpublished), or advanced matrix algebra asymptotics (infinite matrices, yet unpublished, but similar to <a href="https://arxiv.org/abs/1511.08154" target="_blank" rel="noopener">this article</a>) and a new type of Dirichlet functions. Still, 90% of my research never gets published. I only share peer-reviewed, usually new results. The rest goes to garbage, which is always the case when you do research. For those interested, much of what I wrote and that I consider worth sharing, can be found in the math section, <a href="http://datashaping.com/free-articles.html" target="_blank" rel="noopener">here</a>.</p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
<div id="insideblog"><div class="dscAdAppear"></div>
</div>
Fun Math Problems for Machine Learning Practitioners
tag:www.datasciencecentral.com,2021-05-20:6448529:BlogPost:1051076
2021-05-20T03:26:33.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8947380099?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8947380099?profile=RESIZE_710x" width="500"></img></a></p>
<p><span>This is part of a series featuring the following aspects of machine learning:</span></p>
<ul>
<li><span>Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science)</span></li>
<li><span>Opinions, for instance about the value of a PhD in our field, or the use of some…</span></li>
</ul>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8947380099?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8947380099?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p><span>This is part of a series featuring the following aspects of machine learning:</span></p>
<ul>
<li><span>Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science)</span></li>
<li><span>Opinions, for instance about the value of a PhD in our field, or the use of some techniques</span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/more-machine-learning-tricks-recipes-and-statistical-models" target="_blank" rel="noopener">Methods, principles, rules of thumb, recipes, tricks</a></li>
<li><span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-1" target="_blank" rel="noopener">Business analytics</a> </span></li>
<li><span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-2" target="_blank" rel="noopener">Core Techniques</a> </span></li>
</ul>
<p><span>This issue focuses on cool math problems that come with data sets, source code, and algorithms. Many have a statistical, probabilistic or experimental flavor, and some are dealing with dynamical systems. They can be used to extend your math knowledge, practice your machine learning skills on original problems, or for curiosity. My articles, posted on Data Science Central, are always written in simple English and accessible to professionals with typically one year of calculus or statistical training, at the undergraduate level. They are geared towards people who use data but are interesting in gaining more practical analytical experience. The style is compact, geared towards people who do not have a lot of free time. </span></p>
<p><span>Despite these restrictions, state-of-the-art, of-the-beaten-path results as well as machine learning trade secrets and research material are frequently shared. References to more advanced literature (from myself and other authors) is provided for those who want to dig deeper in the interested topics discussed. </span></p>
<p><span><strong>1. Fun Math Problems for Machine Learning Practitioners</strong></span></p>
<p><span>These articles focus on techniques that have wide applications or that are otherwise fundamental or seminal in nature.</span></p>
<ol>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/introduction-to-complex-random-variables-with-applications">Fascinating Facts About Complex Random Variables and the Riemann Hypothesis</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/more-beautiful-math-images" target="_blank" rel="noopener">More Surprising Math Images</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/beautiful-mathematical-images" target="_blank" rel="noopener">Beautiful Mathematical Images</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/deep-visualizations-riemann-s-conjecture" target="_blank" rel="noopener">Deep visualizations to Help Solve Riemann's Conjecture</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" target="_blank" rel="noopener">Spectacular Visualization: The Eye of the Riemann Zeta Function</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-probabilistic-approach-to-factoring-big-numbers" target="_blank" rel="noopener">New Probabilistic Approach to Factoring Big Numbers</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-dramatically-improve-speed-of-convergence" target="_blank" rel="noopener">Simple Trick to Dramatically Improve Speed of Convergence</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/state-of-the-art-statistical-science-to-address-famous-number-the" target="_blank" rel="noopener">State-of-the-Art Statistical Science to Tackle Famous Number Theory Conjectures</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-perspective-on-fermat-s-last-theorem" target="_blank" rel="noopener">New Perspective on Fermat's Last Theorem</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/math-fun-infinite-nested-radicals-of-random-variables" target="_blank" rel="noopener">Fun Math: Infinite Nested Radicals of Random Variables</a> - Connection with Fractals and Brownian Motions</li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/surprising-uses-of-synthetic-random-data-sets" target="_blank" rel="noopener">Surprising Uses of Synthetic Random Data Sets</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/two-new-deep-conjectures-in-probabilistic-number-theory" target="_blank" rel="noopener">Two New Deep Conjectures in Probabilistic Number Theory</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/extreme-events-modeling-using-continued-fractions" target="_blank" rel="noopener">Extreme Events Modeling Using Continued Fractions</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/a-strange-family-of-statistical-distributions" target="_blank" rel="noopener">A Strange Family of Statistical Distributions</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/some-fun-with-the-golden-ratio-time-series-and-number-theory" target="_blank" rel="noopener">Some Fun with Gentle Chaos, the Golden Ratio, and Stochastic Number Theory</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">Fascinating New Results in the Theory of Randomness</a></li>
<li><a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/from-infinite-matrices-to-new-integration-formula" target="_blank" rel="noopener">From Infinite Matrices to New Integration Formula</a></li>
</ol>
<p><span><strong>2. Free books</strong></span></p>
<ul>
<li><span><b>Statistics: New Foundations, Toolbox, and Machine Learning Recipes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning">here</a>. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach.</span></p>
<p><span>The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.</span></p>
</li>
<li><span><b>Applied Stochastic Processes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes">here</a>. Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems (104 pages, 16 chapters.) This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject.</span></p>
<p><span>It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</span></p>
</li>
</ul>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> He recently opened <a href="https://www.parisrestaurantandbar.com/" target="_blank" rel="noopener">Paris Restaurant</a>, in Anacortes. You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
Fascinating Facts About Complex Random Variables and the Riemann Hypothesis
tag:www.datasciencecentral.com,2021-05-09:6448529:BlogPost:1049859
2021-05-09T17:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8907977684?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8907977684?profile=RESIZE_710x" width="500"></img></a></p>
<p style="text-align: center;"><em>Orbit of the Riemann zeta function in the complex plane (see also <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" rel="noopener" target="_blank">here</a>)</em></p>
<p>Despite my long statistical and machine learning career both in academia and in…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8907977684?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8907977684?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p style="text-align: center;"><em>Orbit of the Riemann zeta function in the complex plane (see also <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" target="_blank" rel="noopener">here</a>)</em></p>
<p>Despite my long statistical and machine learning career both in academia and in the industry, I never heard of complex random variables until recently, when I stumbled upon them by chance while working on some number theory problem. However, I learned that they are used in several applications, including signal processing, quadrature amplitude modulation, information theory and actuarial sciences. See <a href="https://en.wikipedia.org/wiki/Complex_random_variable" target="_blank" rel="noopener">here</a> and <a href="https://www.casact.org/sites/default/files/database/forum_15fforum_halliwell_complex.pdf" target="_blank" rel="noopener">here</a>. </p>
<p>In this article, I provide a short overview of the topic, with application to understanding why the Riemann hypothesis (arguably the most famous unsolved mathematical conjecture of all times) might be true, using probabilistic arguments. Stat-of-the-art, recent developments about this conjecture are discussed in a way that most machine learning professionals can understand. The style of my presentation is very compact, with numerous references provided as needed. It is my hope that this will broaden the horizon of the reader, offering new modeling tools to her arsenal, and an off-the-beaten-path reading. The level of mathematics is rather simple and you need to know very little (if anything) about complex numbers. After all, these random variables can be understood as bivariate vectors (<em>X</em>, <em>Y</em>) with <em>X</em> representing the real part and <em>Y</em> the imaginary part. They are typically denoted as <em>Z</em> = <em>X</em> + <em>iY</em>, where the complex number <em>i</em> (whose square is equal to -1) is the <a href="https://en.wikipedia.org/wiki/Imaginary_unit" target="_blank" rel="noopener">imaginary unit</a>. There are some subtle differences with bivariate real variables, and the interested reader can find more details <a href="https://en.wikipedia.org/wiki/Complex_random_variable" target="_blank" rel="noopener">here</a>. The complex Gaussian variable (see <a href="https://en.wikipedia.org/wiki/Complex_normal_distribution" target="_blank" rel="noopener">here</a>) is of course the most popular case.</p>
<p><span style="font-size: 14pt;"><strong>1. Illustration with damped complex random walks</strong></span></p>
<p>Let (<em>Z<span style="font-size: 8pt;">k</span></em>) be an infinite sequence of identically and independently distributed random variables, with <em>P</em>(<em>Z<span style="font-size: 8pt;">k</span></em> = 1) = <em>P</em>(<em>Z<span style="font-size: 8pt;">k</span></em> = -1) = 1/2. We define the damped sequence as </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8906629896?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8906629896?profile=RESIZE_710x" width="120" class="align-center"/></a></p>
<p>The originality here is that <em>s</em> = <em>σ</em> + <em>it</em> is a complex number. The above sequence clearly converges if the real part of <em>s</em> (the real number <em>σ</em>) is strictly above 1. The computation of the variance (first for the real part of <em>Z</em>(<em>s</em>), then for the imaginary part, then the full variance) yields:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8906638864?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8906638864?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Here <span><em>ζ</em> is the <a href="https://en.wikipedia.org/wiki/Riemann_zeta_function" target="_blank" rel="noopener">Riemann zeta function</a>. See also <a href="https://www.datasciencecentral.com/page/search?q=riemann+zeta" target="_blank" rel="noopener">here</a>. So we are dealing with a Riemann-zeta type of distribution; other examples of such distributions are found in one of my previous articles, <a href="https://www.datasciencecentral.com/profiles/blogs/hurwitz-riemann-zeta-and-other-special-probability-distributions" target="_blank" rel="noopener">here</a>. The core result is that the damped sequence not only converges if <em>σ</em> > 1 as announced earlier, but even if <em>σ</em> > 1/2 when you look at the variance: <em>σ</em> > 1/2 keeps the variance of the infinite sum <em>Z</em>(<em>s</em>), finite. This result, due to the fact that we are manipulating complex rather than real numbers, will be of crucial importance in the next section, focusing on an application. </span></p>
<p><span>It is possible to plot the distribution of <em>Z</em>(<em>s</em>) depending on the complex parameter <em>s</em> (or equivalently, depending on two real parameters <em>σ</em> and <em>t</em>), using simulations. You can also compute its distribution numerically, using the inverse Fourier transform of its characteristic function. The characteristic function computed for <em>τ</em> being a real number, is given by the following surprising product:</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8906825294?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8906825294?profile=RESIZE_710x" width="250" class="align-center"/></a></span></p>
<p><strong>1.1. Smoothed random walks and distribution of runs</strong></p>
<p><span>This sub-section is useful for the application discussed in section 2, and also for its own sake. If you don't have much time, you can skip it, and come back to it later.</span></p>
<p><span>The sum of the first <em>n</em> terms of the series defining <em>Z</em>(<em>s</em>) represents a random walk (assuming <em>n</em> represents the time), with zero mean and variance equal to <em>n</em> (thus growing indefinitely with <em>n</em>) if <em>s</em> = 0; it can take on positive or negative values, and can stay positive (or negative) for a very long time, though it will eventually oscillate infinitely many times between positive and negative values (see <a href="https://mathworld.wolfram.com/PolyasRandomWalkConstants.html" target="_blank" rel="noopener">here</a>) if <em>s</em> = 0. The case s = 0 corresponds to the classic random walk. We define the smoothed version <em>Z*</em>(<em>s</em>) as follows:</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8906746501?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8906746501?profile=RESIZE_710x" width="300" class="align-center"/></a></span></p>
<p><span>A <em>run</em> of length <em>m</em> is defined as a maximum subsequence <em>Z<span style="font-size: 8pt;">k</span></em><span style="font-size: 8pt;">+1</span>, ..., <em>Z<span style="font-size: 8pt;">k</span></em><span style="font-size: 8pt;">+<em>m</em></span> all having the same sign: that is, <em>m</em> consecutive values all equal to +1, or all equal to -1. The probability for a run to be of length <em>m</em> > 0, in the original sequence (<em>Z<span style="font-size: 8pt;">k</span></em>), is equal to 1 / 2^<em>m</em>. Here 2^<em>m</em> means 2 at power <em>m</em>. In the smoothed sequence (<em>Z*<span style="font-size: 8pt;">k</span></em>), after removing the zeroes, that probability is now 2 / 3^<em>m</em>. While by construction the <em>Z<span style="font-size: 8pt;">k</span></em>'s are independent, note that the <em>Z*k</em>'s are not independent anymore. After removing all the zeroes (representing 50% of the <em>Z*<span style="font-size: 8pt;">k</span></em>'s), the runs in the sequence (<em>Z*<span style="font-size: 8pt;">k</span></em>) tend to be much shorter than those in (<em>Z<span style="font-size: 8pt;">k</span></em>). This implies that the associated random walk (now actually less random) based on the <em>Z*<span style="font-size: 8pt;">k</span></em>'s is better controlled, and can't go up and up (or down and down) for so long, unlike in the original random walk based on the <em>Z<span style="font-size: 8pt;">k</span></em>'s. A classic result, known as the <a href="https://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm" target="_blank" rel="noopener">law of the iterated logarithm</a>, states that</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8906801290?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8906801290?profile=RESIZE_710x" width="200" class="align-center"/></a></span></p>
<p><span>almost surely (that is, with probability 1). The definition of "lim sup" can be found <a href="https://en.wikipedia.org/wiki/Limit_inferior_and_limit_superior" target="_blank" rel="noopener">here</a>. Of course, this is no longer true for the sequence (<em>Z*<span style="font-size: 8pt;">k</span></em>) even after removing the zeroes.</span></p>
<p><span style="font-size: 14pt;"><strong>2. Application: heuristic proof of the Riemann hypothesis</strong></span></p>
<p><span>The Riemann hypothesis, one of the most famous unsolved mathematical problems, is discussed <a href="https://en.wikipedia.org/wiki/Riemann_hypothesis" target="_blank" rel="noopener">here</a>, and in the DSC article entitled <a href="https://www.datasciencecentral.com/profiles/blogs/will-bigdata-solve-the-riemann-hypothesis" target="_blank" rel="noopener">Will big data solved the Riemann hypothesis</a>. We approach this problem using a function <em>L</em>(<em>s</em>) that behaves (to some extent) like the <em>Z</em>(<em>s</em>) defined in section 1. We start with the following definitions:</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8908523659?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8908523659?profile=RESIZE_710x" width="450" class="align-center"/></a></span></p>
<p>where</p>
<ul>
<li><span><em>Ω</em>(<em>k</em>) is the <a href="https://en.wikipedia.org/wiki/Prime_omega_function" target="_blank" rel="noopener">prime omega function</a>, counting the number of primes (including multiplicity) dividing <em>k</em>,</span></li>
<li><span><em>λ</em>(<em>k</em>) is the <a href="https://en.wikipedia.org/wiki/Liouville_function" target="_blank" rel="noopener">Liouville function</a></span><span>,</span></li>
<li><span><em>p</em><span style="font-size: 8pt;">1</span>, <em>p</em><span style="font-size: 8pt;">2</span>, and so on (with <em>p</em><span style="font-size: 8pt;">1</span> = 2) are the prime numbers.</span></li>
</ul>
<p><span>Note that <em>L</em>(<em>s</em>, 1) = <em>ζ</em>(<em>s</em>) is the Riemann zeta function, and <em>L</em>(<em>s</em>) = <em>ζ</em>(2<em>s</em>) / <em>ζ</em>(<em>s</em>). Again, <em>s</em> = <em>σ</em> + <em>it</em> is a complex number. We also define <em>L<span style="font-size: 8pt;">n</span></em> = <em>L<span style="font-size: 8pt;">n</span></em>(0) and <em>ρ</em> = <em>L</em>(0, 1/2). We have <em>L</em>(1) = 0. The series for <em>L</em>(<em>s</em>) converges for sure if <em>σ</em> > 1.</span></p>
<p><strong>2.1. How to prove the Riemann hypothesis?</strong></p>
<p><span>Any of the following conjectures, if proven, would make the Riemann hypothesis true:</span></p>
<ul>
<li><span>The series for <em>L</em>(<em>s</em>) also converges if <em>σ</em> > 1/2: this is what we investigate in section 2.2. If it were to converge only if <em>σ</em> is larger than (say) <em>σ</em><span style="font-size: 8pt;">0</span> = 0.65, it would imply that the Riemann Hypothesis (RH) is not true in the critical strip 1/2 < <em>σ</em> < 1, but only in <em>σ</em><span style="font-size: 8pt;">0</span> < <em>σ </em> < 1. It would still be a major victory, allowing us to get much more precise estimates about the distribution of prime numbers, than currently known today. RH is equivalent to the fact that <em>ζ</em>(<em>s</em>) has no zero if 1/2 < <em>σ</em> < 1.</span></li>
<li><span>The number <em>ρ</em> is a <a href="https://en.wikipedia.org/wiki/Normal_number" target="_blank" rel="noopener">normal number</a> in base 2 (this would prove the much stronger Chowla conjecture, see <a href="https://mathoverflow.net/questions/391736/normal-numbers-liouville-function-and-the-riemann-hypothesis" target="_blank" rel="noopener">here</a>)</span></li>
<li><span>The sequence (<em>λ</em>(<em>k</em>)) is ergodic (this would also prove the much stronger Chowla conjecture, see <a href="https://arxiv.org/abs/1611.09338" target="_blank" rel="noopener">here</a>)</span></li>
<li><span>The sequence <em>x</em>(<em>n</em>+1) = 2<em>x</em>(<em>n</em>) - INT(2<em>x</em>(<em>n</em>)), with <em>x</em>(0) = (1 + <em>ρ</em>) / 2, is ergodic. This is equivalent to the previous statement. Here INT stands for the integer part function, and the <em>x</em>(<em>n</em>)'s are iterates of the <a href="https://en.wikipedia.org/wiki/Dyadic_transformation" target="_blank" rel="noopener">Bernoulli map</a>, one of the simple chaotic discrete dynamical systems (see Update 2 <a href="https://mathoverflow.net/questions/391736/normal-numbers-liouville-function-and-the-riemann-hypothesis" target="_blank" rel="noopener">in this post</a>) with its main invariant distribution being uniform on [0, 1]</span></li>
<li><span>The function 1 / <em>L</em>(<em>s</em>) = <em>ζ</em>(<em>s</em>) / <em>ζ</em>(2<em>s</em>) has no zero if 1/2 < <em>σ </em> < 1</span></li>
<li><span>The numbers <em>λ</em>(<em>k</em>)'s behave in a way that is random enough, so that for any <em>ε</em> > 0, we have: (see <a href="https://mathoverflow.net/questions/391736/normal-numbers-liouville-function-and-the-riemann-hypothesis" target="_blank" rel="noopener">here</a>)</span><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8906956661?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8906956661?profile=RESIZE_710x" width="250" class="align-center"/></a></span></li>
</ul>
<p>Note that the last statement is weaker than the law of the iterated logarithm mentioned in section 1.1. The coefficient <em>λ</em>(<em>k</em>) plays the same role as <em>Z<span style="font-size: 8pt;">k</span></em> in section 1, however because <em>λ</em>(<i>mn</i>) = <em>λ</em>(<i>m</i>)<em>λ</em>(<i>n</i>), they can't be independent, not even <a href="https://projecteuclid.org/download/pdf_1/euclid.lnms/1215465639" target="_blank" rel="noopener">asymptotically independent</a>, unlike the <em>Z<span style="font-size: 8pt;">k</span></em>'s. Clearly, the sequence (<em>λ</em>(<em>k</em>)) has weak dependencies. That in itself does not prevent the law of the iterated logarithm from applying (see examples <a href="https://projecteuclid.org/journals/annals-of-probability/volume-5/issue-3/A-Functional-Law-of-the-Iterated-Logarithm-for-Empirical-Distribution/10.1214/aop/1176995795.full" target="_blank" rel="noopener">here</a>) nor does it prevent <em>ρ</em> from being a normal number (see <a href="https://arxiv.org/abs/1804.02844" target="_blank" rel="noopener">here</a> why). But it is conjectured that the law of the iterated logarithm does not apply to the sequence (<em>λ</em>(<em>k</em>)), due to another conjecture by <span>Gonek (see <a href="https://arxiv.org/abs/math/0310381" target="_blank" rel="noopener">here</a>).</span></p>
<p><strong>2.2. Probabilistic arguments in favor of the Riemann hypothesis</strong></p>
<p><span>The deterministic </span>sequence (<span><em>λ</em>(<em>k</em>)), consisting of +1 and -1 in a ratio 50/50, appears to behave rather randomly (if you look at its limiting empirical distribution), just like the sequence (<em>Z<span style="font-size: 8pt;">k</span></em>) in section 1 behaves perfectly randomly. Thus, one might think that the series defining <em>L</em>(<em>s</em>) would also converge for <em>σ </em> > 1/2, not just for <em>σ </em> > 1. Why this could be true is because the same thing happens to <em>Z</em>(<em>s</em>) in section 1, for the same reason. And if it is true, then the Riemann hypothesis is true, because of the first statement in the bullet list in section 2.1. Remember, <em>s</em> = <em>σ </em>+ <em>it</em>, or in other words, <em>σ </em>is the real part of the complex number <em>s</em>. </span></p>
<p><span>However, there is a big caveat, that maybe could be addressed to make the arguments more convincing. This is the purpose of this section. As noted at the bottom of section 2.1, the sequence (<em>λ</em>(<em>k</em>)), even though it passes all the randomness tests that I have tried, is much less random than it appears to be. It is obvious that it has weak dependencies since the function <em>λ</em> is multiplicative: <em>λ</em>(<i>mn</i>) = <em>λ</em>(<i>m</i>)<em>λ</em>(<i>n</i>). This is related to the fact that prime numbers are not perfectly randomly distributed. Another disturbing fact is that <em>L<span style="font-size: 8pt;">n</span></em>, the equivalent of the random walk defined in section 1, seems biased towards negative values. For instance, (except for <em>n</em> = 1), it is negative up to <em>n</em> = 906,150,257, a fact proved in 1980, and thus disproving Polya's conjecture (see <a href="https://en.wikipedia.org/wiki/P%C3%B3lya_conjecture" target="_blank" rel="noopener">here</a>). One way to address this is to work with Rademacher multiplicative random functions instead of (<em>Z<span style="font-size: 8pt;">k</span></em>) in section 1: see <a href="https://londmathsoc.onlinelibrary.wiley.com/doi/full/10.1112/jlms.12421" target="_blank" rel="noopener">here</a> for an example that would make the last item in the bullet list in section 2.1, be true. Or see <a href="https://www.ams.org/journals/proc/2013-141-02/S0002-9939-2012-11332-2/" target="_blank" rel="noopener">here</a> for an example that preserves the law of the iterated logarithm (which itself would also imply the Riemann Hypothesis). </span></p>
<p>Finally, working with a smoothed version of <em>L</em>(<em>s</em>) or <em>L<span style="font-size: 8pt;">n</span></em> using the smoothing technique described in section 1.1, may lead to results easier to obtain, with a possibility that it would bring new insights for the original series <em>L</em>(<em>s</em>). The smoothed version <em>L</em>*(<em>s</em>) is defined, using the same technique as in section 1.1, as</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8908556871?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8908556871?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p>The function <em>η</em>(<em>s</em>) is the <a href="https://en.wikipedia.org/wiki/Dirichlet_eta_function" target="_blank" rel="noopener">Dirichlet eta function</a>, and <em>L</em>*(<em>s</em>) can be computed in Mathematica using (DirichletEta[s] + Zeta[2s] / Zeta[s]) / 2. Mathematica uses the <a href="https://en.wikipedia.org/wiki/Analytic_continuation" target="_blank" rel="noopener">analytic continuation</a> of the <em>ζ</em> function if <em>σ</em> < 1. For instance, see computation of <em>L</em>*(0.7) = -0.237771..., <a href="https://www.wolframalpha.com/input/?i=%3D%28DirichletEta%5B0.7%5D%2BZeta%5B1.4%5D%2FZeta%5B0.7%5D%29%2F2" target="_blank" rel="noopener">here</a>. A table of the first million Liouville numbers <em>λ</em>(<i>k</i>) can be produced in Mathematica in just a few seconds, using the command Table[LiouvilleLambda[n], {n, 1, 1000000}]. For convenience, I stored them in a text file, <a href="http://www.datashaping.com/Liouville4b.txt" target="_blank" rel="noopener">here</a>. It would be interesting to see how good (or bad) they are at producing a pseudorandom number generator.</p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> He recently opened <a href="https://www.parisrestaurantandbar.com/" target="_blank" rel="noopener">Paris Restaurant</a>, in Anacortes. You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
What I Learned From 25 Years of Machine Learning
tag:www.datasciencecentral.com,2021-05-04:6448529:BlogPost:1049094
2021-05-04T06:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8890975463?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8890975463?profile=RESIZE_710x" width="600"></img></a></p>
<p style="text-align: center;"><em>Source: <a href="https://www.zeolearn.com/magazine/what-is-machine-learning" rel="noopener" target="_blank">here</a></em></p>
<p>Here is what I learned from practicing machine learning in business settings for over two decades, and prior to that in the academia. Back in the nineties, it was known as computational…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8890975463?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8890975463?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source: <a href="https://www.zeolearn.com/magazine/what-is-machine-learning" target="_blank" rel="noopener">here</a></em></p>
<p>Here is what I learned from practicing machine learning in business settings for over two decades, and prior to that in the academia. Back in the nineties, it was known as computational statistics in some circles, and some problems such as image analysis were already popular. Of course a lot of progress has been made since, thanks in part to the power of modern computers, the cloud, and large data sets now being ubiquitous. The trend has evolved towards more robust and model-free, data-driven techniques, sometimes designed as black boxes: for instance, deep neural networks. Text analysis (NLP) has also seen substantial progress. I hope that the advice I provide below, will be helpful in your data science job. </p>
<p><span style="font-size: 14pt;"><strong>11 pieces of advice</strong></span></p>
<ul>
<li>The biggest achievement in my career was to automate most of the data cleaning / data massaging / outlier detection and exploratory analysis, allowing me to focus on tasks that truly justified my salary. I had to write of few re-usable scripts to take care of that, but it was well worth the effort. </li>
<li>Be friend with the IT department. In one company, much of my job consisted in producing and blending various reports for decision makers. I got it all automated (which required direct access via Perl code to sensitive databases) and I even told my boss about it. He said that I did not work a lot (compared to hard-workers) but understood and was happy to always receive the reports on time automatically delivered to his mailbox, even when I was in vacation.</li>
<li>Leverage API's. In one company, a big project consisted of creating and maintaining a list of the top 95% keywords searched for on the web, and attach a value / yield to each of them. The list had about one million keywords. I started by querying internal databases, then scraping the web, and develop yield models. There was a lot of NLP involved. Until I found out that I could get all that information from Google and Microsoft by accessing their API's. It was not free, but not expensive either, and initially I used my own credit card to pay for the services, which saved me a lot of time. Eventually my boss adopted my idea, and the company reimbursed me for these paid API calls. They continued to use them, under my own personal accounts, long after I was gone. </li>
<li>Document your code, your models, every core tasks you do, with enough details, and in such a way that other people understand your documentation. Without it, you might not even remember what a piece of your own code is doing 3 years down the road, and have to re-write it from scratch. Use simple English as much as possible. It is also good practice, as it will help you train your replacement when you leave.</li>
<li>When blending data from different sources, adjust the metrics accordingly, for each data source; metrics are likely to not be fully compatible or some of them missing, as things are probably measured in different ways depending on the source. Even over time, the same metric in the same database can evolve to the point of not being compatible anymore with historical data. I actually have a patent that addresses this issue.</li>
<li>Be wary of job interviews for a supposedly wonderful data science job requiring a lot of creativity. I was misled quite a few times, the job eventually turned out to be a coding job. It can be a dead-end, boring job. I like doing the job of a software engineer, but only as long as it helps me automate and optimize my tasks.</li>
<li>Working remotely can have many rewards, especially financial ones. Sometimes it also means less time spent in corporate meetings. I had to travel every single week between Seattle and San Francisco, for years. I did not like it, but I saved a lot of money (not the least because there is no employment tax in Washington state, and real estate is much less expensive). Also, walking from your hotel to your workplace is less painful than commuting, and it saves a lot of time. Nowadays telecommute makes it even easier. </li>
<li>Embrace simple models. Use synthetic or simulated data to test them. For instance, I implemented various statistical tests, and used artificial data (many times from number theory experiments) to fine-tune and assess the validity of my tests / models on datasets for which the exact answer is known. It was a win-win: working on a topic I love (experimental and probabilistic number theory) and at the same time producing good models and algorithms with applications to real business processes.</li>
<li>Being a generalist rather than a specialist offers more career opportunities, within your company (horizontal move) or anywhere. You still need to be an expert in at least one or two areas. As a generalist, it will be easier for you to become a consultant or start your own company, should you decide to go that route. Also, it may help you understand the real problems that decision makers are facing in your company, and have a better, closer relationship with them. Or with any department (sales, finance, marketing, IT).</li>
<li>In data we trust. I disagree with that statement. I remember a job at Wells Fargo where I was analyzing user sessions of corporate clients doing online transactions. The sessions were extremely short. I decided to have my boss do a simulated session with multiple transactions, and analyze it the next day. It turned out that the session was broken down into multiple sessions, as the tracking services (powered by Tealeaf back then) started a new session anytime an HTTP request (by the same user) came from a different server (that is, pretty much for every user request). The Tealeaf issue was fixed when notified by Wells Fargo, and I am sure this was my most valuable contribution at the bank. In a different company, reports from a third party were totally erroneous, missing most page views in their count: it turned out that their software was cutting every URL that contained a comma: a glitch caused by bad programming by some software engineer at that third party company, combined with the fact that 95% of our URL's contained commas. If you miss those massive glitches (even though in some ways it is not your job to detect them), your analyses will be totally worthless. One way to detect these glitches is to rely on more than just one single data source.</li>
<li>Get very precise definitions of the metrics you are dealing with. The fact that there are so many fake news nowadays is probably because the concept of fake news has never been properly defined, rather than a data / modeling issue.</li>
</ul>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> He recently opened <a href="https://www.parisrestaurantandbar.com/" target="_blank" rel="noopener">Paris Restaurant</a>, in Anacortes. You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
<p></p>
More Machine Learning Tricks, Recipes, and Statistical Models
tag:www.datasciencecentral.com,2021-04-30:6448529:BlogPost:1049160
2021-04-30T03:57:32.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8873246459?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8873246459?profile=RESIZE_710x" width="500"></img></a></span></p>
<p style="text-align: center;"><em>Source for picture: <a href="https://www.forbes.com/sites/kalevleetaru/2019/01/15/why-machine-learning-needs-semantics-not-just-statistics" rel="noopener" target="_blank">here</a></em></p>
<p><span>The first part of this list was published…</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8873246459?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8873246459?profile=RESIZE_710x" width="500" class="align-center"/></a></span></p>
<p style="text-align: center;"><em>Source for picture: <a href="https://www.forbes.com/sites/kalevleetaru/2019/01/15/why-machine-learning-needs-semantics-not-just-statistics" target="_blank" rel="noopener">here</a></em></p>
<p><span>The first part of this list was published <a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-tricks-recipes-and-statistical-mod" target="_blank" rel="noopener">here</a>. These are articles that I wrote in the last few years. The whole series will feature articles related to the following aspects of machine learning:</span></p>
<ul>
<li><span>Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science)</span></li>
<li><span>Opinions, for instance about the value of a PhD in our field, or the use of some techniques</span></li>
<li><span>Methods, principles, rules of thumb, recipes, tricks</span></li>
<li><span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-1" target="_blank" rel="noopener">Business analytics</a> </span></li>
<li><span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-2" target="_blank" rel="noopener">Core Techniques</a> </span></li>
</ul>
<p><span>My articles are always written in simple English and accessible to professionals with typically one year of calculus or statistical training, at the undergraduate level. They are geared towards people who use data but are interesting in gaining more practical analytical experience. Managers and decision makers are part of my intended audience. The style is compact, geared towards people who do not have a lot of free time. </span></p>
<p><span>Despite these restrictions, state-of-the-art, of-the-beaten-path results as well as machine learning trade secrets and research material are frequently shared. References to more advanced literature (from myself and other authors) is provided for those who want to dig deeper in the interested topics discussed. </span></p>
<p><span><strong>1. Machine Learning Tricks, Recipes and Statistical Models</strong></span></p>
<p><span>These articles focus on techniques that have wide applications or that are otherwise fundamental or seminal in nature.</span></p>
<ol>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/one-trillion-random-digits" target="_blank" rel="noopener">One Trillion Random Digits</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-perspective-on-central-limit-theorem-and-related-stats-topics" target="_blank" rel="noopener">New Perspective on the Central Limit Theorem and Statistical Testing</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/feature-selection-a-simple-solution?xg_source=activity" target="_blank" rel="noopener">Simple Solution to Feature Selection Problems</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/scale-invariant-clustering-and-regression" target="_blank" rel="noopener">Scale-Invariant Clustering and Regression</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/deep-dive-into-polynomial-regression-and-overfitting" target="_blank" rel="noopener">Deep Dive into Polynomial Regression and Overfitting</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/stochastic-processes-new-tests-for-randomness-application-to-numb" target="_blank" rel="noopener">Stochastic Processes and New Tests of Randomness</a> - Application to Cool Number Theory Problem</li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/a-simple-introduction-to-complex-stochastic-processes-part-2" target="_blank" rel="noopener">A Simple Introduction to Complex Stochastic Processes - Part 2</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/a-simple-introduction-to-complex-stochastic-processes" target="_blank" rel="noopener">A Simple Introduction to Complex Stochastic Processes</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/high-precision-computing-benchmark-examples-and-tutorial" target="_blank" rel="noopener">High Precision Computing: Benchmark, Examples, and Tutorial</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/logistic-map-chaos-randomness-and-quantum-algorithms" target="_blank" rel="noopener">Logistic Map, Chaos, Randomness and Quantum Algorithms</a></li>
<li><a href="https://www.bigdatanews.datasciencecentral.com/profiles/blogs/graph-theory-six-degrees-of-separation-problem" target="_blank" rel="noopener">Graph Theory: Six Degrees of Separation Problem</a></li>
<li><a href="http://www.analyticbridge.datasciencecentral.com/profiles/blogs/interesting-probability-problem-for-serious-geeks" target="_blank" rel="noopener">Interesting Problem for Serious Geeks: Self-correcting Random Walks</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/9-off-th-beaten-path-statistical-science-topics" target="_blank" rel="noopener">9 Off-the-beaten-path Statistical Science Topics with Interesting Applications</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/data-science-method-to-discover-large-prime-numbers" target="_blank" rel="noopener">Data Science Method to Discover Large Prime Numbers</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/nice-generalization-of-the-k-nn-clustering-algorithm" target="_blank" rel="noopener">Nice Generalization of the K-NN Clustering Algorithm</a> - Also Useful for Data Reduction</li>
<li><a href="http://www.analyticbridge.datasciencecentral.com/profiles/blogs/mysterious-sequences-that-look-random-with-surprising-properties" target="_blank" rel="noopener">How to Detect if Numbers are Random or Not</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/how-and-why-decorrelate-time-series" target="_blank" rel="noopener">How and Why: Decorrelate Time Series</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/distribution-of-arrival-times-of-extreme-events" target="_blank" rel="noopener">Distribution of Arrival Times of Extreme Events</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/why-zipf-s-law-explains-so-many-big-data-and-physics-phenomenons" target="_blank" rel="noopener">Why Zipf's law explains so many big data and physics phenomenons</a></li>
</ol>
<p><span><strong>2. Free books</strong></span></p>
<ul>
<li><span><b>Statistics: New Foundations, Toolbox, and Machine Learning Recipes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning">here</a>. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach.</span></p>
<p><span>The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.</span></p>
</li>
<li><span><b>Applied Stochastic Processes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes">here</a>. Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems (104 pages, 16 chapters.) This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject.</span></p>
<p><span>It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</span></p>
</li>
</ul>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> He recently opened <a href="https://www.parisrestaurantandbar.com/" target="_blank" rel="noopener">Paris Restaurant</a>, in Anacortes. You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
Unusual Opportunities for AI, Machine Learning, and Data Scientists
tag:www.datasciencecentral.com,2021-04-20:6448529:BlogPost:1048019
2021-04-20T01:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8812460479?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8812460479?profile=RESIZE_710x" width="500"></img></a></p>
<p>Here some off-the-beaten-path options to consider, when looking for a first job, a new job or extra income by leveraging your machine learning experience. Many were offers that came to my mailbox at some point in the last 10 years, mostly from people looking at my LinkedIn profile. Thus the importance of growing your network and visibility, write…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8812460479?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8812460479?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p>Here some off-the-beaten-path options to consider, when looking for a first job, a new job or extra income by leveraging your machine learning experience. Many were offers that came to my mailbox at some point in the last 10 years, mostly from people looking at my LinkedIn profile. Thus the importance of growing your network and visibility, write blogs, and show to the world some of your portfolio and accomplishments (code that you posted on GitHub, etc.) If you do it right, after a while, you will never have to apply for a job ever again: hiring managers and other opportunities will come to you, rather than the other way around.</p>
<p><span style="font-size: 14pt;"><strong>1. For beginners</strong></span></p>
<p>Participating in Kaggle and other competitions. Being a teacher for one of the many online teaching companies or data camps, such as Coursera. Writing, self-publishing, and selling your own books: an example is Jason Brownlee (see <a href="https://machinelearningmastery.com/" target="_blank" rel="noopener">here</a>) who found his niche by selling tutorials explaining data science in simple words, to software engineers. I am moving in the same direction as well, see <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>. Another option is to develop an API, for instance to offer trading signals (buy / sell) to investors, who pay a fee to subscribe to your service - one thing I did in the past and it earned me a little bit of income, more than I had expected. I also created a website where recruiters can post data science job ads for a fee: it still exists (see <a href="https://www.analytictalent.com/" target="_blank" rel="noopener">here</a>) thought it was acquired; you need to aggregate jobs from multiple websites, build a large mailing list of data scientists, and charge a fee only for <em>featured jobs</em>. Many of these ideas require that you promote your services for free, using social media: this is the hard part. A starting point is to create and grow your own groups on social networks. All this can be done while having a full-time job at the same time. </p>
<p>You can also become a contributor/writer for various news outlets, though initially you may have to do it for free. But as you gain experience and notoriety, it can become a full time, lucrative job. And finally, raising money with a partner to start your own company. </p>
<p><span style="font-size: 14pt;"><strong>2. For mid-career and seasoned professionals</strong></span></p>
<p>You can offer consulting services, especially to your former employers to begin with. Here are some unusual opportunities I was offered. I did not accept all of them, but I was still able to maintain a full time job while getting decent side income.</p>
<ul>
<li>Expert witness - get paid by big law firms to show up in court and help them win big money for their clients (and for themselves, and you along the way.) Or you can work for a company specializing in statistical litigation, such as <a href="https://www.wecker.com/" target="_blank" rel="noopener">this one</a>.</li>
<li>Become a part-time, independent recruiter. Some machine learning recruiters are former machine learning experts. You can still keep your full-time job.</li>
<li>Get involved in patent reviews (pertaining to machine learning problems that you know very well.)</li>
<li>Help Venture Capital companies do their due diligence on startups they could potentially fund, or help them find new startups worthy to invest in. The last VC firm that contacted me offered $1,000 per report, each requiring 2-3 hours of work. </li>
<li>I was once contacted to be the data scientist for an Indian Tribe. Other unusual job offers came from the adult industry (fighting advertising fraud on their websites, they needed an expert) and even working for the casino industry. I eventually created my own very unique lottery system, see <a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">here</a>. I plan to either sell the intellectual property or work with some existing lottery companies (governments or casinos) to make it happen and monetize it. If you own some IP (intellectual property) think about monetizing it if you can. </li>
</ul>
<p>There are of course plenty of other opportunities, such as working for a consulting firm or governments to uncover tax fraudsters via data mining techniques, just to give an example. Another idea is to obtain a realtor certification if you own properties, to save a lot of money by selling yourself without using a third party. And use your analytic acumen to buy cheap and sell high at the right times. And working from home in (say) Nevada, for an employer in the Bay Area, can also save you a lot of money. </p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
Simple Machine Learning Approach to Testing for Independence
tag:www.datasciencecentral.com,2021-04-08:6448529:BlogPost:1046622
2021-04-08T06:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8771488658?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8771488658?profile=RESIZE_710x" width="500"></img></a></p>
<p>We describe here a methodology that applies to any statistical test, and illustrated in the context of assessing independence between successive observations in a data set. After reviewing a few standard approaches, we discuss our methodology, its benefits, and drawbacks. The data used here for illustration purposes, has known theoretical…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8771488658?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8771488658?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p>We describe here a methodology that applies to any statistical test, and illustrated in the context of assessing independence between successive observations in a data set. After reviewing a few standard approaches, we discuss our methodology, its benefits, and drawbacks. The data used here for illustration purposes, has known theoretical auto-correlations. Thus it can be used to benchmark various statistical tests. Our methodology also applies to data with high volatility, in particular, to time series models with undefined autocorrelations. Such models (see for instance Figure 1 <a href="https://www.datasciencecentral.com/profiles/blogs/defining-and-measuring-chaos-in-data-sets-why-and-how-in-simple-w" target="_blank" rel="noopener">in this article</a>) are usually ignored by practitioners, despite their interesting properties.</p>
<p>Independence is a stronger concept than all autocorrelations being equal to zero. In particular, some functional non-linear relationships between successive data points may result in zero autocorrelation even though the observations exhibit strong auto-dependencies: a classic example is points randomly located on a circle centered at the origin; the correlation between the <em>X</em> and <em>Y</em> variables may be zero, but of course <em>X</em> and <em>Y</em> are not independent.</p>
<p><span style="font-size: 14pt;"><strong>1. Testing for independence: classic methods</strong></span></p>
<p>The most well known test is the Chi-Square test, see <a href="http://mlwiki.org/index.php/Chi-Squared_Test_of_Independence" target="_blank" rel="noopener">here</a>. It is used to test independence in contingency tables or between two time series. In the latter case, it requires binning the data, and works only if each bin has enough observations, usually more than 5. Its exact statistic under the assumption of independence has a known distribution: Chi-Squared, itself well approximated by a normal distribution for moderately sized data sets, see <a href="https://en.wikipedia.org/wiki/Chi-square_distribution#Asymptotic_properties" target="_blank" rel="noopener">here</a>. </p>
<p>Another test is based on the Kolmogorov-Smirnov statistics. It is typically used to measure <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test" target="_blank" rel="noopener">goodness of fit</a>, but can be adapted to assess independence between two variables (or columns, in a data set). See <a href="https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-8/issue-2/A-Kolmogorov-Smirnov-type-test-for-independence-between-marks-and/10.1214/14-EJS961.full" target="_blank" rel="noopener">here</a>. Convergence to the exact distribution is slow. Our test described in section 2 is somewhat similar, but it is entirely data-driven, model free: our confidence intervals are based on re-sampling techniques, not on tabulated values of known statistical distributions. Our test was first discussed in section 2.3 of a previous article entitled <em>New Tests of Randomness and Independence for Sequences of Observations</em>, and available <a href="https://www.datasciencecentral.com/profiles/blogs/a-new-test-of-independence" target="_blank" rel="noopener">here</a>. In section 2 of this article, a better and simplified version is presented, suitable for big data. In addition, we discuss how to build confidence intervals, in a simple way that will appeal to machine learning professionals.</p>
<p>Finally, rather than testing for independence in successive observations (say, a time series) one can look at the square of the observed autocorrelations of lag-1, lag-2 and so on, up to lag-<em>k</em> (say <em>k</em> = 10). The absence of autocorrelations does not imply independence, but this test is easier to perform than a full independence test. The Ljung-Box and the Box-Pierce tests are the most popular ones used in this context, with Ljung-Box converging faster to the limiting (asymptotic) Chi-Squared distribution of the test statistic, as the sample size increases. See <a href="https://en.wikipedia.org/wiki/Ljung%E2%80%93Box_test" target="_blank" rel="noopener">here</a>.</p>
<p><span style="font-size: 14pt;"><strong>2. Our Test</strong></span></p>
<p>The data consists of a time series <em>x</em><span style="font-size: 8pt;">1</span>, <em>x</em><span style="font-size: 8pt;">2, ...<span style="font-size: 10pt;">, <em>x</em><span style="font-size: 8pt;"><em>n</em></span></span></span>. We want to test whether successive observations are independent or not, that is, whether <em>x</em><span style="font-size: 8pt;">1</span>, <em>x</em><span style="font-size: 8pt;">2</span>, ..., x<span style="font-size: 8pt;"><em>n</em>-1</span> and <em>x</em><span style="font-size: 8pt;">2</span>, <em>x</em><span style="font-size: 8pt;">3</span>, ..., x<span style="font-size: 8pt;"><em>n</em></span> are independent or not. It can be generalized to a broader test of independence (see section 2.3 <a href="https://www.datasciencecentral.com/profiles/blogs/a-new-test-of-independence" target="_blank" rel="noopener">here</a>) or to bivariate observations: <em>x</em><span style="font-size: 8pt;">1</span>, <em>x</em><span style="font-size: 8pt;">2</span>, ..., <em>x<span style="font-size: 8pt;">n</span></em> versus <em>y</em><span style="font-size: 8pt;">1</span>, <em>y</em><span style="font-size: 8pt;">2</span>, ..., <em>y</em><span style="font-size: 8pt;"><em>n</em></span>. For the sake of simplicity, we assume that the observations are in [0, 1].</p>
<p><strong>2.1. Step #1: Computing some probabilities</strong></p>
<p>The first step to perform the test, consists in computing the following statistics:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8779418488?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8779418488?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>for <em>N</em> vectors (<em><span lang="el" title="Greek-language text" xml:lang="el">α</span></em><span>, </span><span lang="el" title="Greek-language text" xml:lang="el"><em>β</em>)<em>'s,</em></span> where <em><span lang="el" title="Greek-language text" xml:lang="el">α</span></em><span>, </span><span lang="el" title="Greek-language text" xml:lang="el"><em>β</em> </span>are randomly sampled or equally spaced values in [0, 1], and <em>χ</em> is the indicator function: <em>χ</em>(<em>A</em>) = 1 if <em>A</em> is true, otherwise <em>χ</em>(<em>A</em>) = 0. The idea behind the test is intuitive: if <em>q</em>(<em><span lang="el" title="Greek-language text" xml:lang="el">α</span></em><span>, </span><span lang="el" title="Greek-language text" xml:lang="el"><em>β</em></span>) is statistically different from zero for one or more of the randomly chosen (<em><span lang="el" title="Greek-language text" xml:lang="el">α</span></em><span>, </span><span lang="el" title="Greek-language text" xml:lang="el"><em>β</em></span>)'s, then successive observations can not possibly be independent, in other words, <em>x<span style="font-size: 8pt;">k</span></em> and <em>x</em><span style="font-size: 8pt;"><em>k</em>+1</span> are not independent. </p>
<p>In practice, I chose <em>N</em> = 100 vectors (<em><span lang="el" title="Greek-language text" xml:lang="el">α</span></em>, <span lang="el" title="Greek-language text" xml:lang="el"><em>β</em>)</span> <span lang="el" title="Greek-language text" xml:lang="el">evenly distributed on the unit square [0, 1] x [0, 1], assuming that the <em>x<span style="font-size: 8pt;">k</span></em>'s take values in [0, 1] and that <em>n</em> is much larger than <em>N</em>, say n = 25 <em>N</em>. </span></p>
<p><strong>2.2. Step #2: The statistic associated with the test</strong></p>
<p>Two natural statistics for the test are</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8779295860?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8779295860?profile=RESIZE_710x" width="200" class="align-center"/></a></p>
<p>The first one <em>S</em>, once standardized, should asymptotically have a Kolmogorov-Smirnov distribution. The second one <em>T</em>, once standardized, should asymptotically have a normal distribution, despite the fact that the various <em>q</em>(<em><span lang="el" title="Greek-language text" xml:lang="el">α</span></em><span>, </span><span lang="el" title="Greek-language text" xml:lang="el"><em>β</em>)'s are never independent. However, we do not care about the theoretical (asymptotic) distribution, thus moving away from the classic statistical approach. We use a methodology that is typical of machine learning, and described in section 2.3.</span></p>
<p><span lang="el" title="Greek-language text" xml:lang="el">Nevertheless, the principle is the same in both cases: the higher the value of <em>S</em> or <em>T</em> computed on the data set, the most likely we must reject the assumption of independence. Among the two statistics, <em>T</em> has less volatility than <em>S</em>, and may be preferred. But <em>S</em> is better at detecting very small departures from independence.</span></p>
<p><strong>2.3. Step #3: Assessing statistical significance</strong></p>
<p>The technique described here is very generic, intuitive, and simple. It applies to any statistical test of hypotheses, not just for testing independence. It is somewhat similar to cross-validation. It consists or reshuffling the observations in various ways (see the <a href="https://en.wikipedia.org/wiki/Resampling_(statistics)" target="_blank" rel="noopener">resampling entry</a> in Wikipedia to see how it actually works) and compute <em>S</em> (or <em>T</em>) for each of the 10 different reshuffled time series. After reshuffling, one would assume that any serial, pairwise independence has been lost, and thus you get an idea of the distribution of <em>S</em> (or <em>T</em>) in case of independence. Now compute <em>S</em> on the original time series. Is it higher than the 10 values you computed on the reshuffled time series? If yes, you have a 90% chance that the original time series exhibits serial, pairwise dependency. </p>
<p>A better but more complicated method consists of computing the empirical distribution of the <em>x<span style="font-size: 8pt;">k</span></em>'s, then generate 10 <em>n</em> independent deviates with that distribution. This constitutes 10 time series, each with <em>n</em> independent observations. Compute <em>S</em> for each of these time series, and compare with the value of <em>S</em> computed on the original time series. If the value computed on the original time series is higher, then you have a 90% chance that the original time series exhibits serial, pairwise dependency. This is the preferred method if the original time series has strong, long-range autocorrelations.</p>
<p><strong>2.4. Test data set and results</strong></p>
<p>I tested the methodology on an artificial data set (a discrete dynamical system) created as follows: <em>x</em><span style="font-size: 8pt;">1</span> = log(2) and <em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>b</em> <em>x<span style="font-size: 8pt;">n</span></em> - INT(<em>b x<span style="font-size: 8pt;">n</span></em>). Here <em>b</em> is an integer larger than 1, and INT is the integer part function. The data generated behaves like any real time series, and has the following properties.</p>
<ul>
<li>The theoretical distribution of the <em>x<span style="font-size: 8pt;">k</span></em>'s is uniform on [0, 1]</li>
<li>The lag-<em>k</em> autocorrelation is known and equal to 1 / <em>b</em>^<em>k</em> (<em>b</em> at power <em>k</em>)</li>
</ul>
<p>It is thus easy to test for independence and to benchmark various statistical tests: the larger <em>b</em>, the closer we are to independence. With a pseudo-random number generator, one can generate a time series consisting of independently and identically distributed deviates, with a uniform distribution on [0, 1], to check the distribution of <em>S</em> (or <em>T</em>) and its expectation, in case of true independence, and compare it with values of <em>S</em> (or <em>T</em>) computed on the artificial data, using various values of <em>b</em>. In this test with <em>N</em> = 100 <em>n</em> = 2500, <em>b</em> = 4, (corresponding to an autocorrelation of 0.25) the value of <i>S</i> is 6 times larger than the one obtained for full independence. For <em>b</em> = 8, (corresponding to an autocorrelation of 0.125), <i>S</i> is 3 times larger than the one obtained for full independence. This validates the test described here at least on this kind of dataset, as it correctly detects lack of independence by yielding abnormally high values of <em>T</em> when the independence assumption is violated.</p>
<p><strong>Note</strong>: Another interesting feature of the dataset used here is this: using <em>b</em>^<em>k</em> (<em>b</em> at power <em>k</em>) instead of <em>b</em>, is equivalent to checking lag-<em>k</em> independence, that is, independence between <em>x</em><span style="font-size: 8pt;">1</span>, <em>x</em><span style="font-size: 8pt;">2</span>, ... and <em>x</em><span style="font-size: 8pt;">1+<em>k</em></span>, <em>x</em><span style="font-size: 8pt;">2+<em>k</em></span>, ... in the original time series corresponding to <em>b</em>. The reason being that in the original series (corresponding to <em>b</em>), we have x<span style="font-size: 8pt;"><i>n</i>+<em>k</em></span> = <em>b</em>^<em>k</em> x<span style="font-size: 10.6667px;"><i>n</i></span> - INT(<em>b</em>^<em>k</em> <em>x<span style="font-size: 10.6667px;">n</span></em>).</p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/a-new-test-of-independence"></a></p>
A Plethora of Machine Learning Tricks, Recipes, and Statistical Models
tag:www.datasciencecentral.com,2021-04-06:6448529:BlogPost:1046327
2021-04-06T03:59:22.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8760416479?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8760416479?profile=RESIZE_710x" width="400"></img></a></p>
<p style="text-align: center;"><em>Source: See article #5, in section 1</em></p>
<p><span>Part 2 of this short series focused on fundamental techniques, see <a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-2" rel="noopener" target="_blank">here</a>. In this Part 3, you will find several…</span></p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8760416479?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8760416479?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source: See article #5, in section 1</em></p>
<p><span>Part 2 of this short series focused on fundamental techniques, see <a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-2" target="_blank" rel="noopener">here</a>. In this Part 3, you will find several machine learning tricks and recipes, many with a statistical flavor. These are articles that I wrote in the last few years. The whole series will feature articles related to the following aspects of machine learning:</span></p>
<ul>
<li><span>Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science)</span></li>
<li><span>Opinions, for instance about the value of a PhD in our field, or the use of some techniques</span></li>
<li><span>Methods, principles, rules of thumb, recipes, tricks</span></li>
<li><span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-1" target="_blank" rel="noopener">Business analytics</a> </span></li>
<li><span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-2" target="_blank" rel="noopener">Core Techniques</a> </span></li>
</ul>
<p><span>My articles are always written in simple English and accessible to professionals with typically one year of calculus or statistical training, at the undergraduate level. They are geared towards people who use data but are interesting in gaining more practical analytical experience. Managers and decision makers are part of my intended audience. The style is compact, geared towards people who do not have a lot of free time. </span></p>
<p><span>Despite these restrictions, state-of-the-art, of-the-beaten-path results as well as machine learning trade secrets and research material are frequently shared. References to more advanced literature (from myself and other authors) is provided for those who want to dig deeper in the interested topics discussed. </span></p>
<p><span><strong>1. Machine Learning Tricks, Recipes and Statistical Models</strong></span></p>
<p><span>These articles focus on techniques that have wide applications or that are otherwise fundamental or seminal in nature.</span></p>
<ol>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/defining-and-measuring-chaos-in-data-sets-why-and-how-in-simple-w">Defining and Measuring Chaos in Data Sets: Why and How, in Simple Words</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/hurwitz-riemann-zeta-and-other-special-probability-distributions">Hurwitz-Riemann Zeta And Other Special Probability Distributions</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/maximum-runs-in-bernoulli-trials">Maximum runs in Bernoulli trials: simulations and results</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/moving-averages-natural-weights-iterated-convolutions-and-central" target="_blank" rel="noopener">Moving Averages: Natural Weights, Iterated Convolutions, and Central Limit Theorem</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/things-you-did-not-know-you-could-do-with-excel" target="_blank" rel="noopener">Amazing Things You Did Not Know You Could Do in Excel</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/a-new-test-of-independence" target="_blank" rel="noopener">New Tests of Randomness and Independence for Sequences of Observations</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/interesting-application-of-the-poisson-binomial-distribution" target="_blank" rel="noopener">Interesting Application of the Poisson-Binomial Distribution</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/alternative-to-the-arithmetic-geometric-and-harmonic-means" target="_blank" rel="noopener">Alternative to the Arithmetic, Geometric, and Harmonic Means</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/bernouilli-lattice-models-connection-to-poisson-processes" target="_blank" rel="noopener">Bernouilli Lattice Models - Connection to Poisson Processes</a></li>
<li><a href="https://www.datasciencecentral.com/forum/topics/simulating-distributions-with-one-line-of-code" target="_blank" rel="noopener">Simulating Distributions with One-Line Formulas, even in Excel</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/simplified-logistic-regression" target="_blank" rel="noopener">Simplified Logistic Regression</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-normalize-correlations-r-squared-and-so-on" target="_blank" rel="noopener">Simple Trick to Normalize Correlations, R-squared, and so on</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-remove-serial-correlation-in-regression-models" target="_blank" rel="noopener">Simple Trick to Remove Serial Correlation in Regression Models</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/a-beautiful-result-in-probability-theory" target="_blank" rel="noopener">A Beautiful Result in Probability Theory</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/long-range-correlation-in-time-series-tutorial-and-case-study" target="_blank" rel="noopener">Long-range Correlations in Time Series: Modeling, Testing, Case Study</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/difference-between-correlation-and-regression-in-statistics" target="_blank" rel="noopener">Difference Between Correlation and Regression in Statistics</a></li>
</ol>
<p><span><strong>2. Free books</strong></span></p>
<ul>
<li><span><b>Statistics: New Foundations, Toolbox, and Machine Learning Recipes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning">here</a>. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach.</span></p>
<p><span>The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.</span></p>
</li>
<li><span><b>Applied Stochastic Processes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes">here</a>. Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems (104 pages, 16 chapters.) This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject.</span></p>
<p><span>It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</span></p>
</li>
</ul>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
Defining and Measuring Chaos in Data Sets: Why and How, in Simple Words
tag:www.datasciencecentral.com,2021-03-29:6448529:BlogPost:1045635
2021-03-29T00:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8735877694?profile=original" rel="noopener" target="_blank"><img class="align-full" src="https://storage.ning.com/topology/rest/1.0/file/get/8735877694?profile=RESIZE_710x" width="720"></img></a></p>
<p>There are many ways chaos is defined, each scientific field and each expert having its own definitions. We share here a few of the most common metrics used to quantify the level of chaos in univariate time series or data sets. We also introduce a new, simple definition based on metrics that are familiar to everyone. Generally speaking, chaos…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8735877694?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8735877694?profile=RESIZE_710x" width="720" class="align-full"/></a></p>
<p>There are many ways chaos is defined, each scientific field and each expert having its own definitions. We share here a few of the most common metrics used to quantify the level of chaos in univariate time series or data sets. We also introduce a new, simple definition based on metrics that are familiar to everyone. Generally speaking, chaos represents how predictable a system is, be it the weather, stock prices, economic time series, medical or biological indicators, earthquakes, or anything that has some level of randomness. </p>
<p>In most applications, various statistical models (or data-driven, model-free techniques) are used to make predictions. Model selection and comparison can be based on testing various models, each one with its own level of chaos. Sometimes, time series do not have an auto-correlation function due to the high level of variability in the observations: for instance, the theoretical variance of the model is infinite. An example is provided in section 2.2 <a href="https://www.datasciencecentral.com/profiles/blogs/hurwitz-riemann-zeta-and-other-special-probability-distributions" target="_blank" rel="noopener">in this article</a> (see picture below), used to model extreme events. In this case, chaos is a handy metric, and it allows you to build and use models that are otherwise ignored or unknown by practitioners. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8725268092?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8725268092?profile=RESIZE_710x" width="450" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong>: <em>Time series with indefinite autocorrelation; instead, chaos is used to measure predictability</em></p>
<p>Below are various definitions of chaos, depending on the context they are used for. References about how to compute these metrics, are provided in each case.</p>
<p><strong>Hurst exponent</strong></p>
<p>The <a href="https://en.wikipedia.org/wiki/Hurst_exponent" target="_blank" rel="noopener">Hurst exponent</a> <em>H</em> is used to measure the level of smoothness in time series, and in particular, the level of long-term memory. <em>H</em> takes on values between 0 and 1, with <em>H</em> = 1/2 corresponding to the Brownian motion, and <em>H</em> = 0 corresponding to pure white noise. Higher values correspond to smoother time series, and lower values to more rugged data. Examples of time series with various values of <em>H</em> are found <a href="https://www.datasciencecentral.com/profiles/blogs/long-range-correlation-in-time-series-tutorial-and-case-study" target="_blank" rel="noopener">in this article</a>, see picture below. In the same article, the relation to the <em>detrending moving average</em> (another metric to measure chaos) is explained. Also, <em>H</em> is related to the fractal dimension. Applications include stock price modeling.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8725551894?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8725551894?profile=RESIZE_710x" width="350" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2</strong>: <em>Time series with H = 1/2 (top), and H close to 1 (bottom)</em></p>
<p><strong>Lyapunov exponent</strong></p>
<p>In dynamical systems, the Lyapunov exponent is used to quantify how a system is sensitive to initial conditions. Intuitively, the more sensitive to initial conditions, the more chaotic the system is. For instance, the system <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span> = <em>x<span style="font-size: 8pt;">n</span></em> - INT(<em>x<span style="font-size: 8pt;">n</span></em>), where INT represents the integer function, is very sensitive to the initial condition <em>x</em><span style="font-size: 8pt;">0</span>. A very small change in the value of <em>x</em><span style="font-size: 8pt;">0</span> results in values of <em>x<span style="font-size: 8pt;">n</span></em> that are totally different even for <em>n</em> as low as 45. See how to compute the Lyapunov exponent, <a href="https://en.wikipedia.org/wiki/Lyapunov_exponent" target="_blank" rel="noopener">here</a>.</p>
<p><strong>Fractal dimension</strong></p>
<p>A one-dimensional curve can be defined parametrically by a system of two equations. For instance <em>x</em>(<em>t</em>) = sin(<em>t</em>), <em>y</em>(<em>t</em>) = cos(<em>t</em>) represents a circle of radius 1, centered at the origin. Typically, <em>t</em> is referred to as the time, and the curve itself is called an orbit. In some cases, as <em>t</em> increases, the orbit fills more and more space in the plane. In some cases, it will fill a dense area, to the point that it seems to be an object with a dimension strictly between 1 and 2. An example is provided in section 2 <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" target="_blank" rel="noopener">in this article</a>, and pictured below. A formal definition of fractal dimension can be found <a href="https://en.wikipedia.org/wiki/Fractal_dimension" target="_blank" rel="noopener">here</a>.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8725489684?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8725489684?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 3</strong>: <em>Example of a curve filling a dense area (fractal dimension > 1)</em></p>
<p>The picture in figure 3 is related to the Riemann hypothesis. Any meteorologist who sees the connection to hurricanes and their eye, could bring some light about how to solve this infamous mathematical conjecture, based on the physical laws governing hurricanes. Conversely, this picture (and the underlying mathematics) could also be used as statistical model for hurricane modeling and forecasting. </p>
<p><strong>Approximate entropy</strong></p>
<p>In statistics, the approximate entropy is a metric used to quantify regularity and predictability in time series fluctuations. Applications include medical data, finance, physiology, human factors engineering, and climate sciences. See the Wikipedia entry, <a href="https://en.wikipedia.org/wiki/Approximate_entropy" target="_blank" rel="noopener">here</a>.</p>
<p>It should not be confused with <a href="https://en.wikipedia.org/wiki/Entropy" target="_blank" rel="noopener">entropy</a>, which measures the amount of information attached to a specific probability distribution (with the uniform distribution on [0, 1] achieving maximum entropy among all continuous distributions on [0, 1], and the normal distribution achieving maximum entropy among all continuous distributions defined on the real line, with a specific variance). Entropy is used to compare the efficiency of various encryption systems, and has been used in feature selection strategies in machine learning, see <a href="https://www.datasciencecentral.com/profiles/blogs/feature-selection-a-simple-solution" target="_blank" rel="noopener">here</a>.</p>
<p><strong>Independence metric </strong></p>
<p>Here I discuss some metrics that are of interest in the context of dynamical systems, offering an alternative to the Lyapunov exponent to measure chaos. While the Lyapunov exponents deals with sensitivity to initial conditions, the classic statistics mentioned here deals with measuring predictability for a single instance (observed time series) of a dynamical systems. However, they are most useful to compare the level of chaos between two different dynamical systems with similar properties. A dynamical system is a sequence <em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>T</em>(<em>x<span style="font-size: 8pt;">n</span></em>), with initial condition <em>x</em><span style="font-size: 8pt;">0</span>. Examples are provided in my last two articles, <a href="https://www.datasciencecentral.com/profiles/blogs/an-easy-way-to-solve-complex-optimization-problems" target="_blank" rel="noopener">here</a> and <a href="https://www.datasciencecentral.com/profiles/blogs/hurwitz-riemann-zeta-and-other-special-probability-distributions" target="_blank" rel="noopener">here</a>. See also <a href="https://www.datasciencecentral.com/profiles/blogs/beautiful-mathematical-images" target="_blank" rel="noopener">here</a>. </p>
<p>A natural metric to measure chaos is the maximum autocorrelation in absolute value, between the sequence (<em>x<span style="font-size: 8pt;">n</span></em>), and the shifted sequences (<em>x</em><span style="font-size: 8pt;"><em>n</em>+<em>k</em></span>), for <em>k</em> = 1, 2, and so on. Its value is maximum and equal to 1 in case of periodicity, and minimum and equal to 0 for the most chaotic cases. However, some sequences attached to dynamical systems, such as the digit sequence pictured in Figure 1 in this article, do not have theoretical autocorrelations: these autocorrelations don't exist because the underlying expectation or variance is infinite or does not exist. A possible solution with positive sequences is to compute the autocorrelations on <em>y<span style="font-size: 8pt;">n</span></em> = log(<em>x<span style="font-size: 8pt;">n</span></em>) rather than on the <em>x<span style="font-size: 8pt;">n</span></em>'s.</p>
<p>In addition, there may be strong non-linear dependencies, and thus high predictability for a sequence (<em>x<span style="font-size: 8pt;">n</span></em>), even if autocorrelations are zero. Thus the desire to build a better metric. In my next article, I will introduce a metric measuring the level of independence, as a proxy to quantifying chaos. It will be similar in some ways to the Kolmogorov-Smirnov metric used to test independence and illustrated <a href="https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-8/issue-2/A-Kolmogorov-Smirnov-type-test-for-independence-between-marks-and/10.1214/14-EJS961.full" target="_blank" rel="noopener">here</a>, however, without much theory, essentially using a machine learning approach and data-driven, model-free techniques to build confidence intervals and compare the amount of chaos in two dynamical systems: one fully chaotic versus one not fully chaotic. Some of this is discussed <a href="https://math.stackexchange.com/questions/4079669/question-about-a-special-test-of-independence-autocorrelation" target="_blank" rel="noopener">here</a>.</p>
<p>I did not include the variance as a metric to measure chaos, as the variance can always be standardized by a change of scale, unless it is infinite.</p>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
Hurwitz-Riemann Zeta And Other Special Probability Distributions
tag:www.datasciencecentral.com,2021-03-22:6448529:BlogPost:1044813
2021-03-22T05:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691835652?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8691835652?profile=RESIZE_710x" width="600"></img></a></p>
<p style="text-align: center;"><em>Source: <a href="https://www.datasciencecentral.com/profiles/blogs/babar-mimou" rel="noopener" target="_blank">here</a></em></p>
<p>In my previous article <a href="https://www.datasciencecentral.com/profiles/blogs/an-easy-way-to-solve-complex-optimization-problems" rel="noopener" target="_blank">here</a>, I discussed a…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691835652?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691835652?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source: <a href="https://www.datasciencecentral.com/profiles/blogs/babar-mimou" target="_blank" rel="noopener">here</a></em></p>
<p>In my previous article <a href="https://www.datasciencecentral.com/profiles/blogs/an-easy-way-to-solve-complex-optimization-problems" target="_blank" rel="noopener">here</a>, I discussed a simple way to solve complex optimization problems in machine learning. This was illustrated in the case of complex dynamical systems, involving non-linear equations in infinite dimensions, known as functional equations. These equations were solved using a fixed point algorithm, of which the Newton–Raphson method is a well known, widely used example.</p>
<p>These equations are typically solved numerically, as no theoretical solution is known in most cases. Nevertheless, in our case, a few examples have an exact, known solution. These examples with known solution are very useful, in the sense that they allow you to test your numerical algorithm and assess how fast it converges, or not. All the solutions were probability distributions, and in this article we introduce an even larger, generic class of problems (chaotic discrete dynamical systems) with known solution. The distributions presented here can thus be used as tests to benchmark optimization algorithms, but they also have their own interest for statistical modeling purposes, especially in risk management and extreme event modeling.</p>
<p>Each dynamical system discussed here (or in my previous article) comes with two distributions:</p>
<ul>
<li>A continuous one on [0, 1], known as the <em>invariant distribution</em>.</li>
<li>A discrete one taking on strictly positive integer values, known as the <em>digit distribution</em>.</li>
</ul>
<p>Besides, these distributions are very useful in number theory, though this will not be discussed here. The name Hurwitz and Riemann-Zeta is just a reminder of their strong connection to number theory problems such as continued fractions, approximation of irrational numbers by rational ones, the construction and distribution of the digits of random numbers in various numeration systems, and the famous <a href="https://en.wikipedia.org/wiki/Riemann_hypothesis" target="_blank" rel="noopener">Riemann Hypothesis</a> that has a one million dollar prize attached to it. Some of this is discussed <a href="https://mathoverflow.net/questions/383925/about-generalized-continued-fractions" target="_blank" rel="noopener">here</a> and in some of my past MathOverflow questions. However, our focus here is applications in machine learning.</p>
<p><span style="font-size: 14pt;"><strong>1. The Hurwitz-Riemann Zeta distribution</strong></span></p>
<p>Without diving into the details, let me first briefly discuss other Riemann-related distributions invented by different authors. For a definition of the Hurwitz function, see <a href="https://en.wikipedia.org/wiki/Hurwitz_zeta_function" target="_blank" rel="noopener">here</a>. It generalizes the <a href="https://en.wikipedia.org/wiki/Riemann_zeta_function" target="_blank" rel="noopener">Riemann Zeta function</a>. The most well known probability distribution related to these functions is the discrete <a href="https://en.wikipedia.org/wiki/Zipf%27s_law" target="_blank" rel="noopener">Zipf distribution</a>. It is well known by machine learning practitioners, and used to model phenomena such as "the top 10 websites amount to (say) 95% of the Internet traffic". Another example, this time continuous over the set of all positive real numbers, can be found <a href="https://benthamopen.com/FULLTEXT/TOSPJ-7-53" target="_blank" rel="noopener">here</a>. The paper is entitled <em>A New Class of Distributions Based on Hurwitz Zeta Function with Applications for Risk Management</em>. The author defines a family of distributions that generalizes the exponential power, normal, gamma, Weibull, Rayleigh, Maxwell-Boltzmann and chi-squared distributions, with applications in actuarial sciences. Finally, there is also a well known example (for mathematicians) defined on the complex plane, see <a href="https://arxiv.org/pdf/1504.03438.pdf" target="_blank" rel="noopener">here</a>. The paper is entitled <em>A complete Riemann zeta distribution and the Riemann hypothesis</em>.</p>
<p>Our Hurwitz-Riemann Zeta distribution is yet another example arising this time from discrete dynamical systems, continuous on [0, 1]. It also has a sister discrete distribution attached to it, useful for statistical modeling. It is defined as follows.</p>
<p><strong>1.1. Our Hurwitz-Riemann Zeta distribution</strong></p>
<p>The distribution discussed here is the most basic example, from the generic family described in section 2. It depends on one parameter <em>s</em> > 0, and the support domain is [0, 1]. The construction mechanism is defined in section 2, for the general case. Our Hurwitz-Riemann zeta distribution has the following density:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8699635072?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8699635072?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>where <span><em>ζ</em>(<em>s</em>, <em>x</em>) is the Hurwitz function, see <a href="https://en.wikipedia.org/wiki/Hurwitz_zeta_function" target="_blank" rel="noopener">here</a>. It has the following two first moments:</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691286058?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691286058?profile=RESIZE_710x" width="550" class="align-center"/></a></span></p>
<p>where <em>ζ</em>(<em>s</em>) = <em>ζ</em>(<em>s</em>, 1) is the Riemann Zeta function. This allows you to compute its variance. Higher moments can also be computed exactly. The cases <em>s</em> = 0, 1 or 2 are limiting cases, with the limit as <em>s</em> tends to zero, corresponding to the uniform density on [0, 1]. Particular values (<em>s</em> = 1, 2), empirically verified, are:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691307680?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691307680?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Here <span><em>γ</em> = 0.57721... is the Euler-Mascheroni constant, see <a href="https://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant" target="_blank" rel="noopener">here</a>. </span></p>
<p><strong>1.2. The discrete version</strong></p>
<p>These systems also have a discrete distribution attached to them, called the digit distribution, and described in section 2. For the Hurwitz-Riemann case, the probability that a digit is equal to <em>k</em>, is </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691322267?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691322267?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p>The expectation is finite only if <em>s</em> > 1. Likewise, the variance is finite only if <em>s</em> > 2. By contrast, the Zipf distribution has <em>P</em>(<em>k</em>) = (1 / <em>ζ</em>(<em>s</em>)) * 1 / <em>k</em>^<em>s</em>.</p>
<p><span style="font-size: 14pt;"><strong>2. A generic family of distributions, with applications</strong></span></p>
<p><span>We are dealing with a particular type of discrete dynamical system defined by </span><em>x</em><span><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>p</em>(<em>x<span style="font-size: 8pt;">n</span></em>) - INT(<em>p</em>(<em>x<span style="font-size: 8pt;">n</span></em>)), where INT is the integer part function, and <em>x</em><span style="font-size: 8pt;">0</span> in [0, 1] is the initial condition. The function <em>p</em>, defined for real numbers in [0, 1], is strictly decreasing and invertible, with <em>p</em>(1) = 1 and <em>p</em>(0) infinite. The results discussed here are valid for the vast majority of initial conditions, nevertheless there are infinitely many exceptions, for instance <em>x</em><span style="font-size: 8pt;">0</span> = 0. These systems are discussed in details in my previous article, <a href="https://www.datasciencecentral.com/profiles/blogs/an-easy-way-to-solve-complex-optimization-problems" target="_blank" rel="noopener">here</a>. In this section, only the main results are presented. These systems have the following properties:</span></p>
<ul>
<li><span>The <em>n</em>-th digit of <em>x</em><span style="font-size: 8pt;">0</span> is <em>d<span style="font-size: 8pt;">n</span></em> = INT(<em>p</em>(<em>x<span style="font-size: 8pt;">n</span></em>)). These digits are called <a href="https://www.tandfonline.com/doi/abs/10.1080/026811199282100?journalCode=cdss19" target="_blank" rel="noopener">codewords</a> in the context of dynamical systems. The probability that a digit is equal to <em>k</em> (<em>k</em> = 1, 2, 3 and so on) is <em>F</em>(<em>q</em>(<em>k</em>)) - <em>F</em>(<em>q</em>(<em>k</em>+1)) where <em>F</em> and <em>q</em> are defined below. If you know the digits, you can retrieve <em>x</em><span style="font-size: 8pt;">0</span> using the algorithm described in my previous article. </span></li>
<li><span>The invariant distribution <em>F</em>, which is the limit of the empirical distribution of the <em>x<span style="font-size: 8pt;">n</span></em>'s, satisfies the following functional equation: <a href="https://storage.ning.com/topology/rest/1.0/file/get/8691388861?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691388861?profile=RESIZE_710x" width="250" class="align-center"/></a></span></li>
</ul>
<p><span>where <em>q</em> is the inverse of the function <em>p, q</em>' denotes the derivative of <em>q</em>, and <em>f</em> (the invariant density) is the derivative of <em>F</em>. We focus only on the results that are of interest to machine learning professionals. </span></p>
<p><span>Typically numerical methods are needed to solve the above functional equation, however here we are dealing with a large class of dynamical systems for which the theoretical solution is known. The purpose is to test numerical algorithms to check how well and how fast they can approach the exact solution, as discussed in section 2 <a href="https://www.datasciencecentral.com/profiles/blogs/an-easy-way-to-solve-complex-optimization-problems" target="_blank" rel="noopener">in my previous article</a>. The invariant distribution <em>F</em> discussed below is far more general than the ones described in my earlier article. </span></p>
<p><strong>2.1. Generalized Hurwitz-Riemann Zeta distribution</strong></p>
<p><span>One way to find a dynamical system with know invariant distribution is to specify that distribution upfront, and then compute the resulting function <em>p</em>(<em>x</em>) that defines the system in question. Based on theory discussed <a href="https://www.datasciencecentral.com/profiles/blogs/an-easy-way-to-solve-complex-optimization-problems" target="_blank" rel="noopener">here</a> and <a href="https://mathoverflow.net/questions/385156/exact-invariant-distribution-for-2d-discrete-dynamical-systems-including-contin" target="_blank" rel="noopener">here</a>, one can proceed as follows. Try a monotonic increasing function <em>r</em>(<em>x</em>) with <em>r</em>(2) = 1 + <em>r</em>(1). Let <em>F</em>(<em>x</em>) = <em>r</em>(<em>x</em>+1) - <em>r</em>(1), and <em>R</em>(<em>x</em>) = <em>r</em>(<em>x</em>+1) - <em>r</em>(<em>x</em>). Then <em>R</em>(<em>x</em>) = <em>F</em>(<em>q</em>(<em>x</em>)), that is, <em>R</em>(<em>p</em>(<em>x</em>)) = <em>F</em>(<em>x</em>) since <em>q</em>(<em>p</em>(<em>x</em>)) = <em>x</em>. You can retrieve <em>p</em>(<em>x</em>) by inverting <em>R</em>(<em>x</em>). </span></p>
<p><span>A simple but generic example is </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691691652?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691691652?profile=RESIZE_710x" width="190" class="align-center"/></a></span></p>
<p><span>where <em>ψ</em> is a strictly decreasing function with <em>ψ</em>(∞) = 0, <em>ψ</em>(1) = 1, and <em>ψ</em>(0) = ∞. Then you have</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691705091?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691705091?profile=RESIZE_710x" width="280" class="align-center"/></a></span></p>
<p><span>It is easy to show that <em>R</em>(<em>x</em>) = <em>ψ</em>(<em>x</em>), thanks to a careful choice for the function <em>r</em>(<em>x</em>). This explains why the system has a simple theoretical solution; it was indeed built for that purpose. As a consequence, the probability for a digit to be equal to <em>k</em> (<em>k</em> = 1, 2, and so on) is simply equal to <em>P</em>(<em>k</em>) = <em>ψ</em>(<i>k</i>) - <em>ψ</em>(<i>k</i>+1). For more details, see Example 5 <a href="https://mathoverflow.net/questions/385156/exact-invariant-distribution-for-2d-discrete-dynamical-systems-including-contin" target="_blank" rel="noopener">in this article</a>, in the section <em>Appendix 1: Exact solution for various 1-D dynamical systems</em>.</span></p>
<p><span>The Hurwitz-Riemann particular case in section 1.1 corresponds to</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691709297?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691709297?profile=RESIZE_710x" width="300" class="align-center"/></a></span></p>
<p>Another particular case corresponds to <span><em>ψ</em>(<em>x</em>) = log<span style="font-size: 8pt;">2</span>(1 + 1/x), where log<span style="font-size: 8pt;">2</span> represents the logarithm in base 2. The associated dynamical system is known as the Gauss map and related to continued fractions. Its digits are the coefficients of continued fractions, and are known to follow a <a href="https://en.wikipedia.org/wiki/Gauss%E2%80%93Kuzmin_distribution" target="_blank" rel="noopener">Gauss-Kuzmin distribution</a>. Also, <em>p</em>(<em>x</em>) = <em>q</em>(x) = 1/<em>x</em>. It is discussed <a href="https://www.datasciencecentral.com/profiles/blogs/an-easy-way-to-solve-complex-optimization-problems" target="_blank" rel="noopener">in my previous article</a>. See also Example 2 <a href="https://mathoverflow.net/questions/385156/exact-invariant-distribution-for-2d-discrete-dynamical-systems-including-contin" target="_blank" rel="noopener">in this article</a>, in the section <em>Appendix 1: Exact solution for various 1-D dynamical systems</em>.</span></p>
<p><strong>2.2. Application</strong></p>
<p><span>Besides being useful to test optimization algorithms against the exact solution (such as solving the above functional equation), the digits of the system have applications in simulations, encoding, random number generation, and statistical modeling. In particular, below is a picture featuring the typical behavior of the first 2,000 values of <em>p</em>(<em>x<span style="font-size: 8pt;">n</span></em>), starting with <em>x</em><span style="font-size: 8pt;">0</span> = 0.5. Depending on the choice of the function <em>ψ</em>,<em> </em>these values may or may not be highly autocorrelated, and in some cases expectation and/or variance are infinite, which implies that the autocorrelation does not exist. The picture below features the Hurwitz-Riemann case with <em>s</em> = 2 (expectation for the digits is finite and equal to <em>ζ</em>(2) = π^2 / 6, but variance is infinite).</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8691827873?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8691827873?profile=RESIZE_710x" width="500" class="align-center"/></a></span></p>
<p><span>Other special distributions are discussed in my previous articles:</span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-family-of-generalized-gaussian-distributions" target="_blank" rel="noopener">New Family of Generalized Gaussian Distributions</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/interesting-application-of-the-poisson-binomial-distribution" target="_blank" rel="noopener">Interesting Application of the Poisson-Binomial Distribution</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/a-strange-family-of-statistical-distributions" target="_blank" rel="noopener">A Strange Family of Statistical Distributions</a></li>
</ul>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
<p></p>
An Easy Way to Solve Complex Optimization Problems in Machine Learning
tag:www.datasciencecentral.com,2021-03-08:6448529:BlogPost:1042655
2021-03-08T03:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641667893?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8641667893?profile=RESIZE_710x" width="400"></img></a></p>
<p style="text-align: center;"><em>Source: <a href="https://www.wikiwand.com/en/Test_functions_for_optimization" rel="noopener" target="_blank">here</a></em></p>
<p>There are numerous examples in machine learning, statistics, mathematics and deep learning, requiring an algorithm to solve some complicated equations: for instance, maximum likelihood…</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641667893?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641667893?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source: <a href="https://www.wikiwand.com/en/Test_functions_for_optimization" target="_blank" rel="noopener">here</a></em></p>
<p>There are numerous examples in machine learning, statistics, mathematics and deep learning, requiring an algorithm to solve some complicated equations: for instance, maximum likelihood estimation (think about logistic regression or the EM algorithm) or gradient methods (think about stochastic or swarm optimization). Here we are dealing with even more difficult problems, where the solution is not a set of optimal parameters (a finite dimensional object), but a function (an infinite dimensional object).</p>
<p>The context is discrete, chaotic dynamical systems, with applications to weather forecasting, population growth models, complex econometric systems, image encryption, chemistry (mixtures), physics (how matter reaches an equilibrium temperature), astronomy (how celestial man-made or natural bodies end up having stable or unstable orbits), or stock market prices, to name a few. These are referred to as complex systems.</p>
<p>The solutions to the problems discussed here requires numerical methods, as usually no exact solution is known. The type of equation to be solved is called <em>functional equation</em> or <em>stochastic integral</em>. We explore a few cases where the exact solution is actually known: this helps assess the efficiency, accuracy and speed of convergence of the numerical methods discussed in this article. These methods are based on the fixed-point algorithm applied to infinite dimensional problems.</p>
<p><span style="font-size: 14pt;"><strong>1. The general problem</strong></span></p>
<p>We are dealing with a discrete dynamical system defined by <em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <i>T</i>(<em>x<span style="font-size: 8pt;">n</span></em>), where <i>T</i> is a real-valued function, and <em>x</em><span style="font-size: 8pt;">0</span> is the initial condition. For the sake of simplicity, we restrict ourselves to the case where <em>x<span style="font-size: 8pt;">n</span></em> is in [0, 1]. Generalizations, for instance with <em>x<span style="font-size: 8pt;">n</span></em> being a vector, are described <a href="https://mathoverflow.net/questions/385156/exact-invariant-distribution-for-2d-discrete-dynamical-systems-including-contin" target="_blank" rel="noopener">here</a>. The most well known example is the <a href="https://en.wikipedia.org/wiki/Logistic_map" target="_blank" rel="noopener">logistic map</a>, with <i>T</i>(<em>x</em>) = <em>λx</em>(1-<em>x</em>), exhibiting a chaotic behavior or not, depending on the value of the parameter <em><span>λ</span></em>.</p>
<p>In our case, the function <i>T</i>(<em>x</em>) takes the following form: <i>T</i>(<em>x</em>) = <em>p</em>(<em>x</em>) - INT(<em>p</em>(<em>x</em>)), where INT denote the integer part function, <em>p</em>(<em>x</em>) is positive, monotonic, continuous and decreasing (thus bijective) with <em>p</em>(1) = 1 and <em>p</em>(0) infinite. For instance <em>p</em>(<em>x</em>) = 1 / <em>x</em> corresponds to the Gauss map associated with continued fractions; it is the most fundamental and basic example, and I discuss it <a href="https://mathoverflow.net/questions/383925/about-generalized-continued-fractions" target="_blank" rel="noopener">here</a> as well as below in this article. Another example is the Hurwitz-Riemann map, discussed <a href="https://www.datasciencecentral.com/profiles/blogs/hurwitz-riemann-zeta-and-other-special-probability-distributions" target="_blank" rel="noopener">here</a>. </p>
<p><strong>1.1. Invariant distribution and ergodicity</strong></p>
<p>The <em>invariant distribution</em> of the system is the one followed by the successive <em>x<span style="font-size: 8pt;">n</span></em>'s, or in other words, the limit of the empirical distribution attached to the <em>x<span style="font-size: 8pt;">n</span></em>'s, given an initial condition <em>x</em><span style="font-size: 8pt;">0</span>. A lot of interesting properties can be derived if the invariant density <em>f</em>(<em>x</em>) (the derivative of the invariant distribution) is known, assuming it exists. This only works with <a href="https://en.wikipedia.org/wiki/Ergodicity" target="_blank" rel="noopener">ergodic systems</a>. All systems under consideration here are <em>ergodic</em>. The invariant distribution applies to almost all initial conditions <em>x</em><span style="font-size: 8pt;">0</span>, though some <span style="font-size: 8pt;"><span style="font-size: 12pt;"><em>x</em></span>0</span>'s called exceptions, violate the law. This is a typical feature of all these systems. For some systems (the <a href="https://en.wikipedia.org/wiki/Dyadic_transformation" target="_blank" rel="noopener">Bernoulli map</a> for instance), the <em>x</em><span style="font-size: 8pt;">0</span>'s that are not exceptions are called <a href="https://en.wikipedia.org/wiki/Normal_number" target="_blank" rel="noopener">normal numbers</a>. </p>
<p>By ergodic, I mean that for almost any initial condition <em>x</em><span style="font-size: 8pt;">0</span>, the sequence (<em>x<span style="font-size: 8pt;">n</span></em>) eventually visits all parts of [0, 1], in a uniform and random sense. This implies that the average behavior of the system can be deduced from the trajectory of a "typical" sequence (<em>x<span style="font-size: 8pt;">n</span></em>) attached to an initial condition <em>x</em><span style="font-size: 8pt;">0</span>. Equivalently, a sufficiently large collection of random instances of the process (also called orbits) can represent the average statistical properties of the entire process.</p>
<p>Invariant distributions are also called equilibrium or attractor distributions in probability theory.</p>
<p><strong>1.2. The functional equation to be solved</strong></p>
<p>Let us assume that the invariant distribution <em>F</em>(<em>x</em>) can be written as <em>F</em>(<em>x</em>) = <em>r</em>(<em>x</em>+1) − r(1) for some function <i>r</i>. The support domain for <em>F</em>(<em>x</em>) is [0, 1], thus <em>F</em>(0) = 0, <em>F</em>(1) = 1, <em>F</em>(<em>x</em>) = 0 if x < 0, and <em>F</em>(<em>x</em>) = 1 if <em>x</em> > 1. Define <em>R</em>(<em>x</em>) = <em>r</em>(<em>x</em>+1) − <em>r</em>(<em>x</em>). Then we can retrieve <em>p</em>(<em>x</em>) (under some conditions) using the formula</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641305083?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641305083?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p>Thus <em>r</em>(<em>x</em>) must be increasing on [1,2] and <em>r</em>(2) = 1 + <em>r</em>(1). Not any function can be an invariant distribution.</p>
<p>In practice, you know <em>p</em>(<em>x</em>) and you try to find the invariant distribution <em>F</em>(<em>x</em>). So the above formula is not useful, except that it helps you create a table of dynamical systems, defined by their function <em>p</em>(<em>x</em>), with known invariant distribution. Such a table is available <a href="https://mathoverflow.net/questions/385156/exact-invariant-distribution-for-2d-discrete-dynamical-systems-including-contin" target="_blank" rel="noopener">here</a>, see Appendix 1 in that article, in particular example 5 featuring a Riemann zeta system. It is useful to test the fixed point algorithm described in section 2, when the exact solution is known. </p>
<p>If you only know <em>p</em>(<em>x</em>), to retrieve <em>F</em>(<em>x</em>) or its derivative <em>f</em>(<em>x</em>), you need to solve the following functional equation, whose unknown is the function <em>f</em>. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641363282?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641363282?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>where <em>q</em> is the inverse of the function <em>p</em>. Note that <em>R</em>(<em>x</em>) = <em>F</em>(<em>q</em>(<em>x</em>)) or alternatively, <em>R</em>(<em>p</em>(<em>x</em>)) = <em>F</em>(<em>x</em>), with <em>p</em>(<em>q</em>(<em>x</em>)) = <em>q</em>(<em>p</em>(<em>x</em>)) = <em>x</em>. Also, here <em>x</em> is in [0, 1]. In practice, you get a good approximation if you use the first 1,000 terms in the sum. Typically, the invariant density <em>f</em> is bounded, and the weights |<em>q</em>'(<em>x</em>+<em>k</em>)| are decaying relatively fast as <em>k</em> increases. </p>
<p>The theory behind this is beyond the scope of this article. It is based on the <a href="https://en.wikipedia.org/wiki/Transfer_operator" target="_blank" rel="noopener">transfer operator</a>, and also briefly discussed in one of my previous articles, <a href="https://mathoverflow.net/questions/383925/about-generalized-continued-fractions/383997#383997" target="_blank" rel="noopener">here</a>: see section "Functional equation for <em>f</em>". The invariant density is the eigenfunction of the transfer operator, corresponding to the eigenvalue 1. Also, if <em>x</em> is replaced by a vector (for instance, if working with bivariate dynamical systems), the above formula can be generalized, involving two variables <em>x</em>, <em>y</em>, and the derivative of the (joint) distribution is replaced by a Jacobian. </p>
<p><span style="font-size: 14pt;"><strong>2. Numerical solution via the fixed point algorithm</strong></span></p>
<p>The last formula in section 1.2. suggests a simple iterative algorithm to solve this type of equation. You need to start with an initial function <em>f</em><span style="font-size: 8pt;">0</span>, and in this case, the uniform distribution on [0, 1] is usually a good starting point. That is, <span style="font-size: 12pt;"><em>f</em></span><span style="font-size: 8pt;">0</span>(<em>x</em>) = 1 if <em>x</em> is in [0, 1], and 0 elsewhere. The iterative step is as follows:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641383454?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641383454?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p>with <em>x</em> in [0, 1]. Each iteration <em>n</em> generates a whole new function <em>f<span style="font-size: 8pt;">n</span></em> on [0, 1], and the hope is that the algorithm converges as <em>n</em> tends to infinity. If convergence occurs, the limiting function must be the invariant density of the system. This is an example of the <a href="https://en.wikipedia.org/wiki/Fixed-point_iteration" target="_blank" rel="noopener">fixed point algorithm</a>, in infinite dimension.</p>
<p>In practice, you compute <em>f</em>(<em>x</em>) for only (say) 10,000 values of <em>x</em> evenly spaced between 0 and 1. If for instance, <em>f</em><span style="font-size: 8pt;"><em>n</em>+1</span>(0.5) requires the computation of (say) <em>f<span style="font-size: 8pt;">n</span></em>(0.879237...) and the closest value in your array is <em>f<span style="font-size: 8pt;">n</span></em>(0.8792), you replace <em>f<span style="font-size: 8pt;">n</span></em>(0.879237...) by <em>f<span style="font-size: 8pt;">n</span></em>(0.8792) or you use interpolation techniques. This is more efficient than using a function defined recursively in a programming language. Surprisingly the convergence is very fast and in the examples tested, the error between the true solution and the one obtained after 3 iterations, is very small, see picture below.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641440290?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641440290?profile=RESIZE_710x" width="400" class="align-center"/></a>In the above picture, <em>p</em>(<em>x</em>) = <em>q</em>(<em>x</em>) = 1 / <em>x</em>, and the invariant distribution is known: <em>f</em>(<em>x</em>) = 1 / ((1+<em>x</em>)(log 2)). It is pictured in red, and it is related to the <a href="https://en.wikipedia.org/wiki/Gauss%E2%80%93Kuzmin_distribution" target="_blank" rel="noopener">Gauss-Kuzmin distribution</a>. Note that we started with the uniform distribution <em>f</em><span style="font-size: 8pt;">0</span> pictured in black (the flat line). The first iterate <em>f</em><span style="font-size: 8pt;">1</span> is in green, the second one <em>f</em><span style="font-size: 8pt;">2</span> is in grey, and the third one <em>f</em><span style="font-size: 8pt;">3</span> is in orange, and almost undistinguishable from the exact solution in red (I need magnifying glasses to see it). Source code for these computations is available <a href="http://datashaping.com/solve2b.txt" target="_blank" rel="noopener">here</a>. In the source code, there are two extra parameters <span><em>α</em>, <em>λ</em>. When <em>α</em> = <em>λ</em> = 1, it corresponds to the classic case <em>p</em>(<em>x</em>) = 1 / <em>x</em>.</span></p>
<p><span style="font-size: 14pt;"><strong>3. Applications</strong></span></p>
<p>One interesting concept associated with these dynamical systems is that of <em>digit</em>. The <em>n</em>-th digit <em>d<span style="font-size: 8pt;">n</span></em> is defined as INT(<em>p</em>(<em>x</em><span style="font-size: 8pt;">n</span>)) where INT is the integer part function. I call it "digit" because all these systems have a numeration system attached to them, generalizing standard numeration systems which are just a particular case. If you know the digits attached to an initial condition <em>x</em><span style="font-size: 8pt;">0</span>, you can retrieve <em>x</em><span style="font-size: 8pt;">0</span> with a simple algorithm. Start with <em>n</em> = <em>N</em> large enough and <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;"><em>+1</em></span> = 0 (you will get about <em>N</em> digits of accuracy for <em>x</em><span style="font-size: 8pt;">0</span>), and compute iteratively <em>x<span style="font-size: 8pt;">n</span></em> backward from <em>n</em> = <em>N</em> to <em>n</em> = 0 using the recursion <em>x<span style="font-size: 8pt;">n</span></em> = <em>q</em>(<em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> + <em>d<span style="font-size: 8pt;">n</span></em>) - INT(<em>q</em>(<span style="font-size: 8pt;"><span style="font-size: 10pt;">x</span><em>n</em>+1</span> + <span style="font-size: 10pt;"><em>d<span style="font-size: 8pt;">n</span></em></span>)). These digits can be used in encryption systems.</p>
<p>This will be described in detail in my upcoming book <em>Gentle Introduction to Discrete Dynamical Systems</em>. However, the interesting part discussed here is related to statistical modeling. As a starter, let's look at the digits of <em>x</em><span style="font-size: 8pt;">0</span> = <span>π - 3 in two different dynamical systems:</span></p>
<ul>
<li><span><strong>Continued fractions</strong>. Here <em>p</em>(<em>x</em>) = 1 / <em>x</em>. The first 20 digits are 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 3, 3, 23, 1, 1, 7, 4, 35, see <a href="https://oeis.org/A001203" target="_blank" rel="noopener">here</a>. </span></li>
<li><strong>A less chaotic dynamical system</strong>. Here <em>p</em>(<em>x</em>) = (-1 + SQRT(5 +4/<em>x</em>)) / 2. <span>The first 20 digits are </span>2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 26, 1, 3, 1, 10, 1, 1. We also have <em>F</em>(x) = 2<em>x</em> / (<em>x</em>+1).</li>
</ul>
<p>The distribution of the digits is known in both cases. For continued fractions, it is the <a href="https://en.wikipedia.org/wiki/Gauss%E2%80%93Kuzmin_distribution" target="_blank" rel="noopener">Gauss-Kuzmin distribution</a>. For the second system, the probability that a digit is equal to <em>k</em>, is 4 / (<em>k</em>(<em>k</em>+1)(<em>k</em>+2)), see Example 1 <a href="https://mathoverflow.net/questions/385156/exact-invariant-distribution-for-2d-discrete-dynamical-systems-including-contin" target="_blank" rel="noopener">in this article</a>. In general, the probability in question is equal to <em>F</em>(<em>q</em>(<em>k</em>)) - <em>F</em>(<em>q</em>(<em>k</em>+1)) for <em>k</em> = 1, 2, and so on. Clearly, the distribution of these digits can be used to quantify the level of chaos in the system. For continued fractions, the expected value of an arbitrary digit is infinite (though it is finite and well known for the logarithm of a digit, see <a href="https://en.wikipedia.org/wiki/Khinchin%27s_constant" target="_blank" rel="noopener">here</a>), while it is finite (equal to 2) for the second system. Yet each system, given enough time, will shoot arbitrarily large digits. Another way to quantify chaos in a dynamical system is to look at the auto-correlation structure of the sequence (<em>x<span style="font-size: 8pt;">n</span></em>). Auto-correlations very close to zero, decaying very fast, are associated with highly chaotic systems. In the case of continued fraction, the lag-1 auto-correlation, defined as the limit of the empirical auto-correlation on a sequence starting with (say) <em>x</em><span style="font-size: 8pt;">0</span> = <span>π - 3, is </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641579290?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641579290?profile=RESIZE_710x" width="250" class="align-center"/></a></span></p>
<p><span>where <em>γ</em> is the <a href="https://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant" target="_blank" rel="noopener">Euler–Mascheroni constant</a>, see Appendix 2 <a href="https://mathoverflow.net/questions/385156/exact-invariant-distribution-for-2d-discrete-dynamical-systems-including-contin" target="_blank" rel="noopener">in this article</a>. This is probably a new result, never published before.</span></p>
<p><span>Below is a picture featuring the successive values of <em>p</em>(<em>x<span style="font-size: 8pt;">n</span></em>) for the smoother dynamical system mentioned above. These values are close to the digits <em>d<span style="font-size: 8pt;">n</span></em>. the initial condition is <em>x</em><span style="font-size: 8pt;">0</span> = π - 3. In my next article, I will further discuss a new way to define and measure chaos in these various systems.</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641636094?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641636094?profile=RESIZE_710x" width="500" class="align-center"/></a></span></p>
<p><span>The first 5,500 values of <em>p</em>(<em>x<span style="font-size: 8pt;">n</span></em>), for <em>n</em> = 0, 1, 2 and so on, are featured in the above picture. Think about what business, natural or industrial process could be modeled by such kinds of time series! The possibilities are endless. For instance, it could represent meteorite hits over a large time period, with a few large values representing massive impacts. Clearly, it can be used in outlier, extreme events, and risk modeling. </span></p>
<p>Finally, here is another example, this time based on an unrelated different bivariate dynamical system on the grid (the cat map), used for image encryption. This is a<span> mapping on a picture of a pair of cherries. The image is 74 pixels wide, and takes 114 iterations to be restored, although it appears upside-down at the halfway point (the 57th iteration). Source: <a href="https://en.wikipedia.org/wiki/Arnold%27s_cat_map" target="_blank" rel="noopener">here</a>. </span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8641638058?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8641638058?profile=RESIZE_710x" class="align-center"/></a></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
A Plethora of Machine Learning Articles: Part 2
tag:www.datasciencecentral.com,2021-03-04:6448529:BlogPost:1041679
2021-03-04T01:44:59.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8629159091?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8629159091?profile=RESIZE_710x" width="400"></img></a></p>
<div class="xg_headline xg_headline-img xg_headline-2l"><div class="tb"><p><a class="xg_sprite xg_sprite-view" href="https://www.datasciencecentral.com/profiles/blog/list?user=3v6n5b6g08kgn"></a></p>
</div>
</div>
<div class="xg_module_body"><div class="postbody"><div class="xg_user_generated"><p style="text-align: center;"><em>Source:…</em></p>
</div>
</div>
</div>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8629159091?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8629159091?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<div class="xg_headline xg_headline-img xg_headline-2l"><div class="tb"><p><a class="xg_sprite xg_sprite-view" href="https://www.datasciencecentral.com/profiles/blog/list?user=3v6n5b6g08kgn"></a></p>
</div>
</div>
<div class="xg_module_body"><div class="postbody"><div class="xg_user_generated"><p style="text-align: center;"><em>Source: see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/more-beautiful-math-images" target="_blank" rel="noopener">here</a></em></p>
<p><span>Part 1 of this short series focused on the business analytics / BI / operational research aspects, see <a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-machine-learning-articles-part-1" target="_blank" rel="noopener">here</a>. In this Part 2, you will find the most interesting machine learning and statistics articles that I wrote in the last few years, focusing on core technical aspects. The whole series will feature articles related to the following aspects of machine learning:</span></p>
<ul>
<li><span>Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science)</span></li>
<li><span>Opinions, for instance about the value of a PhD in our field, or the use of some techniques</span></li>
<li><span>Methods, principles, rules of thumb, recipes, tricks</span></li>
<li><span>Business analytics (Part 1)</span></li>
</ul>
<p><span>My articles are always written in simple English and accessible to professionals with typically one year of calculus or statistical training, at the undergraduate level. They are geared towards people who use data but are interesting in gaining more practical analytical experience. Managers and decision makers are part of my intended audience. The style is compact, geared towards people who do not have a lot of free time. </span></p>
<p><span>Despite these restrictions, state-of-the-art, of-the-beaten-path results as well as machine learning trade secrets and research material are frequently shared. References to more advanced literature (from myself and other authors) is provided for those who want to dig deeper in the interested topics discussed. </span></p>
<p><span style="font-size: 14pt;"><strong>1. Core techniques</strong></span></p>
<p><span>These articles focus on techniques that have wide applications or that are otherwise fundamental or seminal in nature.</span></p>
<ol>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/introducing-an-all-purpose-robust-fast-simple-non-linear-r22" target="_blank" rel="noopener">Introducing an All-purpose, Robust, Fast, Simple Non-linear Regression</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/chaos-attractors-in-machine-learning-systems" target="_blank" rel="noopener">Variance, Attractors and Behavior of Chaotic Statistical Systems</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-family-of-generalized-gaussian-distributions" target="_blank" rel="noopener">New Family of Generalized Gaussian Distributions</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-approach-to-linear-algebra-in-machine-learning" target="_blank" rel="noopener">Gentle Approach to Linear Algebra, with Machine Learning Applications</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/confidence-intervals-without-pain" target="_blank" rel="noopener">Confidence Intervals Without Pain</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/modern-re-sampling-and-statistical-recipes" target="_blank" rel="noopener">Re-sampling: Amazing Results and Applications</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">How to Automatically Determine the Number of Clusters in your Data</a> - and more</li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/decomposition-of-statistical-distributions-using-mixture-models-a" target="_blank" rel="noopener">New Perspectives on Statistical Distributions and Deep Learning</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-original-underused-statistical-tests" target="_blank" rel="noopener">A Plethora of Original, Not Well-Known Statistical Tests</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/pattern-recognition-techniques-application-to-new-decimal-systems?xg_source=activity" target="_blank" rel="noopener">New Decimal Systems - Great Sandbox for Data Scientists and Mathematicians</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/are-the-digits-of-pi-truly-random" target="_blank" rel="noopener">Are the Digits of Pi Truly Random?</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/data-science-and-machine-learning-without-mathematics" target="_blank" rel="noopener">Data Science and Machine Learning Without Mathematics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel" target="_blank" rel="noopener">Advanced Machine Learning with Basic Excel</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/state-of-the-art-machine-learning-automation-with-hdt" target="_blank" rel="noopener">State-of-the-Art Machine Learning Automation with HDT</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/building-outiler-resistant-centroids-in-any-dimension" target="_blank" rel="noopener">Tutorial: Neutralizing Outliers in Any Dimension</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank" rel="noopener">The Fundamental Statistics Theorem Revisited</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/variance-clustering-test-of-hypotheses-and-density-estimation-rev" target="_blank" rel="noopener">Variance, Clustering, and Density Estimation Revisited</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/the-death-of-the-statistical-test-of-hypothesis" target="_blank" rel="noopener">The Death of the Statistical Tests of Hypotheses</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/5-easy-steps-to-structure-highly-unstructured-big-data" target="_blank" rel="noopener">4 Easy Steps to Structure Highly Unstructured Big Data, via Automated Indexation</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/the-best-kept-secret-about-linear-and-logistic-regression" target="_blank" rel="noopener">The best kept secret about linear and logistic regression</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/black-box-confidence-intervals-excel-and-perl-implementations-det" target="_blank" rel="noopener">Black-box Confidence Intervals: Excel and Perl Implementation</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/comparing-linear-regression-with-the-jackknife-method" target="_blank" rel="noopener">Jackknife and linear regression in Excel: implementation and comparison</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/jackknife-logistic-and-linear-regression" target="_blank" rel="noopener">Jackknife logistic and linear regression for clustering and predictions</a></li>
</ol>
<p><span style="font-size: 14pt;"><strong>2. Free books</strong></span></p>
<ul>
<li><span><b>Statistics: New Foundations, Toolbox, and Machine Learning Recipes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning">here</a>. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach.</span></p>
<p><span>The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.</span></p>
</li>
<li><span><b>Applied Stochastic Processes</b></span><p><span>Available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes">here</a>. Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems (104 pages, 16 chapters.) This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject.</span></p>
<p><span>It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</span></p>
</li>
</ul>
<p></p>
<p><span><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
</div>
</div>
</div>
A Plethora of Machine Learning Articles: Part 1
tag:www.datasciencecentral.com,2021-02-21:6448529:BlogPost:1034367
2021-02-21T23:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8582358874?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8582358874?profile=RESIZE_710x" width="400"></img></a></p>
<p><em>Source: see <a href="https://www.datasciencecentral.com/profiles/blogs/more-beautiful-math-images" rel="noopener" target="_blank">here</a></em></p>
<p><span style="font-size: 12pt;">In Part 1 of this short series, I have included the most interesting articles that I wrote in the last few years. This part focuses on the business analytics / BI /…</span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8582358874?profile=original" target="_blank" rel="noopener"><img width="400" class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8582358874?profile=RESIZE_710x"/></a></p>
<p><em>Source: see <a href="https://www.datasciencecentral.com/profiles/blogs/more-beautiful-math-images" target="_blank" rel="noopener">here</a></em></p>
<p><span style="font-size: 12pt;">In Part 1 of this short series, I have included the most interesting articles that I wrote in the last few years. This part focuses on the business analytics / BI / operational research aspects. The next parts will focus on</span></p>
<ul>
<li><span style="font-size: 12pt;">Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science)</span></li>
<li><span style="font-size: 12pt;">Opinions, for instance about the value of a PhD in our field, or the use of some techniques</span></li>
<li><span style="font-size: 12pt;">Methods, principles, rules of thumb, recipes, tricks</span></li>
</ul>
<p><span style="font-size: 12pt;">My articles are always written in simple English and accessible to professionals with typically one year of calculus or statistical training, at the undergraduate level. They are geared towards people who use data but are interesting in gaining more practical analytical experience. Managers and decision makers are part of my intended audience. The style is compact, geared towards people who do not have a lot of free time. </span></p>
<p style="text-align: center;"><em><a href="https://www.datasciencecentral.com/profiles/blogs/more-beautiful-math-images" target="_blank" rel="noopener"></a></em></p>
<p><span style="font-size: 12pt;">Despite these restrictions, state-of-the-art, of-the-beaten-path results as well as machine learning trade secrets and research material are frequently shared. References to more advanced literature (from myself and other authors) is provided for those who want to dig deeper in the interested topics discussed. </span></p>
<p><span style="font-size: 12pt;">Before starting, let me mention in section 1 two books that I wrote recently, available to all Data Science Central members.</span></p>
<p><span style="font-size: 14pt;"><strong>1. Free books</strong></span></p>
<ul>
<li><span style="font-size: 12pt;"><b>Statistics: New Foundations, Toolbox, and Machine Learning Recipes</b></span><p><span style="font-size: 12pt;">Available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning">here</a>. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach.</span></p>
<p><span style="font-size: 12pt;">The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.</span></p>
</li>
<li><span style="font-size: 12pt;"><b>Applied Stochastic Processes</b></span><p><span style="font-size: 12pt;">Available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes">here</a>. Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems (104 pages, 16 chapters.) This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject.</span></p>
<p><span style="font-size: 12pt;">It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</span></p>
</li>
</ul>
<p><span style="font-size: 14pt;"><strong>2. Business related articles</strong></span></p>
<p><span style="font-size: 12pt;">These articles focus on business applications and other matters relevant to being a data scientist working in the Industry. They are accessible to a wide audience, in the sense that they are less technical than many of my 200+ other articles.</span></p>
<ol>
<li><span style="font-size: 12pt;"><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">New Stock Trading and Lottery Game Rooted in Deep Math</a></span></li>
<li><span style="font-size: 12pt;"><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-wizardry" target="_blank" rel="noopener">Time series, Growth Modeling and Data Science Wizardy</a> </span></li>
<li><span style="font-size: 12pt;"><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-stabilize-data-to-avoid-decay-in-model-performance" target="_blank" rel="noopener">How to Stabilize Data Systems, to Avoid Decay in Model Performance</a></span></li>
<li><span style="font-size: 12pt;"><a href="https://www.datasciencecentral.com/profiles/blogs/10-differences-between-junior-and-senior-data-scientist" target="_blank" rel="noopener">22 Differences Between Junior and Senior Data Scientists</a></span></li>
<li><span style="font-size: 12pt;"><a href="https://www.datasciencecentral.com/profiles/blogs/the-first-things-you-should-learn-as-a-data-scientist-not-what-yo" target="_blank" rel="noopener">The First Things you Should Learn as a Data Scientist - Not what you Think</a></span></li>
<li><span style="font-size: 12pt;"><a href="https://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning" target="_blank" rel="noopener">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/20-data-science-systems-used-by-amazon-to-operate-its-business" target="_blank" rel="noopener">21 data science systems used by Amazon to operate its business</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/life-cycle-of-data-science-projects" target="_blank" rel="noopener">Life Cycle of Data Science Projects</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/40-techniques-used-by-data-scientists" target="_blank" rel="noopener">40 Techniques Used by Data Scientists</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/helping-facebook-design-better-machine-learning-algorithms" target="_blank" rel="noopener">Designing better algorithms: 5 case studies</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/the-data-science-zoo" target="_blank" rel="noopener">Architecture of Data Science Projects</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/24-uses-of-statistical-modeling-part-ii" target="_blank" rel="noopener">24 Uses of Statistical Modeling (Part II)</a> | <a href="http://www.datasciencecentral.com/profiles/blogs/top-20-uses-of-statistical-modeling" target="_blank" rel="noopener">(Part I)</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/the-abcd-s-of-business-optimization" target="_blank" rel="noopener">The ABCD's of Business Optimization</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/is-data-science-a-sin-against-the-norms-of-statisticians" target="_blank" rel="noopener">What you won't learn in stats classes</a></span></li>
<li><span style="font-size: 12pt;"><a href="http://www.datasciencecentral.com/profiles/blogs/biased-vs-unbiased-debunking-statistical-myths" target="_blank" rel="noopener">Biased vs Unbiased: Debunking Statistical Myths</a></span></li>
</ol>
<p></p>
<p><span style="font-size: 12pt;"><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></span></p>
<p><span style="font-size: 12pt;"><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books, <a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></span></p>
<p></p>
Maximum runs in Bernoulli trials: simulations and results
tag:www.datasciencecentral.com,2021-02-16:6448529:BlogPost:1029341
2021-02-16T08:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8561683465?profile=original" rel="noopener" target="_blank"><img class="align-full" src="https://storage.ning.com/topology/rest/1.0/file/get/8561683465?profile=RESIZE_710x" width="720"></img></a></p>
<p>Bernoulli trials are <span>random</span><span> experiments with two possible outcomes: "yes" and "no" (in the case of polls), </span><span> "success" and "failure" (in the case of gambling or clinical trials). The trials are independent from each other: for instance tossing a coin multiple times, or testing the success of a new drug against a specific…</span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8561683465?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8561683465?profile=RESIZE_710x" width="720" class="align-full"/></a></p>
<p>Bernoulli trials are <span>random</span><span> experiments with two possible outcomes: "yes" and "no" (in the case of polls), </span><span> "success" and "failure" (in the case of gambling or clinical trials). The trials are independent from each other: for instance tossing a coin multiple times, or testing the success of a new drug against a specific medical condition, on multiple patients: improvements for a specific patient is viewed as a success, lack of improvement as a failure. </span></p>
<p><span>Here we are interested in maximum runs of successes (also called record runs), when they are expected to occur, and their expected length or duration. While the classical application is in games of chance, we will discuss an exciting application in number theory, more specifically, very good approximations of irrational numbers by rational numbers, and numeration systems with a non-integer base. We will also consider the case where the trials are not independent, and where there are more than two outcomes. For instance, if throwing a dice rather than a coin, there are six rather than two outcomes.</span></p>
<p><span>The data used here is simulated and allows us to get some good approximations for a number of interesting statistics. It is based on an unusual pseudo-random number generator that is very relevant to the problem being studied. A more theoretical approach can be found <a href="https://www.csun.edu/~hcmth031/tspolr.pdf" target="_blank" rel="noopener">here</a>, with connections to extreme value theory and the Gumbel distribution. See also my previous article <em>Distribution of Arrival Times for Extreme Events</em>, posted <a href="https://www.datasciencecentral.com/profiles/blogs/distribution-of-arrival-times-of-extreme-events" target="_blank" rel="noopener">here</a>. </span></p>
<p><span style="font-size: 14pt;"><strong>1. Simulations and theoretical results</strong></span></p>
<p>Bernoulli trials with <em>b</em> potential outcomes, each with the same probability of occurring, can be simulated using the following system. Start with some irrational number <em>x</em><span style="font-size: 8pt;">0</span> in [0, 1], say <em>x</em><span style="font-size: 8pt;">0</span> = log 2 (called the <em>seed</em>), and use the following iterations:</p>
<p style="text-align: center;"><em>a<span style="font-size: 8pt;">n</span></em> = INT(<em>b x<span style="font-size: 8pt;">n</span></em>)</p>
<p style="text-align: center;"><em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span> = <em>b x<span style="font-size: 8pt;">n</span></em> - INT(<em>b x<span style="font-size: 8pt;">n</span></em>).</p>
<p>INT represents the integer part function. The result of the <em>n</em>-th trial is <em>a<span style="font-size: 8pt;">n</span></em>: it is a coding integer between 0 and <em>b</em> - 1 inclusive, representing for instance the result of throwing a dice with <em>b</em> sides labeled 0, ..., <em>b</em> - 1. Also, <em>a<span style="font-size: 8pt;">n</span></em> is the <em>n</em>-th digit of <em>x</em><span style="font-size: 8pt;">0</span> in base <em>b</em>. These digits are strongly conjectured to be independent from each other, and have the same probability 1 / <em>b</em> to take on any of the <em>b</em> potential values. Thus this scheme can be used to simulate the Bernoulli trials in question. Also, unlike traditional pseudorandom number generators, it does not produce periodic sequences. Such a system can be viewed as a chaotic dynamical system, just like the sine map discussed in my previous article, <a href="https://www.datasciencecentral.com/profiles/blogs/beautiful-mathematical-images" target="_blank" rel="noopener">here</a>. </p>
<p>The Bernoulli trials generated with <em>x</em><span style="font-size: 8pt;">0</span>, that is the sequence <em>a</em><span style="font-size: 8pt;">0</span>, <em>a</em><span style="font-size: 8pt;">1</span>, and so on, constitutes just one instance of a Bernoulli experiment. If you try with <em>N</em> different seeds (the number <em>x</em><span style="font-size: 8pt;">0</span>), then you end up with <em>N</em> different, independent instances of Bernoulli experiments sharing the same dynamics, and things start to become interesting.</p>
<p><strong>1.1. Simulations</strong></p>
<p>I performed <em>N</em> = 200 simulations, each representing a Bernoulli experiment starting with a different seed <em>x</em><span style="font-size: 8pt;">0</span> each time, each consisting of 1,000,000 trials, with <em>b</em> = 3. Possible outcomes of each trial are 0, 1 or 2. I looked at successive record runs of zeros. For one of these experiments (a typical case), I've found this:</p>
<ul>
<li>One isolated zero (the first occurrence of zero) starts at position <em>n</em> = 3</li>
<li>The first run of 2 zeros starts at position 13 in the digits expansion</li>
<li>The next longer run consists of 3 zeros, starting at position 69</li>
<li>The next longer one (4 zeros) starts at position 132</li>
<li>Then we have 5 zeros starting at position 670, then 6 starting at position 743, 8 starting at position 13411, 10 starting at position 58454, and 12 starting at position 384100.</li>
</ul>
<p>The observations can be summarized by the following bivariate sequence:</p>
<p style="text-align: center;">(3,1), (13,2), (69,3), (132,4), (670,5), (743,6), (13411,8), (58454,10), (384100,12), …</p>
<p>If you blend all the sequences of vectors (<em>X</em>, <em>Y</em>) together, from the 200 experiments, you get the following: </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8558262452?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8558262452?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong>: <em>Record runs of Y zeros vs the position X at which they occur in a Bernoulli experiment</em></p>
<p>Note that in Figure 1, the plot represents <em>Y</em> versus log(<em>X</em>), and <em>b</em> = 3. A record run equal to <em>Y</em> means that starting at position <em>X</em>, we observe the first instance of a (record) run consisting of <em>Y</em> consecutive zeros, in at least one of the <em>N</em> experiments. In Figure 2 featuring aggregated data, you can see the average log(<em>X</em>) computed across the <em>N</em> = 200 experiments, for any record run of length <em>Y</em> = 0, 1, 2, and so on (up to <em>Y</em> = 13). The chart speaks for itself; in the linear fit in Figure 2, the slope approaches log <em>b</em> as <em>N</em> tends to infinity.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8558364664?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8558364664?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2</strong>: <em>Same as Figure 1, with log(X) averaged across the N = 200 experiments</em></p>
<p><strong>1.2. Theory</strong></p>
<p>A lot of theoretical results are known for maximum runs. We present a few of them here, with additional references. Note that in my article, I focus on record runs, which are different from maximum runs: in any Bernoulli experiment, maximum runs correspond to the first occurrence of a run of length 2, 3, 4, and so on. Record runs, as in the example outlined at the beginning of section 1, do not necessarily increase by unit increments: in my example, the first run of length 7 (not a record) occurs after the first (record) run of length 8. In short, you see a run of length 8 before you see one of length 7.</p>
<p>The main theoretical results, provided by <a href="https://mathoverflow.net/questions/383353/distribution-of-the-first-occurrence-of-a-maximum-record-run-of-zeros-in-the-d/383388#383388" target="_blank" rel="noopener">Yuval Peres</a>, are:</p>
<ul>
<li>Let <em>R<span style="font-size: 8pt;">n</span></em> be the length of the longest run in the first <em>n</em> digits. Then <em>R<span style="font-size: 8pt;">n</span></em> log(<em>b</em>) / log(<em>n</em>) tends to 1 almost surely as <em>n</em> tends to infinity. It was first proved by Renyi, see the discussion in reference [1].</li>
<li>The waiting times <em>T<span style="font-size: 8pt;">k</span></em> for the occurrence of a run of length <em>k</em> satisfy that <em>T<span style="font-size: 8pt;">k</span></em> / E(<em>T<span style="font-size: 8pt;">k</span></em>) is asymptotically exponentially distributed with mean 1. See references [2] - [4]. We also have (see reference [5] and [7]): <a href="https://storage.ning.com/topology/rest/1.0/file/get/8558795279?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8558795279?profile=RESIZE_710x" width="100" class="align-center"/></a></li>
</ul>
<p>All references are in section 3. Note that these theoretical results apply to any run, not just runs of zeros. </p>
<p><span style="font-size: 14pt;"><strong>2. Application and generalization</strong></span></p>
<p>If you replace the integer <em>b</em> by a non integer (strictly larger than 1), then the Bernoulli trials will inherit the properties of that unusual numeration system:</p>
<ul>
<li>The number of potential outcomes, for any trial, is INT(<em>b</em>), the integer part of <em>b</em></li>
<li>The trials are no longer independent: the <em>n</em>-th outcome <em>a<span style="font-size: 8pt;">n</span></em> is correlated with <em>a<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span></li>
<li>Outcomes have different probabilities: P(<em>a<span style="font-size: 8pt;">n</span></em> = 0) is not the same as P(<em>a<span style="font-size: 8pt;">n</span></em> = 1)</li>
</ul>
<p>Nevertheless, once can still perform the same simulations to estimate the statistics of interest. If <em>b</em> is a quadratic irrational, the corresponding successive outcomes (the <em>a<span style="font-size: 8pt;">n</span></em>'s) follow a Markov chain model. See <a href="https://www.jstage.jst.go.jp/article/jmath1948/26/1/26_1_33/_pdf" target="_blank" rel="noopener">here</a> for the theoretical details.</p>
<p>Regardless of whether <em>b</em> is an integer or not, the application we are interested in is the approximation of irrational numbers by a specific class of numbers. This is usually done using continued fractions if the class of numbers in question consists of the rational numbers, and there is an abundant literature on this topic, see for instance <a href="https://mathoverflow.net/questions/383142/algebraic-and-rational-parts-of-a-real-number" target="_blank" rel="noopener">here</a>. However, we focus instead on best approximations of an irrational number <em>x</em><span style="font-size: 8pt;">0</span> in [0, 1] by a rational number <em>β<span style="font-size: 8pt;">n</span></em>, where</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8558617871?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8558617871?profile=RESIZE_710x" width="120" class="align-center"/></a></p>
<p>Note that <em>β<span style="font-size: 8pt;">n</span></em><span> can be expressed as </span><em>p<span style="font-size: 8pt;">n</span></em> / <em>q<span style="font-size: 8pt;">n</span></em>, a quotient of two integers if <em>b</em> is an integer, with <em>q<span style="font-size: 8pt;">n</span></em> being equal to <em>b</em> at the power <em>n</em>. The best approximation is obtained when the <em>a<span style="font-size: 8pt;">k</span></em>'s are the successive outcomes of the Bernoulli experiment with seed <em>x</em><span style="font-size: 8pt;">0</span>, or in other words, the first <em>n</em> digits of <em>x</em><span style="font-size: 8pt;">0</span> in base <em>b</em>. The approximation is exceptionally good if the last digit <em>a<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">-1</span> is not zero, and it is followed ty a record run of digits equal to zero. The length of that run is expected to be asymptotically of the order to (log <em>n</em>) / (log <em>b</em>). It can not be better than that, for a fixed <em>n</em>. Therefore, I propose the following conjecture, based on the probability distributions associated with extreme (record) runs discussed in section 1.</p>
<p><strong>Conjecture</strong></p>
<p>For most numbers <em>x</em><span style="font-size: 8pt;">0</span> in [0, 1], and for any <span><em>ε</em> > 0,</span> if <em>p</em> / <em>q</em> is an approximation of <em>x</em><span style="font-size: 8pt;">0</span>, with <em>p</em>, <em>q</em> co-prime positive integers, we have</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8558683066?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8558683066?profile=RESIZE_710x" width="150" class="align-center"/></a></p>
<p>The details as how I came to this conjecture are outlined in the section<em> Connection with approximations of irrationals by rational numbers</em>, in <a href="https://mathoverflow.net/questions/383353/distribution-of-the-first-occurrence-of-a-maximum-record-run-of-zeros-in-the-d/" target="_blank" rel="noopener">this article</a>. While this is beyond the scope of this article, a discussion of best approximations by continued fractions leads to a similar conclusion. In particular, if <em>p<span style="font-size: 8pt;">n</span></em> / <em>q<span style="font-size: 8pt;">n</span></em> is the <em>n</em>-th convergent of the number <em>x</em>, we have the following result, see last theorem in <a href="https://math.colorado.edu/~rohi1040/expository/ergodicthysimplecontfracs.pdf" target="_blank" rel="noopener">this article</a>, pictured below. In short, it says that if <span><em>ε</em> = 0, then only some proportion of all numbers <em>x</em><span style="font-size: 8pt;">0</span> will satisfy the above inequality. With <em>ε</em> > 0, almost all <em>x</em><span style="font-size: 8pt;">0</span> will. </span></p>
<p></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8585977467?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8585977467?profile=RESIZE_710x" width="600" class="align-center"/></a></span></p>
<p></p>
<p>Finally, record runs in Bernoulli trials is a topic of combinatorial analysis, and thus relevant to machine learning, with numerous applications in combinatorics. Also, you can learn more about non-integer bases in <a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">this article</a>. A summary table is available <a href="https://www.datasciencecentral.com/profiles/blogs/number-representation-systems-explained-in-one-picture" target="_blank" rel="noopener">here</a>.</p>
<p><span style="font-size: 14pt;"><strong>3. References</strong></span></p>
<p>[1] Schilling, Mark F. <em>The longest run of heads</em>. The College Mathematics Journal 21, no. 3 (1990): 196-207.</p>
<p>[2] Aldous, David. <em>Probability approximations via the Poisson clumping heuristic</em>. Vol. 77. Springer Science & Business Media, 2013.</p>
<p>[3] Földes, A. <em>The limit distribution of the length of the longest head-run</em>. Period Math Hung 10, 301–310</p>
<p>[4] Godbole, Anant P. <em>Poisson approximations for runs and patterns of rare events</em>. Advances in applied probability (1991): 851-865.</p>
<p>[5] Feller, William. <em>An introduction to probability theory and its applications</em>. 1957.</p>
<p>[6] Gerber, Hans U., and Shuo-Yen Robert Li. <em>The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain</em>. Stochastic Processes and their Applications 11, no. 1 (1981): 101-108.</p>
<p>[7] Li, Shuo-Yen Robert. <em>A martingale approach to the study of occurrence of sequence patterns in repeated experiments</em>. Annals of Probability 8, no. 6 (1980): 1171-1176.</p>
<p></p>
<p><em>To receive a weekly digest of our new articles, subscribe to our newsletter,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at<span> </span><a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></p>
More Surprising Math Images
tag:www.datasciencecentral.com,2021-02-08:6448529:BlogPost:1022670
2021-02-08T04:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><em>To zoom in on any picture, click on the image to get a higher resolution.</em></p>
<p>This a follow up to my previous article <a href="https://www.datasciencecentral.com/profiles/blogs/beautiful-mathematical-images" rel="noopener" target="_blank">here</a>, where you can find additional, very different images, the theory behind it, and relevance to machine learning techniques. What is surprising is that all these images were produced with a formula with a single parameter <em>λ</em>, and…</p>
<p><em>To zoom in on any picture, click on the image to get a higher resolution.</em></p>
<p>This a follow up to my previous article <a href="https://www.datasciencecentral.com/profiles/blogs/beautiful-mathematical-images" target="_blank" rel="noopener">here</a>, where you can find additional, very different images, the theory behind it, and relevance to machine learning techniques. What is surprising is that all these images were produced with a formula with a single parameter <em>λ</em>, and they look very different depending on the value of <em>λ</em>. More precisely, they are generated using the following recursion:</p>
<p style="text-align: center;"><em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span><span> </span>=<span> </span><em>x<span style="font-size: 8pt;">n</span></em><span> </span>+ <em>λ</em><span> </span>sin(<em>y<span style="font-size: 8pt;">n</span></em>),</p>
<p style="text-align: center;"><em>y</em><span style="font-size: 8pt;"><em>n</em>+1</span><span> </span>=<span> </span><em>x<span style="font-size: 8pt;">n</span></em><span> </span>+ <em>λ</em><span> </span>sin(<em>x<span style="font-size: 8pt;">n</span></em>),</p>
<p>with initial conditions <em>x</em><span style="font-size: 8pt;">0</span>, <em>y</em><span style="font-size: 8pt;">0</span>. </p>
<p>Seven different groups of three images are displayed. In each group, the leftmost image, a scatterplot (in blue) corresponds to the orbit of (<em>x<span style="font-size: 8pt;">n</span></em>, <em>y<span style="font-size: 8pt;">n</span></em>) in two dimensions, given the initial conditions. The central images features <em>x<span style="font-size: 8pt;">n</span></em> and <em>y<span style="font-size: 8pt;">n</span></em> as two time series, with <em>x<span style="font-size: 8pt;">n</span></em> in blue and <em>y<span style="font-size: 8pt;">n</span></em> in red. In both cases, 20,000 iterations are used. The rightmost image is the same as the leftmost one, except that only the first 25 iterations are displayed, and a green curve connects the 25 dots, to show how the orbit looks like at the beginning. The initial vector (<em>x</em><span style="font-size: 8pt;">0</span>, <em>y</em><span style="font-size: 8pt;">0</span>) is not included in that image.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8530324885?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8530324885?profile=RESIZE_710x" width="700" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong>: <em>x<span style="font-size: 8pt;">0</span> = 1, y<span style="font-size: 8pt;">0</span> = 4, λ = 0.04</em></p>
<p style="text-align: center;"></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8530326887?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8530326887?profile=RESIZE_710x" width="700" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2</strong>: <em>x<span style="font-size: 8pt;">0</span> = 1, y<span style="font-size: 8pt;">0</span> = 4, λ = 0.06</em></p>
<p style="text-align: center;"></p>
<p style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/8530323258?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8530323258?profile=RESIZE_710x" width="700" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 3</strong>: <em>x<span style="font-size: 8pt;">0</span> = 3, y<span style="font-size: 8pt;">0</span> = 4, λ = 1.5</em></p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8530331493?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8530331493?profile=RESIZE_710x" width="700" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 4</strong>: <em>x<span style="font-size: 8pt;">0</span> = 56, y<span style="font-size: 8pt;">0</span> = 4, λ = 0.04</em></p>
<p style="text-align: center;"></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8530366692?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8530366692?profile=RESIZE_710x" width="700" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 5</strong>: <em>x<span style="font-size: 8pt;">0</span> = 2, y<span style="font-size: 8pt;">0</span> = 4, λ = 10</em></p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8530385678?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8530385678?profile=RESIZE_710x" width="700" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 6</strong>: <em>x<span>0</span> = 1, y<span>0</span> = 4, λ = 2.5</em></p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8530386883?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8530386883?profile=RESIZE_710x" width="700" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 7</strong>: <em>x<span style="font-size: 8pt;">0</span> = 3, y<span style="font-size: 8pt;">0</span> = 4, λ = 2</em></p>
<p></p>
<p>As a bonus, here is another picture produced with a different type of chaotic dynamical system. It is discussed <a href="https://mathoverflow.net/questions/352967/is-this-a-new-strange-attractor" target="_blank" rel="noopener">here</a>. </p>
<p></p>
<p><em><a href="https://storage.ning.com/topology/rest/1.0/file/get/8582320259?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8582320259?profile=RESIZE_710x" width="400" class="align-center"/></a></em></p>
<p></p>
<p>Another interesting one can be found <a href="https://arxiv.org/pdf/1508.07814.pdf" target="_blank" rel="noopener">here</a> (page 21):</p>
<p><em><a href="https://storage.ning.com/topology/rest/1.0/file/get/8609092274?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8609092274?profile=RESIZE_710x" width="400" class="align-center"/></a></em></p>
<p></p>
<p><em>To receive a weekly digest of our new articles, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>.</em></p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at <a href="http://datashaping.com/" target="_blank" rel="noopener">DataShaping.com</a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="http://datashaping.com/" target="_blank" rel="noopener">here</a>.</em></p>
Beautiful Mathematical Images
tag:www.datasciencecentral.com,2021-02-02:6448529:BlogPost:1018503
2021-02-02T19:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><em>To zoom in on any picture, click on the image to get a higher resolution.</em></p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8505475867?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8505475867?profile=RESIZE_710x" width="400"></img></a></p>
<p style="text-align: center;"><strong>Figure 1</strong>: <em>The pillow basins (see section 3)</em></p>
<p style="text-align: center;"></p>
<p style="text-align: left;">The topic discussed here is closely related to optimization techniques in machine…</p>
<p><em>To zoom in on any picture, click on the image to get a higher resolution.</em></p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8505475867?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8505475867?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong>: <em>The pillow basins (see section 3)</em></p>
<p style="text-align: center;"></p>
<p style="text-align: left;">The topic discussed here is closely related to optimization techniques in machine learning. I will also talk about dynamic systems, especially discrete chaotic ones, in two dimensions. This is a fascinating branch of quantitative science, with numerous applications. This article provides you with an opportunity to gain exposure to this discipline, which is usually overlooked by data scientists but well studied by mathematicians and physicists. The images presented here are selected not just for their beauty, but most importantly for their intrinsic value: the practical insights that can be derived from them, and the implications for machine learning professionals. </p>
<p style="text-align: left;"></p>
<p><span style="font-size: 14pt;"><strong>1. Introduction to dynamical systems</strong></span></p>
<p>A discrete dynamical system is a sequence <em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>f</em>(<em>x<span style="font-size: 8pt;">n</span></em>) where <em>n</em> is an integer, starting with <em>n</em> = 0 (the initial condition) and where <em>f</em> is a real-valued function. In the continuous version (not discussed here), the index <em>n</em> (also called time) is a real number. The function <em>f</em> is called the <em>map</em> of the system, the system itself is also called a <em>mapping</em>: the most studied one is the logistic map defined by <em>f</em>(<em>x</em>) = <span><em>ρ</em></span><em>x</em> (1 - <em>x</em>), with <em>x</em> in [0, 1]. When <span><em>ρ</em> = 4, it is fully chaotic. </span>The sequence (<em>x<span style="font-size: 8pt;">n</span></em>) for a specific initial condition <em>x</em><span style="font-size: 8pt;">0</span>, is called the <em>orbit</em>. </p>
<p>Another example of chaotic mapping is the digits in base <em>b</em> of an irrational number <em>z</em> in [0,1]. In this case, <em>x</em><span style="font-size: 8pt;">0</span> = <em>z</em>, <em>f</em>(<em>x</em>) = <em>bx</em> - INT(<em>bx</em>) and the <em>n</em>-th digit of <em>z</em> is INT(<em>bx<span style="font-size: 8pt;">n</span></em>). Here INT is the integer part function. It is studied in details in my book <em>Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems</em><span>, </span><span>available for free, <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>. See also the second, large appendix in my free book </span><span><em>Statistics: New Foundations, Toolbox, and Machine Learning Recipes</em>, available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning" target="_blank" rel="noopener">here</a>. Applications include the design of non-periodic pseudo-random number generators, cryptography, and even a new concept of number guessing (gambling or simulated stock market) where the winning numbers can be computed in advance with a public algorithm that requires trillions of years of computing time, while a fast, private algorithm is kept secret. See <a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">here</a>. </span></p>
<p>The concept easily generalizes to two dimensions. In this case <em>x<span style="font-size: 8pt;">n</span></em> is a vector or a complex number. Mappings in the complex plane are known to produce beautiful fractals; it has been used in fractal compression algorithms, to compress images. In one dimension, once in chaotic mode, they produce Brownian-like orbits, with applications in Fintech.</p>
<p><strong>1.1. The sine map</strong></p>
<p>Moving forward, we focus exclusively on a particular case of the <em>sine mapping</em>, both in one and two dimensions. This is one of the most simple nonlinear mappings, yet it is very versatile and produces a large number of varied and intriguing configurations. In one dimension, it is defined as follows:</p>
<p style="text-align: center;"><em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = -<em>ρx<span style="font-size: 8pt;">n</span></em> + <em>λ</em> sin(<em>x<span style="font-size: 8pt;">n</span></em>).</p>
<p>In two dimensions, it is defined as</p>
<p style="text-align: center;"><em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = -<em>ρx<span style="font-size: 8pt;">n</span></em> + <em>λ</em> sin(<em>y<span style="font-size: 8pt;">n</span></em>),</p>
<p style="text-align: center;"><em>y</em><span style="font-size: 8pt;"><em>n</em>+1</span> = -<em>ρx<span style="font-size: 8pt;">n</span></em> + <em>λ</em> sin(<em>x<span style="font-size: 8pt;">n</span></em>).</p>
<p></p>
<p>This system is governed by two real parameters: <span><em>ρ</em> and</span> <span><em>λ</em>. Some of its properties and references are discussed <a href="https://mathoverflow.net/questions/382610/strange-behavior-of-x-n1-x-n-lambda-sin-x-n" target="_blank" rel="noopener">here</a>. </span></p>
<p><span style="font-size: 14pt;"><strong>2. Connection to machine learning optimization algorithms</strong></span></p>
<p>I need to introduce two more concepts before getting down to the interesting stuff. The first one is called <em>fixed point</em>. Note that a root is simply a value <em>x</em>* such that <em>f</em>(<em>x</em>*) = 0. Some systems don't have any root, some have one, some have several, and some have infinitely many, depending on the values of the parameters (in our case, depending on <em>ρ</em> and<em> λ</em>, see section 1.1). Some or all roots can be found using the following <em>fixed point</em> recursion: <em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>x<span style="font-size: 8pt;">n</span></em> + <em>f</em>(<em>x<span style="font-size: 8pt;">n</span></em>). In our case, this translates to using the following algorithm.</p>
<p><strong>2.1. Fixed point algorithm</strong></p>
<p>For our sine mapping defined in section 1.1, proceed as follows</p>
<p style="text-align: center;"><em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>x<span style="font-size: 8pt;">n</span></em> - <em>ρx<span style="font-size: 8pt;">n</span></em> + <em>λ</em> sin(<em>x<span style="font-size: 8pt;">n</span></em>)</p>
<p>in one dimension, or </p>
<p style="text-align: center;"><em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>x<span style="font-size: 8pt;">n</span></em> - <em>ρx<span style="font-size: 8pt;">n</span></em> + <em>λ</em> sin(<em>y<span style="font-size: 8pt;">n</span></em>),</p>
<p style="text-align: center;"><em>y</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>x<span style="font-size: 8pt;">n</span></em> - <em>ρx<span style="font-size: 8pt;">n</span></em> + <em>λ</em> sin(<em>x<span style="font-size: 8pt;">n</span></em>),</p>
<p>in two dimensions. If the sequences in question converge to some <em>x</em>* (one dimension) or <em>x</em>*, <em>y</em>* (two dimensions), then the limit in question is a fixed point of the system. To find as many fixed points as possible, you need to try many different initial conditions. Some initial conditions lead to one fixed point, some lead to another fixed point, some lead to nowhere. Some fixed points can never be reached no matter what initial conditions you use. This is illustrated later in this article. </p>
<p><strong>2.2. Connection to optimization algorithms</strong></p>
<p>Optimization techniques are widely used in machine learning and statistical science, for instance in deep neural networks, or if you want to find a maximum likelihood estimator.</p>
<p>When looking for the maxima or minima of a function <em>f</em>, you try to find the roots of the derivative of <em>f</em> (in one dimension) or by vanishing its gradient (in two dimensions). This is typically done using the Newton Raphson method, which is a particular type of fixed point algorithm, with quadratic convergence.</p>
<p><strong>2.3. Basins of attraction</strong></p>
<p>The second concept I introduce is <em>basins of attraction</em>. A basin of attraction is the full set of initial conditions such that when applying the fixed point algorithm in section 2.2, the fixed point iterations always converge to the same root <em>x</em>* of the system.</p>
<p>Let me illustrate this with the one-dimensional sine mapping, with <em>ρ</em> = 0 and <em>λ </em>= 1. The roots of the system are solutions to sin(<em>x</em>) = 0, that is <em>x</em>* = <em>k</em><span><em>π</em>, where <em>k</em> is any positive or negative integer. If the initial condition <em>x</em><span style="font-size: 8pt;">0</span> is anywhere in the open interval ]2<em>kπ</em>, 2(<em>k</em>+1)<em>π</em>[, then the fixed point algorithm always converge to the same <em>x</em>* = (2<em>k</em> + 1)<em>π</em>. So each of these intervals constitute a distinct basin of attraction, and there are infinitely many of them. However, none of the roots <em>x</em>* = 2<em>kπ</em> can be reached regardless of the initial condition <em>x</em><span style="font-size: 8pt;">0</span>, unless <em>x</em><span style="font-size: 8pt;">0</span> = <em>x</em>* = 2<em>kπ</em> itself. </span></p>
<p><span>In two dimensions, the basins of attractions look beautiful when plotted. Some have fractal boundaries. I believe none of their boundaries have an explicit, closed-form equation, except in trivial cases. This is illustrated in section 3, featuring the beautiful images promised at the beginning. </span></p>
<p><strong>2.4. Final note about the one-dimensional sine map</strong></p>
<p><span>The sequence <em>x</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>x<span style="font-size: 8pt;">n</span></em> + <em>λ</em> sin(<em>x<span style="font-size: 8pt;">n</span></em>) behaves as follows. Here we assume <em>λ</em> > 0 and <em>ρ</em> = 0.</span></p>
<ul>
<li><span>If <em>λ </em> < 1, it converges to a root <em>x</em>*</span></li>
<li><span>If <em>λ =</em> 4, it oscillates constantly in a narrow horizontal band, never converging</span></li>
<li><span>If <em>λ </em> > 6, it behaves chaotically as a Brownian motion, unbounded, with the following exception below</span></li>
</ul>
<p><span>There is a very narrow interval around <em>λ =</em> 8, where behavior is non-chaotic. In that case, <em>x<span style="font-size: 8pt;">n</span></em> is asymptotically equivalent to +2<em>π n</em> or - 2<em>π n</em>, and the sign depends on the initial condition <em>x</em><span style="font-size: 8pt;">0</span>, and is very sensitive to it. In addition, for instance if <em>x</em><span style="font-size: 8pt;">0</span> = 2 and <em>λ </em>= 8, then <em>x</em><span style="font-size: 8pt;">2<em>n</em></span> - <em>x</em><span style="font-size: 8pt;">2<em>n</em>-1</span> gets closer and closer to <em>α</em> = 7.939712..., and <em>x</em><span style="font-size: 8pt;">2<em>n</em>-1</span> - <em>x</em><span style="font-size: 8pt;">2<em>n</em>-2</span> gets closer and closer to <em>β</em> = -1.65653... as <em>n</em> increases, with <em>α</em> + <em>β</em> = 2<em>π</em>. Furthermore, <em>α</em> satisfies the equation</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8505364456?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8505364456?profile=RESIZE_710x" width="300" class="align-center"/></a></span></p>
<p><span style="font-size: 12pt;">For details, see <a href="https://mathoverflow.net/questions/382610/strange-behavior-of-x-n1-x-n-lambda-sin-x-n" target="_blank" rel="noopener">here</a>. The phenomenon in question is pictured in Figure 2 below. </span></p>
<p><span style="font-size: 10pt;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/8507389694?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8507391075?profile=RESIZE_710x" width="400" class="align-center"/></a></span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8507394283?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8507394283?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><span style="font-size: 12pt;"><strong>Figure 2</strong>: <em>x<span style="font-size: 8pt;">n</span> for n = 0 to 20,000 (X-axis), with x<span style="font-size: 8pt;">0</span> = 2; λ = 8 (top), λ = 7.98 (bottom)</em></span></p>
<p><span style="font-size: 14pt;"><strong>3. Beautiful math images and their implications</strong></span></p>
<p>The first picture (Figure 1, at the top of the article) features part of the four non-degenerate basins of attraction in the 2-dimensional sine map, when <span><em>λ =</em> 2 and <em>ρ </em>= 0.75. This sine map has 49 = 7 x 7 roots (<em>x</em>*, <em>y</em>*) with <em>x</em>* one of the 7 solutions of <em>ρ</em>x = <em>λ </em>sin(<em>λ</em> sin(<em>x</em>) / <em>ρ</em>), and <em>y</em>* also one of the 7 solutions of the same equation. Computations were performed using the fixed point algorithm described in section 2.1. Note that the white zone corresponds to initial conditions (<em>x</em><span style="font-size: 8pt;">0</span>, <em>y</em><span style="font-size: 8pt;">0</span>) that do not lead to convergence of the fixed point algorithm. Each basin is assigned one color (other than white), and is made of sections of pillows with same color, scattered all over across many pillows. I call it the pillow basins. It would be interesting to see if the basin boundaries can be represented by simple mathematical functions. One degenerate basin (the fifth basin) consisting of the diagonal line <em>x</em> = <em>y</em>, is not displayed in Figure 1.</span></p>
<p>The picture below (Figure 3) shows parts of 5 of the infinitely many basins of attractions corresponding to <span><em>λ</em></span> = 0.5 and <span><em>ρ</em></span> = 0, for the 2-dimensional site map. As in figure 1, the X-axis represents <em>x</em><span style="font-size: 8pt;">0</span>, the Y-axis represents <em>y</em><span style="font-size: 8pt;">0</span>. The range is from -4 to 4 both in Figure 1 and Figure 3. Each basin has its own color.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8507115889?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8507115889?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><span><strong>Figure 3</strong>: <em>The octopus basins</em></span></p>
<p><span>In this case, we have infinitely many roots (with <em>x</em>*, <em>y</em>* being a multiple of <em>π</em>) but only one-fourth of them can be reached by the fixed point algorithm. The more roots, the more basins, and as a result, the more interference between basins, making the image look noisy: a very small change in the initial conditions can lead to convergence to a different root, thus the overlapping between the basins. </span></p>
<p><span>The take out from this is that when dealing with an optimization problem with many local maxima and minima, the solution you get is very sensitive to the initial conditions. In some cases, it matters, and in some cases it does not. If you are looking for a local optimum only, this is not an issue. This is further illustrated in Figure 4 below. It shows the orbits - that is the locations of (<em>x<span style="font-size: 8pt;">n</span></em>, <em>y<span style="font-size: 8pt;">n</span></em>) - starting with four different initial conditions (<em>x</em>0, <em>y</em>0), for the sine map featured in Figure 1. The blue dots represent a root (<em>x</em>*, <em>y</em>*). Each orbit except the green one converges to a different root. The green one oscillates back and forth, never converging.</span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8507235501?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8507235501?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><span><strong>Figure 4</strong>: <em>Four orbits corresponding to four initial conditions, for the case shown in Figure 1 </em></span></p>
<p><strong>Note:</strong> When the system is very sensitive to initial conditions and highly chaotic, orbits computed numerically may be all wrong as round-off errors propagate exponentially fast as <em>n</em> increases. In that case, it is needed to use high precision computing to get accurate orbits, see <a href="https://www.datasciencecentral.com/profiles/blogs/high-precision-computing-benchmark-examples-and-tutorial" target="_blank" rel="noopener">here</a>.</p>
<p><strong>3.1. Benchmarking clustering algorithms</strong></p>
<p><span>The basins of attractions can be used to benchmark supervised clustering algorithms. For instance, in Figure 1, if you group the red and black basins together, and the yellow and blue basins together, you end up with two well separated groups whose boundaries can be determined to arbitrary precision. One can sample points from the merged basins to create a training set with two groups, and check how well your clustering algorithm (based for instance on nearest neighbors or density estimation) can estimate the true boundaries. Another machine learning problem that you can test on these basins is boundary estimation: the problem consists in finding the boundary of a domain when you know points that are inside and points that are outside the domain. </span></p>
<p><strong>3.2. Interesting probability problem</strong></p>
<p><span>The case pictured in Figure 1 leads to an interesting question. If you pick up randomly a vector of initial conditions (<em>x</em><span style="font-size: 8pt;">0</span>, <em>y</em><span style="font-size: 8pt;">0</span>), what is the probability that it will fall in (say) the red basin? It turns out that the probabilities are identical regardless of the basin. However, the probability to fall outside any basin (the white area) is different.</span></p>
<p><em>More beautiful images can be found in Part 2 of this article, <a href="https://www.datasciencecentral.com/profiles/blogs/more-beautiful-math-images" target="_blank" rel="noopener">here</a>. To not miss them, subscribe to our newsletter, <a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">here</a>. See also <a href="https://www.datasciencecentral.com/profiles/blogs/deep-visualizations-riemann-s-conjecture" target="_blank" rel="noopener">this article</a>, featuring an image entitled "the eye of the Riemann Zeta function". See also the Wikipedia article about "Infinite Compositions of Analytic Functions", <a href="https://en.wikipedia.org/wiki/Infinite_compositions_of_analytic_functions#:~:text=In%20mathematics%2C%20infinite%20compositions%20of,convergence%2Fdivergence%20of%20these%20expansions." target="_blank" rel="noopener">here</a>. The picture below is from that article.</em></p>
<p></p>
<p><em><a href="https://storage.ning.com/topology/rest/1.0/file/get/8572990262?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8572990262?profile=RESIZE_710x" width="400" class="align-center"/></a></em></p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> He is also the founder and investor in<span> </span><a href="https://www.parisrestaurantandbar.com/blog" target="_blank" rel="noopener">Paris Restaurant</a><span> </span>in Anacortes, WA. You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>. </em></p>
<p></p>
Can a Diploma from a Lower Ranking University Hurt your Data Science Career Prospects?
tag:www.datasciencecentral.com,2021-01-29:6448529:BlogPost:1015350
2021-01-29T04:16:13.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>Here I specifically discuss the case of a PhD degree from a third-tier university, though to some extent, it also applies to master degrees. Many professionals joining companies such as Facebook, Microsoft, or Google in a role other than a programmer, typically have a PhD degree, although there are many exceptions. It is still possible to learn data science on the job, especially if you have a quantitative background (say in physics or engineering) and have experience working with serious…</p>
<p>Here I specifically discuss the case of a PhD degree from a third-tier university, though to some extent, it also applies to master degrees. Many professionals joining companies such as Facebook, Microsoft, or Google in a role other than a programmer, typically have a PhD degree, although there are many exceptions. It is still possible to learn data science on the job, especially if you have a quantitative background (say in physics or engineering) and have experience working with serious data: see <a href="https://www.datasciencecentral.com/profiles/blogs/is-it-still-possible-today-to-become-a-self-taught-data-scientist" target="_blank" rel="noopener">here</a>. After all, learning Python is not that hard and can be done via data camps. What is more difficult to acquire is the analytical maturity. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8492386293?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8492386293?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p style="text-align: center;"><em>University of Namur</em></p>
<p>In my cased, I did my PhD at the University of Namur, a place that nobody has heard of. The topic of my research was computational statistics and image analysis. These were hot topics back then, and I was also lucky to work part-time in the corporate world for a state-of-the-art GIS (Geographic Information System) company, working with engineers on digital satellite images, as part of my PhD program, thanks to my mentor. Much of what I worked on is still very active these days, on a much bigger scale. It was the precursor of automated driving systems, and the math department in my alma mater was young and still very creative back then. This brings me to my first advice when choosing a PhD program.</p>
<p><strong>Advice #1</strong></p>
<ul>
<li>If you come from a poor background, your options might be more limited (this was my case), and you need to leverage everything you can. My parents did not have the money to send me to expensive schools, and I ended up attending the closest one to avoid spending a lot of money on rent. On the plus side, I did not accumulate student loans.</li>
<li>Before deciding on a PhD program, carefully choose your mentor. Mine was not known for his research, but he was well connected to the industry, managed to get money to fund his projects, and was working on exciting, applied projects. </li>
</ul>
<p>A side effect on my last piece of advice is that if your goal is to stay in Academia, you may have to rely on yourself to make your research worthy of publications and susceptible to land you a tenured position. The way I did it is summarized in my next advice. You want ideally to leave all doors open, both Academia and other options.</p>
<p><strong>Advice #2</strong></p>
<ul>
<li>Be proactive about reaching out to well respected professors in your field. Attend conferences and meet peers from around the world. Accept jobs such as reviewers. Start publishing in third-tier journals, move to second-tier, and then get a few ones in first-tier journals before completing your PhD. The one I published in <em>Journal of Statistical Society, Series B</em>, is what resulted in me being accepted as a postdoc at Cambridge University. Initially when it was accepted, it only had my name on it. </li>
<li>It helps to be passionate about what you do. My very first paper was in <em>Journal of Number Theory</em>, during my first year as a PhD student. It happened because I had a passion for number theory that I developed during my middle-school and high-school years. I hated high-school math (repetitive, boring mechanical exercises) but loved the math that I discovered and self-learned myself during these years, mostly through reading. I was the only student to participate (and be a finalist) at the national Math Olympiads, in my school. When you are young, it's something good to have on your resume. </li>
</ul>
<p>So to answer the original question - does it hurt coming from a low ranking school - at this point you know that you can still succeed despite the odds. But it requires patience, perseverance, and you must be very good at what you do. Perhaps the biggest drawback is the lack of great connections that top schools offer. You have to make up for that. Also great schools have state-of-the-art equipment and labs (so you can learn the most modern stuff), but somehow my little math department didn't lack these, so I was not penalized for that. I also cultivated great relationships with the computer science department. At the end, my research was at the intersection of math, statistics and computer science.</p>
<p>My last piece of advice is about what happens next after completing your PhD. In my case, I started a postdoc at Cambridge then moved to the corporate world (after failing a job interview for a tenured position) and eventually became entrepreneur, VC-funded executive, and sold my last venture recently to a publicly traded company. I still do independent math research, even more so and of higher caliber than during my PhD years. </p>
<p><strong>Advice #3</strong></p>
<ul>
<li>Contact other successful professionals who came from a third-tier university to ask for their advice. In my math department, two other PhD students in my cohort ended up having a stellar career: Michel Bierlaire (postdoc MIT after Namur) is now full professor at EPFL; Didier Burton (also postdoc MIT after Namur) ended up as an executive at Yahoo. </li>
<li>If you can, leverage the fact that you are very applied, don't have student loans, and thus you can ask a lower salary, be more competitive, gain various horizontal experience in many places while developing world-class expertise in a few areas. I eventually realized that working for myself (not as consultant, but entrepreneur) was what I liked best.</li>
</ul>
<p>You may argue that you don't need any diploma to create your own self-funded company, not even elementary school, but in the end I believe I got the best I could out of my PhD. In my case, it also implied relocating several times, from Belgium (due to lack of jobs) to UK to United States, and from the East Coast to the Bay Area and finally Seattle. I've been through various bubbles and market crashes; you may use your analytical skills to navigate them the best you can, selling and buying at the right time, understanding the markets, and emerge stronger each time. </p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> He is also the founder and investor in<span> </span><a href="https://www.parisrestaurantandbar.com/blog" target="_blank" rel="noopener">Paris Restaurant</a><span> </span>in Anacortes, WA. You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
<p></p>
Moving Averages: Natural Weights, Iterated Convolutions, and Central Limit Theorem
tag:www.datasciencecentral.com,2021-01-26:6448529:BlogPost:1011806
2021-01-26T02:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>Convolution is a concept well known to machine learning and signal processing professionals. In this article, we explain in simple English how a moving average is actually a discrete convolution, and we use this fact to build weighted moving averages with natural weights that at the limit, have a Gaussian behavior guaranteed by the Central Limit Theorem. Moving averages are nothing more than blurring filters for signal processing experts, with a Gaussian-like kernel in the case discussed…</p>
<p>Convolution is a concept well known to machine learning and signal processing professionals. In this article, we explain in simple English how a moving average is actually a discrete convolution, and we use this fact to build weighted moving averages with natural weights that at the limit, have a Gaussian behavior guaranteed by the Central Limit Theorem. Moving averages are nothing more than blurring filters for signal processing experts, with a Gaussian-like kernel in the case discussed here. Inverting a moving average to recover the original signal consists in applying the inverse filter, known as a sharpening or enhancing filter. The inverse filter is used for instance in image analysis, to remove noise or deblur an image, while the original filter (the moving average) does the opposite. This is discussed here for one dimensional discrete signals, known as time series. Generalizations are also discussed. An interesting application in number theory, related to the famous unsolved Riemann conjecture, is also discussed.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8470334300?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8470334300?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong>: <em>Bell-shaped distribution for re-scaled coefficients (the weights) discussed in section 1.1</em></p>
<p><span style="font-size: 14pt;"><b style="font-size: 14pt;">1.</b> <span style="font-size: 18.6667px;"><b>Weighted</b></span><b style="font-size: 14pt;"> moving averages as convolutions</b></span></p>
<p>Given a discrete time series with observations <em>X</em>(0), <em>X</em>(1), <em>X</em>(2)<i> </i>and so on, a weighted moving average can be defined by</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8469896293?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8469896293?profile=RESIZE_710x" width="350" class="align-center"/></a></p>
<p>Here <em>Y</em>(<em>t</em>) is the smoothed signal and <em>h</em> is a discrete density function (thus summing to one) though negative values of <em>h</em>(<em>k</em>) are sometimes used, for instance in Spencer's 15-point moving average used by actuaries, see <a href="https://mathworld.wolfram.com/Spencers15-PointMovingAverage.html" target="_blank" rel="noopener">here</a>. We assume that <em>t</em> can take on negative integer values. Also, unless otherwise specified, we assume the weights to be symmetrical, that is, <em>h</em>(<em>k</em>) = <em>h</em>(-<em>k</em>). The parameter <em>N</em> can be infinite, but typically, the values <em>h</em>(<em>k</em>) are fast decaying the further away you are from <em>k</em> = 0. </p>
<p>The notation used by mathematicians to represent this transformation is as follows: <em>Y</em> = <em>T</em>(<em>X</em>) = <em>h</em> * <em>X</em> where * is the convolution operator. This notation is convenient because it easily allows us to define the iterated moving average as a self-composition of the operator <em>T</em>, acting on the time series <em>X </em>: Start with <em>Y</em><span style="font-size: 8pt;">0</span> = <em>X</em>, <em>Y</em><span style="font-size: 8pt;">1</span> = <em>Y</em>, and let <em>Y</em><span style="font-size: 8pt;"><em>n</em>+1</span> = <em>T</em>(<em>Y<span style="font-size: 8pt;">n</span></em>) = <em>h</em> * <em>Y<span style="font-size: 8pt;">n</span></em>. Likewise, we can define <span style="font-size: 12pt;"><em>h<span style="font-size: 8pt;">n</span></em></span> (with <em>h</em><span style="font-size: 8pt;">1</span> = <em>h</em>) as <em>h</em> * <em>h</em> * ... * <em>h</em>, that is, an <em>n</em>-fold self-convolution of <em>h</em>. Of course, <em>Y<span style="font-size: 8pt;">n</span></em> = <em>h<span style="font-size: 8pt;">n</span></em> * <em>X</em> so that we have</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8469956688?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8469956688?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Note that the sum goes from -<em>N<span style="font-size: 8pt;">n</span></em> to <em>N<span style="font-size: 8pt;">n</span></em> this time, as each additional iteration increases the number of terms in the sum, so <em>N<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span> > <em>N<span style="font-size: 8pt;">n</span></em>, with <em>N</em><span style="font-size: 8pt;">1</span> = <em>N</em>. This becomes clear in the following illustration.</p>
<p><strong>1.1 Example</strong></p>
<p>The most basic case corresponds to <em>N</em> = 1, with <em>h</em>(-1) = <em>h</em>(0) = <em>h</em>(1) = 1/3. In this case, <em>N<span style="font-size: 8pt;">n</span></em> = <em>n</em>, and the average value of <em>h<span style="font-size: 8pt;">n</span></em>(<em>k</em>) is equal to 1 / (2<em>N<span style="font-size: 8pt;">n</span></em> +1). We have the following table:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8470110072?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8470110072?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>The above table shows how the weights are automatically determined, without guess work, rule of thumb, or fine-tuning required. Note that the sum of the elements in the <em>n</em>-th row is always equal to 3^<em>n</em> (3 at power <em>n</em>). This is very similar to the binomial coefficients table, and <em>h<span style="font-size: 8pt;">n</span></em>(<em>k</em>) are known as the trinomial coefficients, see <a href="https://oeis.org/search?q=1%2C6%2C21%2C50%2C90%2C126&language=english&go=Search" target="_blank" rel="noopener">here</a>. The difference is that for binomial coefficients, the sum of the elements in the <em>n</em>-th row is always equal to 2^<em>n</em>, and the <em>n</em>-th row only has <em>n</em> + 1 entries, versus 2<em>n</em> + 1 in our table. The values <em>h<span style="font-size: 8pt;">n</span></em>(<em>k</em>) corresponding to <em>n</em> = 100 are displayed in Figure 1, at the top of this article. They have been scaled by a factor equal to the square root of <em>N<span style="font-size: 8pt;">n</span></em>, since otherwise they all tend to zero as <em>n</em> tends to infinity. </p>
<p><strong>1.2 Link to the Central Limit Theorem</strong></p>
<p>The methodology developed here can be used to prove the central limit theorem in the most classic way. Indeed, the classic proof uses iterated self-convolutions, and the fact that the Fourier transform of convolutions is the product of the individual Fourier transform of each convolution. The Fourier transform is called characteristic function in probability theory. Interestingly, this leads to Gaussian approximations for partial sums of coefficients such as those in the <em>n</em>-th row, in the above table, when <em>n</em> is large and after proper rescaling. This is already well known for binomial coefficients (see <a href="http://www.ams.org/publicoutreach/feature-column/fcarc-normal" target="_blank" rel="noopener">here</a>), and it easily extends to the coefficients introduced here, as well as to many other types of mathematical coefficients. See also Figure 1.</p>
<p><span style="font-size: 14pt;"><strong>2. Inverting a moving average, and generalizations</strong></span></p>
<p>Inverting a moving average consists in retrieving the original time series or signal. It consists in applying the inverse filter to the observed data, to un-smooth it. It is usually not possible to do it, though the true answer is somewhat more nuanced. It is certainly easier to do when <em>N</em> is small, though usually <em>N</em> is not known, and the weights are also unknown. However if the observed data is the result of applying the simple convolution described in section 1.2 with <em>N</em> = 1, you only need to know the values of <em>X</em>(<em>t</em>) at two different times <em>t</em><span style="font-size: 8pt;">0</span> and <em>t</em><span style="font-size: 8pt;">1</span> to retrieve the original signal. This is easiest if you know <em>X</em>(<em>t</em>) at <em>t</em><span style="font-size: 8pt;">0</span> = 0 and at <em>t</em><span style="font-size: 8pt;">1</span> = 1: in this case, there is a simple inversion formula: </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8470438871?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8470438871?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p>If you know <em>X</em>(0), <em>X</em>(1), and <em>Y</em>(<em>t</em>) for all <em>t</em>'s, you can iteratively retrieve <em>X</em>(2), <em>X</em>(3), and so on with the above recurrence formula. If you don't know <em>X</em>(0), <em>X</em>(1) but instead you know the variance and other higher moments of <em>X</em>(<em>t</em>), assuming <em>X</em>(<em>t</em>) is stationary, then you may test various <em>X</em>(0), <em>X</em>(1) until you find a pair matching these moments when reconstructing the full sequence <em>X</em>(<em>t</em>) using the above recurrence formula. The solution may not be unique. Other parameters you know about <em>X</em>(<em>t</em>) may be useful too for the reconstruction: the period (if any), the slope of a linear trend (if any), and so on. </p>
<p><strong>2.1 Generalizations</strong></p>
<p>The moving averages discussed here rely on the classic arithmetic mean as the fundamental convolution operator, corresponding to <em>N</em> = 1. It is possible to use other means such as the harmonic or geometric means, and even more general as those defined <a href="https://www.datasciencecentral.com/profiles/blogs/alternative-to-the-arithmetic-geometric-and-harmonic-means" target="_blank" rel="noopener">in this article</a>. It can be generalized to two or higher dimensions, and to a time-continuous signal. For prediction or extrapolation, see <a href="https://www.datasciencecentral.com/profiles/blogs/introducing-an-all-purpose-robust-fast-simple-non-linear-r22" target="_blank" rel="noopener">this article</a>. For interpolation, that is to estimate <em>X</em>(<em>t</em>) when <em>t</em> is not an integer, <a href="https://mathoverflow.net/questions/376081/infinite-partial-fraction-expansions-to-compute-fractional-iterations-and-recurr" target="_blank" rel="noopener">see this article</a>. </p>
<p><span style="font-size: 14pt;"><strong>3. Application and source code</strong></span></p>
<p>We applied the above methodology with <em>n</em> = 60 to the following time series, with 60 < <em>t</em> < 240 being an integer:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8477710477?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8477710477?profile=RESIZE_710x" width="250" class="align-center"/></a></p>
<p>Figure 2 shows <em>Y<span style="font-size: 8pt;">n</span></em>(<em>t</em>) with <em>n</em> = 60 (the red curve), after shifting and rescaling (multiplying) it by a factor of order sqrt(<em>n</em>). In this case, <em>X</em>(2<em>t</em>) represents the real part of the <a href="https://en.wikipedia.org/wiki/Dirichlet_eta_function" target="_blank" rel="noopener">Dirichlet Eta function</a> <span><em>η</em> </span>defined in the complex plane. If you replace the cosine by a sine in the definition of <em>X</em>(<em>t</em>), you get similar results for the imaginary part of <em>η</em>. What is spectacular here is that <em>Y<span style="font-size: 8pt;">n</span></em>(<em>t</em>) is very well approximated by a cosine function, see bottom of figure 2. The implication is that thanks to the self-convolution used here, we can approximate the real and imaginary parts of <span><em>η</em> </span>by a simple auto-regressive model. This in turn may have implications to help solve the famous <a href="https://www.datasciencecentral.com/profiles/blogs/deep-visualizations-riemann-s-conjecture" target="_blank" rel="noopener">Riemann Hypothesis</a> (RH) which essentially consists in locating the values of <em>t</em> such that <em>X</em>(2<em>t</em>) = 0 simultaneously for the real and imaginary part of <em>η</em>. RH states that there is no such <em>t</em> in our particular case, where a parameter 0.75 is used in the definition of <em>X</em>(<em>t</em>). It is conjectured to also be true if you replace 0.75 by any value strictly between 0.5 and 1. See more <a href="https://www.datasciencecentral.com/profiles/blogs/deep-visualizations-riemann-s-conjecture" target="_blank" rel="noopener">here</a> and <a href="https://mathoverflow.net/questions/382043/incredibly-accurate-recursions-for-the-riemann-zeta-function" target="_blank" rel="noopener">here</a>. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8477209286?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8477209286?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2</strong>: <em>weighted moving average (WMA) with n = 60 (top), model fitting with cosine function (bottom)</em></p>
<p>Note that <em>X</em>(<em>t</em>), the blue curve, is non-periodic, while the red curve is almost perfectly periodic. If you use arbitrary moving averages instead of the one based on a convolution <em>h<span style="font-size: 8pt;">n</span></em> * <em>X</em>, you won't get a perfect fit in the bottom part of figure 2, certainly not a perfect fit with a simple cosine function. <a href="https://storage.ning.com/topology/rest/1.0/file/get/8477213652?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8477213652?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 3</strong>: <em>same as top part of figure 2, but using a different X(t) for the blue curve</em></p>
<p>Also, the perfect fit can not be achieved if you replace the logarithm in the definition of <em>X</em>(<em>t</em>), by a much faster growing function. This is illustrated in figure 3, where the logarithm in <em>X</em>(<em>t</em>) was replaced by a square root.</p>
<p>The source code can be downloaded <a href="https://storage.ning.com/topology/rest/1.0/file/get/8477763473?profile=original" target="_blank" rel="noopener">here</a> (convol2b.pl.txt). Since it is dealing with convolutions, it can be further optimized using Fast Fourier Transforms (FFT), see <a href="http://www.dspguide.com/ch18/2.htm" target="_blank" rel="noopener">here</a>. Finally, it would be interesting to treat this case assuming the time <em>t</em> is continuous, using continuous rather than discrete convolutions.</p>
<p></p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> He is also the founder and investor in <a href="https://www.parisrestaurantandbar.com/blog" target="_blank" rel="noopener">Paris Restaurant</a> in Anacortes, WA. You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
Machine Learning / Stats / BI: Mini Translation Dictionary
tag:www.datasciencecentral.com,2021-01-19:6448529:BlogPost:1008950
2021-01-19T06:12:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>Here I provide translations for various important terms, to help professionals from related backgrounds better understand each other. In particular, machine learning professionals versus statisticians.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8438181275?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/8438181275?profile=RESIZE_710x" width="600"></img></a></p>
<p style="text-align: center;"><em>Source for picture:…</em></p>
<p>Here I provide translations for various important terms, to help professionals from related backgrounds better understand each other. In particular, machine learning professionals versus statisticians.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8438181275?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8438181275?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture: <a href="https://www.datasciencecentral.com/profiles/blogs/machine-learning-vs-statistics-in-one-picture" target="_blank" rel="noopener">here</a></em></p>
<p><strong>Feature</strong> (machine learning)</p>
<p>A feature is known as a variable or independent variable in statistics. It is also known as a predictor by predictive analytics professionals. </p>
<p><strong>Response</strong></p>
<p>The response is called dependent variable in statistics. Machine learning professionals sometimes call it the output. </p>
<p><strong>R-square</strong></p>
<p>This is the statistics used by statisticians to measure the performance of a model. There are many better alternatives. Machine learning professionals sometimes call it goodness-of-fit metric. </p>
<p><strong>Regression</strong></p>
<p>Sometimes called maximum likelihood regression or linear regression by statisticians. Physicists and signal processing / operations research professionals use the term ordinary least squares instead. And yes, it is possible to compute confidence intervals (CI) without underlying models. They are called data-driven, and rely on simulations and empirical percentile distributions. </p>
<p><strong>Logistic transform</strong></p>
<p>The term used in the context of neural networks is sigmoid. Statisticians are more familiar with the word logistic, as in logistic regression.</p>
<p><strong>Neural networks</strong></p>
<p>While not exactly the same thing, statisticians have they own multi-layers hierarchical networks: they are called Bayesian hierarchical networks.</p>
<p><strong>Test of hypothesis</strong></p>
<p>Business intelligence professionals call it A/B testing, or multivariate testing.</p>
<p><strong>Boosted models</strong></p>
<p>Boosted models are used by machine learning professionals to blend multiple models and get the best of each model. Statisticians call them ensemble techniques.</p>
<p><strong>Confidence intervals</strong></p>
<p>We are all familiar with this concept invented by statisticians. Alternative terms include prediction intervals, or error (not to be confused with predictive or residual error, as it has its own meaning for statisticians).</p>
<p><strong>Grouping</strong></p>
<p>Also known as aggregating, and consisting in grouping values of some feature or independent variable, especially in decision trees to reduce the number of nodes. Machine learning professionals call it feature binning. </p>
<p><strong>Taxonomy</strong></p>
<p>When applied to unstructured text data, the creation of a taxonomy (sometimes called ontology) is referred to as natural language processing. It is basically clustering of text data.</p>
<p><strong>Clustering</strong></p>
<p>Statisticians call it clustering. In machine learning, the concept is referred to as unsupervised classification. To the contrary, supervised clustering is a learning technique based on training sets and cross-validation. </p>
<p><strong>Control set</strong></p>
<p>Machine learning professionals use control and test sets. Statisticians use the term cross-validation or bootstrapping, as well as training sets. </p>
<p><strong>Model fitting</strong></p>
<p>The terms favored by machine learning professionals is model selection, testing, and feature selection. Model performance has its own statistical related term: <em>p</em>-value, though it less used recently. </p>
<p><strong>False positives</strong></p>
<p>Instead of false positives and false negatives, statisticians favor type I and type II errors.</p>
<p>Another similar dictionary can be found <a href="https://insights.sei.cmu.edu/sei_blog/2018/11/translating-between-statistics-and-machine-learning.html" target="_blank" rel="noopener">here</a>. </p>
<p></p>
<p><br/> <em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
Deep visualizations to Help Solve Riemann's Conjecture
tag:www.datasciencecentral.com,2021-01-06:6448529:BlogPost:1007807
2021-01-06T06:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>This is the second part of my article <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" rel="noopener" target="_blank">Spectacular Visualization: The Eye of the Riemann Zeta Function</a>, focusing on the most infamous unsolved mathematical conjecture, one that has a $1 million dollar price attached to it. I used the word <em>deep</em> not in the sense of deep neural networks, but because the implications of these…</p>
<p>This is the second part of my article <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" target="_blank" rel="noopener">Spectacular Visualization: The Eye of the Riemann Zeta Function</a>, focusing on the most infamous unsolved mathematical conjecture, one that has a $1 million dollar price attached to it. I used the word <em>deep</em> not in the sense of deep neural networks, but because the implications of these visualizations have deep consequences on how to solve this conjecture, opening a new path of attack and featuring non-standard generalizations leading to new perspectives and new approaches so solve RH (as the conjecture is called in mathematical circles). </p>
<p>This work is mostly based on data science, and the results presented here are experimental in nature and still need to be proved formally. The main visualization featuring 6 scatterplots is published here for the first time: it shows the orbits of 3 Riemann-like functions, their <em>eyes</em>, and their surprising ring-shaped error distribution when only the first few hundred terms are used in the series defining these functions. It deviates from classical pure-math approaches in the sense that what I do looks more like stochastic dynamical systems, attractors, wavelets, and should appeal to data analysts, engineers and physicists.</p>
<p>The problem is so popular that there are YouTube videos about it, some having gathered several million of views. One of them is also featured here. My own scatterplots show the behavior of a new class of Riemann-like functions, as well as interesting slices of the orbit that are rarely (if ever) displayed in the literature, revealing peculiar features that could help in solving RH.</p>
<p><span style="font-size: 14pt;"><strong>1. Orbits of Riemann-like Functions</strong></span></p>
<p>The main picture in this article consists of the 6 plots below. Click on the picture to zoom in.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8392563253?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8392563253?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong><em>: Orbit (top) and residual error (bottom) for cosine (left),</em> <em>triangular (middle) and square wave (right)</em></p>
<p>I explain later in this section what they represent. But first, I need to introduce some material. Let </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8392571275?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8392571275?profile=RESIZE_710x" width="350" class="align-center"/></a></p>
<p>be a function of <em>t</em>, with 0.5 < <em>σ</em> < 1 fixed, and <em>α</em>, <em>β</em>, <em>γ</em> three real parameters. This generalizes the function <em>ϕ</em> introduced <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" target="_blank" rel="noopener">in my previous article</a>. This time, <em>λ</em>(<em>n</em>) = log(<em>n</em>) and <em>α</em> = 0, <em>β</em> = 1. Also, we are dealing with two sister functions of <em>t</em>, namely <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) = <em>ϕ</em>(<em>σ</em>, <em>t</em>; <em>α</em>, <em>β</em>, <em>γ</em>) with<em> γ </em>= 0, and the shifted <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>) = <em>ϕ</em>(<em>σ</em>, <em>t</em>; <em>α</em>, <em>β</em>, <em>γ</em>) with <em>γ </em>= -π/2. They represent respectively the real and imaginary part of some function defined on the complex plane. The Riemann Hypothesis (RH), corresponding to <em>W</em>(<em>x</em>) = cos <em>x</em>, states that there is no zero of the Riemann zeta function <span><em>ζ</em>(<em>s</em>), with <em>s</em> = <em>σ </em>+ <em>it</em> a complex number, if 0.5 < <em>σ</em> < 1. Here <em>i</em> represents the imaginary unit whose square is -1. In layman's term, it means that we can not have <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) = <em>ϕ<span style="font-size: 8pt;">2</span></em>(<em>σ</em>, <em>t</em>) = 0 if 0.5 < <em>σ</em> < 1. You win $1 million if you prove it, see <a href="https://www.claymath.org/millennium-problems/riemann-hypothesis" target="_blank" rel="noopener">here</a>. </span></p>
<p><span>The novelty in my method is the introduction of a periodic wave function <em>W</em> in the definition of <em>ϕ</em>, thus generalizing RH in a way different from what other mathematicians did, that is, without using complicated <a href="https://en.wikipedia.org/wiki/L-function" target="_blank" rel="noopener">L-functions</a>. </span>This offers more hopes to solve Riemann's conjecture (RH), by first trying to prove it for the easiest <em>W</em>, and understand what those <em>W</em>'s having an RH attached to them (as opposed to those that do not) have in common. </p>
<p>Figure 1 (upper part) displays the spectacular orbits for three different waves (cosine, triangular and alternating-quadratic) in the test case <em>σ</em> = 0.75 and 0 < <em>t</em> < 600, with the hole around the origin (I call it the <em>eye</em>) being the hallmark of RH behavior: that is, no root for that particular value of <em>σ</em>, regardless of <em>t</em>, because of the hole. Though not displayed here, in the case <em>σ</em> = 0.5, the hole is entirely gone and corresponds to the <em>critical line</em> (the name given by mathematicians) where all the zeroes are found.</p>
<p>The orbit consists, for a fixed <em>σ</em>, of the points (<em>X</em>(<em>t</em>),<em>Y</em>(<em>t</em>)) with <em>X</em>(<em>t</em>) = <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) and <em>Y</em>(<em>t</em>) = <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>). The bottom three plots represent the error between the true value (<em>X</em>(<em>t</em>),<em>Y</em>(<em>t</em>)) and its approximation based on using only the first 200 terms in the series that defines <em>ϕ</em>. The error distribution is very surprising; I was expecting the points to be radially but randomly distributed around the origin; instead, they are located on a ring. Note that for <em>t</em> > 600 (and for the triangular wave, for <em>t</em> > 80) you need to use more than 200 terms for the pattern to remain strong.</p>
<p>In Figure 1, the left part of the plot corresponds to the cosine wave (that is, classical RH), the middle part corresponds to the triangular wave, and the right part corresponds to the alternating quadratic wave. Interestingly, when <em>σ</em> = 1/2 the orbit does not have a hole anymore as predicted, yet the error points are still distributed on a similar ring.</p>
<p>The wave <em>W</em> is a continuous periodic function of period 2π, with one minimum equal to −1 and one maximum equal to +1 in the interval [0,2π], and the area below the X-axis equal to the area above the X-axis. It must have some symmetry. The waves used here are defined as follows:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8392809497?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8392809497?profile=RESIZE_710x" width="500" class="align-full"/></a></p>
<p>For the cosine wave, the Taylor series for <em>ϕ</em> is discussed <a href="https://mathoverflow.net/questions/380308/about-the-coefficients-of-taylor-series-for-the-complex-riemann-zeta-function" target="_blank" rel="noopener">here</a>, while the representation as an infinite product is discussed <a href="https://mathoverflow.net/questions/380327/infinite-products-for-linear-combinations-of-sines-or-cosines" target="_blank" rel="noopener">here</a>.</p>
<p><span style="font-size: 14pt;"><strong>2. Other interesting visualizations</strong></span></p>
<p>The orbit for the standard RH case has been published countless time for <em>σ</em> = 0.5. In that case, there is no eye as the orbit crosses the origin infinitely many times. Some videos about the orbit trajectory have been posted on You Tube and viewed millions of times. Below is one of them. </p>
<p></p>
<p><iframe width="640" height="360" src="https://www.youtube.com/embed/zlm1aajH6gY?wmode=opaque" frameborder="0" allowfullscreen=""></iframe>
</p>
<p></p>
<p>Other popular visualizations include the time series for <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) and <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>) when <em>σ</em> = 0.5. Below (Figure 2) is a version of mine, for <em>σ</em> = 0.75 and 0 < <em>t</em> < 600. Not only it displays the time series for the cosine wave (standard RH case) but also for the triangular wave, for the first time ever. The blue curve corresponds to <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>), the orange one to <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>).</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8392886055?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8392886055?profile=RESIZE_710x" width="600" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2</strong><em>: Time series for ϕ<span style="font-size: 8pt;">1</span>(σ, t) and ϕ<span style="font-size: 8pt;">2</span>(σ, t) when σ = 0.75</em></p>
<p>It is interesting to notes that the peaks and valley floors of the triangular and cosine wave frameworks seem to be correlated, occurring at similar times. What's more, for the cosine wave, when a zero of the blue curve is close to a zero of the orange curve (that is then these curves cross the X-axis at a similar time), the zero of the orange curve occurs first. This seems to be true too for the triangular wave, at least when <em>t</em> < 600.</p>
<p><span style="font-size: 14pt;"><strong>3. Generalization and source code</strong></span></p>
<p><span>The Perl source code is available <a href="https://storage.ning.com/topology/rest/1.0/file/get/8393110255?profile=original" target="_blank" rel="noopener">here</a>. Note that convergence is very slow, as discussed <a href="https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" target="_blank" rel="noopener">in my previous article</a>. A table of the first 100,000 zeros of <em>ζ</em>(<em>s</em>) can be found <a href="http://www.dtc.umn.edu/~odlyzko/zeta_tables/index.html" target="_blank" rel="noopener">here</a>. More general results are available <a href="https://mathoverflow.net/questions/380762/some-properties-of-special-dirichlet-series-connection-to-riemann-hypothesis" target="_blank" rel="noopener">here</a>. In short, if 0.5 < <em>σ </em> < 1, the hole around the origin (pictured in Figure 1) is also present in the following case. Let's define </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8409714885?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8409714885?profile=RESIZE_710x" width="380" class="align-center"/></a></span></p>
<p><span>together with <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) = 1 + <em>ϕ</em>(<em>σ</em>, <em>μ</em>, <em>t</em>; <em>α</em>, <em>β</em>, <em>γ</em>) with<em> γ </em>= 0, and <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>) = <em>ϕ</em>(<em>σ</em>, <em>μ</em>, <em>t</em>; <em>α</em>, <em>β</em>, <em>γ</em>) with <em>γ </em>= -π/2. Then we still have a hole around the origin. That hole persists even if <em>σ</em> = 0.5, unless <em>μ</em> = 0. Here <em>μ</em>, <em>σ</em> are fixed but arbitrary, <em>λ</em>(<em>n</em>) = log <em>n</em>, and <em>α </em>= 0, <em>β </em>= 1; only <em>t</em> varies. It has been tested only for <em>W</em>(<em>x</em>) = cos <em>x</em>, and when 0 < <em>t</em> < 200.</span></p>
<p><strong>Exercise 1</strong></p>
<p>Show (numerically) that the cross-correlation between <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ, t</em>) and <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ, t</em>) is apparently zero, for the cosine wave <em>W</em>(<em>x</em>) = cos <em>x</em>. However, if you shift the orange curve in Figure 2, replacing <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ, t</em>) by <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ, t</em> +<em> τ</em>), the correlation may no longer be zero. Find <span><em>τ</em> (numerically) that maximizes the cross-correlation in question. </span></p>
<p><strong>Exercise 2</strong> </p>
<p>Prove that if <em>ζ</em>(<em>s</em>) = 0, with <em>s</em> = <em>σ</em> + <em>it</em> and 0 < <em>σ</em> < 1 then for all real <em>θ</em>, we have</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8400519688?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8400519688?profile=RESIZE_710x" width="250" class="align-center"/></a></p>
<p>See answer <a href="https://mathoverflow.net/questions/380577/on-some-property-of-the-zeros-of-zetas-in-the-complex-plane/" target="_blank" rel="noopener">here</a>. </p>
<p><strong>Exercise 3</strong></p>
<p>Prove that the centroid of the orbits pictured in Figure 1, is always (<em>W</em>(0), <em>W</em>(<span>-π/2)</span>). This is true for the cosine, triangular and alternate square waves. <strong>Hint</strong>: The integral of <em>W</em>(<em>x</em>) between <em>x</em> = 0 and <em>x</em> = 2<span>π (the period) is always zero. The coordinates of the centroid are </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8409760490?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8409760490?profile=RESIZE_710x" width="450" class="align-center"/></a></span></p>
<p>Since <em>ϕ</em><span style="font-size: 8pt;">1</span>, <em>ϕ</em><span><span style="font-size: 8pt;">2</span> are defined as infinite sums, swap the integral and sum operators, then proceed to the computation. The integral vanishes for all the terms in both series, except for the first one where it is equal to <em>W</em>(0) and <em>W</em>(-π/2), respectively for <em>ϕ</em><span style="font-size: 8pt;">1</span> and <em>ϕ<span style="font-size: 8pt;">2</span></em>.</span></p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
Spectacular Visualization: The Eye of the Riemann Zeta Function
tag:www.datasciencecentral.com,2021-01-02:6448529:BlogPost:1006966
2021-01-02T20:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>We discuss here one of the most famous unsolved mathematical conjectures of all times, one among seven that has a $1 million award attached to it, see <a href="https://en.wikipedia.org/wiki/Millennium_Prize_Problems" rel="noopener" target="_blank">here</a>. It is known as the <a href="https://en.wikipedia.org/wiki/Riemann_hypothesis" rel="noopener" target="_blank">Riemann Hypothesis</a> and abbreviated as RH. Of course I did not solve it (yet), but the material presented here offers a new…</p>
<p>We discuss here one of the most famous unsolved mathematical conjectures of all times, one among seven that has a $1 million award attached to it, see <a href="https://en.wikipedia.org/wiki/Millennium_Prize_Problems" target="_blank" rel="noopener">here</a>. It is known as the <a href="https://en.wikipedia.org/wiki/Riemann_hypothesis" target="_blank" rel="noopener">Riemann Hypothesis</a> and abbreviated as RH. Of course I did not solve it (yet), but the material presented here offers a new path towards making significant progress. As usual, I wrote this article in such a way as to make it understandable by a large audience. You don't need to know more than relatively simple calculus to read it, and you don't even need to know anything about <a href="https://en.wikipedia.org/wiki/Complex_analysis" target="_blank" rel="noopener"></a><a href="https://en.wikipedia.org/wiki/Complex_analysis" target="_blank" rel="noopener">complex analysis</a>: I did the heavy lifting for you.</p>
<p>This is a typical illustration of experimental math blended with data science techniques, resulting in visualizations that provide great actionable insights. It is my hope that after reading this article, you will be tempted to further explore RH, create even better visualizations about it, and find new insights. The techniques used here apply to many other problems, including serious business analytics. </p>
<p><span style="font-size: 14pt;"><strong>1. The problem </strong></span></p>
<p>The Riemann hypothesis, dating back to 1859, states that the zeta function <em>ζ</em>(<em>s</em>), with <em>s</em> = <span><em>σ</em> </span>+ <em>it</em> a complex number (the letter <em>i</em> denoting the imaginary complex unit), has no zero in the critical strip 0 < <em>σ</em> < 1. If proved, it would have a profound impact not just in number theory, but in many other areas of mathematics and beyond. In layman's terms, it can be re-formulated as follows. </p>
<p>Let us introduce a parametric family of real-valued functions, defined as follows:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8375731288?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8375731288?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>with 0 < <em>σ</em> < 1, <em>t</em> a real number, <em>α</em>, <em>β</em>, <em>γ</em> three real parameters, and <em>λ</em>(⋅) a real-valued function with logarithmic growth. Elementary computations show that <em>s</em> = <em>σ</em> + <em>it</em> is a complex root (also called <em>zero</em>) of <em>ζ</em>(<em>s</em>), with 0 < <em>σ</em> < 1, if and only if</p>
<ul>
<li><em>ϕ</em>(<em>σ</em>, <em>t</em>; 0, 1, 0) = 0,</li>
<li><em>ϕ</em>(<em>σ</em>, <em>t</em>; 0, 1, −π/2) = 0,</li>
<li><em>λ</em>(<em>n</em>) = log(n).</li>
</ul>
<p>For details about this formulation, see <a href="https://mathoverflow.net/questions/379650/more-mysteries-about-the-zeros-of-the-riemann-zeta-function" target="_blank" rel="noopener">here</a>. Moving forward, we will focus on RH as being a problem of finding the zeroes (or lack of) of a bivariate function in the standard plane: <em>σ</em> is the first variable, attached to the X-axis, and <em>t</em> is the second variable, attached o the Y-axis. A generalized version of RH seems to also be true: it corresponds to arbitrary values for <em>α</em>, <em>β</em>, <em>γ</em>. However we focus here on the classical RH. For ease of presentation, we use the following notation:</p>
<ul>
<li><em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) = <em>ϕ</em>(<em>σ</em>, <em>t</em>; 0, 1, 0)</li>
<li><em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>) = <em>ϕ</em>(<em>σ</em>, <em>t</em>; 0, 1,−π/2 )</li>
</ul>
<p>Much of the discussion has to do with the orbit of (<em>ϕ</em><span><span style="font-size: 8pt;">1</span>, <em>ϕ</em><span style="font-size: 8pt;">2</span></span>) when <em>σ</em> is fixed but arbitrary, and only <em>t</em> is allowed to vary. The orbit consists of all the points (<em>X</em>(<em>t</em>), <em>Y</em>(<em>t</em>)) with <em>X</em>(<em>t</em>) = <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) and <em>Y</em>(<em>t</em>) = <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>). In short, we are dealing with a bivariate time series in continuous time, with strong cross-correlations between <em>X</em>(<em>t</em>) and <em>Y</em>(<em>t</em>). Without loss of generality, we assume that <em>t</em> is positive. The spectacular plot shown in section 2 is just a scatterplot of the orbit, computed for <em>σ</em> = 0.75<em>.</em> It easily generalizes to other values of <em>σ</em> that are strictly greater than 0.5. </p>
<p><span style="font-size: 14pt;"><strong>2. The visualization</strong></span></p>
<p>I call the plot below the <em>Eye of the Zeta Function</em>. It is the scatter plot described in the last paragraph in section 1, and probably the first time that such a plot was created for the Riemann zeta function. It corresponds to <em>σ </em>= 0.75, with <em>t</em> between 0 and 3,000, with <em>t</em> increments equal to 0.01. Thus 300,000 points of the orbit are displayed here. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8375847301?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8375847301?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p>The spectacular feature in that plot is the hole around (0, 0). It has deep implications. It suggests that if <em>σ</em> = 0.75, not only <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) and <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>) can not be simultaneously equal to zero (this is a particular case of RH, nothing new here), but most importantly, that it never jointly gets very close to zero. This is new and suggests that proving RH might be a little less challenging than initially thought. The same plot features a similar "eye" if you try various values of <em>σ</em>. In particular, the hole gets smaller and smaller as <em>σ</em> gets closer to 0.5. At <em>σ</em> = 0.5, the hole is entirely gone, and infinitely many values of <em>t</em> yield <em>ϕ</em><span style="font-size: 8pt;">1</span>(<em>σ</em>, <em>t</em>) = <em>ϕ</em><span style="font-size: 8pt;">2</span>(<em>σ</em>, <em>t</em>) = 0. The same is true for a generalized version of RH discussed in section 1. </p>
<p>Note that it is very tricky to get the scatterplot right. The series for <em>ϕ</em><span><span style="font-size: 8pt;">1</span> and <em>ϕ</em><span style="font-size: 8pt;">2</span> converge very slowly, and in chaotic, unpredictable way, </span>see <a href="https://mathoverflow.net/questions/379650/more-mysteries-about-the-zeros-of-the-riemann-zeta-function/380174#380174" target="_blank" rel="noopener">here</a>. This can result in false positives: points very close to zero due to approximation errors, artificially obfuscating the hole. Convergence boosting techniques are required, see <a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-dramatically-improve-speed-of-convergence" target="_blank" rel="noopener">here</a>. In addition, the frequency of oscillations in <em>ϕ</em><span><span style="font-size: 8pt;">1</span> and <em>ϕ</em><span style="font-size: 8pt;">2</span> increases more and more as <em>t</em> gets larger, and thus <em>t</em> increments should be made smaller and smaller accordingly, as <em>t</em> grows, in order to get a good coverage of the orbit and not miss potential true zeroes.</span></p>
<p>More plots can be found <a href="https://mathoverflow.net/questions/379650/more-mysteries-about-the-zeros-of-the-riemann-zeta-function" target="_blank" rel="noopener">here</a>. One (unpublished yet) is even more spectacular, though esthetically speaking, it looks just like a boring ring. I computed the approximation error (<em>E</em><span style="font-size: 8pt;">1</span>(<span style="font-size: 12pt;"><em>t</em></span>), <em>E</em><span style="font-size: 8pt;">2</span>(<span style="font-size: 12pt;"><em>t</em></span>)) when you use only the first 200 terms in the series defining <em>ϕ</em><span><span style="font-size: 8pt;">1</span> and <em>ϕ</em><span style="font-size: 8pt;">2</span>. If <span style="font-size: 12pt;"><em>t</em></span> < 300, these points are located on a very thin ring very close to 0. Their distribution thus has a strong pattern, making it possibly even less challenging to prove that if <em>σ</em> = 0.75, then the Riemann Zeta function has no zero with <em>t</em> in [0, 300]. The pattern quickly disappears if <em>t</em> is larger, but you can still retrieve it by increasing the number of terms that you use in your approximation, allowing you to identify an even bigger zero-free zone in the critical strip. Proving it is zero-free even narrowed down to these zones, would still remain a big challenge though. </span></p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
<p></p>
Opening a New Restaurant in Covid Times
tag:www.datasciencecentral.com,2020-12-23:6448529:BlogPost:1005865
2020-12-23T06:44:07.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>I am a data scientist, and decided to open a restaurant last November, 10 days before the governor in my state banned dining-in (who knows for how long) and customers were already rare. Some data scientists in managerial positions dream about exiting the corporate world and envied me, at least before the Covid, when I told them my plan.</p>
<p>Here I explore the options and opportunities available, and this article reflects my optimism. I will also discuss analytics in some detail. The…</p>
<p>I am a data scientist, and decided to open a restaurant last November, 10 days before the governor in my state banned dining-in (who knows for how long) and customers were already rare. Some data scientists in managerial positions dream about exiting the corporate world and envied me, at least before the Covid, when I told them my plan.</p>
<p>Here I explore the options and opportunities available, and this article reflects my optimism. I will also discuss analytics in some detail. The reasons for opening a restaurant are varied, and in my case I saw the opportunity in a wealthy town with many foodies, mostly retired from companies such as Amazon, Boeing or Microsoft, who left the Seattle area to live on a little island where the pace of living is much slower, roads are not clogged with commuters, and the landscape is beautiful: Anacortes in Fidalgo Island, next to the San Juan islands, in the Pacific Northwest. Despite being next to the ocean, not a single restaurant offers fresh oysters or crab, there is no great restaurant, and if anything (after selling my company) I thought I would open a restaurant at least so that there is a dining venue I really love in Anacortes. I knew from the very beginning that we would fill a void, and that there was no competition.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8322989256?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8322989256?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><em>Our outdoor seating</em></p>
<p>After chasing locations throughout the Puget Sound without any luck, I found by chance the perfect spot in the very heart of historical downtown Anacortes. The landlord did not want a franchise, a chain, and even turned down a bank. Rent here is 3 times cheaper than in Seattle, and hourly rates for restaurant workers are much lower too, though it is impossible to find qualified people to serve fine cuisine (you must train them). We were lucky to find a great chef who worked in great restaurants in Seattle and left the city years ago for the same reasons that I did. We are also very close to farmers, and all our food comes from local farmers. Not exactly cheap, but people are willing to pay a bit more for fresh local ingredients - this is a long-lasting trend in this industry.</p>
<p>We agreed on a few statistics: food cost should be 1/3 of revenue, staff another 1/3, and 15-20% of the revenue going towards rent, utilities, insurances, etc. Now with Covid, we are operating at a controlled loss probably for the next three months, but we are on the path to success. Rather than closing for three months like plenty of restaurants do, we though we should take advantage of this to develop our brand and becoming known -- and stay open despite the extra cost. We also decided to stop expensive construction on the second floor, and instead focus on heated outdoor dining and cheaper solutions that have a direct positive impact. At the end of the construction stage, we even looked at purchasing used appliances, rather than brand new.</p>
<p>Despite having no experience in the restaurant industry, I am a foodie with tremendous experience as a customer. In particular, I told what the prices should be, given the town we are in and the kind of food we serve. The Chef focused on dishes where he could meet the goal of 1/3 of revenue spent on food (that is, a dish sold for $18 costs $6 in ingredients on average), with waste optimization also being a goal (for instance, unsold fresh oysters served as baked oysters the next day). I even purchased some ingredients myself such as excellent Islandic caviar 10 times cheaper than Beluga. People coming from the big city 90 miles South consider our restaurant as inexpensive, and capable of successfully competing with hip restaurants in Seattle if we were located in that town.</p>
<p><strong>Original ideas to succeed</strong></p>
<p>Here are some concepts that we embraced:</p>
<ul>
<li>Having a little retail store within the restaurant, selling home-made preparations made by the chef, and wines</li>
<li>Opening a wine club with paid membership</li>
<li>Using the second floor for storage, for the retail store, rather than for dine-in</li>
<li>Opening the patio in the back, the heated tent on the front street, and some other space outside to maximize occupancy</li>
<li>Discontinuing breakfast except weekends, due to negative ROI</li>
<li>Creating our home deliver service to be more affordable than Doordash</li>
<li>Organizing our menu items in such a way as to optimize revenue (by displaying best sellers at the top, revenue increased 5 times on Doordash)</li>
<li>Being the only European restaurant in the county</li>
<li>Using pictures of our dishes when posting on social networks, as well as on our website</li>
<li>Offering family meals to go, serving 2 or 4 people</li>
<li>Partnering with grocery stores to sell our products</li>
<li>Having weekly specials that we can announce in social networks and via our fast-growing mailing list, to keep customers returning</li>
<li>Serving the right size, that is less than the average restaurant, along with small dishes, in plates that are not as large as in many restaurants (this reduces waste and we can lower our prices accordingly)</li>
</ul>
<p><strong>Marketing and advertising</strong></p>
<p>We are present and very active on all local Facebook groups, including <a href="https://www.facebook.com/parisrestaurantandbar/" target="_blank" rel="noopener">our Facebook page</a> and the <a href="https://www.facebook.com/groups/424272282275831/" target="_blank" rel="noopener">Skagit Restaurant page</a> that we created for all restaurants in our county. Since our menu has new additions every week, we can post original content all the time. Many people in town use Facebook, thus this is our favorite platform. We also advertise with them. </p>
<p>We created our newsletter, growing to 500 subscribers in a month. Much of our advertising on Google is geared towards growing the newsletter. We are working on a blog (the first article will be <em>10 tips to help your favorite restaurant</em> applicable to any restaurant, we hope it will go viral) and in the long term, we plan on selling recipes from our chef on the website. Finally, as we grow, we plan on using the outdoor tent from our restaurant neighbors, when they are closed. We may even serve Tequila from our neighbor (Mexican restaurant) with revenue on hard liquor going directly to them, if we use their tent. </p>
<p>Advertising on Yelp was a failure, and we noticed and stopped it. Yelp clearly does not help its advertisers regarding reviews (a good thing) but it eliminates reviews randomly, good or bad, with their supposedly smart machine learning algorithm. Maybe to force us to advertise more? Phone calls coming from Yelp advertising were rarely a local number (unlike calls originating from Google ads), and lasted 2 seconds. Not different from click fraud. We are happy that Yelp represents less than 2% of our traffic, as we tried very hard to build our audience organically and via word of mouth thanks to the excellent and original food that we serve. </p>
<p>We also invited our partners (local farmers, accountant etc.) for a free diner during the short window of time when dining-in was allowed. The meal was free, but not the wine. We also plan on having our brochure distributed in all the local hotels, and maybe advertise our restaurant on all the receipts people get when they go shopping to a grocery store. </p>
<p><strong>The results</strong></p>
<p>The last few days have seen revenue growing fast to the point that we will probably operate at a loss for much less than 3 months, beating the expectations. And before Thanksgiving when dining-in was allowed, it was clear that we would be successful, being almost profitable while operating at 25% capacity.</p>
<p>You can find us at <a href="https://www.parisrestaurantandbar.com/" target="_blank" rel="noopener">ParisRestaurantAndBar.com</a>. </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8322990662?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8322990662?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
<p></p>
Amazing Things You Did Not Know You Could Do in Excel
tag:www.datasciencecentral.com,2020-12-17:6448529:BlogPost:1005404
2020-12-17T05:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>I have included a lot of Excel spreadsheets in the numerous articles and books that I have written in the last 10 years, based either on real life problems or simulations to test algorithms, and featuring various machine learning techniques. It is time to create a new blog series focusing on these useful techniques that can easily be handled with Excel. Data scientists typically use programming languages and other visual tools for these techniques, mostly because they are unaware that it can…</p>
<p>I have included a lot of Excel spreadsheets in the numerous articles and books that I have written in the last 10 years, based either on real life problems or simulations to test algorithms, and featuring various machine learning techniques. It is time to create a new blog series focusing on these useful techniques that can easily be handled with Excel. Data scientists typically use programming languages and other visual tools for these techniques, mostly because they are unaware that it can be accomplished with Excel alone. This article is my first one in this new series. The series will appeal to BI analysts, managers presenting insights to decision makers, as well as software engineers or MBA people who do not have a strong data science background. It can also be used as a starting point to learn data science and machine learning, by first solving problems in Excel, before discovering Excel's limitations and then move to programming languages or AI-based automated coding. </p>
<p>Many of the techniques presented in my spreadsheets are data-driven (as opposed to model-driven), robust, simple yet efficient, sometimes entirely novel, and do not lead to problems such as over-fitting or numerical instability. Even in the absence of statistical models, confidence intervals can still be built - even in Excel - and are more intuitive and easy to understand than traditional ones. See my previous article <a href="https://www.datasciencecentral.com/profiles/blogs/introducing-an-all-purpose-robust-fast-simple-non-linear-r22" target="_blank" rel="noopener">here</a> on general regression, as an example. That article also features traditional regression performed with the not well-known Excel built-in function LINEREST; with a simple transformation, it could be used for logistic regression. Also, my spreadsheets are just basic Excel, without using special Excel libraries or add-ins, and are thus accessible to everyone. </p>
<p>In this first blog, I show you how to simulate clustered data and display it with multi-groups scatterplots, things that I used to do with R in the past.</p>
<p><strong>Excel scatterplots in clustering contexts</strong></p>
<p>The pictures below represents a simulation of clustered data: 177 two-dimensional data points spread across three clusters.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8296711259?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8296711259?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1:</strong> <em>Well separated clusters</em></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8296711493?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8296711493?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2:</strong> <em>Overlapping clusters</em></p>
<p>The spreadsheet used to produce these charts is interactive, and you can play with it to generate more clusters, fine-tune the level of overlapping, and to test various clustering algorithms on the simulated data that you create, using cross-validation techniques, to see how they perform. The points, within each of the three groups, are radially distributed around a center. That is, a random point (<em>X</em>, <em>Y</em>) in group #1, assuming the center of that group - also randomly distributed - is (<em>X</em><span style="font-size: 8pt;">1</span>, <em>Y</em><span style="font-size: 8pt;">1</span>) is generated as follows:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8296725669?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8296725669?profile=RESIZE_710x" width="200" class="align-center"/></a></p>
<p>Here, different random deviates <span><em>ρ</em>, <em>θ</em> uniformly distributed on [0, 1] are used for each (<em>X</em>, <em>Y</em>) using the function RAND in Excel, and the constant <em>α</em><span style="font-size: 8pt;">1</span> is fixed for all points in group #1. In the spreadsheet, the three centers are uniformly distributed on [0, 1] x [0, 1], and <em>α</em><span style="font-size: 8pt;">1</span>, <em>α</em><span style="font-size: 8pt;">2</span>, <em>α</em><span style="font-size: 8pt;">3</span> are set to 1/3. </span></p>
<p><span>The scatterplots are produced using the scatter graph in Excel, applied to data separated in three groups as illustrated in the screenshot below. For group #1, point coordinates (<em>X</em>, <em>Y</em>) are stored in the first and second column respectively. For group #2, it's in the first and third column, and for group #3, it is in the first and fourth column as illustrated below.</span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8296773488?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8296773488?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 3:</strong> <em>Organizing the data in Excel to produce the scatterplots</em></p>
<p>The spreadsheet is available for download, <a href="https://storage.ning.com/topology/rest/1.0/file/get/8296781485?profile=original" target="_blank" rel="noopener">here</a> (<strong>scatter-cluster.xlsx</strong>). See also one of my previous spreadsheets to automatically detect the number of clusters, from one of my past articles, <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">here</a> (<strong>elbow.xlsx</strong>, in the the section <em>Elbow Strength with spreadsheet illustration</em>). Finally, many spreadsheets are available for download, from my most recent book <em>Statistics: new foundations, toolkit, and machine learning recipes</em>, <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning" target="_blank" rel="noopener">here</a>. Some of them even perform NLP algorithms.</p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
<p></p>
All-purpose, Robust, Fast, Simple Non-linear Regression
tag:www.datasciencecentral.com,2020-12-16:6448529:BlogPost:1005166
2020-12-16T18:22:17.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p><strong>Announcements</strong></p>
<ul>
<li>Watch APEXX W3: <strong>The Data Science Workstation</strong>, and learn how an NVIDIA-certified BOXX workstation can accelerate your workflow. <a href="https://bit.ly/33Suwni" rel="noopener" target="_blank">Access video here</a>. </li>
<li>Use real-time anomaly detection reference patterns to combat fraud | Google. <a href="http://dsc.news/3gShgUZ" rel="noopener" target="_blank">Read full article</a>.</li>
<li>Merrimack College offers three online…</li>
</ul>
<p><strong>Announcements</strong></p>
<ul>
<li>Watch APEXX W3: <strong>The Data Science Workstation</strong>, and learn how an NVIDIA-certified BOXX workstation can accelerate your workflow. <a href="https://bit.ly/33Suwni" target="_blank" rel="noopener">Access video here</a>. </li>
<li>Use real-time anomaly detection reference patterns to combat fraud | Google. <a href="http://dsc.news/3gShgUZ" target="_blank" rel="noopener">Read full article</a>.</li>
<li>Merrimack College offers three online master's degrees in data science, business analytics, or healthcare analytics – all designed to accommodate working professionals and developed and taught by industry experts. Gain a deeper understanding of data visualization, statistical analysis, machine learning, and business strategy to deliver data-driven insights that impact real-world decisions. <a href="http://dsc.news/34kc07x" target="_blank" rel="noopener">Learn more here</a>. </li>
</ul>
<p><strong>All-purpose, Robust, Fast, Simple Non-linear Regression</strong></p>
<p><span>The model-free, data-driven technique discussed here is so basic that it can easily be implemented in Excel, and we actually provide an Excel implementation. It is surprising that this technique does not pre-date standard linear regression, and is rarely if ever used by statisticians and data scientists. It is related to kriging and nearest neighbor interpolation, and apparently first mentioned in 1965 by Harvard scientists working on GIS (geographic information systems). It was referred back then as Shepard's method or inverse distance weighting, and used for multivariate interpolation on non-regular grids</span><span>. We call this technique </span><em>simple regression</em><span>. Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/introducing-an-all-purpose-robust-fast-simple-non-linear-r22" target="_blank" rel="noopener">here</a>. </span></p>
<p></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/8295194880?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8295194880?profile=RESIZE_710x" width="400" class="align-center"/></a></span></p>
<p><span style="font-size: 8pt;">This email, and all related content, is published by Data Science Central, a division of <a href="https://www.techtarget.com/" target="_blank" rel="noopener noreferrer">TechTarget, Inc</a>.<br/><span class="_2HwZTce1zKwQJyzgqXpmAy">275 Grove Street, Newton, Massachusetts, 02466</span> US</span></p>
<p><span style="font-size: 8pt;">You are receiving this email because you are a member of TechTarget. When you access content from this email, your information may be shared with the sponsors or future sponsors of that content and with our Partners, see up-to-date <a href="https://www.techtarget.com/privacy-partners" target="_blank" rel="noopener noreferrer">Partners List</a> below, as described in our <a href="https://www.techtarget.com/privacy-policy" target="_blank" rel="noopener noreferrer">Privacy Policy</a>. For additional assistance, please contact: <a href="mailto:webmaster@techtarget.com" target="_blank" rel="noopener noreferrer"></a></span><span style="font-size: 8pt;"><a href="mailto:webmaster@techtarget.com" target="_blank" rel="noopener noreferrer">webmaster@techtarget.com</a></span></p>
<p><span style="font-size: 8pt;">© 2020 TechTarget, Inc. all rights reserved. Designated trademarks, brands, logos and service marks are the property of their respective owners.<br/><a href="https://www.techtarget.com/privacy-policy" target="_blank" rel="noopener noreferrer">Privacy Policy</a> | <a href="https://www.techtarget.com/privacy-partners" target="_blank" rel="noopener noreferrer">Partners List</a></span></p>
New Tests of Randomness and Independence for Sequences of Observations
tag:www.datasciencecentral.com,2020-12-03:6448529:BlogPost:1004429
2020-12-03T01:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>There is no statistical test that assesses whether a sequence of observations, time series, or residuals in a regression model, exhibits independence or not. Typically, what data scientists do is to look at auto-correlations and see whether they are close enough to zero. If the data follows a Gaussian distribution, then absence of auto-correlations implies independence. Here however, we are dealing with non-Gaussian observations. The setting is similar to testing whether a pseudo-random…</p>
<p>There is no statistical test that assesses whether a sequence of observations, time series, or residuals in a regression model, exhibits independence or not. Typically, what data scientists do is to look at auto-correlations and see whether they are close enough to zero. If the data follows a Gaussian distribution, then absence of auto-correlations implies independence. Here however, we are dealing with non-Gaussian observations. The setting is similar to testing whether a pseudo-random number generator is random enough, or whether the digits of a number such as <span>π </span>behave in a way that looks random, even though the sequence of digits is deterministic. Batteries of statistical tests are available to address this problem, but there is no one-fit-all solution.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8242402469?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8242402469?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p>Here we propose a new approach. Likewise, it is not a panacea, but rather a set of additional powerful tools to help test for independence and randomness. The data sets under consideration are specific mathematical sequences, some of which are known to exhibit independence / randomness or not. Thus, it constitutes a good setting to benchmark and compare various statistical tests and see how well they perform. This kind of data is also more natural and looks more real than synthetic data obtained via simulations. </p>
<p><span style="font-size: 14pt;"><strong>1. Definition of random-like sequences</strong></span></p>
<p>Since we are dealing with deterministic sequences (<em>x<span style="font-size: 8pt;">n</span></em>) indexed by <em>n</em> = 1, 2, and so on, it is worth defining what we mean by <em>independence</em> and <em>random-like</em>. These two elementary concepts are very intuitive, but a formal definition may help. You may skip this section if you have an intuitive understanding of the concepts in question, as the layman does. Independence in this context is sometimes called <em>asymptotic independence</em>, see <a href="https://mathoverflow.net/questions/372103/recursive-random-number-generator-based-on-irrational-numbers/" target="_blank" rel="noopener">here</a>. Also, for all the sequences investigated here, <em>x<span style="font-size: 8pt;">n</span></em> ∈ [0,1].</p>
<p><strong>1.1. Definition of random-like and independence</strong></p>
<p>A sequence (<em>x<span style="font-size: 8pt;">n</span></em>) with <em>x<span style="font-size: 8pt;">n</span></em> ∈ [0,1] is <em>random-like</em> if it satisfies the following property. For any finite index family <em>h</em><span style="font-size: 8pt;">1</span>,…, <em>h<span style="font-size: 8pt;">k</span></em> and for any <span style="font-size: 12pt;"><em>t<span style="font-size: 8pt;">1</span></em></span>,…, <em>t<span style="font-size: 8pt;">k</span></em> ∈ [0,1], we have </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8238499286?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8238499286?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>The probabilities are empirical probabilities, that is, based on frequency counts. For instance,</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8238501465?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8238501465?profile=RESIZE_710x" width="450" class="align-center"/></a></p>
<p>where χ(<em>A</em>) is the indicator function (equal to 1 if the event <em>A</em> is true, and equal to 0 otherwise). Random-like implies independence, but the converse is not true. A sequence is <em>independently distributed</em> if it satisfies the weaker property </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8238506260?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8238506260?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Random-like means that the <em>x<span style="font-size: 8pt;">n</span></em>'s all have the same underlying uniform distribution on [0, 1], and are independently distributed. </p>
<p><strong>1.2. Definition of lag-<em>k</em> autocorrelation</strong></p>
<p>Again, this is just the standard definition of auto-correlations, but applied to infinite deterministic sequences. The lag-<em>k</em> auto-correlation ρ<span style="font-size: 8pt;"><em>k</em></span> is defined as follows. First define ρ<span style="font-size: 8pt;"><em>k</em></span>(<em>n</em>) as the empirical correlation between (<em>x</em><span style="font-size: 8pt;">1</span>,…, <em>x<span style="font-size: 8pt;">n</span></em>) and (<em>x<span style="font-size: 8pt;">k</span></em><span style="font-size: 8pt;">+1</span>,… ,<em>x<span style="font-size: 8pt;">k</span></em><span style="font-size: 8pt;">+<em>n</em></span>). Then ρ<span style="font-size: 8pt;"><em>k</em></span> is the limit (if it exists) of ρ<span style="font-size: 8pt;"><em>k</em></span>(<span style="font-size: 12pt;"><em>n</em></span>) as <em>n</em> tends to infinity. </p>
<p><strong>1.3. Equidistribution and fractional part denoted as { }</strong></p>
<p>The fractional part of a positive real number <em>x</em> is denoted as { <em>x</em> }. For instance, { 3.141592 } = 0.141592. The sequences investigated here come from number theory. In that context, concepts such as random-like and identically distributed are rarely used. Instead, mathematicians rely on the weaker concept of <em>equidistribution</em>, also called equidistribution modulo 1. Closer to independence is the concept of equidistribution in higher dimensions, for instance if two successive values (<em>x<span style="font-size: 8pt;">n</span></em>, <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span>) are equidistributed on [0, 1] x [0, 1].</p>
<p>A sequence can be equidistributed yet exhibits strong auto-correlations. The most famous example is the sequence <em>x<span style="font-size: 8pt;">n</span></em> = { <em>αn</em> } where <em>α</em> is a positive irrational number. While equidistributed, it has strong lag-<em>k</em> auto-correlations for every strictly positive integer <em>k</em>, and it is anything but random-like. A sequence that looks perfectly random-like is the digits of <span>π</span>: they can not be distinguished from a realization of a perfect <a href="https://en.wikipedia.org/wiki/Bernoulli_process" target="_blank" rel="noopener">Bernouilli process</a>. Such random-like sequences are very useful in cryptographic applications.</p>
<p><span style="font-size: 14pt;"><strong>2. Testing well-known sequences</strong></span> </p>
<p>The sequences we are interested in are <em>x<span style="font-size: 8pt;">n</span></em> = { <em>α n</em>^<em>p</em> }<b> </b> where { } is the fractional part function (see section 1.3), <em>p</em> > 1 is a real number and <em>α</em> is a positive irrational number. Other sequences are discussed in section 3. It is well known that these sequences are equidistributed. Also, if <em>p</em> = 1, these sequences are highly auto-correlated and thus the terms <em>x<span style="font-size: 8pt;">n</span></em>'s are not independently distributed, much less random-like; the exact theoretical lag-<em>k</em> auto-correlations are known. The question here is what happens if <em>p</em> > 1. It seems that in that case, there is much more randomness. In this section, we explore three statistical tests (including a new one) to assess how random these sequences can be depending on the parameters <em>p</em> and <em>α</em>. The theoretical answer to that question is known, thus this provides a good case study to check how various statistical tests perform to detect randomness, or lack of it.</p>
<p><strong>2.1. The gap test</strong></p>
<p>The gap test (some people may call it run test) proceeds as follows. Let us define the binary digit <em>d<span style="font-size: 8pt;">n</span></em> as <em>d<span style="font-size: 8pt;">n</span></em> = ⌊2<em>x<span style="font-size: 8pt;">n</span></em>⌋. The brackets represent the integer part function. Say <em>d<span style="font-size: 8pt;">n</span></em> = 0 and <em>d<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1 </span>= 1 for a specific n. If <em>d<span style="font-size: 8pt;">n</span></em> is followed by <em>G</em> successive digits <em>d<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span>,…, <em>d<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>G</em></span> all equal to 1 and then <em>d<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>G</em>+1</span> = 0, we have one instance of a gap of length <em>G</em>. Compute the empirical distribution of these gaps. Assuming 50% of the digits are 0 (this is the case in all our examples), then the empirical gap distribution converges to a geometric distribution of parameter 1/2 if the sequence <em>x<span style="font-size: 8pt;">n</span></em> is random-like.</p>
<p>This is best illustrated in chapter 4 of my book <em>Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems, </em>available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>. </p>
<p><strong>2.2. The collinearity test</strong></p>
<p>Many sequences pass several tests yet fail the collinearity test. This test checks whether there are <em>k</em> constants <em>a</em><span style="font-size: 8pt;">1</span>, ..., <em>a<span style="font-size: 8pt;">k</span></em> with <em>a<span style="font-size: 8pt;">k</span></em> not equal to zero, such that <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>k</em></span> = <em>a</em><span style="font-size: 8pt;">1</span> <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>k-1</em></span> + <em>a</em><span style="font-size: 8pt;">2</span> <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>k</em>-2</span> + ... + <em>a<span style="font-size: 8pt;">k</span></em> <em>x<span style="font-size: 8pt;">n</span></em> takes only on a finite (usually small) number of values. In short, it addresses this question: are <em>k</em> successive values of the sequence <em>x<span style="font-size: 8pt;">n</span></em> always lie (exactly, approximately, or asymptotically) in a finite number of hyperplanes of dimension <em>k</em> - 1? This test has been used to determine that some congruential pseudo-random number generators were of very poor quality, see <a href="https://en.wikipedia.org/wiki/RANDU" target="_blank" rel="noopener">here</a>. It is illustrated in section 3, with <em>k</em> = 2. </p>
<p>Source code and examples for <em>k</em> = 3 can be found <a href="https://mathoverflow.net/questions/372103/recursive-random-number-generator-based-on-irrational-numbers/" target="_blank" rel="noopener">here</a>. </p>
<p><strong>2.3. The independence test</strong></p>
<p>This may be a new test: I could not find any reference to it in the literature. It does not test for full independence, but rather for random-like behavior in small dimensions (<em>k</em> = 2, 3, 4). Beyond <em>k</em> = 4, it becomes somewhat unpractical as it requires a number of observations (that is, the number of computed terms in the sequence) growing exponentially fast with <em>k</em>. However, it is a very intuitive test. It proceeds as follows, for a fixed <em>k</em>:</p>
<ul>
<li>Let <em>N </em> > 100 be an integer</li>
<li>Let <em>T</em> be a <em>k</em>-uple (<em>t</em><span style="font-size: 8pt;">1</span>,..., <em>t<span style="font-size: 8pt;">k</span></em>) with <i>t<span style="font-size: 8pt;">j</span></i><span style="font-size: 8pt;"> </span>∈ [0,1] for <em>j</em> = 1, ..., <em>k.</em></li>
<li>Compute the following two quantities, with χ being the indicator function as in section 1.2:</li>
</ul>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8242040856?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8242040856?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<ul>
<li>Repeat this computation for <em>M</em> different <em>k</em>-uples randomly selected in the <em>k</em>-dimensional unit hypercube</li>
</ul>
<p>Now plot the <em>M</em> vectors (<em>P<span style="font-size: 8pt;">T</span>, Q<span style="font-size: 8pt;">T</span></em>), each corresponding to a different <em>k</em>-uple, on a scatterplot. Unless the <em>M</em> points lie very close to the main diagonal on the scatterplot, the sequence <em>x<span style="font-size: 8pt;">n</span></em> is not random-like. To see how far away you can be from the main diagonal without violating the random-like assumption, do the same computations for 10 different sequences consisting this time of truly random terms. This will give you a confidence band around the main diagonal, and vectors (<em>P<span style="font-size: 8pt;">T</span>, Q<span style="font-size: 8pt;">T</span></em>) lying outside that band, for the original sequence you are interested in, suggests areas where the randomness assumption is violated. This is illustrated in the picture below, originally posted <a href="https://mathoverflow.net/questions/372103/recursive-random-number-generator-based-on-irrational-numbers/" target="_blank" rel="noopener">here</a>: </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8242055058?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8242055058?profile=RESIZE_710x" width="300" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong></p>
<p>As you can see, there is a strong enough departure from the main diagonal, and the sequence in question (see same reference) is known not to be random-like. The X-axis features <em>P<span style="font-size: 8pt;">T</span></em>, and the Y-axis features <em>Q<span style="font-size: 8pt;">T</span></em>. An example with known random-like behavior, resulting in an almost perfect diagonal, is also featured in the same article. Notice that there are fewer and fewer points as you move towards the upper right corner. The higher <em>k</em>, the more sparse the upper right corner will be. In the above example, <em>k</em> = 3. To address this issue, proceed as follows, stretching the point distribution along the diagonal:</p>
<ul>
<li>Let <em>P*<span style="font-size: 8pt;">T</span></em> = (- 2 log <em>P<span style="font-size: 8pt;">T</span></em>) / <em>k</em> and <em>Q</em>*<span style="font-size: 8pt;"><em>T</em></span> = (- 2 log <em>Q<span style="font-size: 8pt;">T</span></em>) / <em>k</em>. This is a transformation leading to a Gamma(<em>k</em>, 2/<span style="font-size: 10pt;"><em>k</em></span>) distribution. See explanations <a href="https://stats.stackexchange.com/questions/89949/geometric-mean-of-uniform-variables" target="_blank" rel="noopener">here</a>. </li>
<li>Let <em>P</em>**<span style="font-size: 8pt;"><em>T</em></span> = <em>F</em>(<span style="font-size: 12pt;"><em>P</em></span>*<span style="font-size: 8pt;"><em>T</em></span>) and <em>Q</em>**<span style="font-size: 8pt;"><em>T</em></span> = <em>F</em>(<i>Q</i>*<span style="font-size: 8pt;"><em>T</em></span>) where <em>F</em> is the cumulative distribution function of a Gamma(<em>k</em>, 2/<span style="font-size: 10pt;"><em>k</em></span>) random variable.</li>
</ul>
<p>By virtue of the <a href="https://en.wikipedia.org/wiki/Inverse_transform_sampling" target="_blank" rel="noopener">inverse transform sampling theorem</a>, the points (<em>P</em>**<span style="font-size: 8pt;"><em>T</em></span>, <em>Q</em>**<span style="font-size: 8pt;"><em>T</em></span>) are now uniformly stretched along the main diagonal. </p>
<p><span style="font-size: 14pt;"><strong>3. Results and generalization</strong></span></p>
<p>Let's get back to our sequence <em>x<span style="font-size: 8pt;">n</span></em> = { <em>α n</em>^<em>p</em> } with <em>p</em> > 1 and <em>α</em> irrational. Before showing and discussing some charts, I want to discuss a few issues. First, if <em>p</em> is large, machine accuracy will quickly result in erroneous computations for <em>x<span style="font-size: 8pt;">n</span></em>. You need to detect when loss of accuracy becomes a critical problem, usually well below <em>n</em> = 1,000 if <em>p</em> = 5. Working with double precision arithmetic will help. Another issue, if <em>p</em> is close to 1, is the fact that randomness does not kick in until <em>n</em> is large enough. You may have to ignore the first few hundreds terms of the sequence in that case. If <em>p</em> = 1, randomness never occurs. Also, we have assumed that the marginal distributions are uniform on [0, 1]. From the theoretical point of view, they indeed are, and it will show if you compute the empirical percentile distribution of <em>x<span style="font-size: 8pt;">n</span></em>, even in the presence of strong auto-correlations (the reason why is because of the ergodic nature of the sequences in question, but this topic is beyond the scope of the present article). So it would be a good exercise to use various statistical tools or libraries to assess whether they can confirm the uniform distribution assumption.</p>
<p><strong>3.1. Examples</strong></p>
<p>The exact theoretical value of the lag-<em>k</em> auto-correlation is known for all <em>k</em> if <em>p</em> = 1. See section 5.4 in <a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">this article</a>. It is almost never equal to zero, but it turns out that if <em>k</em> = 1, <em>p</em> = 1 and <em>α</em> = (3 + SQRT(3))/6, it is indeed equal to zero. Use a statistical package to see if it can detect this fact, or ask your team to do the test. Also, if <em>p</em> is an integer, show (using statistical techniques) that for some <em>a</em><span style="font-size: 8pt;">1</span>, ..., <em>a</em><span style="font-size: 8pt;">k</span>, we have <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>k</em></span> = <em>a</em><span style="font-size: 8pt;">1</span> <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>k-1</em></span> + <em>a</em><span style="font-size: 8pt;">2</span> <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+<em>k</em>-2</span> + ... + <em>a<span style="font-size: 8pt;">k</span></em> <em>x<span style="font-size: 8pt;">n</span></em> takes only on a finite number of values as discussed in section 2.2, and thus, the random-like assumption is always violated. In particular, <em>k</em> = 2 if <em>p</em> = 1. This is also true <em>asymptotically</em> if <em>p</em> is not an integer, see <a href="https://mathoverflow.net/questions/377697/sequences-similar-to-n-alpha-that-are-both-equidistributed-and-truly-rando/377748#377748" target="_blank" rel="noopener">here</a> for details. Yet, if <em>p</em> > 1, the auto-correlations are very close to zero, unlike the case <em>p</em> = 1. But are they truly identical to zero? What about the sequence <em>x<span style="font-size: 8pt;">n</span></em> = { <em>α</em>^<em>n</em> } with say <em>α</em> = log 3? Is it random-like? Nobody knows. Of course, if <em>α</em> = (1 + SQRT(5))/2, that sequence is anything but random, so it depends on <em>α</em>. </p>
<p>Below are three scatterplots showing the distribution of (<em>x<span style="font-size: 8pt;">n</span></em>, <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span>) for a few hundreds value of <em>n</em>, for various <em>α</em> and <em>p</em>, for the sequence <em>x<span style="font-size: 8pt;">n</span></em> = { <em>α</em> <em>n</em>^<em>p</em> }. The X-axis represents <em>x<span style="font-size: 8pt;">n</span></em>, the Y-axis represents <em>x<span style="font-size: 8pt;">n</span></em><span style="font-size: 8pt;">+1</span>. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8242305270?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8242305270?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2</strong>: <em>p = SQRT(7), α = 1</em></p>
<p>Even to the trained naked eye, Figure 2 shows randomness in 2 dimensions. Independence may fail in higher dimensions (k > 2) as the sequence is known not to be random-like. There is no apparent collinearity pattern as discussed in section 2.2, at least for <em>k</em> = 2. Can you run some test to detect lack of randomness in higher dimensions?</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8242307701?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8242307701?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 3</strong>: <em>p = 1.4, α = log 2</em></p>
<p>To the trained naked eye, Figure 3 shows lack of randomness as highlighted in the red band. Can you do a test to confirm this? If the test is inclusive or provide the wrong answer, than the naked eye performs better, in this case, than statistical software.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8242319869?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8242319869?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 4</strong>: <em>p = 1.1, α = log 2</em></p>
<p>Here (Figure 4) any statistical software and any human being, even the layman, can identify lack of randomness in more than one way. As <em>p</em> gets closer and closer to 1, lack of randomness is obvious, and the collinearity issue discussed in section 1.2, even if fuzzy, becomes more apparent even in two dimensions.</p>
<p><strong>3.2. Independence between two sequences</strong></p>
<p>It is known that if <em>α</em> and <em>β</em> are irrational numbers linearly independent over the set of rational numbers, then the sequences { <em>αn</em> } and { <em>βn</em> } are not correlated, even though each one taken separately is heavily auto-correlated. A sketch proof of this result can be found in the Appendix of <a href="https://www.datasciencecentral.com/profiles/blogs/state-of-the-art-statistical-science-to-address-famous-number-the" target="_blank" rel="noopener">this article</a>. But are they really independent? Test, using statistical software, the absence of correlation if <em>α </em>= log 2 and <em>β</em> = log 3. How would you do to test independence? The methodology presented in section 2.3 can be adapted and used to answer this question empirically (although not theoretically). </p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
Covid-19: My Predictions for 2021
tag:www.datasciencecentral.com,2020-11-30:6448529:BlogPost:1003991
2020-11-30T07:30:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>Here I share my predictions as well as personal opinion about the pandemic. My thoughts are not derived from running sophisticated models on vast amounts of data. Much of the data available has major issues anyway, something I am also about to discuss. There are some bad news and some good news. This article discusses what I believe are the good news and bad news, as well a some attempt at explaining people behavior and reactions, and resulting consequences. My opinion is very different from…</p>
<p>Here I share my predictions as well as personal opinion about the pandemic. My thoughts are not derived from running sophisticated models on vast amounts of data. Much of the data available has major issues anyway, something I am also about to discuss. There are some bad news and some good news. This article discusses what I believe are the good news and bad news, as well a some attempt at explaining people behavior and reactions, and resulting consequences. My opinion is very different from what you have read in the news, whatever the political color. Mine has I think, no political color. It offers a different, possibly refreshing perspective to gauge and interpret what is happening.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8230291873?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8230291873?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p>I will start by mentioning Belgium, one of the countries with the highest death rate. Very recently, it went from 10,000 deaths to 15,000 in the last wave, in a matter of days. They are back in some lock-down, and the situation has dramatically improved in the last few days. But 15,000 deaths out of 10,000,000 people would translate to 500,000 deaths in US. We are far from there yet. Had they not mandated a new lock-down, killing restaurants and other businesses but keeping schools open along the way, they would probably have 20,000 deaths now, probably quickly peaking at 25,000 before things improve. Now we are comparing apples and oranges. In Belgium, everyone believed to have died from covid was listed as having actually died from the virus even if un-tested. Also, the population density is very high compared to US, and use of public transportation is widespread. Areas with lower population density have initially fewer deaths per 100,000 inhabitants, until complacency eventually creates the same spike.</p>
<p>The bad news is that I think we will surpass 500,000 deaths in US by the end of February. But I don't think we will ever reach 1,000,000 by the end of 2021. A vaccine has been announced for months, but won't be available to the public at large in time: only to some specific groups of people (hospital workers) in the next few months. By the time it will be widely available, we will all have been contaminated / infected and recovered (99.8% of us) or dead (0.2% of us). The vaccine will therefore be useless to curtail the pandemic, which by then will have died out of its own due to lack of new people to infect. It may still be useful for the future, but not to spare the lives of another 300,000 who will have died between now and end of February. </p>
<p>You may wonder: why not imposing a full lock-down until March? Yes this will save many lives but kill many others in what I think is a zero-sum sinister game. Economic destruction, suicide, drug abuse, crime, riots would follow and would be just as bad. And with surge in unemployment and massive losses in tax revenue, I don't think any local or state government has the financial ability to do it, it is just financially unsustainable. So I think lock-downs can only last so long, probably about a month or so maximum. What is likely to happen is more and more people not following un-enforced regulations anymore, and those who really need to protect themselves, will stay at home and continue to live in a self-imposed state of lock-down.</p>
<p>Now some good news at least. It is said that for anyone who tests positive, 8 go untested because symptoms are too mild or inexistent to require medical help, and thus are not diagnosed. Me and my whole family and close friends fit in that category: never tested, but fully recovered, with no long-term side effects. Have we been re-infected again? Possibly, but it was even milder the second time, and again none of us were tested. One reason for not being tested / treated is that going to an hospital is much more risky than dining-in in a restaurant (many hospital workers died from covid, much fewer restaurant workers did). Another reason is to not have a potentially worrisome medical record attached to my name. Now you can say we were never infected in the first place, but it's like saying the virus is not contagious at all. Or you can say we will be re-infected again, but it's like saying the vaccine, even two doses six months apart, won't work. Indeed we are very optimistic about our future, as are all the people currently boosting the stock market to incredible highs. What I am saying here is that probably up to half of the population (150 million Americans) are currently at the end of the tunnel by now: recovered for most of us, or dead. </p>
<p>Some people like myself who had a worse-than-average (still mild) case realize that wearing a mask causes difficulty breathing worse than the virus itself. I don't have time to wash my mask and hands all the time, or buy new masks and so on, when I believe me and my family are done with it. Unwashed, re-used masks are probably full of germs and worse than no mask, once immune. As more and more people recover every day in very large numbers these days (but the media never mention it) you are going to see more and more people who spontaneously return to a normal life. These people are not anti-science, anti-social, or anti-government - quite the contrary, they are acting rationally, not driven by fear. They don't believe in conspiracy theories, and are from all political affiliations or apolitical. Forcing these people to isolate via mandated lock-downs won't work: some will have big parties in private homes, a hair-dresser may decide to provide her services privately in the homes of her clients, and be paid under the table. People still want to eat great food with friends and will continue to do so. People still want to date. Even if the city of Los Angeles makes it illegal to meet in your home with members from another household, you can't stop young (or less young) people from dating, not any more than you can stop the law of gravity no matter how hard you try.</p>
<p>Of course, if all the people acting this way were immune, it would not be an issue. Unfortunately, many people who behave that way today are just careless (or ignorant, maybe not reading the news anymore). But as time goes by, even many of the careless people are going to get infected and then immune; it's a matter of weeks. So the intensity of this situation may peak in a few weeks and then naturally slow down, as dramatically as it rose.</p>
<p>In conclusion, I believe that by the end of March we will be back to much better times, and covid will be a thing of the past for most of us. Like the Spanish flu. Though it is said that the current yearly flu is just remnants from the 1918 pandemic. The same may apply to covid, but it will be less lethal moving forward, after having killed those who were most susceptible to it. Already the death rate has plummeted. This of course won't help people who have lost a family member or friend, you can't make her come back. This is the sad part.</p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>
<p></p>
Introducing an All-purpose, Robust, Fast, Simple Non-linear Regression
tag:www.datasciencecentral.com,2020-11-24:6448529:BlogPost:1003574
2020-11-24T03:00:00.000Z
Vincent Granville
https://www.datasciencecentral.com/profile/VincentGranville
<p>The model-free, data-driven technique discussed here is so basic that it can easily be implemented in Excel, and we actually provide an Excel implementation. It is surprising that this technique does not pre-date standard linear regression, and is rarely if ever used by statisticians and data scientists. It is related to kriging and nearest neighbor interpolation, and apparently first mentioned in 1965 by Harvard scientists working on GIS (geographic information systems). It was referred…</p>
<p>The model-free, data-driven technique discussed here is so basic that it can easily be implemented in Excel, and we actually provide an Excel implementation. It is surprising that this technique does not pre-date standard linear regression, and is rarely if ever used by statisticians and data scientists. It is related to kriging and nearest neighbor interpolation, and apparently first mentioned in 1965 by Harvard scientists working on GIS (geographic information systems). It was referred back then as Shepard's method or inverse distance weighting, and used for multivariate interpolation on non-regular grids (see <a href="https://en.wikipedia.org/wiki/Multivariate_interpolation" target="_blank" rel="noopener">here</a> and <a href="https://en.wikipedia.org/wiki/Inverse_distance_weighting" target="_blank" rel="noopener">here</a>). We call this technique <em>simple regression</em>.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8209321855?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8209321855?profile=RESIZE_710x" width="500" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture: <a href="https://www.datasciencecentral.com/profiles/blogs/3-types-of-regression-in-one-picture-baba-png" target="_blank" rel="noopener">here</a></em></p>
<p>In this article, we show how simple regression can be generalized and used in regression problems especially when standard regression fails due to multi-collinearity or other issues. It can safely be used by non-experts without risking misinterpretation of the results or over-fitting. We also show how to build confidence intervals for predicted values, compare it to linear regression on test data sets, and apply it to a non-linear context (regression on a circle) where standard regression fails. Not only it works for prediction inside the domain (equivalent to interpolation) but also, to a lesser extent and with extra care, outside the domain (equivalent to extrapolation). No matrix inversion or gradient descend is needed in the computations, making it a faster alternative to linear or logistic regression.</p>
<p><span style="font-size: 14pt;"><strong>1. Simple regression explained</strong></span></p>
<p>For ease of presentation, we only discuss the two-dimensional case. Generalization to any dimension is straightforward. Let us assume that the data set (also called training set) consists of <em>n</em> points or locations (<em>X</em><span style="font-size: 8pt;">1</span>, <em>Y</em><span style="font-size: 8pt;">1</span>), ..., (<em>X<span style="font-size: 8pt;">n</span></em>, <em>Y<span style="font-size: 8pt;">n</span></em>) together with the response (also called dependent values) <em>Z</em><span style="font-size: 8pt;">1</span>, ..., <em>Z<span style="font-size: 8pt;">n</span></em> attached to each observation. Then the predicted value <em>Z</em> at an arbitrary location (<em>X</em>, <em>Y</em>) is computed as follows:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8208229253?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8208229253?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Throughout this article, we used </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8208207489?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8208207489?profile=RESIZE_710x" width="370" class="align-center"/></a></p>
<p>with <em>β</em> = 5.<b> </b>The parameter <em>β</em> controls the smoothness and is actually an hyper-parameter. It should be set to at least twice the dimension of the problem. A large value of <em>β </em>decreases the influence of far-away points in the predictions. In a Bayesian framework, a prior could be attached to <em>β</em>. Also note that if (<em>X</em>, <em>Y</em>) is one of the <em>n</em> training set points, say (<em>X</em>, <em>Y</em>) = (<em>X<span style="font-size: 8pt;">j</span></em>, <em>Y<span style="font-size: 8pt;">j</span></em>) for some <em>j</em>, then <em>Z</em> must be set to <em>Z<span style="font-size: 8pt;">j</span></em>. In short, the predicted value is exact for points belonging to the training set. If <span>(<em>X</em>, <em>Y</em>)</span> is very close to say (<em>X<span style="font-size: 8pt;">j</span></em>, <em>Y<span style="font-size: 8pt;">j</span></em>) and further away from the other training set points, then the computed <em>Z</em> is very close to <em>Z<span style="font-size: 8pt;">j</span></em>. It is assumed here that there are no duplicate locations in the training set otherwise, the formula needs adjustments. </p>
<p><span style="font-size: 14pt;"><strong>2. Case studies and Excel spreadsheet with computations</strong></span></p>
<p>We did some simulations to compare the performance of simple regression versus linear regression. In the first example, the training set consists of <em>n</em> = 100 data points generated as follows. The locations are random points (<em>X<span style="font-size: 8pt;">k</span></em>, <em>Y<span style="font-size: 8pt;">k</span></em>) in the two-dimensional unit square [0, 1] x [0, 1]. The response was set to <em>Z<span style="font-size: 8pt;">k</span></em> = SQRT[(<em>X<span style="font-size: 8pt;">k</span></em>)^2 + (<em>Y<span style="font-size: 8pt;">k</span></em>)^2]. The control set consists of another <em>n</em> = 100 points, also randomly distributed on the same unit square. The predicted values were computed on the control set, and the goal is to check how well they approximate the theoretical (true) value SQRT(<em>X</em>^2 + <em>Y</em>^2). Both the simple and linear regression perform well, though the R-squared is a little better for the simple regression, for most training and control sets of this type. The picture below shows the quality of the fit. A perfect fit would correspond to a perfect diagonal line rather than a cloud, with 0.9886 and 0.0089 (the slope and intercept of the red line) replaced respectively by 1 and 0. Note that the R-squared 0.9897 is very close to 1.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8208321887?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8208321887?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 1</strong>: <em>data set doing well with both simple and linear regression</em></p>
<p><span><strong>2.1. Regression on the circle</strong></span></p>
<p>In this second example, both the training set and control points are located on the unit circle (on the border of the circle, not inside or outside, so technically this a one-dimensional case). As expected the R-squared for the linear regression is terrible, and close to zero, while it is close to one for the simple regression. Note the weird distribution for the linear regression: this is not a glitch, it is expected to be that way.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8208423294?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8208423294?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 2</strong>: <em>Good fit with simple regression (points distributed on a circle)</em></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8208428655?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8208428655?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 3</strong>: <em>Bad fit with linear regression (points distributed on the same circle as in Figure 2)</em></p>
<p><strong>2.2. Extrapolation</strong></p>
<p>In the third example, we used the same training set with random locations on the unit circle. The control set consists this time of <em>n</em> = 100 points located in a square away from the circle, with no intersection with the circle. This corresponds to extrapolation. Both the linear and simple regression perform badly this time. The R-squared associated with the linear regression is close to zero, so no amount of re-scaling can fix it. The predicted values appear random.</p>
<p>However, even though the simple regression results are almost as much off as those coming from the linear regression with respect to bias, they can be substantially improved, easily. The picture below illustrates this fact. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8209018659?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8209018659?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><strong>Figure 4</strong>: <em>Testing predictions outside the domain (extrapolation)</em></p>
<p>The slope in figure 4 is 0.3784. For a perfect fit, it should be equal to one. However the R-squared for the simple regression is pretty good: 0.842. So if we multiply the predicted values by a constant so that the average predicted value, in the square outside the circle, if not heavily biased anymore, we would have a good fit with the same R-squared. Of course, this assumes that the true average value on the unit square domain is known, at least approximately. It is significantly different from the average value computed on the training set (the circle), thus the bias. This fix won't work for the linear regression, with the R-squared staying unchanged and close to zero after rescaling, even if we remove the bias. </p>
<p><strong>2.3. Confidence intervals for predicted values</strong></p>
<p>Here, we are back to using the first data set that worked well both for linear and simple regression, doing interpolation rather than extrapolation, as at the beginning of section 2. The control set is fixed, but we split the training set (consisting this time of 500 points) into 5 subsets. This approach is similar to cross-validation or bootstrapping, and allows us to compute confidence intervals for the predicted values. It works as follows:</p>
<ul>
<li>Repeat the whole procedure 5 times, using each time a different subset of the training set</li>
<li>Estimate <em>Z</em> based on the location (<em>X</em>, <em>Y</em>) for each point in the control set, using the formula in section 1: we will have 5 different estimates for each point, one for each subset of the training set</li>
<li>For each point in the control set, compute the minimum and maximum estimated value, out of the 5 predictions</li>
<li>The confidence interval for each point has the minimum predicted value as lower bound, and the maximum as upper bound. </li>
</ul>
<p>Of course the technique can be further refined, using percentiles rather than minimum and maximum for the bounds of the confidence intervals. The most modern way to do it is described in my book <em>Statistics: New Foundations, Toolkit and Machine Learning Recipes</em>, available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning" target="_blank" rel="noopener">here</a> to DSC members. See chapters 15-16, pages 107-132.</p>
<p>The <strong>striking conclusions</strong> based on this test are as follows:</p>
<ul>
<li>The CI (confidence interval) based on simple regression is about 50% larger on average than the one based on linear regression</li>
<li>The CI based on simple regression contains the true value 92% of the time, versus 24% of the time for the linear regression.</li>
</ul>
<p>What is striking is the 92% achieved by the simple regression. Part of it is because the simple regression CI's are larger, but there is more to it. </p>
<p><strong>2.4. Excel spreadsheet</strong></p>
<p>All the data and tests discussed, including the computations, are available in my spreadsheet, allowing you to replicate the results or use it on your own data. You can download it <a href="https://storage.ning.com/topology/rest/1.0/file/get/8209116672?profile=original" target="_blank" rel="noopener">here</a> (krigi2.xlsx). The main tabs in the spreadsheet are</p>
<ul>
<li>Square</li>
<li>Circle-Interpolation</li>
<li>Circle-Extrapolation</li>
<li>Square-CI-Summary</li>
</ul>
<p>The remaining tabs are used for auxiliary computations and can be ignored.</p>
<p><span style="font-size: 14pt;"><strong>4. Generalization</strong></span></p>
<p>If you look at the main formula in section 1, the predicted <em>Z</em> is the quotient of two arithmetic means. The one at the numerator is a weighted mean, and the one at the denominator is a standard mean. But the formula will also work with other types of means, for example with the exponential mean discussed in one of my previous articles, <a href="https://www.datasciencecentral.com/profiles/blogs/alternative-to-the-arithmetic-geometric-and-harmonic-means" target="_blank" rel="noopener">here</a>. The advantage of using such means, over the arithmetic mean, is that there are hyperparameters attached to them, thus allowing for more granular fine-tuning. </p>
<p>For example, the exponential mean of <em>n</em> numbers <em>A</em><span style="font-size: 8pt;">1</span>, ..., <em>A<span style="font-size: 8pt;">n</span></em> is defined as</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8209146656?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8209146656?profile=RESIZE_710x" width="350" class="align-center"/></a></p>
<p>When the hyperparameter <em>p</em> tends to 1, it corresponds to the arithmetic mean. Here, use the exponential mean with</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/8209189858?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/8209189858?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>respectively for the numerator and denominator in the first formula in section 1. You can even use a different <em>p</em> for the numerator and denominator.</p>
<p>Other original exact interpolation techniques based on Fourier methods, in one dimension and for points equally spaced, are described <a href="https://mathoverflow.net/questions/376081/infinite-partial-fraction-expansions-to-compute-fractional-iterations-and-recurr" target="_blank" rel="noopener">in this article</a>. Indeed, it was this type of interpolation that led me to investigate the material presented here. Robust, simple linear regression techniques are also described in chapter 1 in my book <em>Statistics: New Foundations, Toolkit and Machine Learning Recipes</em>, available <a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning" target="_blank" rel="noopener">here</a> to DSC members.</p>
<p></p>
<p><em><strong>About the author</strong>: Vincent Granville is a d<span class="lt-line-clamp__raw-line">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).</span> You can access Vincent's articles and books,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">here</a>.</em></p>