]]>

Convolution is a concept well known to machine learning and signal processing professionals. In this article, we explain in simple English how a moving average is actually a discrete convolution, and we use this fact to build weighted moving averages with natural weights that at the limit, have a Gaussian behavior guaranteed by the Central Limit Theorem. Moving averages are nothing more than blurring filters for signal processing experts, with a Gaussian-like kernel in the case discussed here. Inverting a moving average to recover the original signal consists in applying the inverse filter, known as a sharpening or enhancing filter. The inverse filter is used for instance in image analysis, to remove noise or deblur an image, while the original filter (the moving average) does the opposite. This is discussed here for one dimensional discrete signals, known as time series. Generalizations are also discussed. An interesting application in number theory, related to the famous unsolved Riemann conjecture, is also discussed.Figure 1: Bell-shaped distribution for re-scaled coefficients (the weights) discussed in section 1.11. Weighted moving averages as convolutionsGiven a discrete time series with observations X(0), X(1), X(2) and so on, a weighted moving average can be defined byHere Y(t) is the smoothed signal and h is a discrete density function (thus summing to one) though negative values of h(k) are sometimes used, for instance in Spencer's 15-point moving average used by actuaries, see here. We assume that t can take on negative integer values. Also, unless otherwise specified, we assume the weights to be symmetrical, that is, h(k) = h(-k). The parameter N can be infinite, but typically, the values h(k) are fast decaying the further away you are from k = 0. The notation used by mathematicians to represent this transformation is as follows: Y = T(X) = h * X where * is the convolution operator. This notation is convenient because it easily allows us to define the iterated moving average as a self-composition of the operator T, acting on the time series X : Start with Y0 = X, Y1 = Y, and let Yn+1 = T(Yn) = h * Yn. Likewise, we can define hn (with h1 = h) as h * h * ... * h, that is, an n-fold self-convolution of h. Of course, Yn = hn * X so that we haveNote that the sum goes from -Nn to Nn this time, as each additional iteration increases the number of terms in the sum, so Nn+1 > Nn, with N1 = N. This becomes clear in the following illustration.1.1 ExampleThe most basic case corresponds to N = 1, with h(-1) = h(0) = h(1) = 1/3. In this case, Nn = n, and the average value of hn(k) is equal to 1 / (2Nn +1). We have the following table:The above table shows how the weights are automatically determined, without guess work, rule of thumb, or fine-tuning required. Note that the sum of the elements in the n-th row is always equal to 3^n (3 at power n). This is very similar to the binomial coefficients table, and hn(k) are known as the trinomial coefficients, see here. The difference is that for binomial coefficients, the sum of the elements in the n-th row is always equal to 2^n, and the n-th row only has n + 1 entries, versus 2n + 1 in our table. The values hn(k) corresponding to n = 100 are displayed in Figure 1, at the top of this article. They have been scaled by a factor equal to the square root of Nn, since otherwise they all tend to zero as n tends to infinity. 1.2 Link to the Central Limit TheoremThe methodology developed here can be used to prove the central limit theorem in the most classic way. Indeed, the classic proof uses iterated self-convolutions, and the fact that the Fourier transform of convolutions is the product of the individual Fourier transform of each convolution. The Fourier transform is called characteristic function in probability theory. Interestingly, this leads to Gaussian approximations for partial sums of coefficients such as those in the n-th row, in the above table, when n is large and after proper rescaling. This is already well known for binomial coefficients (see here), and it easily extends to the coefficients introduced here, as well as to many other types of mathematical coefficients. See also Figure 1.2. Inverting a moving average, and generalizationsInverting a moving average consists in retrieving the original time series or signal. It consists in applying the inverse filter to the observed data, to un-smooth it. It is usually not possible to do it, though the true answer is somewhat more nuanced. It is certainly easier to do when N is small, though usually N is not known, and the weights are also unknown. However if the observed data is the result of applying the simple convolution described in section 1.2 with N = 1, you only need to know the values of X(t) at two different times t0 and t1 to retrieve the original signal. This is easiest if you know X(t) at t0 = 0 and at t1 = 1: in this case, there is a simple inversion formula: If you know X(0), X(1), and Y(t) for all t's, you can iteratively retrieve X(2), X(3), and so on with the above recurrence formula. If you don't know X(0), X(1) but instead you know the variance and other higher moments of X(t), assuming X(t) is stationary, then you may test various X(0), X(1) until you find a pair matching these moments when reconstructing the full sequence X(t) using the above recurrence formula. The solution may not be unique. Other parameters you know about X(t) may be useful too for the reconstruction: the period (if any), the slope of a linear trend (if any), and so on. 2.1 GeneralizationsThe moving averages discussed here rely on the classic arithmetic mean as the fundamental convolution operator, corresponding to N = 1. It is possible to use other means such as the harmonic or geometric means, and even more general as those defined in this article. It can be generalized to two or higher dimensions, and to a time-continuous signal. For prediction or extrapolation, see this article. For interpolation, that is to estimate X(t) when t is not an integer, see this article. 3. Application and source codeWe applied the above methodology with n = 60 to the following time series, with 60 < t < 240 being an integer:Figure 2 shows Yn(t) with n = 60 (the red curve), after shifting and rescaling (multiplying) it by a factor of order sqrt(n). In this case, X(2t) represents the real part of the Dirichlet Eta function η defined in the complex plane. If you replace the cosine by a sine in the definition of X(t), you get similar results for the imaginary part of η. What is spectacular here is that Yn(t) is very well approximated by a cosine function, see bottom of figure 2. The implication is that thanks to the self-convolution used here, we can approximate the real and imaginary parts of η by a simple auto-regressive model. This in turn may have implications to help solve the famous Riemann Hypothesis (RH) which essentially consists in locating the values of t such that X(2t) = 0 simultaneously for the real and imaginary part of η. RH states that there is no such t in our particular case, where a parameter 0.75 is used in the definition of X(t). It is conjectured to also be true if you replace 0.75 by any value strictly between 0.5 and 1. See more here and here. Figure 2: weighted moving average (WMA) with n = 60 (top), model fitting with cosine function (bottom)Note that X(t), the blue curve, is non-periodic, while the red curve is almost perfectly periodic. If you use arbitrary moving averages instead of the one based on a convolution hn * X, you won't get a perfect fit in the bottom part of figure 2, certainly not a perfect fit with a simple cosine function. Figure 3: same as top part of figure 2, but using a different X(t) for the blue curveAlso, the perfect fit can not be achieved if you replace the logarithm in the definition of X(t), by a much faster growing function. This is illustrated in figure 3, where the logarithm in X(t) was replaced by a square root.The source code can be downloaded here (convol2b.pl.txt). Since it is dealing with convolutions, it can be further optimized using Fast Fourier Transforms (FFT), see here. Finally, it would be interesting to treat this case assuming the time t is continuous, using continuous rather than discrete convolutions.About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). He is also the founder and investor in Paris Restaurant in Anacortes, WA. You can access Vincent's articles and books, here.See More

Convolution is a concept well known to machine learning and signal processing professionals. In this article, we explain in simple English how a moving average is actually a discrete convolution, and we use this fact to build weighted moving averages with natural weights that at the limit, have a Gaussian behavior guaranteed by the Central Limit Theorem. Moving averages are nothing more than blurring filters for signal processing experts, with a Gaussian-like kernel in the case discussed here. Inverting a moving average to recover the original signal consists in applying the inverse filter, known as a sharpening or enhancing filter. The inverse filter is used for instance in image analysis, to remove noise or deblur an image, while the original filter (the moving average) does the opposite. This is discussed here for one dimensional discrete signals, known as time series. Generalizations are also discussed. An interesting application in number theory, related to the famous unsolved Riemann conjecture, is also discussed.Figure 1: Bell-shaped distribution for re-scaled coefficients (the weights) discussed in section 1.11. Weighted moving averages as convolutionsGiven a discrete time series with observations X(0), X(1), X(2) and so on, a weighted moving average can be defined byHere Y(t) is the smoothed signal and h is a discrete density function (thus summing to one) though negative values of h(k) are sometimes used, for instance in Spencer's 15-point moving average used by actuaries, see here. We assume that t can take on negative integer values. Also, unless otherwise specified, we assume the weights to be symmetrical, that is, h(k) = h(-k). The parameter N can be infinite, but typically, the values h(k) are fast decaying the further away you are from k = 0. The notation used by mathematicians to represent this transformation is as follows: Y = T(X) = h * X where * is the convolution operator. This notation is convenient because it easily allows us to define the iterated moving average as a self-composition of the operator T, acting on the time series X : Start with Y0 = X, Y1 = Y, and let Yn+1 = T(Yn) = h * Yn. Likewise, we can define hn (with h1 = h) as h * h * ... * h, that is, an n-fold self-convolution of h. Of course, Yn = hn * X so that we haveNote that the sum goes from -Nn to Nn this time, as each additional iteration increases the number of terms in the sum, so Nn+1 > Nn, with N1 = N. This becomes clear in the following illustration.1.1 ExampleThe most basic case corresponds to N = 1, with h(-1) = h(0) = h(1) = 1/3. In this case, Nn = n, and the average value of hn(k) is equal to 1 / (2Nn +1). We have the following table:The above table shows how the weights are automatically determined, without guess work, rule of thumb, or fine-tuning required. Note that the sum of the elements in the n-th row is always equal to 3^n (3 at power n). This is very similar to the binomial coefficients table, and hn(k) are known as the trinomial coefficients, see here. The difference is that for binomial coefficients, the sum of the elements in the n-th row is always equal to 2^n, and the n-th row only has n + 1 entries, versus 2n + 1 in our table. The values hn(k) corresponding to n = 100 are displayed in Figure 1, at the top of this article. They have been scaled by a factor equal to the square root of Nn, since otherwise they all tend to zero as n tends to infinity. 1.2 Link to the Central Limit TheoremThe methodology developed here can be used to prove the central limit theorem in the most classic way. Indeed, the classic proof uses iterated self-convolutions, and the fact that the Fourier transform of convolutions is the product of the individual Fourier transform of each convolution. The Fourier transform is called characteristic function in probability theory. Interestingly, this leads to Gaussian approximations for partial sums of coefficients such as those in the n-th row, in the above table, when n is large and after proper rescaling. This is already well known for binomial coefficients (see here), and it easily extends to the coefficients introduced here, as well as to many other types of mathematical coefficients. See also Figure 1.2. Inverting a moving average, and generalizationsInverting a moving average consists in retrieving the original time series or signal. It consists in applying the inverse filter to the observed data, to un-smooth it. It is usually not possible to do it, though the true answer is somewhat more nuanced. It is certainly easier to do when N is small, though usually N is not known, and the weights are also unknown. However if the observed data is the result of applying the simple convolution described in section 1.2 with N = 1, you only need to know the values of X(t) at two different times t0 and t1 to retrieve the original signal. This is easiest if you know X(t) at t0 = 0 and at t1 = 1: in this case, there is a simple inversion formula: If you know X(0), X(1), and Y(t) for all t's, you can iteratively retrieve X(2), X(3), and so on with the above recurrence formula. If you don't know X(0), X(1) but instead you know the variance and other higher moments of X(t), assuming X(t) is stationary, then you may test various X(0), X(1) until you find a pair matching these moments when reconstructing the full sequence X(t) using the above recurrence formula. The solution may not be unique. Other parameters you know about X(t) may be useful too for the reconstruction: the period (if any), the slope of a linear trend (if any), and so on. 2.1 GeneralizationsThe moving averages discussed here rely on the classic arithmetic mean as the fundamental convolution operator, corresponding to N = 1. It is possible to use other means such as the harmonic or geometric means, and even more general as those defined in this article. It can be generalized to two or higher dimensions, and to a time-continuous signal. For prediction or extrapolation, see this article. For interpolation, that is to estimate X(t) when t is not an integer, see this article. 3. Application and source codeWe applied the above methodology with n = 60 to the following time series, with 60 < t < 240 being an integer:Figure 2 shows Yn(t) with n = 60 (the red curve), after shifting and rescaling (multiplying) it by a factor of order sqrt(n). In this case, X(2t) represents the real part of the Dirichlet Eta function η defined in the complex plane. If you replace the cosine by a sine in the definition of X(t), you get similar results for the imaginary part of η. What is spectacular here is that Yn(t) is very well approximated by a cosine function, see bottom of figure 2. The implication is that thanks to the self-convolution used here, we can approximate the real and imaginary parts of η by a simple auto-regressive model. This in turn may have implications to help solve the famous Riemann Hypothesis (RH) which essentially consists in locating the values of t such that X(2t) = 0 simultaneously for the real and imaginary part of η. RH states that there is no such t in our particular case, where a parameter 0.75 is used in the definition of X(t). It is conjectured to also be true if you replace 0.75 by any value strictly between 0.5 and 1. See more here and here. Figure 2: weighted moving average (WMA) with n = 60 (top), model fitting with cosine function (bottom)Note that X(t), the blue curve, is non-periodic, while the red curve is almost perfectly periodic. If you use arbitrary moving averages instead of the one based on a convolution hn * X, you won't get a perfect fit in the bottom part of figure 2, certainly not a perfect fit with a simple cosine function. Figure 3: same as top part of figure 2, but using a different X(t) for the blue curveAlso, the perfect fit can not be achieved if you replace the logarithm in the definition of X(t), by a much faster growing function. This is illustrated in figure 3, where the logarithm in X(t) was replaced by a square root.The source code can be downloaded here (convol2b.pl.txt). Since it is dealing with convolutions, it can be further optimized using Fast Fourier Transforms (FFT), see here. Finally, it would be interesting to treat this case assuming the time t is continuous, using continuous rather than discrete convolutions.About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). He is also the founder and investor in Paris Restaurant in Anacortes, WA. You can access Vincent's articles and books, here.See More

Here I provide translations for various important terms, to help professionals from related backgrounds better understand each other. In particular, machine learning professionals versus statisticians.Source for picture: hereFeature (machine learning)A feature is known as a variable or independent variable in statistics. It is also known as a predictor by predictive analytics professionals. ResponseThe response is called dependent variable in statistics. Machine learning professionals sometimes call it the output. R-squareThis is the statistics used by statisticians to measure the performance of a model. There are many better alternatives. Machine learning professionals sometimes call it goodness-of-fit metric. RegressionSometimes called maximum likelihood regression or linear regression by statisticians. Physicists and signal processing / operations research professionals use the term ordinary least squares instead. And yes, it is possible to compute confidence intervals (CI) without underlying models. They are called data-driven, and rely on simulations and empirical percentile distributions. Logistic transformThe term used in the context of neural networks is sigmoid. Statisticians are more familiar with the word logistic, as in logistic regression.Neural networksWhile not exactly the same thing, statisticians have they own multi-layers hierarchical networks: they are called Bayesian hierarchical networks.Test of hypothesisBusiness intelligence professionals call it A/B testing, or multivariate testing.Boosted modelsBoosted models are used by machine learning professionals to blend multiple models and get the best of each model. Statisticians call them ensemble techniques.Confidence intervalsWe are all familiar with this concept invented by statisticians. Alternative terms include prediction intervals, or error (not to be confused with predictive or residual error, as it has its own meaning for statisticians).GroupingAlso known as aggregating, and consisting in grouping values of some feature or independent variable, especially in decision trees to reduce the number of nodes. Machine learning professionals call it feature binning. TaxonomyWhen applied to unstructured text data, the creation of a taxonomy (sometimes called ontology) is referred to as natural language processing. It is basically clustering of text data.ClusteringStatisticians call it clustering. In machine learning, the concept is referred to as unsupervised classification. To the contrary, supervised clustering is a learning technique based on training sets and cross-validation. Control setMachine learning professionals use control and test sets. Statisticians use the term cross-validation or bootstrapping, as well as training sets. Model fittingThe terms favored by machine learning professionals is model selection, testing, and feature selection. Model performance has its own statistical related term: p-value, though it less used recently. False positivesInstead of false positives and false negatives, statisticians favor type I and type II errors.Another similar dictionary can be found here. About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent's articles and books, here.See More

Here I provide translations for various important terms, to help professionals from related backgrounds better understand each other. In particular, machine learning professionals versus statisticians.Source for picture: hereFeature (machine learning)A feature is known as a variable or independent variable in statistics. It is also known as a predictor by predictive analytics professionals. ResponseThe response is called dependent variable in statistics. Machine learning professionals sometimes call it the output. R-squareThis is the statistics used by statisticians to measure the performance of a model. There are many better alternatives. Machine learning professionals sometimes call it goodness-of-fit metric. RegressionSometimes called maximum likelihood regression or linear regression by statisticians. Physicists and signal processing / operations research professionals use the term ordinary least squares instead. And yes, it is possible to compute confidence intervals (CI) without underlying models. They are called data-driven, and rely on simulations and empirical percentile distributions. Logistic transformThe term used in the context of neural networks is sigmoid. Statisticians are more familiar with the word logistic, as in logistic regression.Neural networksWhile not exactly the same thing, statisticians have they own multi-layers hierarchical networks: they are called Bayesian hierarchical networks.Test of hypothesisBusiness intelligence professionals call it A/B testing, or multivariate testing.Boosted modelsBoosted models are used by machine learning professionals to blend multiple models and get the best of each model. Statisticians call them ensemble techniques.Confidence intervalsWe are all familiar with this concept invented by statisticians. Alternative terms include prediction intervals, or error (not to be confused with predictive or residual error, as it has its own meaning for statisticians).GroupingAlso known as aggregating, and consisting in grouping values of some feature or independent variable, especially in decision trees to reduce the number of nodes. Machine learning professionals call it feature binning. TaxonomyWhen applied to unstructured text data, the creation of a taxonomy (sometimes called ontology) is referred to as natural language processing. It is basically clustering of text data.ClusteringStatisticians call it clustering. In machine learning, the concept is referred to as unsupervised classification. To the contrary, supervised clustering is a learning technique based on training sets and cross-validation. Control setMachine learning professionals use control and test sets. Statisticians use the term cross-validation or bootstrapping, as well as training sets. Model fittingThe terms favored by machine learning professionals is model selection, testing, and feature selection. Model performance has its own statistical related term: p-value, though it less used recently. False positivesInstead of false positives and false negatives, statisticians favor type I and type II errors.Another similar dictionary can be found here. About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent's articles and books, here.See More

]]>

]]>

]]>

]]>

]]>