It is a well-known fact that neural networks can approximate the output of any continuous mathematical function, no matter how complicated it might be. Take for instance the function below:
Though it has a pretty complicated shape, the theorems we will discuss shortly guarantee that one can build some neural network that can approximate f(x) as accurately as we want. Neural networks, therefore, display a type of universal behavior.
One of the reasons neural networks have received so much attention is that in addition to these rather remarkable universal properties, they possess many powerful algorithms for learning functions.
This piece will give an informal overview of some of the fundamental mathematical results (theorems) underlying these approximation capabilities of artificial neural networks.
“Almost any process you can imagine can be thought of as function computation… [Examples include] naming a piece of music based on a short sample of the piece […], translating a Chinese text into English […], taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting.”
— Michael Nielsen
In 1957, the Russian mathematician Andrey Kolmogorov (picture below) known for his contributions to a wide range of mathematical topics (such as probability theory, topology, turbulence, computational complexity, and many others), proved an important theorem about the representation of real functions of many variables. According to Kolmogorov’s theorem, multivariate functions can be expressed via a combination of sums and compositions of (a finite number of) univariate functions.
A little more formally, the theorem states that a continuous function f of real variables defined in the n-dimensional hypercube (with n ≥ 2) can be represented as follows:
In this expression, the gs are univariate functions and the ϕs are continuous, entirely (monotonically) increasing functions (as shown in the figure below) that do not depend on the choice of f.
The UAT states that feed-forward neural networks containing a single hidden layer with a finite number of nodes can be used to approximate any continuous function provided rather mild assumptions about the form of the activation function are satisfied. Now, since almost any process we can imagine can be described by some mathematical function, neural networks can, at least in principle, predict the outcome of nearly every process.
There are several rigorous proofs of the universality of feed-forward artificial neural nets using different activation functions. Let us, for the sake of brevity, restrict ourselves to the sigmoid function. Sigmoids are “S”-shaped and include as special cases the logistic function, the Gompertz curve, and the ogee curve.
A quick Python code to build and plot a sigmoid function is:
import numpy as np
import matplotlib.pyplot as plt
upper, lower = 6, -6
num = 100
if x > upper:
elif x < lower:
vals = [sigmoid_activation(x) for
x in np.linspace(lower, upper, num=num)]
The proof given by Cybenko (1989) is known for its elegance, simplicity, and conciseness. In his article, he proves the following statement. Let ϕ be any continuous function of the sigmoid type (see discussion above). Given any multivariate continuous function
inside a compact subset of the N-dimensional real space and any positive ϵ, there are vectors
(the weights), constants
(the bias terms) and
for any x (the NN inputs) inside the compact subset, where the function G is given by:
Choosing the appropriate parameters, neural nets can be used to represent functions with a wide range of different forms.
To make these statements less abstract, let us build a simple Python code to illustrate what has been discussed until now. The following analysis is based on Michael Nielsen’s great online book and on the exceptional lectures byMatt Brems and Justin Pounders at the General Assembly Data Science Immersive (DSI).
We will build a neural network to approximate the following simple function:
A few remarks are needed before diving into the code:
To reproduce this function, we choose the values of hs to be (see the corresponding figure below, taken from here):
The code starts with the following definitions:
inversefuncthat will need to build the inverse sigmoid function
weight_outputsare obtained from the values of the output weights given in the previous section.
from pynverse import inversefunc
w = 500
def solve_for_bias(s, w=w):
return(-w * s)
steps = [0,.2,.2,.4,.4,.6,.6,.8,.8,1]
bias_hl = np.array([solve_for_bias(s) for s in steps])
weights_hl = np.array([w] * len(steps))
bias_output = 0
weights_output =np.array([-1.2, 1.2, -1.6, 1.6,
-.3, .3, -1])
1, 1, 1, -1])
The final steps are:
Pythonfunction which I called
nnthat builds and runs the network
Z_hl = input_value * weights_hl + bias_hl
activation_hl = np.array([sigmoid_activation(Z)
for Z in Z_hl])
Z_output = np.sum(activation_hl * weights_output)
activation_output = identity_activation(Z_output)
x_values = np.linspace(0,1,1000)
y_hat = [nn(x) for x in x_values]
return 0.2 + 0.4*(x**2) + 0.3*x*np.sin(15*x)+ 0.05*np.cos(50*x))
y = [f(x) for x in x_values]
inv_sigmoid = inversefunc(sigmoid_activation)
y_hat = [nn(x) for x in x_values]
y_invsig = [inv_sigmoid(i) for i in y]
_ = plt.plot(x_values, y_invsig)
_ = plt.plot(x_values, y_hat)
_ = plt.xlim((0,1))
This approximation is far from ideal. However is it straightforward to improve it, for example, by increasing the number of nodes or the number of layers (but at the same time avoiding overfitting).
In this article, I described some basic mathematics underlying the universal properties of neural networks, and I showed a simple Python code that implements an approximation of a simple function.
Thanks for reading and see you soon!
By the way, constructive criticism and feedback are always welcome!