This article was written by Ray of RayOnStorage Blog. Original title: Compressing information through the information bottleneck during deep learning.
Read an article in Quanta Magazine about a talk done a month or so ago given by Professor Naftali (Tali) Tishby on his theory that all deep learning convolutional neural networks (CNN) exhibit an “information bottleneck” during deep learning. This information bottleneck results in compressing the information present, in for example, an image and only working with the relevant information.
The Professor and his researchers used a simple AI problem (like recognizing a dog) and trained a deep learning CNN to perform this task. At the start of the training process the CNN nodes at the top were all connected to the next layer, and those were all connected to the next layer and so on until you got to the output layer.
Essentially, the researchers found that during the deep learning process, the CNN went from recognizing all features of an image to over time just recognizing (processing?) only the relevant features of an image when successfully trained.
Limits of Deep Learning CNNs:
In his talk the Professor identifies two modes of operations of a deep learning CNN: the encoder layers and decoder layers. The encoder function identifies relevant information in the input and the decoder function takes this relevant information and maps this to an output.
This view results in two statistics that can characterize any deep learning CNN:
Where mutual information is defined as how much of the uncertainty of an input is removed when you have an output that is based on that input. (See the talk for a more formal explanation).
The professor states that any complex deep learning CNN can be characterized by these two statistics where sample complexity determines the number of samples required and accuracy determines the precision by which the deep learning CNN can properly interpret those samples. The deep black line in the chart represents the limits of accuracy achievable at some number of training events, with some number of hidden layers and some sample set.
What happens during deep learning:
Moreover, the professor shows an interesting characteristic of all CNNs is that they converge over time in accuracy and that convergence differs based mostly on the number of layers, sample size and training count used.
In the chart, the top row show 3 CNNs with different amounts of training data (5%, 40% and 80% of total). The chart shows the end result and trace of learning within the CNN over the same number of epochs (training cycles). More training data generates more accurate results.
The Professor views those epochs after the farthest right traces (where the trace essentially starts moving up and to the left in the chart), the compression phase of deep learning.