Understanding the maths behind forward and back propagation is not very easy.

There are some very good – but also very technical explanations.

For example : The Matrix Calculus You Need For Deep Learning Terence Parr and Jer...is an excellent resource but still too complex for beginners.

I found a much simpler explanation in the ml cheatsheet.

The section below is based on this source.

I have tried to simplify this explanation as below.

All diagrams and equations are based on this source in the ml cheatsheet

Let’s start with forward propagation

Here, input data is “forward propagated” through the network layer by layer to the final layer which outputs a prediction. The simple network can be seen as a series of nested functions.

For the neural network above, a single pass of forward propagation translates mathematically to:

**A ( A( X Wh) Wo )**

Where

A is an activation function like ReLU,

X is the input

Wh and Wo are weights for the hidden layer and output layer respectively

A more complex network can be shown as below

INPUT_LAYER_SIZE = 1

HIDDEN_LAYER_SIZE = 2

OUTPUT_LAYER_SIZE = 2

In matrix form, this is represented as:

Var |
Name |
Dimensions |

X |
Input |
(3, 1) |

Wh |
Hidden weights |
(1, 2) |

Bh |
Hidden bias |
(1, 2) |

Zh |
Hidden weighted input |
(1, 2) |

H |
Hidden activations |
(3, 2) |

Wo |
Output weights |
(2, 2) |

Bo |
Output bias |
(1, 2) |

Zo |
Output weighted input |
(3, 2) |

O |
Output activations |
(3, 2) |

Source: https://ml-cheatsheet.readthedocs.io/en/latest/forwardpropagation.html

Now, let us explore backpropagation. In backpropagation, we adjust each weight in the network in proportion to how much it contributes to overall error. We iteratively reduce each weight’s error. Eventually, when the error is minimised, we have a series of weights the produce good predictions.

As we have seen before, forward propagation can be viewed as a series of nested functions. Hence, backpropagation can be seen as the application of the Chain rule to find the derivative of the cost with respect to any weight in the network. This represents how much each weight contributes to the overall error and the direction to update each weight to reduce the error. The error is then reduced through Gradient descent. The equations needed to make a prediction and calculate total error, or cost:

To use these equations:

- We first calculate the output layer error
- We pass this error back to the hidden layer before it
- At that hidden layer, we calculate the error (hidden layer error) pass the result to the hidden layer before it and so on.
- At every layer, we calculate the derivative of cost with respect that layer’s weights.
- This resulting derivative tells us in which direction to adjust our weights to reduce overall cost. This step is performed using Gradient descent algorithm.

Hence,

We first calculate the derivative of cost with respect to the output layer input, Zo. This gives us the impact of the final layer’s weights on the overall error in the network. The derivative is:

**C′(Zo)=(y^−y)****⋅****R′(Zo)**

Here

(y^−y) is the cost and

R′(Zo) represents the derivative of the ReLU activation for the output layer

This error is represented by Eo where

**Eo=(y^−y)****⋅****R′(Zo)**

Now, to calculate hidden layer error, we need to find the derivative of cost with respect to the hidden layer input, Zh. Following the same logic, this can be represented as

**Eh=Eo****⋅****Wo****⋅****R′(Zh)**

Where R′(Zh) represents the derivative of the Relu activation for the hidden layer

This formula is at the core of backpropagation.

- We calculate the current layer’s error
- Pass the weighted error back to the previous layer
- We continue the process through the hidden layers
- Along the way we update the weights using the derivative of cost with respect to each weight.

The Derivative of cost with respect to any weight is represented as

**C′(w)=CurrentLayerError****⋅****CurrentLayerInput** where Input refers to the activation from the previous layer, not the weighted input, Z.

Hence, the 3 equations that together form the foundation of backpropagation are

The process can be visualised as below:

These equations are not very easy to understand and I hope you find the simplified explanation useful

I keep trying to improve my own understanding and to explain them better

I welcome your comments

Source: adapted and simplified from the ml cheatsheet

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central