.

The gradient decent approach is used in many algorithms to minimize loss functions. In this introduction we will see how exactly a gradient descent works. In addition, some special features will be pointed out. We will be guided by a practical example.

First of all we need a loss function. A very common approach is

where y are the true observations of the target variable and yhat are the predictions. This function looks lika a squared function - and yes, it is. But if we assume a non linear approach for the prediction of yhat, we end up in a higher degree. In deep neural networks for example, it is a function of weights and input-features which is nonlinear.

So, lets assume that the function looks like the following polynom - as an example:

- f is the loss function
- f' is the first derivation to x and
- f'' is the second derivation to x.

We want to find the minimum. But there are two minima and one maximum. So, firstly we find out where the extreme values are. For this intention, we use the Newton iteration approach. If we set a tangent to an arbitrary point x0 of the green function we can calculate where the linear tangent is zero instead. Through this procedure we iteratively approach the zero points of f' and thus obtain the extrema. This intercept with x is supposed to be x1.

To find the three points of intersection with x, we set x0 close to the suspected zero points which we can see in the function-plot.

x0 = c(1, -1, -4)

j = 1

for(x in x0) {

for (i in 1:20) {

fx = 6 * x ^ 5 + 20 * x ^ 4 + 8 * x ^ 3 + 15 * x ^ 2 + 40 * x + 1

dfx = 30 * x ^ 4 + 80 * x ^ 3 + 24 * x ^ 2 + 30 * x + 40

x = x - fx / dfx

}

assign(paste("x", j, sep = ""), x)

j = j + 1

}

**Results are:**

x1 = -0.03

x2 = -1.45

x3 = -2.9

Now that we know where f has its extrema, we can apply the gradient decent - by the way, for the gradient descent you don't need to know where the extremes are, that's only important for this example. To search for the minima, especially the global minimum,we apply the gradient decent.

The gradient decent is an iterative approach. It starts from initial values and moves downhill. The initial values are usually set randomly, although there are other approaches.

Where x is for example the weight of a neuron relation and

is the partial derivation to a certein weight. The *i* is the iteration index.

Why the negative learning_rate times the derivation? Well, if the derivation is positive, it means that an increase of x goes along with an increase of f and vice versa. So if the derivative is positive, x must be decreased and vice versa.

Lets start from:

x = 2

We set the learning rate to:

learning_rate = .001

for(i in 1:2000){

fx = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

x = x - fx*learning_rate

}

**Result:**

-0.03

Here it becomes clear why the initial values (weights) are so important. The minimum found corresponds to x1 and not x3. This is due to the initial value of x = 2. Let us now change it to:

x = -2

for(i in 1:2000){

fx = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

x = x - fx*learning_rate

}

**Result:**

-2.9

This is the global minimum we have been searching for.

Now, we change the learning rate to:

learning_rate = .007

And we start from:

x = 2

for(i in 1:2000){

fx = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

x = x - fx*learning_rate

}

**Result???**

-2.67

How can such an obviously wrong result come about?

**Difference is:**

0

However, oscillation is not only a phenomenon that cannot be influenced, it occurs due to the chosen learning rate and the shape of the loss function. This learning rate causes the algorithm to jump from one side to the other. To counteract this, the learning rate can be reduced. Again, we start from:

x = 2

Initial learning rate:

learning_rate = .007

Learning rate decay:

decay = learning_rate / 2500; decay0.0000028

checkX = NULL

for(i in 1:2000){

fx = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

x = x - fx*(learning_rate - decay * i)

A frequently used approach is to add a momentum term. A momentum term is the change of the value x (weight) from one iteration before:

Where M is the weight of the momentum term. This approach is comparable to a ball rolling down the road. In the turns, it swings back with the swing it has gained. In addition, the ball rolls a bit further on flat surfaces. Lets check what happens with the oscillation when we add a momentum term.

Starting again from:

Learning rate is again:

learning_rate = .007

Momentum weight is set to:

M = .2

M = .2

checkX = NULL

for(i in 1:2000){

for(i in 1:2000){

fx = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

x = x - fx*learning_rate + M * x_1

x = x - fx*learning_rate + M * x_1

x_1 = - fx*learning_rate

checkX = c(checkX, x)

}

}

**Result:**

-0.03

As you can see, a minimum was reached, but unfortunately the wrong one. A Momentum term can protect against oscillation, but it is no guarantee for the global minimum.

We can see that the process of optimization of the gradiant decent is highly dependent from the initial values (weights). So, it is possible that the best solution you find when using a method based on a gradient decent - such as a neural network - is not optimal. To prevent from oscillation, one can add a momentum term or change the learning rate. But even that is no guarantee of finding the global minimum.

- Demand proliferates for low-code app development platforms
- US Senate mulling bill on data breach notifications
- AIOps network management requires vendor-buyer teamwork
- Deloitte SAP Industry Cloud apps aim to fill ERP gaps
- Compare 7 headless CMS offerings and their key differences
- 9 steps to a dynamic data architecture plan
- Building trustworthy AI is key for enterprises
- IoT and responsibility: Use digital for good
- 4 zero-trust IoT steps to scale security
- Apply hyperscale data center design principles to IT storage

Posted 27 July 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central