NorthGradient
Start reading
Neural Networks Browse lessons

Neural Networks · Neural Networks · 5 min read

Gradient descent

The loss is a single number, and it depends on the weights. Change a weight a little and the loss changes a little. Gradient descent is the rule that uses this dependence to improve the weights: nudge each one in the direction that makes the loss go down, then repeat.

To lower the loss, move each weight a small step in the direction opposite to the slope of the loss, and repeat.

Following the slope downhill

For a single weight ww, the slope of the loss is the derivative Lw\frac{\partial L}{\partial w}. If the slope is positive, increasing ww raises the loss, so we should decrease ww, and vice versa. Either way we move against the slope:

wwηLww \leftarrow w - \eta \, \frac{\partial L}{\partial w}
  • Lw\frac{\partial L}{\partial w} is the slope of the loss with respect to ww, telling us which way is uphill.
  • η\eta is the learning rate, a small positive number setting the step size.
  • The minus sign sends us opposite the uphill direction, that is, downhill.
  • \leftarrow means we replace the old ww with this new value.

With many weights, every weight has its own slope, and the full collection of slopes is the gradient θL\nabla_\theta L. The update is the same idea applied to all weights at once:

θθηθL\theta \leftarrow \theta - \eta \, \nabla_\theta L

where θ\theta stands for all the weights and biases together, and θL\nabla_\theta L is the vector of their slopes.

A bowl-shaped loss curve over a weight, with points stepping down from the start toward the minimum, each step moving against the slope and shrinking as the minimum nears.
A bowl-shaped loss curve over a weight, with points stepping down from the start toward the minimum, each step moving against the slope and shrinking as the minimum nears.

Notice the steps shrink near the bottom. The slope flattens as the minimum approaches, so the same rule automatically takes smaller steps where smaller steps are needed.

A worked example

Suppose the loss depends on one weight as L(w)=(w3)2L(w) = (w - 3)^2, a bowl with its lowest point at w=3w = 3. Its slope is Lw=2(w3)\frac{\partial L}{\partial w} = 2(w - 3). Starting at w=0w = 0 with learning rate η=0.1\eta = 0.1:

# a toy loss with a single lowest point at w = 3
def loss(w):
    return (w - 3) ** 2

# the slope of the loss with respect to w
def grad(w):
    return 2 * (w - 3)

w = 0.0      # starting guess for the weight
eta = 0.1    # learning rate (step size)

# take five gradient descent steps
for step in range(5):
    w = w - eta * grad(w)   # move against the slope
    print(round(w, 4))
# 0.6
# 1.08
# 1.464
# 1.7712
# 2.017

Each step moves ww closer to 33, and the moves get smaller as the slope flattens. Run it longer and ww settles at the minimum.

The one piece still missing is how to get Lw\frac{\partial L}{\partial w} for every weight in a real network, where the loss depends on each weight through many layers. In the next lesson, backpropagation will compute exactly these slopes using the chain rule.