Gradient descent · NorthGradient

The loss is a single number, and it depends on the weights. Change a weight a little and the loss changes a little. Gradient descent is the rule that uses this dependence to improve the weights: nudge each one in the direction that makes the loss go down, then repeat.

To lower the loss, move each weight a small step in the direction opposite to the slope of the loss, and repeat.

Following the slope downhill

For a single weight $w$ , the slope of the loss is the derivative $\frac{\partial L}{\partial w}$ . If the slope is positive, increasing $w$ raises the loss, so we should decrease $w$ , and vice versa. Either way we move against the slope:

w \leftarrow w - \eta \, \frac{\partial L}{\partial w}

$\frac{\partial L}{\partial w}$ is the slope of the loss with respect to $w$ , telling us which way is uphill.
$\eta$ is the learning rate, a small positive number setting the step size.
The minus sign sends us opposite the uphill direction, that is, downhill.
$\leftarrow$ means we replace the old $w$ with this new value.

With many weights, every weight has its own slope, and the full collection of slopes is the gradient $\nabla_\theta L$ . The update is the same idea applied to all weights at once:

\theta \leftarrow \theta - \eta \, \nabla_\theta L

where $\theta$ stands for all the weights and biases together, and $\nabla_\theta L$ is the vector of their slopes.

A bowl-shaped loss curve over a weight, with points stepping down from the start toward the minimum, each step moving against the slope and shrinking as the minimum nears.

Notice the steps shrink near the bottom. The slope flattens as the minimum approaches, so the same rule automatically takes smaller steps where smaller steps are needed.

A worked example

Suppose the loss depends on one weight as $L(w) = (w - 3)^2$ , a bowl with its lowest point at $w = 3$ . Its slope is $\frac{\partial L}{\partial w} = 2(w - 3)$ . Starting at $w = 0$ with learning rate $\eta = 0.1$ :

# a toy loss with a single lowest point at w = 3
def loss(w):
    return (w - 3) ** 2

# the slope of the loss with respect to w
def grad(w):
    return 2 * (w - 3)

w = 0.0      # starting guess for the weight
eta = 0.1    # learning rate (step size)

# take five gradient descent steps
for step in range(5):
    w = w - eta * grad(w)   # move against the slope
    print(round(w, 4))
# 0.6
# 1.08
# 1.464
# 1.7712
# 2.017

Each step moves $w$ closer to $3$ , and the moves get smaller as the slope flattens. Run it longer and $w$ settles at the minimum.

The one piece still missing is how to get $\frac{\partial L}{\partial w}$ for every weight in a real network, where the loss depends on each weight through many layers. In the next lesson, backpropagation will compute exactly these slopes using the chain rule.