NorthGradient
Start reading
Neural Networks Browse lessons

Neural Networks · Neural Networks · 6 min read

Backpropagation

Gradient descent needs the slope Lw\frac{\partial L}{\partial w} for every weight. But a weight does not touch the loss directly. It affects the sum zz, which affects the activation aa, which affects the loss LL. Backpropagation computes the slope through this chain by multiplying the local slopes along the way.

Backpropagation is the chain rule: a weight’s slope is the product of the local slopes along the path from that weight to the loss.

A chain of dependencies

For one weight feeding one sigmoid neuron with cross-entropy loss, the dependency runs wzaLw \to z \to a \to L. The chain rule says the overall slope is the product of the slopes of each link:

Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}
  • zw\frac{\partial z}{\partial w} is how the sum changes as the weight changes.
  • az\frac{\partial a}{\partial z} is how the activation changes as the sum changes.
  • La\frac{\partial L}{\partial a} is how the loss changes as the activation changes.
  • Multiplying them gives how the loss changes as the weight changes.

The slopes, and a clean cancellation

Each local slope is known. With z=wx+bz = wx + b, sigmoid activation, and binary cross-entropy:

zw=x,az=a(1a),La=aya(1a)\frac{\partial z}{\partial w} = x, \qquad \frac{\partial a}{\partial z} = a(1 - a), \qquad \frac{\partial L}{\partial a} = \frac{a - y}{a(1 - a)}

Here xx is the input, aa the prediction, and yy the true label. Multiplying all three, the a(1a)a(1 - a) terms cancel and leave a strikingly simple result:

Lw=(ay)x\frac{\partial L}{\partial w} = (a - y)\,x

The slope for a weight is just the prediction error (ay)(a - y) scaled by the input that weight carried.

The forward chain w to z to a to L on top, with the gradient flowing backward underneath as a product of the local slopes, multiplying to (a minus y) times x.
The forward chain w to z to a to L on top, with the gradient flowing backward underneath as a product of the local slopes, multiplying to (a minus y) times x.

Checking it

A numeric finite-difference check confirms the formula. Take x=2x = 2, w=0.5w = 0.5, b=1b = -1, and true label y=1y = 1:

import math

def sigmoid(t):
    return 1 / (1 + math.exp(-t))

# one neuron: input x, weight w, bias b, true label y
x, w, b, y = 2.0, 0.5, -1.0, 1

# the loss as a function of w alone: forward pass, then cross-entropy
def L(w):
    a = sigmoid(w * x + b)
    return -(y * math.log(a) + (1 - y) * math.log(1 - a))

# the chain-rule result for sigmoid plus cross-entropy
a = sigmoid(w * x + b)
analytic = (a - y) * x

# finite-difference check: the slope from two nearby loss values
eps = 1e-6
numeric = (L(w + eps) - L(w - eps)) / (2 * eps)

print(analytic)  # -1.0
print(numeric)   # -1.0000000000842668

The chain-rule value and the numerical slope agree, so the formula is right.

Propagating backward through layers

In a deeper network the same rule applies along a longer path: more links, more factors multiplied together. The key efficiency is that the slope arriving at a neuron is reused for every weight feeding it, and is then passed back to the layer before. Computing these shared slopes once and sending them backward, layer by layer, is what the name backpropagation describes. It also assigns credit: each weight’s slope scales with how much that weight contributed to the error.

In the next lesson, we will assemble the forward pass, the loss, these gradients, and the update into the loop that actually trains a network.