Neural Networks · Neural Networks · 6 min read
Backpropagation
Gradient descent needs the slope for every weight. But a weight does not touch the loss directly. It affects the sum , which affects the activation , which affects the loss . Backpropagation computes the slope through this chain by multiplying the local slopes along the way.
Backpropagation is the chain rule: a weight’s slope is the product of the local slopes along the path from that weight to the loss.
A chain of dependencies
For one weight feeding one sigmoid neuron with cross-entropy loss, the dependency runs . The chain rule says the overall slope is the product of the slopes of each link:
- is how the sum changes as the weight changes.
- is how the activation changes as the sum changes.
- is how the loss changes as the activation changes.
- Multiplying them gives how the loss changes as the weight changes.
The slopes, and a clean cancellation
Each local slope is known. With , sigmoid activation, and binary cross-entropy:
Here is the input, the prediction, and the true label. Multiplying all three, the terms cancel and leave a strikingly simple result:
The slope for a weight is just the prediction error scaled by the input that weight carried.
Checking it
A numeric finite-difference check confirms the formula. Take , , , and true label :
import math
def sigmoid(t):
return 1 / (1 + math.exp(-t))
# one neuron: input x, weight w, bias b, true label y
x, w, b, y = 2.0, 0.5, -1.0, 1
# the loss as a function of w alone: forward pass, then cross-entropy
def L(w):
a = sigmoid(w * x + b)
return -(y * math.log(a) + (1 - y) * math.log(1 - a))
# the chain-rule result for sigmoid plus cross-entropy
a = sigmoid(w * x + b)
analytic = (a - y) * x
# finite-difference check: the slope from two nearby loss values
eps = 1e-6
numeric = (L(w + eps) - L(w - eps)) / (2 * eps)
print(analytic) # -1.0
print(numeric) # -1.0000000000842668
The chain-rule value and the numerical slope agree, so the formula is right.
Propagating backward through layers
In a deeper network the same rule applies along a longer path: more links, more factors multiplied together. The key efficiency is that the slope arriving at a neuron is reused for every weight feeding it, and is then passed back to the layer before. Computing these shared slopes once and sending them backward, layer by layer, is what the name backpropagation describes. It also assigns credit: each weight’s slope scales with how much that weight contributed to the error.
In the next lesson, we will assemble the forward pass, the loss, these gradients, and the update into the loop that actually trains a network.