The training loop · NorthGradient

Every piece is now in place. The forward pass produces a prediction, the loss measures its error, backpropagation gives the slope for each weight, and gradient descent updates the weights. The training loop simply runs these in order and repeats until the loss stops falling.

Training repeats four steps, forward pass, loss, gradients, and update, over the data until the loss settles.

One step

A single training step chains the previous five lessons together:

Forward pass: run the inputs through the network to get predictions (lessons 1 to 4).
Compute loss: measure how wrong those predictions are (lesson 6).
Backprop: compute the slope of the loss for every weight (lesson 8).
Update: move every weight a small step against its slope (lesson 7).

Batches and epochs

Running all four steps on the whole dataset at once is one option, but data is usually split into smaller groups. A batch is the set of examples used for one update. One epoch is one full pass through all the batches, so every example has been seen once. Training runs for many epochs, and the gradient for each update is averaged over its batch:

\theta \leftarrow \theta - \eta \cdot \frac{1}{m}\sum_{i=1}^{m} \nabla_\theta L_i

where $m$ is the number of examples in the batch, $L_i$ is the loss on example $i$ , and $\theta$ are all the weights and biases.

A complete example

Here is a single neuron learning the AND function from scratch, using the error signal $(a - y)$ from lesson 8 as the gradient:

import math

def sigmoid(t):
    return 1 / (1 + math.exp(-t))

# training data for the AND function (output 1 only when both inputs are 1)
X = [(0, 0), (0, 1), (1, 0), (1, 1)]
Y = [0, 0, 0, 1]

# parameters to learn: two weights and a bias
w = [0.0, 0.0]
b = 0.0
eta = 0.5  # learning rate

# train for many epochs over the whole dataset
for epoch in range(2000):
    gw = [0.0, 0.0]
    gb = 0.0
    for (x1, x2), y in zip(X, Y):
        a = sigmoid(w[0] * x1 + w[1] * x2 + b)  # forward pass
        err = a - y                             # error signal from lesson 8
        gw[0] += err * x1                       # gradient for weight 0
        gw[1] += err * x2                       # gradient for weight 1
        gb    += err                            # gradient for the bias
    n = len(X)
    w[0] -= eta * gw[0] / n                      # gradient-descent update
    w[1] -= eta * gw[1] / n
    b    -= eta * gb / n

# check the trained neuron on each input
for (x1, x2), y in zip(X, Y):
    a = sigmoid(w[0] * x1 + w[1] * x2 + b)
    print((x1, x2), round(a, 3), "->", 1 if a > 0.5 else 0)
# (0, 0) 0.0 -> 0
# (0, 1) 0.02 -> 0
# (1, 0) 0.02 -> 0
# (1, 1) 0.972 -> 1

Starting from all zeros, the loop alone discovers weights that classify AND correctly. The same four steps, scaled up to millions of weights and examples, are how every neural network is trained.

In the final lesson, a short quiz checks the whole chain, from what one neuron computes to how a network learns.