Measuring how wrong the network is

A network produces predictions, but to improve it we first need to say how wrong those predictions are, and we need to say it as a single number. That number is the loss. A smaller loss means better predictions, so the entire goal of training becomes one thing: make the loss as small as possible.

A loss function turns the gap between the network’s prediction and the true answer into one number, and learning is the search for weights that make it small.

Cross-entropy for classification

When the network outputs a probability $\hat{y}$ between $0$ and $1$ for a yes-or-no label, the standard loss is binary cross-entropy. For one example:

L = -\left[\, y \log(\hat{y}) + (1 - y)\log(1 - \hat{y}) \,\right]

$y$ is the true label, either $0$ or $1$ .
$\hat{y}$ is the network’s predicted probability that the label is $1$ .
$\log$ is the natural logarithm.
When $y = 1$ , only the first term survives and $L = -\log(\hat{y})$ . A prediction near $1$ gives a loss near $0$ ; a prediction near $0$ gives a very large loss.
When $y = 0$ , only the second term survives and $L = -\log(1 - \hat{y})$ , which behaves the same way in reverse.
The leading minus sign keeps $L$ positive, because the logarithm of a number below $1$ is negative.

Cross-entropy loss against the predicted probability: for true label 1 the loss falls as the prediction rises toward 1; for true label 0 it rises. The curves cross near 0.5.

The curves show the rule: confident and correct gives almost no loss, while confident and wrong gives a steeply growing loss. At a hesitant prediction near $0.5$ , both labels cost roughly the same.

A worked example

Use the network output from lesson 4, $\hat{y} \approx 0.4878$ , and check the loss under each possible true label:

import math

# the network's predicted probability that the label is 1 (from lesson 4)
y_hat = 0.4878441320948962

# binary cross-entropy for one example and a given true label
def bce(y, p):
    return -(y * math.log(p) + (1 - y) * math.log(1 - p))

print(bce(1, y_hat))  # 0.7177593255933803  (true label is 1)
print(bce(0, y_hat))  # 0.6691262707697319  (true label is 0)

Both losses sit near $0.69$ , which is $\log 2$ , exactly what you expect from a prediction that is barely committing either way.

Averaging over a dataset

Training does not care about one example but about all of them. The loss for the whole dataset is the average of the per-example losses:

L = -\frac{1}{N}\sum_{i=1}^{N}\left[\, y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i) \,\right]

Here $N$ is the number of examples, and $y_i$ and $\hat{y}_i$ are the true label and predicted probability for example $i$ . For predicting a number rather than a class, the analogous choice is mean squared error, the average of $(\hat{y}_i - y_i)^2$ .

In the next lesson, we will use this single number to improve the weights, by stepping each weight in the direction that makes the loss smaller. That procedure is gradient descent.