NorthGradient
Start reading
Neural Networks Browse lessons

Neural Networks · Neural Networks · 5 min read

Measuring how wrong the network is

A network produces predictions, but to improve it we first need to say how wrong those predictions are, and we need to say it as a single number. That number is the loss. A smaller loss means better predictions, so the entire goal of training becomes one thing: make the loss as small as possible.

A loss function turns the gap between the network’s prediction and the true answer into one number, and learning is the search for weights that make it small.

Cross-entropy for classification

When the network outputs a probability y^\hat{y} between 00 and 11 for a yes-or-no label, the standard loss is binary cross-entropy. For one example:

L=[ylog(y^)+(1y)log(1y^)]L = -\left[\, y \log(\hat{y}) + (1 - y)\log(1 - \hat{y}) \,\right]
  • yy is the true label, either 00 or 11.
  • y^\hat{y} is the network’s predicted probability that the label is 11.
  • log\log is the natural logarithm.
  • When y=1y = 1, only the first term survives and L=log(y^)L = -\log(\hat{y}). A prediction near 11 gives a loss near 00; a prediction near 00 gives a very large loss.
  • When y=0y = 0, only the second term survives and L=log(1y^)L = -\log(1 - \hat{y}), which behaves the same way in reverse.
  • The leading minus sign keeps LL positive, because the logarithm of a number below 11 is negative.
Cross-entropy loss against the predicted probability: for true label 1 the loss falls as the prediction rises toward 1; for true label 0 it rises. The curves cross near 0.5.
Cross-entropy loss against the predicted probability: for true label 1 the loss falls as the prediction rises toward 1; for true label 0 it rises. The curves cross near 0.5.

The curves show the rule: confident and correct gives almost no loss, while confident and wrong gives a steeply growing loss. At a hesitant prediction near 0.50.5, both labels cost roughly the same.

A worked example

Use the network output from lesson 4, y^0.4878\hat{y} \approx 0.4878, and check the loss under each possible true label:

import math

# the network's predicted probability that the label is 1 (from lesson 4)
y_hat = 0.4878441320948962

# binary cross-entropy for one example and a given true label
def bce(y, p):
    return -(y * math.log(p) + (1 - y) * math.log(1 - p))

print(bce(1, y_hat))  # 0.7177593255933803  (true label is 1)
print(bce(0, y_hat))  # 0.6691262707697319  (true label is 0)

Both losses sit near 0.690.69, which is log2\log 2, exactly what you expect from a prediction that is barely committing either way.

Averaging over a dataset

Training does not care about one example but about all of them. The loss for the whole dataset is the average of the per-example losses:

L=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{N}\sum_{i=1}^{N}\left[\, y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i) \,\right]

Here NN is the number of examples, and yiy_i and y^i\hat{y}_i are the true label and predicted probability for example ii. For predicting a number rather than a class, the analogous choice is mean squared error, the average of (y^iyi)2(\hat{y}_i - y_i)^2.

In the next lesson, we will use this single number to improve the weights, by stepping each weight in the direction that makes the loss smaller. That procedure is gradient descent.