Neural Networks · Neural Networks · 5 min read
Measuring how wrong the network is
A network produces predictions, but to improve it we first need to say how wrong those predictions are, and we need to say it as a single number. That number is the loss. A smaller loss means better predictions, so the entire goal of training becomes one thing: make the loss as small as possible.
A loss function turns the gap between the network’s prediction and the true answer into one number, and learning is the search for weights that make it small.
Cross-entropy for classification
When the network outputs a probability between and for a yes-or-no label, the standard loss is binary cross-entropy. For one example:
- is the true label, either or .
- is the network’s predicted probability that the label is .
- is the natural logarithm.
- When , only the first term survives and . A prediction near gives a loss near ; a prediction near gives a very large loss.
- When , only the second term survives and , which behaves the same way in reverse.
- The leading minus sign keeps positive, because the logarithm of a number below is negative.
The curves show the rule: confident and correct gives almost no loss, while confident and wrong gives a steeply growing loss. At a hesitant prediction near , both labels cost roughly the same.
A worked example
Use the network output from lesson 4, , and check the loss under each possible true label:
import math
# the network's predicted probability that the label is 1 (from lesson 4)
y_hat = 0.4878441320948962
# binary cross-entropy for one example and a given true label
def bce(y, p):
return -(y * math.log(p) + (1 - y) * math.log(1 - p))
print(bce(1, y_hat)) # 0.7177593255933803 (true label is 1)
print(bce(0, y_hat)) # 0.6691262707697319 (true label is 0)
Both losses sit near , which is , exactly what you expect from a prediction that is barely committing either way.
Averaging over a dataset
Training does not care about one example but about all of them. The loss for the whole dataset is the average of the per-example losses:
Here is the number of examples, and and are the true label and predicted probability for example . For predicting a number rather than a class, the analogous choice is mean squared error, the average of .
In the next lesson, we will use this single number to improve the weights, by stepping each weight in the direction that makes the loss smaller. That procedure is gradient descent.