Stacking layers · NorthGradient

A single layer maps a list of inputs to a list of outputs. Nothing stops us from treating that output as the input to another layer. Chaining layers this way is what makes a network deep, and computing the result from start to finish is the operation a network performs every time it runs.

A network is a chain of layers: each layer’s output is the next layer’s input. Evaluating them in order, from input to output, is the forward pass.

Layers feeding layers

Number the layers $1, 2, \dots, L$ . Write the network’s input as $\mathbf{a}^{(0)} = \mathbf{x}$ . Then every layer takes the previous layer’s output and applies the same rule from the last lesson:

\mathbf{a}^{(l)} = \sigma\!\left( W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \right)

Reading the symbols:

$\mathbf{a}^{(0)} = \mathbf{x}$ is the network’s input.
$\mathbf{a}^{(l-1)}$ is the output of the layer before, which serves as the input to layer $l$ .
$W^{(l)}$ and $\mathbf{b}^{(l)}$ are the weight matrix and bias vector belonging to layer $l$ .
$\mathbf{a}^{(l)}$ is the output of layer $l$ .
$L$ is the number of layers, so $\mathbf{a}^{(L)}$ is the final output of the whole network.

The forward pass as composition

Substituting one layer’s equation into the next shows what the network really is. For two layers:

\mathbf{a}^{(2)} = \sigma\!\left( W^{(2)} \, \sigma\!\left( W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \right) + \mathbf{b}^{(2)} \right)

Each layer is a function, and the network applies them one after another. A network is therefore function composition: layer 1’s function wrapped inside layer 2’s, and so on. The forward pass is just evaluating this from the inside out, one layer at a time.

The forward pass: the input flows through Layer 1, then Layer 2, with each layer's output feeding the next, evaluated left to right.

A worked example

Reuse the layer from lesson 3 as layer 1, then add a second layer with one neuron that reads layer 1’s two outputs:

W^{(2)} = \begin{bmatrix} 1 & -1 \end{bmatrix}, \qquad \mathbf{b}^{(2)} = \begin{bmatrix} 0.5 \end{bmatrix}

In code, one layer function is called twice, the output of the first feeding the second:

import math

def sigmoid(t):
    return 1 / (1 + math.exp(-t))

# apply one layer to an input vector: each row of W is one neuron
def layer(W, b, inp):
    return [sigmoid(sum(w_i * v_i for w_i, v_i in zip(row, inp)) + b_j)
            for row, b_j in zip(W, b)]

# the network input
x = [2.0, 3.0]

# layer 1: two neurons reading the two inputs (same as lesson 3)
W1 = [[0.5, -1.0], [1.0, 0.5]]
b1 = [1.0, -2.0]

# layer 2: one neuron reading layer 1's two outputs
W2 = [[1.0, -1.0]]
b2 = [0.5]

# forward pass: input through layer 1, then that output through layer 2
a1 = layer(W1, b1, x)
a2 = layer(W2, b2, a1)

print(a1)  # [0.2689414213699951, 0.8175744761936437]
print(a2)  # [0.4878441320948962]

The first layer reproduces lesson 3’s output exactly, and the second layer turns those two numbers into the network’s single final value.

In the next lesson, we will ask what these stacked layers actually represent: the shapes they can carve in the input space, called decision boundaries.