Why it works · NorthGradient

The transformer’s advantages are not incidental. They follow directly from the decision to replace recurrence with attention. Three properties explain most of the performance gain: shorter paths between tokens, full parallelism during training, and the deliberate removal of architectural assumptions that limited earlier models.

Shorter paths between distant tokens

Top: an RNN processes ten words in a sequential chain, labeled path length O(n) = 10 steps. Bottom: a transformer connects the first and last word directly with a single arc, labeled path length O(1) = 1 step. A note reads: shorter path = easier to learn long-range relationships.

In an RNN, information from the first token must travel through every intermediate token to reach the last one. For a sequence of $n$ tokens, the path length between any two distant tokens is $O(n)$ . In the transformer, every token is directly connected to every other token through attention. The path length between any two tokens is always $O(1)$ , regardless of how far apart they are in the sequence. Shorter paths mean gradients flow more directly during training, making it easier for the model to learn dependencies between distant words.

Full parallelism during training

Left panel RNN training: five words in a vertical sequential chain with time steps 1 to 5 labeled and a clock showing must wait. A tortoise sits below. Right panel Transformer training: five words in a horizontal row all processed simultaneously with a checkmark and a fast arrow below.

RNN training is inherently sequential: the hidden state at step $t$ depends on the hidden state at step $t-1$ , so no step can begin before the previous one finishes. The transformer has no such dependency. Because attention operates on the full sequence at once using matrix operations, all positions are processed in parallel on modern hardware. This means a sentence of 100 tokens takes the same number of forward pass operations as a sentence of 10 tokens. Training is dramatically faster, which makes it practical to scale to much larger datasets and model sizes.

What was removed and what was kept

Two-column diagram. Left column Removed with a cross: recurrent connections, convolutions, positional recurrence. Right column Kept with a checkmark: attention mechanism, residual connections, layer normalization, feed-forward layers. A note at the bottom reads: result: parallelizable, scalable, strong on long sequences.

The transformer is defined as much by what it removes as by what it adds. Recurrent connections are gone, eliminating the sequential bottleneck. Convolutions are gone, removing the assumption that relevant context is always local. Positional recurrence is gone, replaced by explicit positional encodings. What remains are components that make deep networks trainable at scale: attention for global context, residual connections for gradient flow, layer normalization for training stability, and feed-forward layers for per-token transformations. The result is an architecture with no built-in bias about locality or sequence order, which makes it more general and more scalable than its predecessors.