Paper explainer · intermediate
Attention Is All You Need
A visual, plain-language walkthrough of the 2017 paper that introduced the transformer, from the limits of RNNs to the architecture behind every modern LLM.
Vaswani et al. · 2017 · NeurIPS
Read the original paper →0 of 9 read · 9 parts · ~27 min
- 1 The problem with RNNs RNNs forced sequential, one-word-at-a-time processing and squeezed long sentences through a fixed-size hidden state, two limits the transformer removes with attention. 3 min
- 2 The transformer at a glance The transformer reads a source sentence with an encoder and writes the target with a decoder, both built from six stacked identical layers, moving data through tokens, embeddings, attention, and output probabilities. 3 min
- 3 Input representation The transformer turns text into numbers by looking up a learned embedding vector for each token, then adding a sine-cosine positional encoding so the model knows where each token sits. 3 min
- 4 Scaled dot-product attention Attention scores each pair of tokens with a query-key dot product, scales by the square root of the key dimension, applies softmax, and uses the resulting weights to mix the value vectors. 3 min
- 5 Multi-head attention Running eight attention heads in parallel, each with its own learned projections in a reduced dimension, lets the model capture many relationship types at once before concatenating and projecting the results. 3 min
- 6 The encoder Each of six encoder layers applies unrestricted multi-head self-attention and a feed-forward network, each wrapped in a residual connection and layer norm, turning input tokens into rich contextual vectors. 3 min
- 7 The decoder Each decoder layer uses three sublayers, masked self-attention that hides future tokens, cross-attention that reads the encoder output, and a feed-forward network, to generate the output one token at a time. 3 min
- 8 Why it works Replacing recurrence with attention gives the transformer O(1) paths between any two tokens, full training parallelism, and the removal of locality and ordering assumptions that constrained earlier models. 3 min
- 9 Results and lasting impact The transformer set new BLEU records at lower training cost than prior models and became the architectural foundation for BERT, GPT, T5, the Vision Transformer, and every major language model since. 3 min