Scaled dot-product attention

Attention is the mechanism that lets every token gather information from every other token. It does this by computing a score between each pair of tokens, converting those scores into weights, and using the weights to mix together a set of information vectors. Three vectors drive the whole process: a query, a key, and a value.

Query, key, and value

Each word is projected into three vectors: Q (what am I looking for?), K (what do I offer?), and V (what information do I carry?).

Every token in the sequence is transformed into three separate vectors by multiplying its embedding with three learned weight matrices. The query $Q$ represents what the token is searching for in the rest of the sequence. The key $K$ represents what the token advertises about itself. The value $V$ carries the actual content that will be passed on if another token attends to this one. These three projections are independent: each is learned separately during training.

Computing attention weights

Q and K are multiplied to produce a score S, which is divided by the square root of d_k to prevent large values, then passed through softmax to produce attention weights that sum to 1.

To decide how much attention token $i$ should pay to token $j$ , the model computes the dot product of their query and key vectors. A larger dot product means the two tokens are more compatible. The result is a raw score $S = Q \cdot K$ , where $Q$ is the query of token $i$ and $K$ is the key of token $j$ .

This score is then divided by $\sqrt{d_k}$ , where $d_k$ is the dimension of the key vectors. Without this scaling step, dot products in high-dimensional spaces tend to grow very large, pushing the softmax output toward extreme values close to 0 or 1 and making gradients vanish during training. Dividing by $\sqrt{d_k}$ keeps the scores in a stable range.

The scaled score passes through a softmax function, which converts the full set of scores for token $i$ across all other tokens into a probability distribution. Every weight is between 0 and 1, and all weights sum to exactly 1. This is the complete formula:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where $Q$ is the matrix of query vectors, $K$ is the matrix of key vectors, $K^T$ is its transpose, $d_k$ is the key dimension, and $V$ is the matrix of value vectors.

Weighted sum over values

Four value vectors V1 to V4 with weights 0.6, 0.3, 0.08, and 0.02 are combined proportionally into a single attention output vector.

The attention weights are applied to the value vectors. Each value vector is multiplied by its corresponding weight, and the results are summed. A token with weight 0.6 contributes much more to the output than a token with weight 0.02. The final result is a single vector that is a weighted mixture of all value vectors in the sequence. This is the attention output for token $i$ : a new representation that blends information from across the entire sequence, with each source weighted by how relevant it was judged to be.