Understanding Attention Mechanisms in Transformers

Query (Q): what this token is looking for
Key (K): what each token offers
Value (V): what each token contributes if selected

The Core Idea

Attention lets each token in a sequence “look at” every other token and decide how much to weight its contribution.

Three projections of the input:

$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The dot product of Q and K measures compatibility. Dividing by $\sqrt{d_k}$ prevents softmax from saturating in high dimensions.

Running attention in parallel across multiple learned subspaces lets the model attend to different relationship types simultaneously.