The Core Idea
Attention lets each token in a sequence “look at” every other token and decide how much to weight its contribution.
Query, Key, Value
Three projections of the input:
- Query (Q): what this token is looking for
- Key (K): what each token offers
- Value (V): what each token contributes if selected
$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Why It Works
The dot product of Q and K measures compatibility. Dividing by $\sqrt{d_k}$ prevents softmax from saturating in high dimensions.
Multi-Head Attention
Running attention in parallel across multiple learned subspaces lets the model attend to different relationship types simultaneously.
Key Takeaways
- Self-attention is permutation-equivariant — position encodings add order.
- Complexity is O(n²) in sequence length.
- The transformer replaced recurrence with attention entirely.
