Understanding Attention Mechanisms in Transformers

The Core Idea Attention lets each token in a sequence “look at” every other token and decide how much to weight its contribution. Query, Key, Value Three projections of the input: Query (Q): what this token is looking for Key (K): what each token offers Value (V): what each token contributes if selected $$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Why It Works The dot product of Q and K measures compatibility. Dividing by $\sqrt{d_k}$ prevents softmax from saturating in high dimensions. ...

June 18, 2026 · 1 min

Gradient Descent: From Intuition to Implementation

What is Gradient Descent? Gradient descent minimizes a function by iteratively stepping in the direction of steepest descent. The Intuition Imagine standing blindfolded on a hilly landscape. You want the lowest point. Strategy: feel the slope, step downhill, repeat. The Math Given a loss function $L(\theta)$, update parameters as: $$\theta = \theta - \alpha \nabla L(\theta)$$ where $\alpha$ is the learning rate. A Simple Implementation def gradient_descent(grad_fn, theta, lr=0.01, steps=100): for _ in range(steps): theta -= lr * grad_fn(theta) return theta Key Takeaways Learning rate too large: diverges. Too small: slow. Vanilla GD uses the full dataset per step. Mini-batch GD is the practical default in deep learning.

May 19, 2026 · 1 min
pixel cat