Learning Journal

Notes on AI/ML concepts, coding, and ideas — written down while they’re still fresh.

Understanding Attention Mechanisms in Transformers

The Core Idea Attention lets each token in a sequence “look at” every other token and decide how much to weight its contribution. Query, Key, Value Three projections of the input: Query (Q): what this token is looking for Key (K): what each token offers Value (V): what each token contributes if selected $$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Why It Works The dot product of Q and K measures compatibility. Dividing by $\sqrt{d_k}$ prevents softmax from saturating in high dimensions. ...

Rust Ownership: Mental Models That Finally Clicked

The Problem Rust Solves Memory bugs — use-after-free, double-free, dangling pointers — are the root of most systems security vulnerabilities. Rust eliminates them at compile time. The Three Rules Every value has exactly one owner. When the owner goes out of scope, the value is dropped. Ownership can be transferred (moved) or temporarily lent (borrowed). What Finally Clicked I stopped thinking about the borrow checker as a restriction and started thinking about it as a guarantee: there is always exactly one place responsible for cleanup. ...

Notes: Four Thousand Weeks — Oliver Burkeman

The Core Argument The average human life is about four thousand weeks. You will never clear your to-do list. Accepting this — really accepting it — changes how you make decisions. What Struck Me The book’s central move is to reframe the productivity trap: the reason “getting on top of things” never works is that it’s based on a false premise — that a state of being on top of things is achievable. ...

Gradient Descent: From Intuition to Implementation

What is Gradient Descent? Gradient descent minimizes a function by iteratively stepping in the direction of steepest descent. The Intuition Imagine standing blindfolded on a hilly landscape. You want the lowest point. Strategy: feel the slope, step downhill, repeat. The Math Given a loss function $L(\theta)$, update parameters as: $$\theta = \theta - \alpha \nabla L(\theta)$$ where $\alpha$ is the learning rate. A Simple Implementation def gradient_descent(grad_fn, theta, lr=0.01, steps=100): for _ in range(steps): theta -= lr * grad_fn(theta) return theta Key Takeaways Learning rate too large: diverges. Too small: slow. Vanilla GD uses the full dataset per step. Mini-batch GD is the practical default in deep learning.