Learning Journal

Understanding Attention Mechanisms in Transformers

Thu, 18 Jun 2026 00:00:00 +0000

The Core Idea

Attention lets each token in a sequence “look at” every other token and decide how much to weight its contribution.

Query, Key, Value

Three projections of the input:

Query (Q): what this token is looking for
Key (K): what each token offers
Value (V): what each token contributes if selected

$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Why It Works

The dot product of Q and K measures compatibility. Dividing by $\sqrt{d_k}$ prevents softmax from saturating in high dimensions.

Rust Ownership: Mental Models That Finally Clicked

Wed, 10 Jun 2026 00:00:00 +0000

The Problem Rust Solves

Memory bugs — use-after-free, double-free, dangling pointers — are the root of most systems security vulnerabilities. Rust eliminates them at compile time.

The Three Rules

Every value has exactly one owner.
When the owner goes out of scope, the value is dropped.
Ownership can be transferred (moved) or temporarily lent (borrowed).

What Finally Clicked

I stopped thinking about the borrow checker as a restriction and started thinking about it as a guarantee: there is always exactly one place responsible for cleanup.

Notes: Four Thousand Weeks — Oliver Burkeman

Wed, 03 Jun 2026 00:00:00 +0000

The Core Argument

The average human life is about four thousand weeks. You will never clear your to-do list. Accepting this — really accepting it — changes how you make decisions.

What Struck Me

The book’s central move is to reframe the productivity trap: the reason “getting on top of things” never works is that it’s based on a false premise — that a state of being on top of things is achievable.

Gradient Descent: From Intuition to Implementation

Tue, 19 May 2026 00:00:00 +0000

What is Gradient Descent?

Gradient descent minimizes a function by iteratively stepping in the direction of steepest descent.

The Intuition

Imagine standing blindfolded on a hilly landscape. You want the lowest point. Strategy: feel the slope, step downhill, repeat.

The Math

Given a loss function $L(\theta)$, update parameters as:

$$\theta = \theta - \alpha \nabla L(\theta)$$

where $\alpha$ is the learning rate.

A Simple Implementation

def gradient_descent(grad_fn, theta, lr=0.01, steps=100):
 for _ in range(steps):
 theta -= lr * grad_fn(theta)
 return theta

Key Takeaways

Learning rate too large: diverges. Too small: slow.
Vanilla GD uses the full dataset per step.
Mini-batch GD is the practical default in deep learning.

About

Mon, 01 Jan 0001 00:00:00 +0000

Hi, I’m Hieu — a CS student working as AI/ML Engineer.

I spend my days training models that occasionally work, reading papers that occasionally make sense, and doing problems that occasionally don’t make me want to close my laptop forever. The keyword here is occasionally.

This journal exists because my brain leaks. I read something, think “ah, that’s clever,” and then forget it completely within 48 hours. So now I write it down. If it happens to be useful to you too, great — we can be confused together.