<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Learning Journal</title><link>https://learning-notes-8ef.pages.dev/</link><description>Recent content on Learning Journal</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 18 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://learning-notes-8ef.pages.dev/index.xml" rel="self" type="application/rss+xml"/><item><title>Understanding Attention Mechanisms in Transformers</title><link>https://learning-notes-8ef.pages.dev/posts/machine-learning/understanding-attention/</link><pubDate>Thu, 18 Jun 2026 00:00:00 +0000</pubDate><guid>https://learning-notes-8ef.pages.dev/posts/machine-learning/understanding-attention/</guid><description>&lt;h2 id="the-core-idea"&gt;The Core Idea&lt;/h2&gt;
&lt;p&gt;Attention lets each token in a sequence &amp;ldquo;look at&amp;rdquo; every other token and decide how much to weight its contribution.&lt;/p&gt;
&lt;h2 id="query-key-value"&gt;Query, Key, Value&lt;/h2&gt;
&lt;p&gt;Three projections of the input:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query (Q):&lt;/strong&gt; what this token is looking for&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Key (K):&lt;/strong&gt; what each token offers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Value (V):&lt;/strong&gt; what each token contributes if selected&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$&lt;/p&gt;
&lt;h2 id="why-it-works"&gt;Why It Works&lt;/h2&gt;
&lt;p&gt;The dot product of Q and K measures compatibility. Dividing by $\sqrt{d_k}$ prevents softmax from saturating in high dimensions.&lt;/p&gt;</description></item><item><title>Rust Ownership: Mental Models That Finally Clicked</title><link>https://learning-notes-8ef.pages.dev/posts/programming/rust-ownership/</link><pubDate>Wed, 10 Jun 2026 00:00:00 +0000</pubDate><guid>https://learning-notes-8ef.pages.dev/posts/programming/rust-ownership/</guid><description>&lt;h2 id="the-problem-rust-solves"&gt;The Problem Rust Solves&lt;/h2&gt;
&lt;p&gt;Memory bugs — use-after-free, double-free, dangling pointers — are the root of most systems security vulnerabilities. Rust eliminates them at compile time.&lt;/p&gt;
&lt;h2 id="the-three-rules"&gt;The Three Rules&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Every value has exactly one owner.&lt;/li&gt;
&lt;li&gt;When the owner goes out of scope, the value is dropped.&lt;/li&gt;
&lt;li&gt;Ownership can be transferred (moved) or temporarily lent (borrowed).&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="what-finally-clicked"&gt;What Finally Clicked&lt;/h2&gt;
&lt;p&gt;I stopped thinking about the borrow checker as a restriction and started thinking about it as a guarantee: &lt;strong&gt;there is always exactly one place responsible for cleanup&lt;/strong&gt;.&lt;/p&gt;</description></item><item><title>Notes: Four Thousand Weeks — Oliver Burkeman</title><link>https://learning-notes-8ef.pages.dev/posts/books/four-thousand-weeks/</link><pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate><guid>https://learning-notes-8ef.pages.dev/posts/books/four-thousand-weeks/</guid><description>&lt;h2 id="the-core-argument"&gt;The Core Argument&lt;/h2&gt;
&lt;p&gt;The average human life is about four thousand weeks. You will never clear your to-do list. Accepting this — really accepting it — changes how you make decisions.&lt;/p&gt;
&lt;h2 id="what-struck-me"&gt;What Struck Me&lt;/h2&gt;
&lt;p&gt;The book&amp;rsquo;s central move is to reframe the productivity trap: the reason &amp;ldquo;getting on top of things&amp;rdquo; never works is that it&amp;rsquo;s based on a false premise — that a state of being on top of things is achievable.&lt;/p&gt;</description></item><item><title>Gradient Descent: From Intuition to Implementation</title><link>https://learning-notes-8ef.pages.dev/posts/machine-learning/gradient-descent/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://learning-notes-8ef.pages.dev/posts/machine-learning/gradient-descent/</guid><description>&lt;h2 id="what-is-gradient-descent"&gt;What is Gradient Descent?&lt;/h2&gt;
&lt;p&gt;Gradient descent minimizes a function by iteratively stepping in the direction of steepest descent.&lt;/p&gt;
&lt;h2 id="the-intuition"&gt;The Intuition&lt;/h2&gt;
&lt;p&gt;Imagine standing blindfolded on a hilly landscape. You want the lowest point. Strategy: feel the slope, step downhill, repeat.&lt;/p&gt;
&lt;h2 id="the-math"&gt;The Math&lt;/h2&gt;
&lt;p&gt;Given a loss function $L(\theta)$, update parameters as:&lt;/p&gt;
&lt;p&gt;$$\theta = \theta - \alpha \nabla L(\theta)$$&lt;/p&gt;
&lt;p&gt;where $\alpha$ is the learning rate.&lt;/p&gt;
&lt;h2 id="a-simple-implementation"&gt;A Simple Implementation&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;gradient_descent&lt;/span&gt;(grad_fn, theta, lr&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;0.01&lt;/span&gt;, steps&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;100&lt;/span&gt;):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; _ &lt;span style="color:#f92672"&gt;in&lt;/span&gt; range(steps):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; theta &lt;span style="color:#f92672"&gt;-=&lt;/span&gt; lr &lt;span style="color:#f92672"&gt;*&lt;/span&gt; grad_fn(theta)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; theta
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="key-takeaways"&gt;Key Takeaways&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Learning rate too large: diverges. Too small: slow.&lt;/li&gt;
&lt;li&gt;Vanilla GD uses the full dataset per step.&lt;/li&gt;
&lt;li&gt;Mini-batch GD is the practical default in deep learning.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>About</title><link>https://learning-notes-8ef.pages.dev/about/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learning-notes-8ef.pages.dev/about/</guid><description>&lt;p&gt;Hi, I&amp;rsquo;m Hieu — a CS student working as AI/ML Engineer.&lt;/p&gt;
&lt;p&gt;I spend my days training models that occasionally work, reading papers that occasionally make sense, and doing problems that occasionally don&amp;rsquo;t make me want to close my laptop forever. The keyword here is &lt;em&gt;occasionally&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This journal exists because my brain leaks. I read something, think &amp;ldquo;ah, that&amp;rsquo;s clever,&amp;rdquo; and then forget it completely within 48 hours. So now I write it down. If it happens to be useful to you too, great — we can be confused together.&lt;/p&gt;</description></item></channel></rss>