<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Optimization on Learning Journal</title><link>https://learning-notes-8ef.pages.dev/tags/optimization/</link><description>Recent content in Optimization on Learning Journal</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://learning-notes-8ef.pages.dev/tags/optimization/index.xml" rel="self" type="application/rss+xml"/><item><title>Gradient Descent: From Intuition to Implementation</title><link>https://learning-notes-8ef.pages.dev/posts/machine-learning/gradient-descent/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://learning-notes-8ef.pages.dev/posts/machine-learning/gradient-descent/</guid><description>&lt;h2 id="what-is-gradient-descent"&gt;What is Gradient Descent?&lt;/h2&gt;
&lt;p&gt;Gradient descent minimizes a function by iteratively stepping in the direction of steepest descent.&lt;/p&gt;
&lt;h2 id="the-intuition"&gt;The Intuition&lt;/h2&gt;
&lt;p&gt;Imagine standing blindfolded on a hilly landscape. You want the lowest point. Strategy: feel the slope, step downhill, repeat.&lt;/p&gt;
&lt;h2 id="the-math"&gt;The Math&lt;/h2&gt;
&lt;p&gt;Given a loss function $L(\theta)$, update parameters as:&lt;/p&gt;
&lt;p&gt;$$\theta = \theta - \alpha \nabla L(\theta)$$&lt;/p&gt;
&lt;p&gt;where $\alpha$ is the learning rate.&lt;/p&gt;
&lt;h2 id="a-simple-implementation"&gt;A Simple Implementation&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;gradient_descent&lt;/span&gt;(grad_fn, theta, lr&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;0.01&lt;/span&gt;, steps&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;100&lt;/span&gt;):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; _ &lt;span style="color:#f92672"&gt;in&lt;/span&gt; range(steps):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; theta &lt;span style="color:#f92672"&gt;-=&lt;/span&gt; lr &lt;span style="color:#f92672"&gt;*&lt;/span&gt; grad_fn(theta)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; theta
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="key-takeaways"&gt;Key Takeaways&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Learning rate too large: diverges. Too small: slow.&lt;/li&gt;
&lt;li&gt;Vanilla GD uses the full dataset per step.&lt;/li&gt;
&lt;li&gt;Mini-batch GD is the practical default in deep learning.&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>