<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Nlp on Learning Journal</title><link>https://learning-notes-8ef.pages.dev/tags/nlp/</link><description>Recent content in Nlp on Learning Journal</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 18 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://learning-notes-8ef.pages.dev/tags/nlp/index.xml" rel="self" type="application/rss+xml"/><item><title>Understanding Attention Mechanisms in Transformers</title><link>https://learning-notes-8ef.pages.dev/posts/machine-learning/understanding-attention/</link><pubDate>Thu, 18 Jun 2026 00:00:00 +0000</pubDate><guid>https://learning-notes-8ef.pages.dev/posts/machine-learning/understanding-attention/</guid><description>&lt;h2 id="the-core-idea"&gt;The Core Idea&lt;/h2&gt;
&lt;p&gt;Attention lets each token in a sequence &amp;ldquo;look at&amp;rdquo; every other token and decide how much to weight its contribution.&lt;/p&gt;
&lt;h2 id="query-key-value"&gt;Query, Key, Value&lt;/h2&gt;
&lt;p&gt;Three projections of the input:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query (Q):&lt;/strong&gt; what this token is looking for&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Key (K):&lt;/strong&gt; what each token offers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Value (V):&lt;/strong&gt; what each token contributes if selected&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$&lt;/p&gt;
&lt;h2 id="why-it-works"&gt;Why It Works&lt;/h2&gt;
&lt;p&gt;The dot product of Q and K measures compatibility. Dividing by $\sqrt{d_k}$ prevents softmax from saturating in high dimensions.&lt;/p&gt;</description></item></channel></rss>