<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Prefix-Caching on lo0 — Blog Técnico</title><link>https://blog.lo0.es/tags/prefix-caching/</link><description>Recent content in Prefix-Caching on lo0 — Blog Técnico</description><generator>Hugo -- gohugo.io</generator><language>es</language><lastBuildDate>Fri, 05 Jun 2026 04:00:00 +0000</lastBuildDate><atom:link href="https://blog.lo0.es/tags/prefix-caching/index.xml" rel="self" type="application/rss+xml"/><item><title>Prefix cache: ingeniería del hit rate para pasar del 15% al 75%</title><link>https://blog.lo0.es/posts/prefix-cache-hit-rate-engineering/</link><pubDate>Fri, 05 Jun 2026 04:00:00 +0000</pubDate><guid>https://blog.lo0.es/posts/prefix-cache-hit-rate-engineering/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;p>El prefix cache de vLLM almacena los bloques de KV cache de prefijos compartidos y los reutiliza en requests posteriores. Un hit evita recalcular ese prefijo: el TTFT cae al coste del sufijo variable únicamente. En workloads enterprise con system prompts fijos —RAG, chatbots de dominio, asistentes con instrucciones largas— el hit rate debería ser &amp;gt;70%. En la práctica es 10-20% por razones completamente evitables. Este artículo las identifica, las corrige y da las queries OTel para confirmar el resultado.&lt;/p>
&lt;hr>
&lt;h2 id="la-analogía">La analogía&lt;/h2>
&lt;p>Un intérprete de conferencias simultáneas que tiene que traducir los discursos de veinte ponentes. Todos empiezan con el mismo preámbulo protocolar de dos páginas: la declaración de la conferencia, las reglas de conducta, el programa del día. Un intérprete sin memoria relee las dos páginas para cada ponente antes de empezar a traducir su discurso específico. Un intérprete con notas buenas las lee una vez, las archiva, y cuando empieza el segundo ponente pasa directamente al discurso.&lt;/p>
&lt;p>El prefix cache es ese archivo. El hash del prefijo es la referencia que permite saltar a la parte nueva. Pero si el preámbulo cambia aunque sea en una palabra — porque alguien pone la fecha del día — el intérprete tiene que releer todo desde el principio.&lt;/p>
&lt;hr>
&lt;h2 id="cómo-funciona-el-hash-de-prefix-cache">Cómo funciona el hash de prefix cache&lt;/h2>
&lt;p>vLLM divide el KV cache en bloques de 16 tokens. Cada bloque tiene un hash calculado sobre su contenido exacto. Cuando llega un nuevo request, vLLM comprueba si algún bloque inicial del prompt ya está en cache comparando hashes.&lt;/p>
&lt;p>El hash se calcula sobre &lt;strong>el contenido byte a byte de los tokens&lt;/strong>. Cualquier diferencia — un espacio, un carácter diferente, un token de más — produce un hash completamente distinto. No hay matching parcial dentro de un bloque.&lt;/p>
&lt;p>Consecuencia directa: si tu system prompt tiene 512 tokens y el token número 3 cambia entre requests (porque interpolas una fecha, un ID, un número de versión), &lt;strong>ningún bloque hace hit&lt;/strong> aunque el 99% del texto sea idéntico.&lt;/p>
&lt;pre tabindex="0">&lt;code>Bloque 0 (tokens 0-15): hash = a3f7... ← ¿en cache?
Bloque 1 (tokens 16-31): hash = 9d2c... ← ¿en cache?
...
Bloque 31 (tokens 496-511): hash = 7e1a... ← ¿en cache?
&lt;/code>&lt;/pre>&lt;p>Si el bloque 0 no hace hit (porque su contenido cambió), los bloques 1-31 tampoco se comprueban aunque sean idénticos — el prefix cache es secuencial.&lt;/p>
&lt;hr>
&lt;h2 id="auditoría-por-qué-tu-hit-rate-real-es-bajo">Auditoría: por qué tu hit rate real es bajo&lt;/h2>
&lt;p>Antes de cambiar nada, hay que saber &lt;em>qué&lt;/em> está rompiendo el hash. El método más directo: extraer los últimos 1000 prompts de producción y calcular qué fracción del prefix varía.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># audit_prefix_cache.py&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">langfuse&lt;/span>&lt;span class="o">,&lt;/span> &lt;span class="nn">hashlib&lt;/span>&lt;span class="o">,&lt;/span> &lt;span class="nn">collections&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">transformers&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">AutoTokenizer&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">client&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">langfuse&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Langfuse&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tokenizer&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">AutoTokenizer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">from_pretrained&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Qwen/Qwen2.5-14B-Instruct&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">traces&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">client&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fetch_traces&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">limit&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1000&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">data&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">prompts&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="n">t&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">input&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">t&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">traces&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="n">t&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">input&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Tokenizar y extraer los primeros 512 tokens (el system prompt típico)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">prefixes&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">prompt&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">prompts&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">tokens&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tokenizer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">encode&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">prompt&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">add_special_tokens&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">prefix_tokens&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">tuple&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tokens&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">prefixes&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">prefix_tokens&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ¿Cuántos prefixes únicos hay?&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">unique&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">set&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">prefixes&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">total&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">prefixes&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Prefixes únicos: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">total&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> (&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">total&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="mi">100&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.1f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">%)&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Hit rate teórico si todos fueran iguales: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">unique&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">total&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="mi">100&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.1f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">%&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Encontrar qué token difiere entre el prefix más común y los demás&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">collections&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Counter&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">most_common_prefix&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Counter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">prefixes&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">most_common&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">divergence_positions&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">prefix&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">prefixes&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">prefix&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">most_common_prefix&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">continue&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">a&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">b&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">most_common_prefix&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">prefix&lt;/span>&lt;span class="p">)):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">a&lt;/span> &lt;span class="o">!=&lt;/span> &lt;span class="n">b&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">divergence_positions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">break&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="n">divergence_positions&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">pos&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Counter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">divergence_positions&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">most_common&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">token_text&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tokenizer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">decode&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">most_common_prefix&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">pos&lt;/span>&lt;span class="p">]])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">Divergencia más frecuente en posición &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">pos&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">: &amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">token_text&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#39;&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;→ El token en esa posición varía entre requests&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Los culpables más comunes, en orden de frecuencia:&lt;/p>
&lt;p>&lt;strong>1. Timestamps y fechas:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ❌ Rompe el hash en cada request&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Fecha actual: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">datetime&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">now&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">strftime&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;%Y-%m-&lt;/span>&lt;span class="si">%d&lt;/span>&lt;span class="s1"> %H:%M&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">. Eres un asistente...&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ✅ Sacar la fecha del system prompt&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;Eres un asistente especializado en infraestructura cloud.&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Pasar la fecha como parte del mensaje del usuario si es necesaria&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>2. IDs de sesión y usuarios:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ❌&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Usuario ID: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">. Preferencias: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">user_prefs&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">. Eres un asistente...&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ✅ Separar lo estático de lo contextual&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;Eres un asistente especializado.&amp;#34;&lt;/span> &lt;span class="c1"># siempre igual&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Agregar contexto de usuario como primer mensaje del historial&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>3. Versiones de prompt interpoladas:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ❌&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;[v&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">PROMPT_VERSION&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">] Eres un asistente...&amp;#34;&lt;/span> &lt;span class="c1"># cambia con cada deploy&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ✅ No versionar en el texto, versionar en el nombre del prompt en Langfuse&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;Eres un asistente...&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>4. Few-shots dinámicos:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ❌ Ejemplos recuperados aleatoriamente de un pool&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">examples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sample&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">example_pool&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">k&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Ejemplos:&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">format_examples&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">examples&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="se">\n\n&lt;/span>&lt;span class="s2">Eres un asistente...&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ✅ Few-shots fijos ordenados siempre igual&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">FIXED_EXAMPLES&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="n">example_pool&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">example_pool&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">example_pool&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">]]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">system&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Ejemplos:&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">format_examples&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">FIXED_EXAMPLES&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="se">\n\n&lt;/span>&lt;span class="s2">Eres un asistente...&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="ingeniería-de-templates-la-estructura-que-maximiza-hits">Ingeniería de templates: la estructura que maximiza hits&lt;/h2>
&lt;p>El principio es simple: &lt;strong>todo lo estático va antes, todo lo dinámico va después&lt;/strong>. El prefix cache es secuencial — una vez que un bloque no hace hit, el resto tampoco se busca.&lt;/p>
&lt;pre tabindex="0">&lt;code>ESTRUCTURA ÓPTIMA para maximizar prefix cache:
┌──────────────────────────────────────────────┐
│ BLOQUE ESTÁTICO (tokens 0-511) │ ← hit rate ~100%
│ System prompt invariante │
│ Instrucciones fijas │
│ Few-shots ordenados siempre igual │
├──────────────────────────────────────────────┤
│ BLOQUE SEMI-ESTÁTICO (tokens 512-1023) │ ← hit rate ~60-80%
│ Documentos RAG para esta sesión │
│ Historial de conversación hasta ahora │
├──────────────────────────────────────────────┤
│ BLOQUE DINÁMICO (tokens 1024+) │ ← hit rate ~0% (esperado)
│ Mensaje actual del usuario │
│ Contexto específico de este request │
└──────────────────────────────────────────────┘
&lt;/code>&lt;/pre>&lt;p>Para RAG específicamente: si los documentos recuperados son los mismos para un conjunto de queries similares (muy frecuente en RAG sobre documentos corporativos fijos), ordenarlos &lt;strong>siempre en el mismo orden&lt;/strong> (por ID, por score fijo, no por score variable) multiplica el hit rate del bloque semi-estático.&lt;/p>
&lt;hr>
&lt;h2 id="routing-prefix-aware-el-siguiente-nivel">Routing prefix-aware: el siguiente nivel&lt;/h2>
&lt;p>Con una sola instancia de vLLM, el prefix cache funciona automáticamente. El problema aparece con múltiples réplicas: el load balancer distribuye requests round-robin, y el prefix cacheado en la réplica A no sirve de nada cuando el request llega a la réplica B.&lt;/p>
&lt;p>La solución es &lt;strong>prefix-aware routing&lt;/strong>: enviar requests con el mismo prefix al mismo nodo.&lt;/p>
&lt;p>&lt;strong>Con Ray Serve (integración nativa):&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ray_serve_prefix_router.py&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">ray&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">serve&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">ray.serve.llm&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">LLMConfig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">build_llm_deployment&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nd">@serve.deployment&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">class&lt;/span> &lt;span class="nc">PrefixAwareRouter&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="fm">__init__&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">replicas&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">replicas&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">replicas&lt;/span> &lt;span class="c1"># lista de handles de vLLM&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">async&lt;/span> &lt;span class="k">def&lt;/span> &lt;span class="fm">__call__&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">request&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">body&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">await&lt;/span> &lt;span class="n">request&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">json&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">messages&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">body&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;messages&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">[])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Calcular hash del system prompt (prefix estático)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">system_content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">msg&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">messages&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">msg&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;role&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s2">&amp;#34;system&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">system_content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">msg&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;content&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">break&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">prefix_hash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">hash&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">system_content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Routing determinístico: mismo hash → mismo nodo&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">replica_idx&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">prefix_hash&lt;/span> &lt;span class="o">%&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">replicas&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="k">await&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">replicas&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">replica_idx&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">remote&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">request&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Con un gateway L7 (Nginx/Traefik):&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-nginx" data-lang="nginx">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># nginx.conf — routing por header X-Prefix-Hash
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">upstream&lt;/span> &lt;span class="s">vllm_backends&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="kn">hash&lt;/span> &lt;span class="nv">$http_x_prefix_hash&lt;/span> &lt;span class="s">consistent&lt;/span>&lt;span class="p">;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="kn">server&lt;/span> &lt;span class="n">vllm-0&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">8000&lt;/span>&lt;span class="p">;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="kn">server&lt;/span> &lt;span class="n">vllm-1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">8000&lt;/span>&lt;span class="p">;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="kn">server&lt;/span> &lt;span class="n">vllm-2&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">8000&lt;/span>&lt;span class="p">;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="kn">server&lt;/span> &lt;span class="n">vllm-3&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">8000&lt;/span>&lt;span class="p">;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>El cliente calcula el hash del prefix estático y lo incluye como header:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">hashlib&lt;/span>&lt;span class="o">,&lt;/span> &lt;span class="nn">requests&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">llm_request&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">messages&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">base_url&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">system_msg&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">next&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="n">m&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;content&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">m&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">messages&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="n">m&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;role&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s2">&amp;#34;system&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="s2">&amp;#34;&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">prefix_hash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">hashlib&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sha256&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">system_msg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">encode&lt;/span>&lt;span class="p">())&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hexdigest&lt;/span>&lt;span class="p">()[:&lt;/span>&lt;span class="mi">16&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">requests&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">post&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">base_url&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/v1/chat/completions&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">json&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="s2">&amp;#34;messages&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">messages&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;model&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;mi-modelo&amp;#34;&lt;/span>&lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">headers&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="s2">&amp;#34;X-Prefix-Hash&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">prefix_hash&lt;/span>&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="medir-el-impacto-con-otel">Medir el impacto con OTel&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-promql" data-lang="promql">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Hit rate actual (0.0 a 1.0) — objetivo &amp;gt; 0.70 con workloads enterprise&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nv">vllm&lt;/span>&lt;span class="err">:&lt;/span>&lt;span class="nv">gpu_prefix_cache_hit_rate&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1"># TTFT por percentil — debe caer cuando el hit rate sube&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="kr">histogram_quantile&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="mf">0.50&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kr">rate&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">vllm&lt;/span>&lt;span class="err">:&lt;/span>&lt;span class="nv">time_to_first_token_seconds_bucket&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s">5m&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">))&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="kr">histogram_quantile&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="mf">0.95&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kr">rate&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">vllm&lt;/span>&lt;span class="err">:&lt;/span>&lt;span class="nv">time_to_first_token_seconds_bucket&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s">5m&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">))&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>La correlación inversa entre hit rate y TTFT es la prueba de que el cache está funcionando. Si el hit rate sube del 15% al 70% y el TTFT p50 no cambia, hay un problema de configuración: el cache puede estar desactivado o el routing no está enviando los requests al nodo correcto.&lt;/p>
&lt;p>&lt;strong>Query de correlación en Grafana&lt;/strong> (panel de dos ejes):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-promql" data-lang="promql">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Eje Y izquierdo: hit rate&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nv">vllm&lt;/span>&lt;span class="err">:&lt;/span>&lt;span class="nv">gpu_prefix_cache_hit_rate&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1"># Eje Y derecho: TTFT p50 (invertido)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="kr">histogram_quantile&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="mf">0.50&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kr">rate&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">vllm&lt;/span>&lt;span class="err">:&lt;/span>&lt;span class="nv">time_to_first_token_seconds_bucket&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s">5m&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">))&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>La pendiente inversa debe ser visible: cuando el hit rate baja (pico de requests con prompts nuevos), el TTFT sube. Cuando el hit rate se estabiliza (usuarios repitiendo el mismo flujo), el TTFT baja.&lt;/p>
&lt;hr>
&lt;h2 id="el-impacto-en-números">El impacto en números&lt;/h2>
&lt;p>Para un sistema con 100 req/min, system prompt de 512 tokens y hit rate antes/después:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">Métrica&lt;/th>
&lt;th style="text-align:right">Hit rate 15%&lt;/th>
&lt;th style="text-align:right">Hit rate 75%&lt;/th>
&lt;th style="text-align:right">Diferencia&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">Tokens de prefill por minuto&lt;/td>
&lt;td style="text-align:right">5.100&lt;/td>
&lt;td style="text-align:right">12.800 — 50% cacheados → 6.400 efectivos&lt;/td>
&lt;td style="text-align:right">−37% carga&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">TTFT p50 (prompt 512 + sufijo 100)&lt;/td>
&lt;td style="text-align:right">~820 ms&lt;/td>
&lt;td style="text-align:right">~180 ms (sólo sufijo)&lt;/td>
&lt;td style="text-align:right">−78%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">Capacidad de prefill liberada&lt;/td>
&lt;td style="text-align:right">—&lt;/td>
&lt;td style="text-align:right">+1.200 tok/min&lt;/td>
&lt;td style="text-align:right">disponible para más requests&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>El 75% de hit rate en este ejemplo equivale a poder atender un 37% más de requests con el mismo hardware, porque el trabajo de prefill de 3 de cada 4 requests ya está hecho.&lt;/p>
&lt;hr>
&lt;h2 id="cuándo-el-prefix-cache-no-ayuda">Cuándo el prefix cache no ayuda&lt;/h2>
&lt;p>El prefix cache es ineficaz en workloads donde cada request tiene un prompt completamente único: traducciones de documentos distintos cada vez, análisis de código con contexto siempre diferente, generación creativa sin sistema. En estos casos, el hit rate estructuralmente no puede superar el 5-10% y el esfuerzo de ingeniería de templates no compensa.&lt;/p>
&lt;p>La señal: si tu p99 de longitud de input es mayor que el p50, tienes alta varianza de prompts y el prefix cache aporta poco. Si el p50 y el p99 son similares (prompts consistentes), el prefix cache es la palanca más barata disponible.&lt;/p>
&lt;hr>
&lt;h2 id="ver-también">Ver también&lt;/h2>
&lt;ul>
&lt;li>https://blog.lo0.es/posts/prefill-optimizaciones-vllm/ — &lt;code>--enable-prefix-caching&lt;/code> y la interacción con chunked prefill: sólo el primer chunk se beneficia del cache, lo que afecta al presupuesto óptimo de &lt;code>max-num-batched-tokens&lt;/code>&lt;/li>
&lt;li>https://blog.lo0.es/posts/kv-cache-fundamentos/ — la estructura de bloques sobre la que opera el prefix cache: por qué la granularidad de 16 tokens importa para el diseño de templates&lt;/li>
&lt;li>https://blog.lo0.es/posts/batch-sizing-vllm-grid-search/ — el grid search que determina el &lt;code>max-num-seqs&lt;/code> óptimo, que interactúa con el número de bloques disponibles para el cache&lt;/li>
&lt;li>https://blog.lo0.es/posts/router-inferencia-llm-gateway-l7/ — el gateway L7 donde se implementa el routing prefix-aware via header&lt;/li>
&lt;li>https://blog.lo0.es/posts/vllm-otel-instrumentacion-optimizaciones/ — cómo configurar &lt;code>gpu_prefix_cache_hit_rate&lt;/code> en el dashboard de Grafana y la alerta cuando cae por debajo del umbral objetivo&lt;/li>
&lt;/ul>
&lt;h3 id="en-esta-misma-serie">En esta misma serie&lt;/h3>
&lt;ul>
&lt;li>https://blog.lo0.es/posts/batch-sizing-vllm-grid-search/ — la primera optimización de la serie: el grid search de max-num-seqs × max-num-batched-tokens&lt;/li>
&lt;li>https://blog.lo0.es/posts/fp8-end-to-end-pesos-kv-calidad/ — FP8 en pesos y KV cache: doblar la VRAM disponible para cache y medir la degradación de calidad antes de ir a producción&lt;/li>
&lt;li>https://blog.lo0.es/posts/tp-replicas-una-grande-vs-n-pequenas/ — TP=4×1 vs TP=2×2: el routing por sesión que complementa el prefix-aware routing de este artículo&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="referencias">Referencias&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://docs.vllm.ai/en/stable/design/prefix_caching/">vLLM Automatic Prefix Caching — documentación oficial&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://docs.ray.io/en/latest/serve/llm/user-guides/prefix-aware-routing.html">Prefix-aware routing — Ray Serve&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://llm-d.ai/blog/kvcache-wins-you-can-see">KV-Cache Wins You Can See — llm-d blog&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.digitalocean.com/blog/reduce-llm-inference-costs-prefix-caching">The Inference Tax: Prefix-Aware Routing — DigitalOcean&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/vllm-project/vllm/issues/24394">vLLM issue #24394: Improve Prefix Cache Hit Rate&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>