<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Calidad on lo0 — Blog Técnico</title><link>https://blog.lo0.es/tags/calidad/</link><description>Recent content in Calidad on lo0 — Blog Técnico</description><generator>Hugo -- gohugo.io</generator><language>es</language><lastBuildDate>Fri, 05 Jun 2026 04:00:00 +0000</lastBuildDate><atom:link href="https://blog.lo0.es/tags/calidad/index.xml" rel="self" type="application/rss+xml"/><item><title>FP8 end-to-end: activar, medir calidad y decidir con datos</title><link>https://blog.lo0.es/posts/fp8-end-to-end-pesos-kv-calidad/</link><pubDate>Fri, 05 Jun 2026 04:00:00 +0000</pubDate><guid>https://blog.lo0.es/posts/fp8-end-to-end-pesos-kv-calidad/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;p>FP8 es el cambio de configuración con mayor impacto por esfuerzo disponible en hardware H100 y Ada Lovelace. En H100, activa tensor cores FP8 nativos: +40-60% throughput en decode y ×2 VRAM disponible para KV cache. En RTX 4090 y L40, el beneficio de compute es menor pero el ×2 VRAM es real y se traduce directamente en el doble de concurrencia. El riesgo es la degradación de calidad, que en modelos modernos bien calibrados es &amp;lt;0.5% en benchmarks estándar pero puede ser mayor en razonamiento formal. El workflow correcto no es activar y rezar: es activar en staging, correr la eval suite, correlacionar calidad con throughput en OTel, y decidir con datos.&lt;/p>
&lt;hr>
&lt;h2 id="la-analogía">La analogía&lt;/h2>
&lt;p>Un fotógrafo que trabaja con negativos de 35 mm y pasa a digital. Las fotos digitales ocupan menos espacio y se procesan más rápido. Pero una foto de baja resolución de un paisaje puede ser indistinguible de la de alta resolución para el ojo humano, mientras que una foto de texto en baja resolución pierde letras. El mismo trade-off exacto aplica a FP8: para tareas donde la imprecisión numérica se promedía sobre miles de activaciones (conversación, resumen, RAG), es prácticamente invisible. Para tareas donde una sola multiplicación errónea propaga una respuesta incorrecta (matemáticas formales, código crítico), puede ser determinante.&lt;/p>
&lt;hr>
&lt;h2 id="las-tres-capas-de-fp8-en-vllm">Las tres capas de FP8 en vLLM&lt;/h2>
&lt;p>FP8 no es un único flag: son tres capas independientes que se activan por separado y tienen beneficios distintos.&lt;/p>
&lt;p>&lt;strong>Capa 1 — Pesos del modelo (&lt;code>--quantization fp8&lt;/code>):&lt;/strong>
Los pesos del modelo se almacenan y se calculan en FP8 E4M3. Los modelos deben estar pre-cuantizados (disponibles en HuggingFace con sufijo &lt;code>-FP8&lt;/code> o &lt;code>-fp8&lt;/code>) o cuantizarse en tiempo de carga con calibración. El beneficio: el modelo ocupa la mitad de VRAM y los matmuls de pesos son 2× más rápidos en H100.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Modelo pre-cuantizado (recomendado para producción)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --quantization fp8
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># O cuantización on-the-fly (sin archivos adicionales, algo más lento en primeros tokens)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --quantization fp8 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --kv-cache-dtype auto
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Capa 2 — KV cache (&lt;code>--kv-cache-dtype fp8&lt;/code>):&lt;/strong>
Los tensores K y V del KV cache se almacenan en FP8 en vez de BF16. Reduce el tamaño del KV cache a la mitad, duplicando el número de tokens que caben en VRAM. No afecta a los pesos del modelo.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">vllm serve mi-modelo &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --kv-cache-dtype fp8 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --calculate-kv-scales &lt;span class="c1"># calibración dinámica, obligatorio para minimizar degradación&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Capa 3 — Activaciones (automático en H100):&lt;/strong>
En GPUs Hopper, vLLM activa automáticamente FP8 para las activaciones intermedias cuando ambas capas anteriores están activas. No requiere flag adicional.&lt;/p>
&lt;p>&lt;strong>Configuración completa para producción:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --quantization fp8 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --kv-cache-dtype fp8 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --calculate-kv-scales &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --gpu-memory-utilization 0.92 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --max-model-len &lt;span class="m">16384&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="el-impacto-medible-por-hardware">El impacto medible por hardware&lt;/h2>
&lt;h3 id="h100-sxm-hopper-tensor-cores-fp8-nativos">H100 SXM (Hopper, tensor cores FP8 nativos)&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">Métrica&lt;/th>
&lt;th style="text-align:right">BF16 baseline&lt;/th>
&lt;th style="text-align:right">FP8 activado&lt;/th>
&lt;th style="text-align:right">Delta&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">Throughput decode (tok/s, 70B, batch 32)&lt;/td>
&lt;td style="text-align:right">~1.800&lt;/td>
&lt;td style="text-align:right">~2.700&lt;/td>
&lt;td style="text-align:right">+50%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">VRAM modelo (70B)&lt;/td>
&lt;td style="text-align:right">140 GB&lt;/td>
&lt;td style="text-align:right">70 GB&lt;/td>
&lt;td style="text-align:right">−50%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">VRAM KV cache disponible (en 4×H100)&lt;/td>
&lt;td style="text-align:right">180 GB&lt;/td>
&lt;td style="text-align:right">250 GB&lt;/td>
&lt;td style="text-align:right">+39%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">Concurrencia máxima (ctx 8K)&lt;/td>
&lt;td style="text-align:right">~22.500 tok&lt;/td>
&lt;td style="text-align:right">~31.250 tok&lt;/td>
&lt;td style="text-align:right">+39%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Esto equivale a una réplica adicional gratis en términos de capacidad de KV cache.&lt;/p>
&lt;h3 id="rtx-4090-ada-lovelace-fp8-cuda-pero-sin-tensor-cores-dedicados">RTX 4090 (Ada Lovelace, FP8 CUDA pero sin tensor cores dedicados)&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">Métrica&lt;/th>
&lt;th style="text-align:right">BF16/Q4 baseline&lt;/th>
&lt;th style="text-align:right">FP8 KV cache añadido&lt;/th>
&lt;th style="text-align:right">Delta&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">Throughput decode (tok/s, 14B Q4)&lt;/td>
&lt;td style="text-align:right">~45&lt;/td>
&lt;td style="text-align:right">~47&lt;/td>
&lt;td style="text-align:right">+4%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">VRAM KV cache disponible&lt;/td>
&lt;td style="text-align:right">15 GB&lt;/td>
&lt;td style="text-align:right">15 GB (modelo igual)&lt;/td>
&lt;td style="text-align:right">—&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">Tokens totales de cache (ctx 8K)&lt;/td>
&lt;td style="text-align:right">~46.000&lt;/td>
&lt;td style="text-align:right">~92.000&lt;/td>
&lt;td style="text-align:right">+100%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">Concurrencia máxima (ctx 8K)&lt;/td>
&lt;td style="text-align:right">~5 usuarios&lt;/td>
&lt;td style="text-align:right">~11 usuarios&lt;/td>
&lt;td style="text-align:right">+120%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>En Ada, el beneficio de compute es menor (los tensor cores FP8 no tienen el mismo ancho que en Hopper), pero el ×2 en capacidad de KV cache es completamente real y se traduce en el doble de usuarios concurrentes posibles.&lt;/p>
&lt;hr>
&lt;h2 id="el-workflow-correcto-activar-medir-decidir">El workflow correcto: activar, medir, decidir&lt;/h2>
&lt;p>Activar FP8 directamente en producción sin validar calidad es inadecuado. El workflow correcto tiene cuatro pasos.&lt;/p>
&lt;h3 id="paso-1-baseline-en-staging">Paso 1: baseline en staging&lt;/h3>
&lt;p>Antes de activar FP8, registrar las métricas de calidad del modelo BF16 actual. La forma más reproducible es correr una eval suite sobre un dataset fijo y guardar los resultados:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Instalar lm-evaluation-harness&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">pip install lm-eval
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Baseline BF16&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">lm_eval --model vllm &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --model_args &lt;span class="nv">pretrained&lt;/span>&lt;span class="o">=&lt;/span>meta-llama/Meta-Llama-3.1-70B-Instruct,dtype&lt;span class="o">=&lt;/span>bfloat16 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --tasks mmlu,hellaswag,gsm8k &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --num_fewshot &lt;span class="m">5&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --output_path ./results/baseline_bf16.json
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="paso-2-activar-fp8-y-correr-la-misma-eval-suite">Paso 2: activar FP8 y correr la misma eval suite&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># FP8&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">lm_eval --model vllm &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --model_args &lt;span class="nv">pretrained&lt;/span>&lt;span class="o">=&lt;/span>neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8,quantization&lt;span class="o">=&lt;/span>fp8,kv_cache_dtype&lt;span class="o">=&lt;/span>fp8,calculate_kv_scales&lt;span class="o">=&lt;/span>&lt;span class="nb">true&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --tasks mmlu,hellaswag,gsm8k &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --num_fewshot &lt;span class="m">5&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --output_path ./results/fp8_full.json
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="paso-3-calcular-la-degradación">Paso 3: calcular la degradación&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># compare_eval.py&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">json&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;results/baseline_bf16.json&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">baseline&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">json&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;results/fp8_full.json&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fp8&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">json&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tasks&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;mmlu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;hellaswag&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;gsm8k&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="s1">&amp;#39;Task&amp;#39;&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;lt;15&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="s1">&amp;#39;BF16&amp;#39;&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;8&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="s1">&amp;#39;FP8&amp;#39;&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;8&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="s1">&amp;#39;Delta&amp;#39;&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;8&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="s1">&amp;#39;OK?&amp;#39;&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;6&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;-&amp;#34;&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">50&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">task&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">tasks&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">b&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">baseline&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;results&amp;#34;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">task&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="s2">&amp;#34;acc,none&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">f&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">fp8&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;results&amp;#34;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">task&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="s2">&amp;#34;acc,none&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">delta&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">f&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">b&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="n">b&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">100&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ok&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;✓&amp;#34;&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="nb">abs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">delta&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="mf">1.0&lt;/span> &lt;span class="k">else&lt;/span> &lt;span class="s2">&amp;#34;✗ REVISAR&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">task&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;lt;15&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">b&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;8.3f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">f&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;8.3f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">delta&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;+7.1f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">% &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">ok&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">&amp;gt;6&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Umbrales de decisión documentados en MLPerf Inference 2025:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&amp;lt; 0.5% degradación&lt;/strong>: activar en producción sin restricciones.&lt;/li>
&lt;li>&lt;strong>0.5% – 1.5%&lt;/strong>: activar con monitorización activa de calidad via LLM-as-judge.&lt;/li>
&lt;li>&lt;strong>&amp;gt; 1.5%&lt;/strong>: investigar antes de activar — posible problema de calibración o modelo incompatible.&lt;/li>
&lt;/ul>
&lt;h3 id="paso-4-eval-de-dominio-con-llm-as-judge">Paso 4: eval de dominio con LLM-as-judge&lt;/h3>
&lt;p>Los benchmarks académicos miden lo que miden. Tu caso de uso puede ser diferente. Añadir 200 muestras representativas de tu dominio evaluadas por un juez LLM cierra el gap:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># domain_eval.py&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">langfuse&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Langfuse&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">openai&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">OpenAI&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">client&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Langfuse&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">judge&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">OpenAI&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">base_url&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;http://judge-llm:8000/v1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">api_key&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;token&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Cargar las 200 muestras de producción curadas (prompt + respuesta esperada)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">load_domain_samples&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;eval_dataset_200.json&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">scores_bf16&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">scores_fp8&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[],&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">sample&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">samples&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">model_type&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">endpoint&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="p">[(&lt;/span>&lt;span class="s2">&amp;#34;bf16&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;http://staging-bf16:8000&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;fp8&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;http://staging-fp8:8000&amp;#34;&lt;/span>&lt;span class="p">)]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">response&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">call_model&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">endpoint&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sample&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;prompt&amp;#34;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">score&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">judge&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">chat&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">completions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">create&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">model&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;Qwen/Qwen2.5-72B-Instruct&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">messages&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;role&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;user&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;content&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Evalúa esta respuesta del 1 al 5 según precisión y completitud.&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">Pregunta: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sample&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;prompt&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">Respuesta esperada: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sample&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;expected&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">Respuesta modelo: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">response&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="se">\n\n&lt;/span>&lt;span class="s2">Responde solo con un número del 1 al 5.&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">choices&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">message&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">content&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">strip&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">model_type&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s2">&amp;#34;bf16&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">scores_bf16&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">int&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">score&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">scores_fp8&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">int&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">score&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">numpy&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">np&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Score medio BF16: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">scores_bf16&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Score medio FP8: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">scores_fp8&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Degradación: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">scores_fp8&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">scores_bf16&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">scores_bf16&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="mi">100&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.1f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">%&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="correlación-otel--langfuse-el-dashboard-que-decide">Correlación OTel + Langfuse: el dashboard que decide&lt;/h2>
&lt;p>El momento de la decisión se apoya en un único dashboard con dos señales en el mismo eje temporal:&lt;/p>
&lt;p>&lt;strong>Señal 1 — Throughput (Prometheus):&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-promql" data-lang="promql">&lt;span class="line">&lt;span class="cl">&lt;span class="kr">rate&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">vllm&lt;/span>&lt;span class="err">:&lt;/span>&lt;span class="nv">generation_tokens_total&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s">5m&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Señal 2 — Calidad media (Langfuse → Prometheus via exporter):&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-promql" data-lang="promql">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Si has configurado Langfuse con scores exportados via OTel&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nv">langfuse_score_value&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="nl">name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">llm_judge_domain&lt;/span>&lt;span class="p">&amp;#34;}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>El patrón esperado después de activar FP8: el throughput sube un 40-60% y la calidad se mantiene dentro de ±0.1 puntos. Si la calidad cae más de 0.3 puntos y permanece baja, hay un problema real.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-promql" data-lang="promql">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Alerta: calidad cae más de 0.2 puntos sostenidos tras el cambio&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nv">ALERT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">FP8CalidadDegradada&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nv">IF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kr">avg_over_time&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">langfuse_score_value&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="nl">name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">llm_judge_domain&lt;/span>&lt;span class="p">&amp;#34;}[&lt;/span>&lt;span class="s">30m&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;lt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="kr">avg_over_time&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">langfuse_score_value&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="nl">name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">llm_judge_domain&lt;/span>&lt;span class="p">&amp;#34;}[&lt;/span>&lt;span class="s">1d&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">offset&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s">2h&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mf">0.2&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nv">FOR&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s">15m&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nv">LABELS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">severity&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">warning&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nv">ANNOTATIONS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">summary&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">Posible degradación de calidad tras cambio de configuración FP8&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="cuándo-no-activar-fp8">Cuándo NO activar FP8&lt;/h2>
&lt;p>FP8 no es siempre la respuesta correcta. Los casos donde la degradación supera el umbral aceptable:&lt;/p>
&lt;p>&lt;strong>Razonamiento matemático formal:&lt;/strong> GSM8K y MATH son los benchmarks más sensibles a FP8. Si tu caso de uso es resolución de problemas matemáticos o cálculo financiero preciso, medir específicamente en estos benchmarks antes de activar.&lt;/p>
&lt;p>&lt;strong>Código crítico con tests:&lt;/strong> la precisión numérica afecta a la probabilidad de los tokens en posiciones clave de una función. El riesgo no es que el código &amp;ldquo;parezca&amp;rdquo; malo, sino que pase tests superficiales pero tenga bugs sutiles.&lt;/p>
&lt;p>&lt;strong>Contextos muy largos sin &lt;code>--calculate-kv-scales&lt;/code>:&lt;/strong> sin calibración dinámica de escalas, el error numérico acumulado en el KV cache crece con el contexto. Con &lt;code>--calculate-kv-scales&lt;/code> activo, el impacto es mínimo hasta 32K tokens.&lt;/p>
&lt;p>&lt;strong>Modelos pequeños (&amp;lt;7B):&lt;/strong> el overhead de conversión FP8 puede superar el beneficio de throughput. El punto de equilibrio está alrededor de 7B parámetros.&lt;/p>
&lt;hr>
&lt;h2 id="ver-también">Ver también&lt;/h2>
&lt;ul>
&lt;li>https://blog.lo0.es/posts/quantization-fundamentos-inferencia/ — la matemática de FP8 E4M3: qué es el exponente de 4 bits y la mantisa de 3 bits, y por qué este formato específico fue elegido sobre INT8&lt;/li>
&lt;li>https://blog.lo0.es/posts/kv-cache-fundamentos/ — la fórmula del tamaño del KV cache: por qué pasar a FP8 lo divide exactamente por dos&lt;/li>
&lt;li>https://blog.lo0.es/posts/decode-optimizaciones-vllm/ — &lt;code>--kv-cache-dtype fp8&lt;/code> y &lt;code>--calculate-kv-scales&lt;/code> en el contexto del tuning completo del decode&lt;/li>
&lt;li>https://blog.lo0.es/posts/vllm-otel-instrumentacion-optimizaciones/ — cómo configurar la correlación Langfuse + Prometheus en un solo dashboard para el before/after de FP8&lt;/li>
&lt;li>https://blog.lo0.es/posts/evals-llm-la-capa-despues-de-tracing/ — la eval suite completa: cómo construir el dataset de dominio de 200 muestras y el juez LLM que verifica la calidad&lt;/li>
&lt;/ul>
&lt;h3 id="en-esta-misma-serie">En esta misma serie&lt;/h3>
&lt;ul>
&lt;li>https://blog.lo0.es/posts/batch-sizing-vllm-grid-search/ — el grid search de max-num-seqs × max-num-batched-tokens: la optimización gratis con mayor impacto antes de tocar la cuantización&lt;/li>
&lt;li>https://blog.lo0.es/posts/prefix-cache-hit-rate-engineering/ — ingeniería del hit rate de prefix cache: pasar del 15% al 75% sin añadir hardware&lt;/li>
&lt;li>https://blog.lo0.es/posts/tp-replicas-una-grande-vs-n-pequenas/ — TP=4×1 vs TP=2×2: la decisión arquitectónica que determina cómo escalar lo que FP8 libera&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="referencias">Referencias&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://vllm.ai/blog/2026-04-22-fp8-kvcache">The State of FP8 KV-Cache and Attention Quantization in vLLM — vLLM Blog (abril 2026)&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://docs.vllm.ai/en/v0.8.5/features/quantization/fp8.html">FP8 W8A8 — vLLM Documentation&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://mlcommons.org/benchmarks/inference-datacenter/">MLPerf Inference v5.1 — resultados de calidad FP8 (sep 2025)&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://ifactoryapp.com/sap-integration/on-prem-ai/fp4-vs-fp8-vs-fp16-llm-inference">FP4 vs FP8 vs FP16 LLM Inference: Quality and Speed Tradeoffs&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.spheron.network/blog/vllm-production-deployment-2026/">vLLM Production Deployment 2026: FP8 Docker Setup on H100&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>