<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hpa on lo0 — Blog Técnico</title><link>https://blog.lo0.es/tags/hpa/</link><description>Recent content in Hpa on lo0 — Blog Técnico</description><generator>Hugo -- gohugo.io</generator><language>es</language><lastBuildDate>Mon, 01 Jun 2026 16:00:00 +0200</lastBuildDate><atom:link href="https://blog.lo0.es/tags/hpa/index.xml" rel="self" type="application/rss+xml"/><item><title>Autoscaling de inferencia LLM en Kubernetes: HPA con custom metrics y KEDA para vLLM</title><link>https://blog.lo0.es/posts/autoscaling-llm-kubernetes-keda/</link><pubDate>Mon, 01 Jun 2026 16:00:00 +0200</pubDate><guid>https://blog.lo0.es/posts/autoscaling-llm-kubernetes-keda/</guid><description>&lt;blockquote>
&lt;p>Este post complementa los de &lt;a href="https://blog.lo0.es/posts/observabilidad-gpu-dcgm-llm/">Observabilidad GPU para inferencia LLM&lt;/a> (de donde vienen las métricas que alimentan al HPA), &lt;a href="https://blog.lo0.es/posts/capacity-planning-inferencia-llm-on-premise/">Capacity planning&lt;/a> (qué techo y qué head-room presupone el autoscaler) y &lt;a href="https://blog.lo0.es/posts/continuous-batching-fundamentos/">Continuous batching&lt;/a> (lo que explica por qué &lt;code>num_requests_waiting&lt;/code> es la métrica primaria).&lt;/p>
&lt;/blockquote>
&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;p>El autoscaling clásico de Kubernetes —HPA sobre &lt;code>cpu&lt;/code> o &lt;code>memory&lt;/code>— &lt;strong>no sirve para inferencia LLM&lt;/strong>. Razón: el pod vLLM consume poco CPU (el trabajo lo hace la GPU) y la memoria RSS del proceso es plana; ambas métricas pueden quedarse al 30 % mientras la GPU está saturada y la cola de requests crece sin freno. Las cuatro señales viables que sí responden a la carga real son: &lt;strong>&lt;code>vllm:num_requests_waiting&lt;/code>&lt;/strong> (la cola, la métrica primaria), &lt;strong>&lt;code>vllm:gpu_cache_usage_perc&lt;/code>&lt;/strong> (presión sobre el KV cache pool), &lt;strong>TTFT P95&lt;/strong> vía histogram de &lt;code>vllm:time_to_first_token_seconds_bucket&lt;/code> (la garantía del SLO) y el &lt;strong>batch fill ratio&lt;/strong> &lt;code>num_requests_running / max_num_seqs&lt;/code> (utilización del techo de concurrencia). Para que un HPA pueda consumir métricas Prometheus hace falta un adaptador; en mayo 2026 hay dos opciones maduras: &lt;strong>&lt;code>prometheus-adapter&lt;/code>&lt;/strong> (sigma de cluster, configuración estática, output &lt;code>external.metrics.k8s.io&lt;/code>) y &lt;strong>KEDA&lt;/strong> (&lt;code>ScaledObject&lt;/code> con trigger Prometheus, polling configurable, escalado a cero opcional, integración con cron). KEDA es la opción dominante para LLM en cluster genérico porque resuelve el patrón &amp;ldquo;warm pool + cron + métrica del motor&amp;rdquo; en un solo CRD. El reto operacional dominante no es la lógica de escalado sino el &lt;strong>cold start&lt;/strong>: un pod vLLM con Llama 70B BF16 (140 GB) tarda entre &lt;strong>90 segundos&lt;/strong> (modelo precacheado en PV local) y &lt;strong>6 minutos&lt;/strong> (image pull + descarga del modelo desde object store) hasta servir el primer token. Las cinco palancas que lo recortan son imagen pre-pulled vía DaemonSet, modelo cacheado en PV o tmpfs regional, &lt;strong>warm pool&lt;/strong> con &lt;code>minReplicaCount &amp;gt; 0&lt;/code>, &lt;strong>predictive scaling&lt;/strong> vía KEDA cron cuando el patrón de tráfico es predecible (oficinas 9–18 h), y descarga paralela del modelo. Los tres pitfalls específicos del scale-down LLM: cortar conexiones SSE de streaming a media respuesta (drain elegante con &lt;code>terminationGracePeriodSeconds&lt;/code> ≥ 60 s), oscilación de scale-out/in por stabilization window mal calibrada, y olvidar que el HPA solo escala pods — los &lt;strong>nodos GPU&lt;/strong> se escalan con cluster-autoscaler sobre nodepools etiquetados. Este post incluye los manifests YAML mínimos.&lt;/p>
&lt;h2 id="estás-aquí-deploy">Estás aquí: DEPLOY&lt;/h2>
&lt;div class="diagram" style="max-width:780px;margin:1rem auto;">
&lt;svg viewBox="0 0 780 90" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="estás aquí: Deploy">
&lt;style>.box{stroke:#444;stroke-width:1.4;rx:6}.active{fill:#7ad88f;stroke-width:3}.idle{fill:#f4f4f4}.lbl{font:600 12px sans-serif;fill:#222}.arr{stroke:#666;stroke-width:1.4;fill:none;marker-end:url(#asm)}.cyc{stroke:#888;stroke-width:1.2;fill:none;stroke-dasharray:4 2;marker-end:url(#asm)}&lt;/style>
&lt;defs>&lt;marker id="asm" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">&lt;path d="M0,0 L10,5 L0,10 z" fill="#666"/>&lt;/marker>&lt;/defs>
&lt;text x="390" y="20" text-anchor="middle" class="lbl">Estás aquí: DEPLOY · autoscaling sobre las métricas que mide OBSERVE&lt;/text>
&lt;rect x="30" y="35" width="110" height="35" class="box idle"/>&lt;text x="85" y="58" text-anchor="middle" class="lbl">1 · Data&lt;/text>
&lt;rect x="155" y="35" width="110" height="35" class="box idle"/>&lt;text x="210" y="58" text-anchor="middle" class="lbl">2 · Tune&lt;/text>
&lt;rect x="280" y="35" width="110" height="35" class="box idle"/>&lt;text x="335" y="58" text-anchor="middle" class="lbl">3 · Eval&lt;/text>
&lt;rect x="405" y="35" width="110" height="35" class="box active"/>&lt;text x="460" y="58" text-anchor="middle" class="lbl">4 · Deploy&lt;/text>
&lt;rect x="530" y="35" width="110" height="35" class="box idle"/>&lt;text x="585" y="58" text-anchor="middle" class="lbl">5 · Observe&lt;/text>
&lt;rect x="655" y="35" width="110" height="35" class="box idle"/>&lt;text x="710" y="58" text-anchor="middle" class="lbl">6 · Retrain&lt;/text>
&lt;path class="arr" d="M140,52 L155,52"/>&lt;path class="arr" d="M265,52 L280,52"/>&lt;path class="arr" d="M390,52 L405,52"/>&lt;path class="arr" d="M515,52 L530,52"/>&lt;path class="arr" d="M640,52 L655,52"/>
&lt;path class="cyc" d="M710,72 L710,82 L85,82 L85,72"/>
&lt;/svg>
&lt;/div>
&lt;h2 id="la-analogía-la-panadería-con-hornos-de-leña">La analogía: la panadería con hornos de leña&lt;/h2>
&lt;p>Una panadería artesanal tiene tres hornos de leña. Cada horno tarda 25 minutos en alcanzar temperatura desde frío. Una vez caliente, hornea pan continuamente con una tirada de 18 minutos por hornada. La encargada quiere maximizar pan vendido por día sin gastar leña inútil, y sabe tres cosas: que hay un pico de demanda a las 7:30 cada mañana, que los lunes no se vende casi nada, y que cuando se acaba el pan en mostrador los clientes se van al supermercado de al lado.&lt;/p>
&lt;p>La estrategia barata —encender hornos cuando hay cola en la tienda— &lt;strong>no funciona&lt;/strong>. Para cuando la cola crece y la encargada enciende el segundo horno, ese horno no estará listo hasta 25 minutos después; los clientes de esa ventana se perdieron. La señal &amp;ldquo;cola en mostrador&amp;rdquo; llega tarde.&lt;/p>
&lt;p>La estrategia inteligente: encender el segundo horno a las 6:55, antes del pico previsible de las 7:30, y dejarlo activo hasta las 10:00 aunque la cola baje a las 8:15. Mantener el tercer horno apagado entre lunes y miércoles porque la demanda no llega; encenderlo proactivamente los jueves a las 12:00 porque históricamente sube. Tener una bolsa de masa cruda pre-fermentada en cámara para que cuando el horno esté listo, el pan entre en 30 segundos y no haya que esperar dos horas de fermentación.&lt;/p>
&lt;p>El autoscaling de un cluster de inferencia LLM funciona igual:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Encender hornos en frío&lt;/strong> = scale-out reactivo cuando la cola crece (lento, pierde clientes).&lt;/li>
&lt;li>&lt;strong>Cron proactivo&lt;/strong> = predictive scaling cuando el patrón es conocido (horario laboral, picos previstos).&lt;/li>
&lt;li>&lt;strong>Masa pre-fermentada&lt;/strong> = warm pool de réplicas con modelo cargado pero a 0 carga.&lt;/li>
&lt;li>&lt;strong>Apagar hornos sin pan en curso&lt;/strong> = scale-down respetando las streamings activas (no se cierra el horno con pan dentro).&lt;/li>
&lt;/ul>
&lt;p>La métrica clave —&amp;ldquo;cuántos clientes hay en cola&amp;rdquo;— se llama &lt;code>num_requests_waiting&lt;/code>. La métrica que dice &amp;ldquo;el horno se va a quedar sin masa para nuevos panes&amp;rdquo; se llama &lt;code>gpu_cache_usage_perc&lt;/code>. Y la métrica de calidad de servicio —&amp;ldquo;cuánto tarda el primer pan en salir cuando un cliente nuevo entra&amp;rdquo;— se llama TTFT.&lt;/p>
&lt;h2 id="por-qué-hpa-sobre-cpu-no-sirve">Por qué HPA sobre CPU no sirve&lt;/h2>
&lt;p>El HPA clásico de Kubernetes mira &lt;code>resource.cpu&lt;/code> del pod. Para un servicio HTTP convencional —Node.js, una API REST— la CPU se mueve linealmente con el tráfico y el HPA escala con razonable acierto. Para un pod vLLM o SGLang sobre GPU, la CPU del pod típicamente vive entre 5 % y 15 % independientemente de si la GPU está al 30 % o al 99 % de carga: el trabajo real lo hace el dispositivo, no el proceso. Resultado: el HPA basado en CPU &lt;strong>nunca&lt;/strong> dispara scale-out aunque la GPU esté reventando, y los clientes acumulan en la cola hasta que TTFT P95 cruza el SLO. El operador descubre el problema por la alerta de TTFT, no por el HPA.&lt;/p>
&lt;p>&lt;code>memory&lt;/code> tampoco sirve: la RSS del proceso vLLM es plana después del arranque (modelo + buffers cargados de una vez); no refleja la presión real sobre la GPU. Lo único que crece y baja con la carga útil de inferencia son métricas que el motor publica explícitamente: cola de requests, KV cache pool, latencias del SLO. Sin un adaptador que las haga visibles al HPA, el autoscaling es ciego.&lt;/p>
&lt;h2 id="las-cuatro-señales-viables">Las cuatro señales viables&lt;/h2>
&lt;div class="diagram" style="max-width:820px;margin:1.5rem auto;">
&lt;svg viewBox="0 0 820 320" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="cuatro señales de autoscaling LLM">
&lt;style>.b{stroke:#333;stroke-width:1.4;rx:6}.p{fill:#dfe9f5;stroke:#356}.s{fill:#eef0d0;stroke:#7a3}.t{fill:#f4e3cf;stroke:#a63}.f{fill:#ead8f5;stroke:#634}.title{font:600 13px sans-serif;fill:#222}.h{font:700 12px sans-serif;fill:#222}.m{font:11px monospace;fill:#222}.n{font:italic 11px sans-serif;fill:#444}&lt;/style>
&lt;text x="410" y="20" text-anchor="middle" class="title">Las cuatro métricas que alimentan el HPA LLM&lt;/text>
&lt;rect x="30" y="40" width="370" height="120" class="b p"/>
&lt;text x="40" y="62" class="h">1 · COLA (PRIMARIA)&lt;/text>
&lt;text x="40" y="82" class="m">vllm:num_requests_waiting&lt;/text>
&lt;text x="40" y="105" class="n">¿Hay requests esperando entrar al batch?&lt;/text>
&lt;text x="40" y="125" class="n">Reacciona al instante. Robusta a cambios de modelo.&lt;/text>
&lt;text x="40" y="145" class="n">Umbral típico HPA: target = 5 (scale-out si &amp;gt; 5 sostenido).&lt;/text>
&lt;rect x="420" y="40" width="370" height="120" class="b s"/>
&lt;text x="430" y="62" class="h">2 · KV CACHE POOL&lt;/text>
&lt;text x="430" y="82" class="m">vllm:gpu_cache_usage_perc&lt;/text>
&lt;text x="430" y="105" class="n">¿Cuánta VRAM de KV cache se usa?&lt;/text>
&lt;text x="430" y="125" class="n">Predictiva: avisa antes de que la cola empiece.&lt;/text>
&lt;text x="430" y="145" class="n">Umbral típico: target = 0.85 (scale-out si &amp;gt; 0.85).&lt;/text>
&lt;rect x="30" y="170" width="370" height="120" class="b t"/>
&lt;text x="40" y="192" class="h">3 · TTFT P95 (SLO)&lt;/text>
&lt;text x="40" y="212" class="m">histogram_quantile(0.95,&lt;/text>
&lt;text x="40" y="226" class="m"> rate(vllm:time_to_first_token_seconds_bucket[5m]))&lt;/text>
&lt;text x="40" y="246" class="n">La garantía contractual al cliente.&lt;/text>
&lt;text x="40" y="266" class="n">Backup de las dos anteriores; reacciona tarde pero defiende SLO.&lt;/text>
&lt;rect x="420" y="170" width="370" height="120" class="b f"/>
&lt;text x="430" y="192" class="h">4 · BATCH FILL RATIO&lt;/text>
&lt;text x="430" y="212" class="m">vllm:num_requests_running&lt;/text>
&lt;text x="430" y="226" class="m"> / max_num_seqs (config)&lt;/text>
&lt;text x="430" y="246" class="n">Utilización del techo de concurrencia del motor.&lt;/text>
&lt;text x="430" y="266" class="n">Útil para scale-down: si ratio &amp;lt; 0.4 sostenido, sobra réplica.&lt;/text>
&lt;text x="410" y="310" text-anchor="middle" class="n">Política recomendada: cola como primaria, KV cache como secundaria, TTFT como guardrail&lt;/text>
&lt;/svg>
&lt;/div>
&lt;p>&lt;strong>Señal 1 — &lt;code>vllm:num_requests_waiting&lt;/code> (cola).&lt;/strong> Es la métrica más directa: cuántas requests esperan entrar al batch. Reacciona en el instante en que la concurrencia objetivo se satura. Es robusta frente a cambios de modelo (el número de requests es el mismo concepto sea Llama 7B o 70B). Es la &lt;strong>métrica primaria&lt;/strong> del HPA LLM. Umbral típico: &lt;code>target = 5&lt;/code> requests waiting de media; si la cola crece por encima de 5 sostenido durante 2 minutos, scale-out.&lt;/p>
&lt;p>&lt;strong>Señal 2 — &lt;code>vllm:gpu_cache_usage_perc&lt;/code> (KV pool).&lt;/strong> Se mueve &lt;strong>antes&lt;/strong> que la cola: el KV pool se va llenando mientras los slots del batch aún están libres, hasta que el motor empieza a rechazar nuevas requests por OOM-prevention y se forma la cola. Por tanto es &lt;strong>predictiva&lt;/strong>: dispara scale-out antes de que el cliente note degradación. Umbral típico: &lt;code>target = 0.85&lt;/code> (85 % de pool usado).&lt;/p>
&lt;p>&lt;strong>Señal 3 — TTFT P95.&lt;/strong> La garantía contractual. Si TTFT P95 sale del SLO, scale-out aunque cola y KV pool parezcan razonables (puede haber un pico de prompts largos). Es &lt;strong>reactiva&lt;/strong> —sale del SLO antes de que tu HPA reaccione— pero sirve de guardrail final.&lt;/p>
&lt;p>&lt;strong>Señal 4 — batch fill ratio.&lt;/strong> El cociente &lt;code>num_requests_running / max_num_seqs&lt;/code> (este último es config del motor, no métrica). Útil para &lt;strong>scale-down&lt;/strong>: si el ratio queda por debajo de 0.4 durante 10 minutos, sobra capacidad y se puede reducir réplicas con seguridad.&lt;/p>
&lt;p>La política recomendada combina las cuatro: la cola y el KV pool disparan scale-out (lo que llegue antes), TTFT lo confirma como guardrail, y el batch fill ratio gestiona scale-down. Implementarlo en un único HPA exige métricas externas; KEDA hace esto manejable.&lt;/p>
&lt;h2 id="el-cableado-keda-como-adaptador-prometheus">El cableado: KEDA como adaptador Prometheus&lt;/h2>
&lt;p>KEDA introduce dos CRDs principales: &lt;code>TriggerAuthentication&lt;/code> (cómo autenticarse contra la fuente) y &lt;code>ScaledObject&lt;/code> (qué deployment escalar con qué triggers). Para un deployment vLLM con Prometheus como fuente:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">keda.sh/v1alpha1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ScaledObject&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-llama70b-scaler&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">inference&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scaleTargetRef&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-llama70b&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minReplicaCount&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">2&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># warm pool&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">maxReplicaCount&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">20&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">pollingInterval&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">15&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cooldownPeriod&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">300&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># 5 min antes de scale-down&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">advanced&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">horizontalPodAutoscalerConfig&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">behavior&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scaleDown&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">stabilizationWindowSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">600&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># ventana grande para evitar oscilación&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policies&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Pods&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">periodSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">120&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scaleUp&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">stabilizationWindowSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">30&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policies&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Pods&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">2&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">periodSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">60&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">triggers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">prometheus&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">serverAddress&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">http://prometheus.observability.svc:9090&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metricName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm_queue_depth&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">threshold&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;5&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> avg(vllm:num_requests_waiting{deployment=&amp;#34;vllm-llama70b&amp;#34;})&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">prometheus&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">serverAddress&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">http://prometheus.observability.svc:9090&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metricName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm_kv_cache&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">threshold&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;0.85&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> avg(vllm:gpu_cache_usage_perc{deployment=&amp;#34;vllm-llama70b&amp;#34;})&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">prometheus&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">serverAddress&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">http://prometheus.observability.svc:9090&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metricName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm_ttft_p95&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">threshold&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;1.5&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> histogram_quantile(0.95,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> sum by(le)(rate(vllm:time_to_first_token_seconds_bucket{deployment=&amp;#34;vllm-llama70b&amp;#34;}[5m])))&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Tres detalles operativos no obvios:&lt;/p>
&lt;p>&lt;strong>&lt;code>minReplicaCount: 2&lt;/code>.&lt;/strong> Es el warm pool. Mantener al menos dos réplicas garantiza disponibilidad ante pérdida de un nodo y absorbe spikes sin esperar al cold start del primer escalado. Bajarlo a 0 ahorra GPU en off-peak pero introduce 90 s–6 min de latencia al primer cliente nuevo.&lt;/p>
&lt;p>&lt;strong>&lt;code>stabilizationWindowSeconds: 600&lt;/code> en scale-down.&lt;/strong> Diez minutos. Los modelos no son &lt;code>nginx&lt;/code>: si una réplica se cierra prematuramente y a los dos minutos hay otro pico, el cold start de un nuevo pod tarda lo que el cliente espera. Mejor mantener réplicas extra el doble de lo que mantendrías para un servicio web normal.&lt;/p>
&lt;p>&lt;strong>&lt;code>scaleUp: stabilizationWindowSeconds: 30&lt;/code>.&lt;/strong> Treinta segundos. El scale-out tiene que ser rápido — el cold start del nuevo pod añade su propio retraso, y si encima el HPA espera otros minutos antes de disparar, el SLO ya está roto.&lt;/p>
&lt;h2 id="el-gran-problema-operativo-cold-start">El gran problema operativo: cold start&lt;/h2>
&lt;p>Un pod vLLM cargando Llama 70B pasa por estas fases antes de servir el primer token:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Fase&lt;/th>
&lt;th>Tiempo típico&lt;/th>
&lt;th>Acelerable con&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Image pull (4–6 GB)&lt;/td>
&lt;td>30–90 s&lt;/td>
&lt;td>DaemonSet pre-pull&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Descarga del modelo (140 GB BF16)&lt;/td>
&lt;td>60–300 s&lt;/td>
&lt;td>PV regional cacheado, S3 + multi-thread&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Carga del modelo a HBM&lt;/td>
&lt;td>30–90 s&lt;/td>
&lt;td>tmpfs o NVMe local&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Capture de CUDA graphs&lt;/td>
&lt;td>20–60 s&lt;/td>
&lt;td>&lt;code>--enforce-eager&lt;/code> (más lento en runtime pero arranque rápido)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Warmup de PagedAttention&lt;/td>
&lt;td>5–15 s&lt;/td>
&lt;td>—&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Health check ready&lt;/td>
&lt;td>10–30 s&lt;/td>
&lt;td>tuning de probe&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Total sin optimización: 4–10 minutos.&lt;/strong> Una réplica nueva tarda eso en absorber tráfico. Con todas las palancas combinadas: &lt;strong>45–90 segundos&lt;/strong>. La diferencia entre los dos números es el principal trabajo de plataforma para autoscaling LLM.&lt;/p>
&lt;h3 id="las-cinco-palancas">Las cinco palancas&lt;/h3>
&lt;p>&lt;strong>Palanca 1 — imagen pre-pulled.&lt;/strong> Un DaemonSet trivial corre &lt;code>ctr image pull&lt;/code> (o &lt;code>crictl pull&lt;/code>) sobre los nodos GPU en cuanto se incorporan al cluster. La imagen del motor de inferencia queda en disco; los nuevos pods saltan los 30–90 s de pull. Coste: ~6 GB de disco por nodo.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">apps/v1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">DaemonSet&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-image-warmer }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">selector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">matchLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">app&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-warmer } }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">app&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-warmer } }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nodeSelector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">workload&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">gpu }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">initContainers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pull&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm/vllm-openai:v0.10.0&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">command&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;/bin/true&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pause&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">registry.k8s.io/pause:3.10&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Palanca 2 — modelo en PV regional.&lt;/strong> El download del modelo (140 GB BF16 o 35 GB FP8) desde object storage central es el componente dominante del cold start. Cachear el modelo en un &lt;strong>PV de zona/rack&lt;/strong> —Rook-Ceph RBD, o NVMe local provisionado por el operador— recorta 60–300 s a 5–15 s. El antipatrón: descargar el modelo en cada arranque desde S3 externo.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">volumeMounts&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">model-cache&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">mountPath&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/models&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">readOnly&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">volumes&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">model-cache&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">persistentVolumeClaim&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">claimName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">llama70b-fp8-pvc &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># RWX shared, llenado offline&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Palanca 3 — warm pool.&lt;/strong> &lt;code>minReplicaCount &amp;gt; 0&lt;/code> mantiene réplicas pre-cargadas en idle. El coste es GPU ociosa; el beneficio es 0 s de cold start para el primer cliente de un pico. Para clusters productivos con tráfico continuo: warm pool de 2–3 réplicas. Para clusters batch nocturnos con tráfico 0: warm pool 0 y aceptar el cold start, o KEDA con cron que pre-encienda 10 minutos antes.&lt;/p>
&lt;p>&lt;strong>Palanca 4 — predictive scaling con cron.&lt;/strong> Cuando el patrón es predecible (oficinas 9–18 h):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">triggers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cron&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">timezone&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Europe/Madrid&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">start&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;30 8 * * 1-5&amp;#34;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># 8:30 lunes–viernes&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">end&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;0 19 * * 1-5&amp;#34;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># 19:00&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">desiredReplicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;6&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Combinado con triggers reactivos. El HPA escala según el máximo de las señales: si la cron pide 6 y la cola pide 10, el resultado es 10.&lt;/p>
&lt;p>&lt;strong>Palanca 5 — descarga paralela y formato eficiente.&lt;/strong> Para PVs no pre-cargados, herramientas como &lt;code>nvidia-modelmanager&lt;/code>, &lt;code>s5cmd&lt;/code> o &lt;code>aria2c&lt;/code> paralelizan la descarga del modelo. Pasar de descarga serial (~150 MB/s) a paralela 8 threads (~1.2 GB/s) divide entre 8 el tiempo. Y formatos como &lt;strong>safetensors&lt;/strong> se cargan en HBM más rápido que &lt;strong>PyTorch pickle&lt;/strong> original.&lt;/p>
&lt;h2 id="cuándo-escalar-nodos-no-solo-pods">Cuándo escalar nodos, no solo pods&lt;/h2>
&lt;p>El HPA escala &lt;strong>pods&lt;/strong>. Si el cluster no tiene nodos GPU libres, el nuevo pod se queda en &lt;code>Pending&lt;/code> por falta de recursos. Para escalar &lt;strong>nodos&lt;/strong>, hace falta cluster-autoscaler con un nodepool GPU específico, etiquetado:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># nodepool config (Karpenter o cluster-autoscaler equivalent)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">workload&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">gpu&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">gpu-model&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm-80gb&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">taints&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvidia.com/gpu&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">effect&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">NoSchedule&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">min&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">2&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nodes&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">max&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nodes&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Sin esto, el HPA puede pedir 10 réplicas pero el cluster solo entrega las que caben en nodos ya levantados. El cold start de un nodo nuevo (provisioning bare metal o cloud, PXE, OS boot, drivers NVIDIA, join del cluster) es &lt;strong>mucho&lt;/strong> mayor que el cold start de un pod: típicamente 5–15 minutos en bare metal preconfigurado, 30–60 minutos en provisioning real. Para clusters on-premise, el nodepool debe estar &lt;strong>siempre dimensionado al máximo previsto&lt;/strong>, y el &amp;ldquo;scaling&amp;rdquo; es solo del lado de pods. El concepto de scale-out reactivo de nodos solo aplica a clouds; en on-premise hay que comprar para el pico.&lt;/p>
&lt;h2 id="tres-pitfalls-específicos-del-scale-down-llm">Tres pitfalls específicos del scale-down LLM&lt;/h2>
&lt;p>&lt;strong>Pitfall 1 — cortar conexiones SSE de streaming.&lt;/strong> Cuando una réplica entra en &lt;code>Terminating&lt;/code>, Kubernetes envía SIGTERM al pod y, por defecto, lo mata 30 segundos después. Para vLLM eso significa &lt;strong>cortar conexiones SSE de streaming a la mitad de la respuesta&lt;/strong>. El cliente recibe un error 502 con el output parcial perdido. Solución: &lt;code>terminationGracePeriodSeconds: 120&lt;/code> + un preStop hook que avise al motor de no aceptar nuevas requests pero terminar las en curso:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">terminationGracePeriodSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">120&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">lifecycle&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">preStop&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">httpGet&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">path&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/shutdown&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">port&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8000&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Esto requiere que el motor exponga un endpoint de shutdown elegante; vLLM v1 lo soporta vía &lt;code>--enable-graceful-shutdown&lt;/code>. Sin esto, el scale-down rompe SLO aunque las métricas no lo capturen (las requests cortadas no entran al histograma de TTFT).&lt;/p>
&lt;p>&lt;strong>Pitfall 2 — oscilación scale-up/scale-down.&lt;/strong> Si la &lt;code>stabilizationWindowSeconds&lt;/code> del scale-down es corta (~60 s default), la siguiente bajada de cola dispara scale-down, y dos minutos después el siguiente pico dispara scale-up. El sistema oscila, paga cold starts repetidos, y nunca alcanza un régimen estable. Solución: scale-down con ventana de 10 minutos como mínimo y políticas conservadoras (&lt;code>type: Pods, value: 1, periodSeconds: 120&lt;/code> — máximo una réplica menos cada 2 minutos).&lt;/p>
&lt;p>&lt;strong>Pitfall 3 — &lt;code>vllm:num_requests_waiting&lt;/code> con &lt;code>avg&lt;/code> cuando hay rebalanceo.&lt;/strong> Si dos réplicas están desbalanceadas (una con cola 20, otra con cola 0), &lt;code>avg&lt;/code> da 10 — el HPA dispara scale-out cuando lo correcto sería rebalancear vía el load balancer. Para detectarlo: añadir una alerta sobre &lt;code>stddev(vllm:num_requests_waiting)&lt;/code> por deployment. Si la dispersión es alta, el problema no es de capacidad sino de routing.&lt;/p>
&lt;h2 id="manifest-completo-de-ejemplo">Manifest completo de ejemplo&lt;/h2>
&lt;p>Para un deployment vLLM con Llama 70B FP8 en 4×H100 SXM por réplica, KEDA con warm pool 2:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">apps/v1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Deployment&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-llama70b&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">inference&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">2&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># gestionado por KEDA después&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">selector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">matchLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">app&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-llama70b } }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">app: vllm-llama70b, deployment&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-llama70b }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">terminationGracePeriodSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">120&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nodeSelector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">workload: gpu, gpu-model&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm-80gb }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">tolerations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvidia.com/gpu&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">operator&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Exists&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">effect&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">NoSchedule&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm/vllm-openai:v0.10.0&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">args&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- --&lt;span class="l">model=/models/llama-3.3-70b-fp8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- --&lt;span class="l">tensor-parallel-size=4&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- --&lt;span class="l">max-num-seqs=64&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- --&lt;span class="l">enable-prefix-caching&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- --&lt;span class="l">enable-graceful-shutdown&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ports&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- {&lt;span class="w"> &lt;/span>&lt;span class="nt">name: http, containerPort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8000&lt;/span>&lt;span class="w"> &lt;/span>}&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- {&lt;span class="w"> &lt;/span>&lt;span class="nt">name: metrics, containerPort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8000&lt;/span>&lt;span class="w"> &lt;/span>}&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">200Gi&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">readinessProbe&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">httpGet&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">path: /health, port&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8000&lt;/span>&lt;span class="w"> &lt;/span>}&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">initialDelaySeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">60&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">periodSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">failureThreshold&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">30&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># tolera el warmup&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">lifecycle&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">preStop&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">httpGet&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">path: /shutdown, port&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8000&lt;/span>&lt;span class="w"> &lt;/span>}&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">volumeMounts&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- {&lt;span class="w"> &lt;/span>&lt;span class="nt">name: model-cache, mountPath: /models, readOnly&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w"> &lt;/span>}&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- {&lt;span class="w"> &lt;/span>&lt;span class="nt">name: dshm, mountPath&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/dev/shm }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">volumes&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">model-cache&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">persistentVolumeClaim&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">claimName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">llama70b-fp8-pvc }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">dshm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">emptyDir&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">medium: Memory, sizeLimit&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">16Gi }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">monitoring.coreos.com/v1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">PodMonitor&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">name: vllm-llama70b-metrics, namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">inference }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">selector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">matchLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">app&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-llama70b } }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">podMetricsEndpoints&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">port&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">metrics&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">path&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/metrics&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">interval&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">15s&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">keda.sh/v1alpha1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ScaledObject&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">name: vllm-llama70b-scaler, namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">inference }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scaleTargetRef&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>{&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm-llama70b }&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minReplicaCount&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">2&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">maxReplicaCount&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">20&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">pollingInterval&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">15&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cooldownPeriod&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">300&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">advanced&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">horizontalPodAutoscalerConfig&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">behavior&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scaleDown&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">stabilizationWindowSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">600&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policies&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- {&lt;span class="w"> &lt;/span>&lt;span class="nt">type: Pods, value: 1, periodSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">120&lt;/span>&lt;span class="w"> &lt;/span>}&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scaleUp&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">stabilizationWindowSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">30&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policies&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- {&lt;span class="w"> &lt;/span>&lt;span class="nt">type: Pods, value: 2, periodSeconds&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">60&lt;/span>&lt;span class="w"> &lt;/span>}&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">triggers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">prometheus&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">serverAddress&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">http://prometheus.observability.svc:9090&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metricName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm_queue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">threshold&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;5&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">avg(vllm:num_requests_waiting{deployment=&amp;#34;vllm-llama70b&amp;#34;})&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">prometheus&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">serverAddress&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">http://prometheus.observability.svc:9090&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metricName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">vllm_kv&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">threshold&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;0.85&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">avg(vllm:gpu_cache_usage_perc{deployment=&amp;#34;vllm-llama70b&amp;#34;})&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cron&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">timezone&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Europe/Madrid&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">start&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;30 8 * * 1-5&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">end&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;0 19 * * 1-5&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">desiredReplicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;6&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Este conjunto es el mínimo viable para autoscaling LLM en cluster genérico con NVIDIA GPU Operator. Cada equipo lo adapta a su SLO concreto.&lt;/p>
&lt;h2 id="aplicado-a-hardware-on-premise-típico">Aplicado a hardware on-premise típico&lt;/h2>
&lt;p>Para un cluster genérico de &lt;strong>4×H100 SXM 80 GB por nodo, 4 nodos GPU&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Cada nodo aloja una réplica vLLM TP=4 con Llama 70B FP8 (un modelo por nodo, no se comparten).&lt;/li>
&lt;li>Warm pool de 2 réplicas en off-peak; KEDA cron eleva a 4 en horario laboral.&lt;/li>
&lt;li>Cluster-autoscaler &lt;strong>no aplica&lt;/strong> (4 nodos físicos comprados; el escalado es solo de pods). El número de réplicas concurrentes es como máximo el número de nodos disponibles (si cada réplica usa los 4 GPUs del nodo entero).&lt;/li>
&lt;li>Si el dimensionamiento requiere más réplicas simultáneas que nodos, hay dos vías: (a) bajar el TP de cada réplica para que entren dos por nodo, (b) ampliar el nodepool físico. La decisión la dicta el capacity planning —ver &lt;a href="https://blog.lo0.es/posts/capacity-planning-inferencia-llm-on-premise/">Capacity planning para inferencia LLM on-premise&lt;/a>—.&lt;/li>
&lt;/ul>
&lt;p>Volumen de eventos KEDA: ~5 evaluations/min por ScaledObject. Para 10 modelos servidos en paralelo, 3 000 evaluations/h. Manejable con un KEDA operator por cluster.&lt;/p>
&lt;h2 id="lo-que-no-hemos-cubierto-próximos-artículos">Lo que no hemos cubierto (próximos artículos)&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Cluster-autoscaler para nodos GPU on-premise&lt;/strong>: cómo orquestar provisioning bare metal (Tinkerbell, Metal³) en función de demanda.&lt;/li>
&lt;li>&lt;strong>Multi-cluster autoscaling&lt;/strong>: escalar entre clusters de DCs distintos para resiliencia geográfica.&lt;/li>
&lt;li>&lt;strong>Cost-aware autoscaling&lt;/strong>: priorizar nodos según coste energético horario (en clusters con tarifa indexada).&lt;/li>
&lt;li>&lt;strong>Predictive ML-based scaling&lt;/strong>: en lugar de cron estático, entrenar un modelo que prediga demanda con 30 minutos de antelación.&lt;/li>
&lt;li>&lt;strong>Quotas y fairness multi-tenant&lt;/strong>: KEDA con namespace quotas para que un tenant no acapare el HPA.&lt;/li>
&lt;/ul>
&lt;h2 id="ver-también">Ver también&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://blog.lo0.es/posts/observabilidad-gpu-dcgm-llm/">Observabilidad GPU para inferencia LLM&lt;/a> — fuente de las métricas que alimentan al HPA.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/capacity-planning-inferencia-llm-on-premise/">Capacity planning para inferencia LLM on-premise&lt;/a> — qué techo y qué head-room presupone el autoscaler.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/continuous-batching-fundamentos/">Continuous batching&lt;/a> — explica &lt;code>num_requests_running&lt;/code>, &lt;code>num_requests_waiting&lt;/code> y &lt;code>gpu_cache_usage_perc&lt;/code>.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/kv-cache-fundamentos/">KV cache&lt;/a> — domina el KV pool y por tanto los thresholds.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/cinco-niveles-madurez-plataforma-llm-on-premise/">Cinco niveles de madurez&lt;/a> — KEDA es pieza del nivel 4.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/canary-blue-green-shadow-modelos-llm/">Canary, blue-green y shadow&lt;/a> — el autoscaler convive con la estrategia de despliegue.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/router-inferencia-llm-gateway-l7/">El router de inferencia LLM&lt;/a> — el router consume &lt;code>vllm:num_requests_running&lt;/code> y &lt;code>vllm:gpu_cache_usage_perc&lt;/code> (mismas métricas que el autoscaler) para decidir réplica con token-aware LB y prefix-aware routing; los dos componentes comparten cabina pero deciden cosas distintas.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/runbooks-incident-response-llm-keep-kafka/">Runbooks de incident response para LLM con Keep + Kafka&lt;/a> — los runbooks RB-01 (&lt;code>GpuHbmNearOom&lt;/code>) y RB-05 (&lt;code>VllmKvCachePoolNearFull&lt;/code>) usan el autoscaler como palanca de mitigación inmediata.&lt;/li>
&lt;/ul>
&lt;h2 id="referencias">Referencias&lt;/h2>
&lt;ul>
&lt;li>KEDA project — &lt;code>keda.sh&lt;/code> (documentación oficial de triggers Prometheus y cron).&lt;/li>
&lt;li>Kubernetes — &lt;em>Horizontal Pod Autoscaler walkthrough&lt;/em> (kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale).&lt;/li>
&lt;li>NVIDIA — &lt;em>GPU Operator on Kubernetes&lt;/em> (Helm chart oficial con DaemonSet de drivers y DCGM).&lt;/li>
&lt;li>vLLM project — &lt;code>production_monitoring/&lt;/code> (métricas Prometheus expuestas por el servidor).&lt;/li>
&lt;li>Karpenter — &lt;em>NodePool spec&lt;/em> (etiquetado y taints para nodepools GPU).&lt;/li>
&lt;li>Cluster Autoscaler — &lt;em>Scaling GPU nodes&lt;/em> (caveats de descubrimiento de recursos GPU).&lt;/li>
&lt;li>Kubernetes — &lt;em>Pod lifecycle and termination&lt;/em> (preStop, terminationGracePeriodSeconds).&lt;/li>
&lt;/ul></description></item></channel></rss>