<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Volcano on lo0 — Blog Técnico</title><link>https://blog.lo0.es/tags/volcano/</link><description>Recent content in Volcano on lo0 — Blog Técnico</description><generator>Hugo -- gohugo.io</generator><language>es</language><lastBuildDate>Tue, 16 Jun 2026 13:00:00 +0200</lastBuildDate><atom:link href="https://blog.lo0.es/tags/volcano/index.xml" rel="self" type="application/rss+xml"/><item><title>Volcano y Kueue: gang scheduling, colas y cuotas GPU para cargas distribuidas en Kubernetes</title><link>https://blog.lo0.es/posts/volcano-vs-kueue-scheduling-gpu-kubernetes/</link><pubDate>Tue, 16 Jun 2026 13:00:00 +0200</pubDate><guid>https://blog.lo0.es/posts/volcano-vs-kueue-scheduling-gpu-kubernetes/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;p>&lt;strong>Volcano&lt;/strong> (volcano-sh, CNCF incubating) es un &lt;strong>scheduler batch completo&lt;/strong> que reemplaza o complementa al kube-scheduler: coloca pods con semántica gang (todo-o-nada via &lt;code>PodGroup&lt;/code>/&lt;code>minMember&lt;/code>), gestiona colas con prioridad, fair-share DRF y preemption entre colas, y entiende topología de red y NUMA.&lt;/p>
&lt;p>&lt;strong>Kueue&lt;/strong> (kubernetes-sigs/kueue) es un &lt;strong>gestor de colas y cuotas a nivel de Job&lt;/strong>: NO coloca pods (delega en kube-scheduler o en Volcano), pero decide cuándo un workload puede ser admitido según cuota disponible (&lt;code>ClusterQueue&lt;/code>/&lt;code>LocalQueue&lt;/code>/&lt;code>Cohort&lt;/code>), con fair sharing, borrowing entre equipos y preemption por prioridad. Integra nativamente Job, JobSet, RayJob, todos los Kubeflow operators y más.&lt;/p>
&lt;p>La combinación ganadora en producción para cargas GPU multi-tenant es: &lt;strong>Kueue para cuota/colas + Volcano (o el plugin coscheduling de sig-scheduler) para el gang del job&lt;/strong>.&lt;/p>
&lt;hr>
&lt;h2 id="la-analogía">La analogía&lt;/h2>
&lt;p>Imagina un club con capacidad limitada y una sala de baile dentro.&lt;/p>
&lt;p>&lt;strong>Kueue es el portero y el gestor de reservas&lt;/strong>: comprueba si la cuota del equipo (aforo reservado) permite entrar al grupo, aplica la lista de espera justa, presta aforo de otros equipos si los hay ociosos, y recupera el sitio cuando el dueño lo necesita. Pero el portero no decide dónde se sienta cada persona dentro del local.&lt;/p>
&lt;p>&lt;strong>Volcano es el jefe de sala&lt;/strong>: cuando el grupo ya tiene permiso para entrar, él decide en qué mesas se sientan, asegura que todo el grupo se sienta a la vez o ninguno entra (gang), elige las mesas según topología (quién necesita hablar con quién) y expulsa grupos de menor prioridad para hacer hueco si es necesario.&lt;/p>
&lt;p>Sin portero (Kueue), el jefe de sala no sabe cuántos grupos puede acoger a la vez ni si un equipo se está pasando de aforo. Sin jefe de sala (Volcano), el portero deja entrar al grupo pero los integrantes se dispersan solos por las mesas disponibles —y el grupo de 8 que necesita sentarse junto nunca lo consigue.&lt;/p>
&lt;hr>
&lt;h2 id="el-problema-que-ninguno-resuelve-por-defecto-el-kube-scheduler">El problema que ninguno resuelve por defecto: el kube-scheduler&lt;/h2>
&lt;p>El &lt;code>kube-scheduler&lt;/code> de Kubernetes es un scheduler de pods, no de jobs. Asigna pods de uno en uno al nodo más adecuado según recursos disponibles y constraints de afinidad. Para una carga de entrenamiento distribuido que necesita, por ejemplo, 8 pods con 4 GPUs cada uno (32 GPUs en total sobre 8 nodos 4×H100), el scheduler estándar hace lo siguiente:&lt;/p>
&lt;ol>
&lt;li>Busca un nodo con 4 GPUs disponibles. Lo encuentra. Programa el pod 1.&lt;/li>
&lt;li>Busca otro nodo con 4 GPUs. Lo encuentra. Programa el pod 2.&lt;/li>
&lt;li>Continúa hasta que llega al pod 6 y resulta que ya no hay nodos con 4 GPUs libres: el cluster tiene exactamente 32 GPUs y hay otros workloads usando algunas.&lt;/li>
&lt;li>Los pods 1–5 están &lt;code>Running&lt;/code>. Los pods 6–8 están &lt;code>Pending&lt;/code>.&lt;/li>
&lt;li>Los pods 1–5 no pueden hacer nada sin los demás: un trabajo PyTorch distribuido necesita que todos los workers arranquen para iniciar el proceso &lt;code>torchrun&lt;/code>. &lt;strong>Espera con los recursos ocupados. Deadlock.&lt;/strong>&lt;/li>
&lt;/ol>
&lt;p>Esto no es un bug, es el diseño: el kube-scheduler no tiene el concepto de &amp;ldquo;programa este grupo de pods solo si puedes programarlos todos&amp;rdquo;. En consecuencia:&lt;/p>
&lt;ul>
&lt;li>Los recursos de los pods 1–5 están bloqueados sin producir trabajo.&lt;/li>
&lt;li>Otros jobs que podrían correr con los recursos parciales también esperan.&lt;/li>
&lt;li>Si hay varios jobs en esta situación, el cluster puede quedar con recursos fragmentados, ningún job corriendo y todo el mundo en deadlock circular.&lt;/li>
&lt;/ul>
&lt;p>Además, el kube-scheduler no tiene noción de:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Colas&lt;/strong> por equipo/proyecto con prioridad relativa.&lt;/li>
&lt;li>&lt;strong>Cuotas&lt;/strong> de recursos por equipo con capacidad de borrowing entre equipos ociosos.&lt;/li>
&lt;li>&lt;strong>Fair-share&lt;/strong>: si el equipo A lleva semanas usando el 80 % del cluster, debería esperar más que el equipo B que lleva semanas en idle.&lt;/li>
&lt;li>&lt;strong>Preemption inter-queue&lt;/strong>: expulsar un job de menor prioridad de otro equipo para hacer sitio al job urgente de este equipo.&lt;/li>
&lt;/ul>
&lt;p>Resolver cualquiera de estos problemas requiere añadir una capa sobre el scheduler. Volcano y Kueue son las dos soluciones OSS dominantes en 2026, con enfoques arquitectónicos complementarios.&lt;/p>
&lt;hr>
&lt;h2 id="volcano-el-scheduler-batch">Volcano: el scheduler batch&lt;/h2>
&lt;h3 id="qué-es-y-qué-reemplaza">Qué es y qué reemplaza&lt;/h3>
&lt;p>Volcano (volcano-sh) es un &lt;strong>scheduler batch Kubernetes-native&lt;/strong> aceptado por CNCF como su primer y único proyecto oficial de scheduling batch de contenedores (&lt;a href="https://volcano.sh/en/docs/">volcano.sh/en/docs&lt;/a>). En versión v1.15.x a junio de 2026, con estado CNCF incubating.&lt;/p>
&lt;p>Volcano &lt;strong>no es un addon al kube-scheduler&lt;/strong>: es un scheduler alternativo (o complementario, según la configuración) que coloca pods. Se instala como un deployment y los jobs que quieran beneficiarse de sus capacidades deben usar la clase de scheduler &lt;code>volcano&lt;/code> en su pod spec (&lt;code>schedulerName: volcano&lt;/code>) o usar el CRD &lt;code>VolcanoJob&lt;/code>.&lt;/p>
&lt;p>La propuesta de valor central de Volcano es que trata grupos de pods como unidades atómicas de scheduling, no pods individuales. Esto es lo que permite resolver el deadlock descrito arriba.&lt;/p>
&lt;h3 id="gang-scheduling-via-podgroup">Gang scheduling via PodGroup&lt;/h3>
&lt;p>El mecanismo central de Volcano es el &lt;code>PodGroup&lt;/code>: un CRD que agrupa los pods de un job y define cuántos deben poder programarse antes de que Volcano arranque cualquiera de ellos.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># PodGroup para un job de entrenamiento PyTorch distribuido: 8 workers, mínimo 8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">scheduling.volcano.sh/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">PodGroup&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pytorch-train-pg&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ml-training&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minMember: 8 # all-or-nothing&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">si no hay sitio para 8, ninguno arranca&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minResources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;32&amp;#34;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># 8 pods × 4 GPUs = 32 GPUs mínimas en el cluster&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">queue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># a qué cola Volcano se asigna este job&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">priorityClassName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">high-priority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>El parámetro &lt;code>minMember&lt;/code> implementa la semántica &lt;strong>all-or-nothing&lt;/strong>: Volcano solo asigna nodos a los pods del grupo cuando puede asignar al menos &lt;code>minMember&lt;/code> pods simultáneamente. Si el cluster no tiene capacidad para 8 pods GPU en este momento, ningún pod del grupo se mueve de &lt;code>Pending&lt;/code>. Nada se bloquea, nada se fragmenta.&lt;/p>
&lt;p>&lt;code>minMember&lt;/code> puede ser menor que el total de pods del job: esto permite &lt;strong>elastic gang scheduling&lt;/strong>, donde el job puede arrancar con menos workers y escalar, útil para jobs tolerantes a workers reducidos.&lt;/p>
&lt;p>Para que los pods de un job se asocien al PodGroup, llevan la anotación:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># Pod spec del worker PyTorch&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">annotations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scheduling.volcano.sh/pod-group-name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pytorch-train-pg&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">volcano&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">trainer&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pytorch/pytorch:2.5-cuda12.4&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="queue-colas-con-cuotas-y-prioridad">Queue: colas con cuotas y prioridad&lt;/h3>
&lt;p>Volcano introduce el CRD &lt;code>Queue&lt;/code> para gestionar múltiples tenants con cuotas independientes:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">scheduling.volcano.sh/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Queue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">weight&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">4&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># peso relativo para fair-share entre colas (proportion plugin)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">capability&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># techo absoluto de recursos que puede usar esta cola&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">guarantee&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># recurso garantizado, nunca prestado a otras colas&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resource&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;8&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">reclaimable&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># si true, otros pueden reclamar los recursos que presta cuando los necesiten&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">scheduling.volcano.sh/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Queue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">weight&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">2&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">capability&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;8&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">guarantee&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resource&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">reclaimable&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>El campo &lt;code>weight&lt;/code> alimenta el &lt;strong>plugin proportion&lt;/strong>: las colas compiten por los recursos disponibles del cluster en proporción a su peso. Un cluster con 32 GPUs y dos colas de peso 4 y 2 reparte las GPUs en proporción 4:2 (≈21 y 11 GPUs respectivamente) cuando ambas están saturadas.&lt;/p>
&lt;h3 id="plugins-del-scheduler-drf-binpack-topology-aware">Plugins del scheduler: DRF, binpack, topology-aware&lt;/h3>
&lt;p>Volcano implementa su lógica de scheduling como un pipeline de acciones y plugins:&lt;/p>
&lt;p>&lt;strong>Acciones&lt;/strong> (qué hace el scheduler en cada ciclo):&lt;/p>
&lt;ul>
&lt;li>&lt;code>enqueue&lt;/code>: mueve jobs de cola waiting a schedulable cuando hay cuota disponible.&lt;/li>
&lt;li>&lt;code>allocate&lt;/code>: asigna nodos a pods schedulable.&lt;/li>
&lt;li>&lt;code>preempt&lt;/code>: expulsa pods de menor prioridad para hacer sitio a los de mayor prioridad dentro de la misma cola.&lt;/li>
&lt;li>&lt;code>reclaim&lt;/code>: expulsa pods de otras colas que están usando por encima de su &lt;code>guarantee&lt;/code> para devolver recursos al dueño.&lt;/li>
&lt;li>&lt;code>backfill&lt;/code>: rellena recursos ociosos con jobs best-effort que no interfieren con los demás.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Plugins&lt;/strong> relevantes para cargas GPU:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Plugin&lt;/th>
&lt;th>Qué hace&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>gang&lt;/code>&lt;/td>
&lt;td>Implementa la semántica all-or-nothing del PodGroup&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>proportion&lt;/code>&lt;/td>
&lt;td>Fair-share por peso de cola (cuota proporcional)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>capacity&lt;/code>&lt;/td>
&lt;td>Cuotas con &lt;code>guarantee&lt;/code>/&lt;code>capability&lt;/code> y reclaim; alternativa más expresiva a proportion&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>drf&lt;/code>&lt;/td>
&lt;td>Dominant Resource Fairness: fair-share multi-dimensional (CPU, memoria, GPU)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>binpack&lt;/code>&lt;/td>
&lt;td>Compacta pods en los nodos más llenos; reduce fragmentación GPU&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>priority&lt;/code>&lt;/td>
&lt;td>Ordena jobs por prioridad dentro de la misma cola&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>nodeorder&lt;/code>&lt;/td>
&lt;td>Puntuación de nodos según múltiples criterios (afinidad, recursos, spread)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>task-topology&lt;/code>&lt;/td>
&lt;td>Afinidad entre pods del mismo job (comunicación inter-GPU)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>numa-aware&lt;/code>&lt;/td>
&lt;td>NUMA affinity: alinea pods con el socket NUMA del nodo para reducir latencia de memoria&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="topología-de-red-y-numa-v111">Topología de red y NUMA (v1.11+)&lt;/h3>
&lt;p>Volcano v1.11 (febrero 2025) introdujo &lt;strong>Network Topology Aware Scheduling&lt;/strong> como feature de primera clase (&lt;a href="https://www.cncf.io/blog/2025/03/05/volcano-v1-11-released-a-new-era-of-cloud-native-scheduling-for-ai-and-big-data/">CNCF blog, marzo 2025&lt;/a>). Los jobs de entrenamiento distribuido en un datacenter con estructura de red jerárquica (spine/leaf, bloques de nodos con NVSwitch) pueden declarar restricciones de topología:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># VolcanoJob con restricción de topología de red&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">batch.volcano.sh/v1alpha1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Job&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">llm-pretrain&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ml-training&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minAvailable&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">volcano&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">queue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">plugins&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ssh&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">env&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">svc&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">networkTopology&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">mode: hard # hard&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">el job DEBE cumplir la restricción&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">highestTierAllowed&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">block &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># los pods no pueden spannear más allá de un &amp;#34;bloque&amp;#34; de red&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">tasks&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">worker&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">trainer&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvcr.io/nvidia/pytorch:25.01-py3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>La semántica &lt;code>highestTierAllowed: block&lt;/code> instruye a Volcano a colocar los 8 pods dentro del mismo bloque de red (por ejemplo, todos los nodos bajo el mismo switch de acceso), minimizando el tráfico inter-bloque que degrada el all-reduce distribuido.&lt;/p>
&lt;p>La NUMA awareness funciona de forma similar: con el plugin &lt;code>numa-aware&lt;/code>, los pods solicitan una política NUMA (&lt;code>single-numa-node&lt;/code>, &lt;code>restricted&lt;/code>, &lt;code>best-effort&lt;/code>) y Volcano selecciona nodos donde los recursos CPU/memoria/GPU solicitados están en el mismo dominio NUMA, evitando el overhead de acceso a memoria remota (NUMA-crossing) que puede degradar el throughput de entrenamiento un 15-30 % en nodos multi-socket.&lt;/p>
&lt;h3 id="gpu-virtualization-en-v111">GPU virtualization en v1.11+&lt;/h3>
&lt;p>Volcano v1.11 también introduce soporte para &lt;strong>MIG dinámico y vCUDA&lt;/strong>: en lugar de declarar &lt;code>nvidia.com/gpu: 1&lt;/code> para una GPU entera, los workloads pueden declarar &lt;code>nvidia.com/gpu-memory: 20Gi&lt;/code> y Volcano (con el device plugin correspondiente) provisiona dinámicamente la instancia MIG o la partición vCUDA. Esto es [marketing del proyecto sin benchmarks independientes publicados a junio 2026], pero la arquitectura del feature está documentada en el código.&lt;/p>
&lt;h3 id="qué-reemplazaañade-al-scheduler-por-defecto">Qué reemplaza/añade al scheduler por defecto&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Capacidad&lt;/th>
&lt;th>kube-scheduler&lt;/th>
&lt;th>Volcano&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Colocar pods en nodos&lt;/td>
&lt;td>Sí&lt;/td>
&lt;td>Sí (lo reemplaza para workloads marcados con &lt;code>schedulerName: volcano&lt;/code>)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Gang scheduling (all-or-nothing)&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Sí (PodGroup + minMember)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Colas con prioridad&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Sí (Queue CRD)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Fair-share inter-queue&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Sí (DRF, proportion, capacity plugins)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Preemption inter-queue&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Sí (reclaim action)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Topology-aware (NUMA, network)&lt;/td>
&lt;td>Parcial (node affinity)&lt;/td>
&lt;td>Sí (plugins dedicados)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Elastic gang (minMember &amp;lt; total)&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Sí&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Backfill best-effort&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Sí&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Integraciones de frameworks&lt;/td>
&lt;td>Parcial&lt;/td>
&lt;td>MPI, PyTorch, Ray, TensorFlow, Spark, Flink, Horovod&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="kueue-el-gestor-de-colas-y-cuotas">Kueue: el gestor de colas y cuotas&lt;/h2>
&lt;h3 id="qué-es-y-qué-no-hace">Qué es y qué NO hace&lt;/h3>
&lt;p>Kueue (kubernetes-sigs/kueue) es un &lt;strong>sistema kubernetes-native que gestiona cuotas y cómo los jobs las consumen&lt;/strong> (&lt;a href="https://kueue.sigs.k8s.io/docs/overview/">kueue.sigs.k8s.io&lt;/a>). Kueue decide cuándo un job debe esperar, cuándo debe ser admitido (pods pueden crearse) y cuándo debe ser expulsado (pods activos deben borrarse).&lt;/p>
&lt;p>El principio de diseño central de Kueue es explícito en su documentación: &lt;strong>evitar duplicar funcionalidad madura de componentes Kubernetes&lt;/strong>. El autoscaling es responsabilidad del cluster-autoscaler. El scheduling de pod-a-nodo es responsabilidad del kube-scheduler. La gestión del ciclo de vida del job es responsabilidad del kube-controller-manager. Kueue no reemplaza ninguno de ellos: se coloca encima como una capa de &lt;strong>admission control y gestión de cuotas a nivel de Job&lt;/strong>.&lt;/p>
&lt;p>Esto es la distinción fundamental: &lt;strong>Kueue no coloca pods en nodos&lt;/strong>. Cuando Kueue admite un workload, simplemente permite que el controller del job correspondiente cree los pods, y esos pods son programados por el kube-scheduler (o por Volcano, si está configurado como scheduler).&lt;/p>
&lt;h3 id="los-cuatro-objetos-nucleares">Los cuatro objetos nucleares&lt;/h3>
&lt;p>&lt;strong>ResourceFlavor&lt;/strong>: mapea recursos abstractos a grupos de nodos físicos concretos.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ResourceFlavor&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nodeLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">accelerator&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">node-pool&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">gpu-training&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">tolerations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">operator&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;Exists&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">effect&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;NoSchedule&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>ClusterQueue&lt;/strong>: define la cuota de recursos por flavor para un tenant. Objeto cluster-scoped.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ClusterQueue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cq-team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">llm-platform &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># cohort al que pertenece (puede prestar/pedir prestado)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">queueingStrategy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">BestEffortFIFO&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespaceSelector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">matchLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">kubernetes.io/metadata.name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resourceGroups&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">coveredResources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;memory&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">flavors&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">16&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># GPUs garantizadas para este equipo&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">borrowingLimit&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># puede tomar hasta 8 GPUs adicionales del cohort&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">lendingLimit&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># puede prestar hasta 8 de sus 16 GPUs nominales&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;128&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;memory&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;512Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">preemption&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">reclaimWithinCohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># recupera quota prestada expulsando jobs de menor prioridad&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">borrowWithinCohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">withinClusterQueue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>LocalQueue&lt;/strong>: punto de entrada namespace-scoped para los workloads de un equipo. Los jobs apuntan a su LocalQueue; Kueue los mapea al ClusterQueue correspondiente.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LocalQueue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lq-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">clusterQueue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cq-team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Cohort&lt;/strong>: agrupa ClusterQueues que pueden prestarse quota entre sí. No es un CRD independiente; se declara como campo en el ClusterQueue (&lt;code>spec.cohort: nombre&lt;/code>). Kueue agrega la quota disponible de todos los ClusterQueues en el cohort y permite que cualquiera tome prestado lo que otros no usan, respetando los &lt;code>borrowingLimit&lt;/code> y &lt;code>lendingLimit&lt;/code>.&lt;/p>
&lt;h3 id="fair-sharing-y-preemption">Fair sharing y preemption&lt;/h3>
&lt;p>Kueue implementa &lt;strong>Fair Sharing&lt;/strong> como política de ordenación de la cola de workloads pendientes (&lt;a href="https://kueue.sigs.k8s.io/docs/concepts/fair_sharing/">kueue.sigs.k8s.io/docs/concepts/fair_sharing&lt;/a>): cuando hay varios workloads compitiendo por quota en el cohort, los que pertenecen a ClusterQueues con mayor uso histórico acumulado tienen menor prioridad de admisión. Esto implementa reparto equitativo sin bloquear permanentemente a ningún equipo.&lt;/p>
&lt;p>La preemption en Kueue opera en dos dimensiones:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>reclaimWithinCohort&lt;/code>&lt;/strong>: el ClusterQueue que presta cuota la recupera expulsando workloads que la están usando prestada, según política de prioridad.&lt;/li>
&lt;li>&lt;strong>&lt;code>withinClusterQueue&lt;/code>&lt;/strong>: dentro del mismo ClusterQueue, workloads de menor prioridad son expulsados para dar paso a workloads de mayor prioridad del mismo equipo.&lt;/li>
&lt;/ul>
&lt;h3 id="gang-semantics-en-kueue-all-or-nothing-con-ready-pods">Gang semantics en Kueue: All-or-nothing con ready Pods&lt;/h3>
&lt;p>Kueue proporciona admisión gang a nivel de Job: admite el workload completo solo cuando toda la cuota necesaria está disponible. Si un RayJob necesita 8 GPUs (1 head + 7 workers), Kueue no admite el workload hasta que haya 8 GPUs disponibles en el ClusterQueue (o prestadas del cohort). El &lt;code>waitForPodsReady&lt;/code> con timeout añade una segunda garantía: si los pods creados no pasan a &lt;code>Ready&lt;/code> en el tiempo configurado, Kueue re-encola el workload liberando la cuota (&lt;a href="https://kueue.sigs.k8s.io/docs/tasks/manage/setup_wait_for_pods_ready/">kueue.sigs.k8s.io/docs/tasks/manage/setup_wait_for_pods_ready&lt;/a>).&lt;/p>
&lt;p>Esta es una &lt;strong>semántica gang a nivel de admisión&lt;/strong>, no a nivel de placement de pod. Garantiza que la cuota esté disponible antes de crear los pods, pero no garantiza que el kube-scheduler pueda colocarlos todos en nodos concretos al mismo tiempo. Para esa segunda garantía se necesita Volcano o el plugin coscheduling.&lt;/p>
&lt;h3 id="topology-aware-scheduling-tas">Topology-Aware Scheduling (TAS)&lt;/h3>
&lt;p>Kueue v0.10+ introduce &lt;strong>Topology-Aware Scheduling&lt;/strong> (&lt;a href="https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/">kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling&lt;/a>): permite definir topologías de nodos (bloques, subblocks, hosts) y que los workloads soliciten niveles de co-localización. Kueue solo admite el workload cuando puede satisfacer la restricción topológica, y añade node selectors/taints al momento de la admisión para que el scheduler coloque los pods en la topología correcta.&lt;/p>
&lt;p>TAS se configura con el CRD &lt;code>Topology&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Topology&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">datacenter-topology&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">levels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">nodeLabel&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;topology.kubernetes.io/block&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">nodeLabel&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;topology.kubernetes.io/rack&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">nodeLabel&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;kubernetes.io/hostname&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Y el ResourceFlavor referencia la topología:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ResourceFlavor&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nodeLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">accelerator&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">topologyName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">datacenter-topology&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="integraciones-de-frameworks">Integraciones de frameworks&lt;/h3>
&lt;p>Kueue tiene integración built-in (sin código adicional) para los siguientes tipos de workload, activadas con una anotación en el job:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">kueue.x-k8s.io/queue-name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lq-datos &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># apunta a la LocalQueue del equipo&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Los tipos de workload soportados incluyen: &lt;code>batch/Job&lt;/code>, &lt;code>JobSet&lt;/code>, &lt;code>RayJob&lt;/code>, &lt;code>RayCluster&lt;/code>, &lt;code>PytorchJob&lt;/code>, &lt;code>TFJob&lt;/code>, &lt;code>MPIJob&lt;/code>, &lt;code>JAXJob&lt;/code>, &lt;code>PaddleJob&lt;/code>, &lt;code>XGBoostJob&lt;/code>, &lt;code>TrainJob&lt;/code>, &lt;code>AppWrapper&lt;/code>, &lt;code>LeaderWorkerSet&lt;/code>, &lt;code>Deployment&lt;/code>, &lt;code>StatefulSet&lt;/code>, y plain &lt;code>Pod&lt;/code>/&lt;code>PodGroup&lt;/code>.&lt;/p>
&lt;p>Para cargas LLM, los casos directamente relevantes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>RayJob&lt;/strong> (entrenamiento distribuido con Ray Train): Kueue admite el RayJob cuando hay cuota para todo el cluster Ray (head + workers). Documentado en &lt;a href="https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.html">docs.ray.io&lt;/a>.&lt;/li>
&lt;li>&lt;strong>PyTorchJob&lt;/strong> (Kubeflow Training Operator): admisión gang del job completo.&lt;/li>
&lt;li>&lt;strong>JobSet&lt;/strong>: para jobs multi-réplica coordinados (LWS, pipelines multi-step).&lt;/li>
&lt;li>&lt;strong>Deployment/StatefulSet&lt;/strong>: para inferencia continua, permitiendo gestionar la cuota de GPUs de inferencia igual que las de entrenamiento.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="la-distinción-clave-volcano-coloca-kueue-admite">La distinción clave: Volcano coloca, Kueue admite&lt;/h2>
&lt;p>Esta tabla resume la diferencia arquitectónica fundamental:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Dimensión&lt;/th>
&lt;th>Volcano&lt;/th>
&lt;th>Kueue&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Rol principal&lt;/strong>&lt;/td>
&lt;td>Scheduler (coloca pods en nodos)&lt;/td>
&lt;td>Admission controller + gestor de cuotas (decide cuándo crear pods)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Gang scheduling&lt;/strong>&lt;/td>
&lt;td>Sí, nivel de placement (PodGroup/minMember)&lt;/td>
&lt;td>Sí, nivel de admisión (all-or-nothing de cuota)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cuotas multi-tenant&lt;/strong>&lt;/td>
&lt;td>Sí (Queue con capability/guarantee)&lt;/td>
&lt;td>Sí (ClusterQueue con nominalQuota/borrowingLimit)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cohorts / borrowing&lt;/strong>&lt;/td>
&lt;td>Parcial (reclaimable entre queues)&lt;/td>
&lt;td>Sí (Cohort con lendingLimit/borrowingLimit explícito)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Fair sharing&lt;/strong>&lt;/td>
&lt;td>Sí (DRF plugin)&lt;/td>
&lt;td>Sí (Fair Sharing basado en uso histórico)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Preemption&lt;/strong>&lt;/td>
&lt;td>Sí (preempt + reclaim actions)&lt;/td>
&lt;td>Sí (reclaimWithinCohort, withinClusterQueue)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Topology/NUMA&lt;/strong>&lt;/td>
&lt;td>Sí (plugins dedicados, network topology, NUMA-aware)&lt;/td>
&lt;td>Sí (TAS, niveles de topología en ResourceFlavor)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Integraciones de frameworks&lt;/strong>&lt;/td>
&lt;td>Volcano Job (MPI, PyTorch, Ray, TF, Spark, Flink)&lt;/td>
&lt;td>Nativo: Job, JobSet, RayJob, Kubeflow, LWS, AppWrapper, Deployment, StatefulSet&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Quién coloca los pods&lt;/strong>&lt;/td>
&lt;td>Volcano&lt;/td>
&lt;td>kube-scheduler (o Volcano si está configurado)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Huella de instalación&lt;/strong>&lt;/td>
&lt;td>Media-alta (scheduler propio, CRDs, webhook, metrics)&lt;/td>
&lt;td>Ligera (controller, CRDs, webhook; no reemplaza scheduler)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Madurez / estado&lt;/strong>&lt;/td>
&lt;td>CNCF incubating; v1.15 (junio 2026); producción en Huawei, Baidu, DiDi&lt;/td>
&lt;td>kubernetes-sigs; API v1beta2; producción en Google GKE, Red Hat OpenShift 4.20, Runway ML&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Curva de adopción&lt;/strong>&lt;/td>
&lt;td>Mayor (requiere cambiar schedulerName o usar VolcanoJob CRD)&lt;/td>
&lt;td>Menor (añade labels a jobs existentes; no cambia el scheduler)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="cómo-coexisten-el-patrón-producción">Cómo coexisten: el patrón producción&lt;/h3>
&lt;p>En producción, Kueue y Volcano son &lt;strong>complementarios, no excluyentes&lt;/strong>. El patrón más común en 2026 para clusters GPU multi-tenant es:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Kueue&lt;/strong> gestiona la cuota global: qué equipo puede usar cuántas GPUs, cuánto puede pedir prestado, cuándo un job entra en cola vs. se admite.&lt;/li>
&lt;li>&lt;strong>Volcano&lt;/strong> hace el gang scheduling a nivel de pod: una vez que Kueue admite el job (la cuota está disponible), Volcano coloca los pods asegurando que todos se colocan simultáneamente en nodos compatibles.&lt;/li>
&lt;/ol>
&lt;p>La integración se configura especificando &lt;code>schedulerName: volcano&lt;/code> en los pod specs de los workloads gestionados por Kueue. Kueue ve el Job/RayJob/PyTorchJob y gestiona su cuota; cuando lo admite, los pods se crean y Volcano los coloca con semántica gang. Los PodGroups de Volcano se crean automáticamente por el Volcano Job controller o por el propio Kubeflow Training Operator cuando detecta que el scheduler es Volcano.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># PyTorchJob gestionado por Kueue (cuota) + Volcano (gang placement)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kubeflow.org/v1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">PyTorchJob&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">llm-finetune-70b&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">kueue.x-k8s.io/queue-name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lq-datos &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># Kueue gestiona la cuota&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">annotations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scheduling.volcano.sh/queue-name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># Volcano usa su Queue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">pytorchReplicaSpecs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">Master&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">restartPolicy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">OnFailure&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">volcano &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># Volcano hace el placement&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pytorch&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvcr.io/nvidia/pytorch:25.01-py3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">Worker&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">7&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">restartPolicy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">OnFailure&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">volcano&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pytorch&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvcr.io/nvidia/pytorch:25.01-py3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="la-tercera-vía-plugin-coscheduling-de-sig-scheduler">La tercera vía: plugin coscheduling de sig-scheduler&lt;/h3>
&lt;p>Si no quieres desplegar un scheduler alternativo completo pero necesitas gang scheduling, existe el &lt;strong>plugin coscheduling&lt;/strong> de &lt;a href="https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md">kubernetes-sigs/scheduler-plugins&lt;/a>. Este plugin extiende el kube-scheduler con un mecanismo de PodGroup similar al de Volcano, implementado como plugin de scheduling framework (permit plugin). La ventaja es que no reemplaza el scheduler; la desventaja es que tiene menos funcionalidad que Volcano (sin DRF, sin Queue/fair-share, sin network topology). Es la opción correcta para clusters simples que solo necesitan gang y no quieren la complejidad operativa de Volcano. Kueue puede funcionar también junto a este plugin.&lt;/p>
&lt;hr>
&lt;h2 id="ejemplos-yaml-completos">Ejemplos YAML completos&lt;/h2>
&lt;h3 id="volcano-queue--podgroup--volcanojob">Volcano: Queue + PodGroup + VolcanoJob&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># 1. Queue para equipo datos (16 GPUs nominales, techo en 24)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">scheduling.volcano.sh/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Queue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">weight&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">4&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">capability&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;24&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;192&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;768Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">guarantee&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resource&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">reclaimable&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 2. Queue para equipo ia (8 GPUs nominales, techo en 16)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">scheduling.volcano.sh/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Queue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">weight&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">2&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">capability&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;256Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">guarantee&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resource&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;8&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">reclaimable&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 3. PodGroup: job de fine-tuning de 70B, 8 workers × 4 GPUs = 32 GPUs&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">scheduling.volcano.sh/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">PodGroup&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">finetune-70b-pg&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minMember&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minResources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;32&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">queue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">priorityClassName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">training-high&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 4. VolcanoJob (wrapper que Volcano entiende nativamente)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">batch.volcano.sh/v1alpha1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Job&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">finetune-70b&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minAvailable&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">volcano&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">queue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">priorityClassName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">training-high&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">plugins&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">env&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">svc&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policies&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">event&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">PodEvicted&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">RestartJob&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">tasks&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">worker&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policies&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">event&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">TaskCompleted&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">CompleteJob&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">annotations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scheduling.volcano.sh/pod-group-name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">finetune-70b-pg&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">volcano&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">trainer&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvcr.io/nvidia/pytorch:25.01-py3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">command&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;torchrun&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;--nproc_per_node=4&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;--nnodes=8&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;--node_rank=$(RANK)&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;--master_addr=$(MASTER_ADDR)&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;--master_port=23456&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;train.py&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">env&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">NCCL_DEBUG&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;INFO&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">requests&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">restartPolicy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Never&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="kueue-resourceflavor--clusterqueue--localqueue--job-anotado">Kueue: ResourceFlavor + ClusterQueue + LocalQueue + Job anotado&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># 1. ResourceFlavor: nodos con H100 SXM&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ResourceFlavor&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nodeLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">accelerator&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">tolerations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">operator&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;Exists&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">effect&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;NoSchedule&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 2. ClusterQueue equipo datos: 16 GPUs nominales, puede pedir 8 prestadas&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ClusterQueue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cq-team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">llm-platform&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">queueingStrategy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">BestEffortFIFO&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespaceSelector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">matchLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">kubernetes.io/metadata.name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resourceGroups&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">coveredResources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;memory&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">flavors&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">16&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">borrowingLimit&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">lendingLimit&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;128&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;memory&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;512Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">preemption&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">reclaimWithinCohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">borrowWithinCohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">policy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">withinClusterQueue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 3. ClusterQueue equipo ia: 8 GPUs nominales&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ClusterQueue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cq-team-ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">llm-platform&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">queueingStrategy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">BestEffortFIFO&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespaceSelector&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">matchLabels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">kubernetes.io/metadata.name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resourceGroups&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">coveredResources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;memory&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">flavors&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">h100-sxm&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;nvidia.com/gpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">borrowingLimit&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">8&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">lendingLimit&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">4&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;memory&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nominalQuota&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;256Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">preemption&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">reclaimWithinCohort&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">withinClusterQueue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LowerPriority&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 4. LocalQueue en el namespace del equipo datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LocalQueue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lq-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">clusterQueue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cq-team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 5. LocalQueue en el namespace del equipo ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kueue.x-k8s.io/v1beta1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">LocalQueue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lq-ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">clusterQueue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cq-team-ia&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nn">---&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c"># 6. RayJob de inferencia batch gestionado por Kueue&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ray.io/v1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">RayJob&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">batch-eval-llama70b&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">kueue.x-k8s.io/queue-name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lq-datos &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># Kueue gestiona la admisión&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">entrypoint&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;python batch_eval.py --model /models/llama-70b&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">rayClusterSpec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">headGroupSpec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">rayStartParams&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">num-gpus&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ray-head&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">rayproject/ray-ml:2.40.0-gpu&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">workerGroupSpecs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">minReplicas: 3 # gang&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Kueue no admite si no hay cuota para 3 workers&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">maxReplicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">groupName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">gpu-worker&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">rayStartParams&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">num-gpus&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ray-worker&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">rayproject/ray-ml:2.40.0-gpu&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="kueue--volcano-juntos-pytorchjob-con-cuota-kueue-y-gang-placement-volcano">Kueue + Volcano juntos: PyTorchJob con cuota Kueue y gang placement Volcano&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="c"># PyTorchJob: Kueue controla la cuota, Volcano hace el gang placement&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">apiVersion&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kubeflow.org/v1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">kind&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">PyTorchJob&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">metadata&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">distributed-finetune&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ns-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">kueue.x-k8s.io/queue-name: lq-datos # Kueue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cuota y admisión&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">annotations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="c"># Volcano crea automáticamente el PodGroup cuando schedulerName=volcano&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">scheduling.volcano.sh/queue-name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">team-datos&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">pytorchReplicaSpecs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">Master&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">restartPolicy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">OnFailure&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName: volcano # Volcano&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">placement gang&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pytorch&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvcr.io/nvidia/pytorch:25.01-py3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">Worker&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">replicas&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">7&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">restartPolicy&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">OnFailure&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">template&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">spec&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schedulerName&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">volcano&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">containers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pytorch&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">nvcr.io/nvidia/pytorch:25.01-py3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">resources&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">limits&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">nvidia.com/gpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;4&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">cpu&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;16&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">memory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;64Gi&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="para-cargas-llm-entrenamiento-fine-tuning-e-inferencia-batch">Para cargas LLM: entrenamiento, fine-tuning e inferencia batch&lt;/h2>
&lt;h3 id="entrenamiento-y-fine-tuning-distribuido-multi-gpu">Entrenamiento y fine-tuning distribuido multi-GPU&lt;/h3>
&lt;p>El gang scheduling es &lt;strong>imprescindible&lt;/strong> para cualquier job de entrenamiento distribuido que use NCCL all-reduce (PyTorch DDP, FSDP, DeepSpeed ZeRO). Si un solo worker del grupo no arranca, el &lt;code>torchrun&lt;/code> coordinator espera indefinidamente; con el kube-scheduler estándar este escenario ocurre cada vez que el cluster está bajo contención.&lt;/p>
&lt;p>En un cluster genérico de 4 nodos 4×H100 SXM (16 GPUs totales), un job de fine-tuning de un modelo de 70B requiere típicamente 8 GPUs en tensor-parallel 8 (TP=8) o 16 GPUs en TP=4 + data-parallel 4. Con Volcano, el &lt;code>PodGroup&lt;/code> con &lt;code>minMember: 8&lt;/code> garantiza que o se colocan los 8 pods a la vez o ninguno bloquea recursos. Con Kueue encima, la cuota garantiza que el equipo no supera sus 16 GPUs nominales y que otros equipos con cuota disponible no quedan bloqueados por un job esperando.&lt;/p>
&lt;p>El cross-link con &lt;a href="https://blog.lo0.es/posts/capacity-planning-inferencia-llm-on-premise/">capacity planning de inferencia&lt;/a> es directo: el presupuesto de VRAM del modelo (pesos + KV-cache) determina el TP mínimo y por tanto el &lt;code>minMember&lt;/code> del PodGroup.&lt;/p>
&lt;h3 id="inferencia-batch-y-evaluaciones-evals">Inferencia batch y evaluaciones (evals)&lt;/h3>
&lt;p>Los jobs de inferencia batch —generar respuestas para un dataset de evaluación, procesar embeddings masivos, re-ranking offline— son cargas naturalmente paralelas que no necesariamente requieren gang scheduling estricto (cada request es independiente), pero sí se benefician de cuota y fair-share.&lt;/p>
&lt;p>Para estas cargas, &lt;strong>Kueue solo es suficiente&lt;/strong>: un &lt;code>batch/Job&lt;/code> con múltiples pods independientes se gestiona con la cuota del ClusterQueue sin necesidad de Volcano. Si hay varios equipos enviando jobs de evaluación simultáneamente, Kueue ordena la admisión por fair-share y prioridad, y con borrowing del cohort los jobs de equipos con cuota libre no tienen que esperar la cuota de otros equipos ocupados.&lt;/p>
&lt;p>El chargeback de estas cargas se conecta directamente con lo descrito en &lt;a href="https://blog.lo0.es/posts/chargeback-showback-multitenancy-gpu/">chargeback y showback de GPU&lt;/a>: el &lt;code>nominalQuota&lt;/code> del ClusterQueue es la expresión del presupuesto de GPU en Kubernetes, y OpenCost puede atribuir el coste por namespace/label para el informe mensual.&lt;/p>
&lt;h3 id="cuota-gpu-multi-tenant-y-chargeback">Cuota GPU multi-tenant y chargeback&lt;/h3>
&lt;p>La alineación entre Kueue y el sistema de chargeback es directa:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Concepto FinOps&lt;/th>
&lt;th>Mecanismo Kueue&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Presupuesto garantizado de GPU&lt;/td>
&lt;td>&lt;code>nominalQuota&lt;/code> por ClusterQueue&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Techo de gasto máximo&lt;/td>
&lt;td>&lt;code>nominalQuota + borrowingLimit&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Préstamo de capacidad ociosa&lt;/td>
&lt;td>Cohort + &lt;code>lendingLimit&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Recuperar la cuota propia&lt;/td>
&lt;td>&lt;code>preemption.reclaimWithinCohort: LowerPriority&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Fair-share entre equipos&lt;/td>
&lt;td>Fair Sharing policy en ClusterQueue&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Chargeback del préstamo&lt;/td>
&lt;td>horas de GPU prestada × coste/GPU-hora (OpenCost)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Para la dimensión de utilización como palanca FinOps, ver &lt;a href="https://blog.lo0.es/posts/utilizacion-gpu-como-finops/">la utilización de GPU como palanca FinOps&lt;/a>: Kueue + Volcano juntos permiten maximizar la utilización sin sacrificar las garantías de cuota, que es exactamente el objetivo FinOps.&lt;/p>
&lt;p>La gestión de particiones MIG dentro de este sistema (declarar &lt;code>nvidia.com/mig-4g.40gb&lt;/code> como recurso en el ClusterQueue) se integra naturalmente: el ResourceFlavor puede mapear a nodos con perfil MIG específico, como se explica en &lt;a href="https://blog.lo0.es/posts/compartir-gpu-time-slicing-mps-mig/">compartir GPU: time-slicing, MPS y MIG&lt;/a>.&lt;/p>
&lt;hr>
&lt;h2 id="diagrama-flujo-de-un-job-de-entrenamiento-con-kueue--volcano">Diagrama: flujo de un job de entrenamiento con Kueue + Volcano&lt;/h2>
&lt;div class="diagram" style="max-width:800px;margin:1rem auto;">
&lt;svg viewBox="0 0 800 380" role="img" aria-label="Flujo de admisión y scheduling de un job de entrenamiento distribuido con Kueue y Volcano en Kubernetes" xmlns="http://www.w3.org/2000/svg">
&lt;style>.bx{fill:none;stroke:currentColor;stroke-width:1.4}.dsh{fill:none;stroke:currentColor;stroke-width:1.4;stroke-dasharray:5 3}.tl{font:600 12px sans-serif;fill:currentColor}.ts{font:11px sans-serif;fill:currentColor}.ar{fill:none;stroke:currentColor;stroke-width:1.4;marker-end:url(#arm)}&lt;/style>
&lt;defs>&lt;marker id="arm" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto">&lt;path d="M0,0 L10,5 L0,10 z" fill="currentColor"/>&lt;/marker>&lt;/defs>
&lt;text x="400" y="22" text-anchor="middle" font-size="13" font-weight="700" fill="currentColor">Job de entrenamiento distribuido: Kueue (cuota) + Volcano (gang placement)&lt;/text>
&lt;rect class="bx" x="20" y="40" width="140" height="52" rx="5"/>
&lt;text x="90" y="62" text-anchor="middle" class="tl">Usuario / CI&lt;/text>
&lt;text x="90" y="78" text-anchor="middle" class="ts">kubectl apply PyTorchJob&lt;/text>
&lt;path class="ar" d="M160,66 L200,66"/>
&lt;rect class="bx" x="200" y="40" width="155" height="52" rx="5"/>
&lt;text x="277" y="62" text-anchor="middle" class="tl">Kueue controller&lt;/text>
&lt;text x="277" y="78" text-anchor="middle" class="ts">¿cuota disponible en CQ?&lt;/text>
&lt;path class="ar" d="M277,92 L277,130"/>
&lt;rect class="dsh" x="200" y="130" width="155" height="40" rx="5"/>
&lt;text x="277" y="146" text-anchor="middle" class="tl">Cola de espera&lt;/text>
&lt;text x="277" y="162" text-anchor="middle" class="ts">fair-share / prioridad&lt;/text>
&lt;path class="ar" d="M355,66 L400,66"/>
&lt;rect class="bx" x="400" y="40" width="155" height="52" rx="5"/>
&lt;text x="477" y="62" text-anchor="middle" class="tl">Admisión (cuota OK)&lt;/text>
&lt;text x="477" y="78" text-anchor="middle" class="ts">pods permitidos; CQ reserva GPU&lt;/text>
&lt;path class="ar" d="M477,92 L477,140"/>
&lt;rect class="bx" x="400" y="140" width="155" height="52" rx="5"/>
&lt;text x="477" y="162" text-anchor="middle" class="tl">Volcano scheduler&lt;/text>
&lt;text x="477" y="178" text-anchor="middle" class="ts">PodGroup: minMember=8 gangs&lt;/text>
&lt;path class="ar" d="M477,192 L477,240"/>
&lt;rect class="bx" x="400" y="240" width="155" height="52" rx="5"/>
&lt;text x="477" y="262" text-anchor="middle" class="tl">Nodos 4×H100 SXM&lt;/text>
&lt;text x="477" y="278" text-anchor="middle" class="ts">8 pods × 4 GPU — todos a la vez&lt;/text>
&lt;path class="ar" d="M555,166 L625,166"/>
&lt;rect class="dsh" x="625" y="140" width="150" height="52" rx="5"/>
&lt;text x="700" y="162" text-anchor="middle" class="tl">Si no hay nodos&lt;/text>
&lt;text x="700" y="178" text-anchor="middle" class="ts">ningún pod se coloca → espera&lt;/text>
&lt;text x="90" y="320" class="ts">Kueue: gestiona cuota, cohorts, fair-share, preemption entre colas&lt;/text>
&lt;text x="90" y="338" class="ts">Volcano: gang placement (todo-o-nada), topology-aware, NUMA, DRF entre Queues&lt;/text>
&lt;text x="90" y="356" class="ts">Los dos niveles son ortogonales: Kueue no ve nodos, Volcano no ve cuota de equipo&lt;/text>
&lt;/svg>
&lt;/div>
&lt;hr>
&lt;h2 id="tabla-comparativa-completa">Tabla comparativa completa&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Criterio&lt;/th>
&lt;th>Volcano&lt;/th>
&lt;th>Kueue&lt;/th>
&lt;th>Plugin coscheduling&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Rol&lt;/strong>&lt;/td>
&lt;td>Scheduler (placement)&lt;/td>
&lt;td>Admission + cuota (no placement)&lt;/td>
&lt;td>Plugin kube-scheduler (placement)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Gang scheduling&lt;/strong>&lt;/td>
&lt;td>Sí, nivel pod (PodGroup/minMember)&lt;/td>
&lt;td>Sí, nivel admisión (quota gang)&lt;/td>
&lt;td>Sí, nivel pod (PodGroup)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cuotas multi-tenant&lt;/strong>&lt;/td>
&lt;td>Sí (Queue capability/guarantee)&lt;/td>
&lt;td>Sí (ClusterQueue nominalQuota)&lt;/td>
&lt;td>No&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cohorts / borrowing&lt;/strong>&lt;/td>
&lt;td>Limitado (reclaimable)&lt;/td>
&lt;td>Sí (Cohort con borrowingLimit/lendingLimit)&lt;/td>
&lt;td>No&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Fair-share&lt;/strong>&lt;/td>
&lt;td>Sí (DRF, proportion)&lt;/td>
&lt;td>Sí (Fair Sharing por uso histórico)&lt;/td>
&lt;td>No&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Preemption inter-queue&lt;/strong>&lt;/td>
&lt;td>Sí (reclaim action)&lt;/td>
&lt;td>Sí (reclaimWithinCohort)&lt;/td>
&lt;td>No&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Topología red / NUMA&lt;/strong>&lt;/td>
&lt;td>Sí (v1.11+, plugins dedicados)&lt;/td>
&lt;td>Sí (TAS, Topology CRD)&lt;/td>
&lt;td>No&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Integraciones nativas&lt;/strong>&lt;/td>
&lt;td>MPI, PyTorch, Ray, TF, Spark, Flink, Horovod&lt;/td>
&lt;td>Job, JobSet, RayJob, Kubeflow, LWS, AppWrapper, Deployment, StatefulSet&lt;/td>
&lt;td>Cualquier job con PodGroup&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Quién coloca los pods&lt;/strong>&lt;/td>
&lt;td>Volcano&lt;/td>
&lt;td>kube-scheduler (o Volcano)&lt;/td>
&lt;td>kube-scheduler&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Elastic gang&lt;/strong>&lt;/td>
&lt;td>Sí (minMember &amp;lt; total replicas)&lt;/td>
&lt;td>Parcial (partial admission en batch/Job)&lt;/td>
&lt;td>Limitado&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Huella&lt;/strong>&lt;/td>
&lt;td>Media-alta (scheduler propio)&lt;/td>
&lt;td>Ligera (controller adicional)&lt;/td>
&lt;td>Mínima (plugin del scheduler)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Compatibilidad con Kueue&lt;/strong>&lt;/td>
&lt;td>Sí (como scheduler bajo Kueue)&lt;/td>
&lt;td>—&lt;/td>
&lt;td>Sí (opción complementaria)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Estado CNCF / madurez&lt;/strong>&lt;/td>
&lt;td>CNCF incubating, v1.15&lt;/td>
&lt;td>kubernetes-sigs, v1beta2, adoptado en GKE/OpenShift&lt;/td>
&lt;td>kubernetes-sigs/scheduler-plugins, experimental&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cuándo elegirlo&lt;/strong>&lt;/td>
&lt;td>Entrenamiento distribuido HPC-like, NUMA, topología red, MPI&lt;/td>
&lt;td>Multi-tenancy con cuota flexible, cargas heterogéneas, inferencia + batch juntos&lt;/td>
&lt;td>Clusters simples que solo necesitan gang sin scheduler propio&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="pitfalls-operativos-y-escepticismo-honesto">Pitfalls operativos y escepticismo honesto&lt;/h2>
&lt;h3 id="1-deadlock-por-gang-mal-configurado">1. Deadlock por gang mal configurado&lt;/h3>
&lt;p>El escenario más frecuente: &lt;code>minMember&lt;/code> configurado igual al total de réplicas en un cluster donde varios jobs compiten por los mismos recursos. Si dos jobs de 8 pods cada uno intentan usar un cluster con 8 nodos GPU y el job A tiene 4 pods colocados (no son 8, Volcano los retiene en pending), y el job B también tiene 4 pods retenidos, &lt;strong>nadie avanza&lt;/strong>. Volcano hace bien su trabajo: no coloca ninguno hasta que haya sitio para los 8. Pero si los &lt;code>nominalQuota&lt;/code> de las colas están mal dimensionados respecto a la capacidad real del cluster, esto genera esperas indefinidas.&lt;/p>
&lt;p>Solución: dimensionar las cuotas de las colas para que la suma de &lt;code>guarantee&lt;/code> no supere la capacidad real, y que los &lt;code>minMember&lt;/code> de los jobs activos quepan en la cuota disponible. El autoscaling de nodos con ProvisioningRequest (Kueue + cluster-autoscaler) ayuda, pero introduce latencia de provisioning que hay que contemplar en el SLA del job.&lt;/p>
&lt;h3 id="2-cuota-vs-capacidad-real-el-drift-silencioso">2. Cuota vs capacidad real: el drift silencioso&lt;/h3>
&lt;p>El &lt;code>nominalQuota&lt;/code> de Kueue o el &lt;code>guarantee&lt;/code> de Volcano son declaraciones administrativas. No garantizan que los nodos con esas GPUs estén disponibles, saludables o que el device plugin las haya registrado correctamente. Un nodo en &lt;code>NotReady&lt;/code> con 4 GPUs reduce la capacidad real sin que Kueue lo sepa: el ClusterQueue seguirá admitiendo workloads que luego no podrán colocarse.&lt;/p>
&lt;p>Monitorización recomendada: cruzar las métricas de Kueue (&lt;code>kueue_admitted_workloads_total&lt;/code>, &lt;code>kueue_pending_workloads&lt;/code>) con las métricas de capacidad real del cluster (GPUs registradas en el device plugin) para detectar el drift. Kueue expone métricas Prometheus nativas; Volcano también.&lt;/p>
&lt;h3 id="3-naming-de-recursos-gpu-mig-time-slicing-y-resourceflavor">3. Naming de recursos GPU: MIG, time-slicing y ResourceFlavor&lt;/h3>
&lt;p>Si el cluster usa MIG, los recursos en los pod specs cambian de &lt;code>nvidia.com/gpu&lt;/code> a &lt;code>nvidia.com/mig-Xg.Ygb&lt;/code> (por ejemplo, &lt;code>nvidia.com/mig-3g.40gb&lt;/code>). Los ResourceFlavor de Kueue y las Queue de Volcano deben declarar el recurso correcto, o la cuota no matcheará con los pods. Con time-slicing, el recurso sigue siendo &lt;code>nvidia.com/gpu&lt;/code> pero el device plugin anuncia más instancias de las GPUs físicas; la cuota se expresa en réplicas virtuales, lo que puede llevar a sobre-admisión si no se contempla el presupuesto de VRAM (ver &lt;a href="https://blog.lo0.es/posts/compartir-gpu-time-slicing-mps-mig/">compartir GPU: time-slicing, MPS y MIG&lt;/a>).&lt;/p>
&lt;h3 id="4-preemption-en-producción-el-workload-expulsado-pierde-progreso">4. Preemption en producción: el workload expulsado pierde progreso&lt;/h3>
&lt;p>Cuando Kueue o Volcano expulsan un job de entrenamiento que ha corrido durante horas, ese job pierde el progreso si no tiene checkpointing configurado. La preemption es correcta desde el punto de vista de la cuota, pero destruye trabajo si el job no está preparado. Antes de habilitar preemption agresiva, verificar que todos los jobs de entrenamiento tienen checkpointing periódico con restauración automática. PyTorch + Torchrun tienen soporte nativo; DeepSpeed también. Los jobs de inferencia batch sin estado no tienen este problema.&lt;/p>
&lt;h3 id="5-volcano-como-scheduler-único-vs-coexistencia-con-kube-scheduler">5. Volcano como scheduler único vs. coexistencia con kube-scheduler&lt;/h3>
&lt;p>Volcano puede configurarse como scheduler por defecto del cluster (todos los pods pasan por él) o como scheduler alternativo (solo los pods con &lt;code>schedulerName: volcano&lt;/code>). La primera opción simplifica la configuración pero rompe pods del sistema que asumen comportamientos del kube-scheduler. La segunda (recomendada) requiere que los jobs de ML marquen explícitamente &lt;code>schedulerName: volcano&lt;/code>, lo que puede ser un cambio de operator/chart no trivial para cargas existentes. Kueue resuelve esto de forma más transparente: solo requiere un label en el job, sin cambiar el scheduler.&lt;/p>
&lt;h3 id="6-complexidad-operativa-real-en-2026">6. Complexidad operativa real en 2026&lt;/h3>
&lt;p>Ejecutar Kueue + Volcano + el Training Operator + el GPU Operator en producción son cuatro componentes con sus propios CRDs, webhooks, versiones y ciclos de release. Una actualización de Kubernetes puede requerir actualizar los cuatro en secuencia. El debt operativo es real. Para un equipo pequeño sin capacidad de mantener este stack, la opción de un proveedor de Kubernetes gestionado (GKE con Kueue nativo, OpenShift con Red Hat build of Kueue) puede ser más pragmática que montar el stack completo desde cero.&lt;/p>
&lt;p>El plugin coscheduling de sig-scheduler es una opción deliberadamente más simple cuando solo se necesita gang: menos features, menos complejidad, menos cosas que mantener.&lt;/p>
&lt;hr>
&lt;h2 id="ver-también">Ver también&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://blog.lo0.es/posts/chargeback-showback-multitenancy-gpu/">Chargeback y showback de GPU en multi-tenancy&lt;/a> — cómo conectar la &lt;code>nominalQuota&lt;/code> de Kueue con el informe mensual de OpenCost por equipo.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/utilizacion-gpu-como-finops/">La utilización de GPU como palanca FinOps&lt;/a> — gang scheduling + cuotas como herramienta para subir la utilización sin idle bloqueado por deadlocks.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/compartir-gpu-time-slicing-mps-mig/">Compartir una GPU: time-slicing, MPS y MIG&lt;/a> — cómo declarar recursos MIG en ResourceFlavor de Kueue y en Queue de Volcano.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/capacity-planning-inferencia-llm-on-premise/">Capacity planning para inferencia LLM on-premise&lt;/a> — el presupuesto VRAM del modelo determina el minMember del PodGroup y el nominalQuota mínimo viable.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/autoscaling-llm-kubernetes-keda/">Autoscaling de LLM en Kubernetes con KEDA&lt;/a> — cómo combinar autoscaling de réplicas de inferencia con cuotas de Kueue para no sobrepasar el presupuesto de GPU.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/operators-llm-kubernetes/">Operators LLM en Kubernetes&lt;/a> — el Training Operator (Kubeflow) es el controller que crea PyTorchJob/TFJob gestionados por Kueue + Volcano.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/cluster-h100-plataforma-multi-tenant/">Cluster H100: plataforma multi-tenant&lt;/a> — arquitectura completa de un cluster GPU multi-tenant donde Volcano y Kueue son piezas del stack de scheduling.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="fuentes">Fuentes&lt;/h2>
&lt;ul>
&lt;li>Volcano — Introducción y features (CNCF incubating, v1.15) — &lt;a href="https://volcano.sh/en/docs/">https://volcano.sh/en/docs/&lt;/a>&lt;/li>
&lt;li>Volcano — Volcano v1.11 Released: A New Era of Cloud-Native Scheduling (CNCF blog, marzo 2025) — &lt;a href="https://www.cncf.io/blog/2025/03/05/volcano-v1-11-released-a-new-era-of-cloud-native-scheduling-for-ai-and-big-data/">https://www.cncf.io/blog/2025/03/05/volcano-v1-11-released-a-new-era-of-cloud-native-scheduling-for-ai-and-big-data/&lt;/a>&lt;/li>
&lt;li>Volcano — Release v1.11.0 (GitHub) — &lt;a href="https://github.com/volcano-sh/volcano/releases/tag/v1.11.0">https://github.com/volcano-sh/volcano/releases/tag/v1.11.0&lt;/a>&lt;/li>
&lt;li>Volcano — Network Topology Aware Scheduling design doc — &lt;a href="https://github.com/volcano-sh/volcano/blob/master/docs/design/Network%20Topology%20Aware%20Scheduling.md">https://github.com/volcano-sh/volcano/blob/master/docs/design/Network%20Topology%20Aware%20Scheduling.md&lt;/a>&lt;/li>
&lt;li>Volcano — NUMA-aware scheduling design — &lt;a href="https://github.com/volcano-sh/volcano/blob/master/docs/design/numa-aware.md">https://github.com/volcano-sh/volcano/blob/master/docs/design/numa-aware.md&lt;/a>&lt;/li>
&lt;li>Volcano — Unified Scheduling (docs oficiales) — &lt;a href="https://volcano.sh/en/docs/unified_scheduling/">https://volcano.sh/en/docs/unified_scheduling/&lt;/a>&lt;/li>
&lt;li>Volcano — Capacity scheduling design — &lt;a href="https://github.com/volcano-sh/volcano/blob/master/docs/design/capacity-scheduling.md">https://github.com/volcano-sh/volcano/blob/master/docs/design/capacity-scheduling.md&lt;/a>&lt;/li>
&lt;li>NVIDIA Technical Blog — Practical Tips for Preventing GPU Fragmentation for Volcano Scheduler — &lt;a href="https://developer.nvidia.com/blog/practical-tips-for-preventing-gpu-fragmentation-for-volcano-scheduler/">https://developer.nvidia.com/blog/practical-tips-for-preventing-gpu-fragmentation-for-volcano-scheduler/&lt;/a>&lt;/li>
&lt;li>Kueue — Overview (kueue.sigs.k8s.io, actualizado febrero 2026) — &lt;a href="https://kueue.sigs.k8s.io/docs/overview/">https://kueue.sigs.k8s.io/docs/overview/&lt;/a>&lt;/li>
&lt;li>Kueue — Cluster Queue (nominalQuota, borrowingLimit, lendingLimit, preemption) — &lt;a href="https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/">https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/&lt;/a>&lt;/li>
&lt;li>Kueue — Cohort — &lt;a href="https://kueue.sigs.k8s.io/docs/concepts/cohort/">https://kueue.sigs.k8s.io/docs/concepts/cohort/&lt;/a>&lt;/li>
&lt;li>Kueue — Fair Sharing — &lt;a href="https://kueue.sigs.k8s.io/docs/concepts/fair_sharing/">https://kueue.sigs.k8s.io/docs/concepts/fair_sharing/&lt;/a>&lt;/li>
&lt;li>Kueue — Topology Aware Scheduling — &lt;a href="https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/">https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/&lt;/a>&lt;/li>
&lt;li>Kueue — Preemption — &lt;a href="https://kueue.sigs.k8s.io/docs/concepts/preemption/">https://kueue.sigs.k8s.io/docs/concepts/preemption/&lt;/a>&lt;/li>
&lt;li>Kueue — Setup All-or-nothing with ready Pods — &lt;a href="https://kueue.sigs.k8s.io/docs/tasks/manage/setup_wait_for_pods_ready/">https://kueue.sigs.k8s.io/docs/tasks/manage/setup_wait_for_pods_ready/&lt;/a>&lt;/li>
&lt;li>Kueue — GitHub (kubernetes-sigs/kueue) — &lt;a href="https://github.com/kubernetes-sigs/kueue">https://github.com/kubernetes-sigs/kueue&lt;/a>&lt;/li>
&lt;li>Ray — Gang Scheduling with RayJob and Kueue — &lt;a href="https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.html">https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.html&lt;/a>&lt;/li>
&lt;li>Kubeflow — Volcano scheduler integration — &lt;a href="https://www.kubeflow.org/docs/components/trainer/operator-guides/job-scheduling/volcano/">https://www.kubeflow.org/docs/components/trainer/operator-guides/job-scheduling/volcano/&lt;/a>&lt;/li>
&lt;li>kubernetes-sigs/scheduler-plugins — Coscheduling plugin README — &lt;a href="https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md">https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md&lt;/a>&lt;/li>
&lt;li>kubernetes/enhancements — KEP-583 Coscheduling (sig-scheduling) — &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling">https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling&lt;/a>&lt;/li>
&lt;li>InfraCloud — Batch Scheduling on Kubernetes: Comparing YuniKorn, Volcano and Kueue — &lt;a href="https://www.infracloud.io/blogs/batch-scheduling-on-kubernetes/">https://www.infracloud.io/blogs/batch-scheduling-on-kubernetes/&lt;/a>&lt;/li>
&lt;li>Red Hat — Red Hat build of Kueue (OpenShift 4.20) — &lt;a href="https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/ai_workloads/red-hat-build-of-kueue">https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/ai_workloads/red-hat-build-of-kueue&lt;/a>&lt;/li>
&lt;li>AceCloud — Multi GPU Orchestration in Kubernetes 2026: Kueue, Volcano, DRA — &lt;a href="https://acecloud.ai/blog/multi-gpu-orchestration-kubernetes/">https://acecloud.ai/blog/multi-gpu-orchestration-kubernetes/&lt;/a>&lt;/li>
&lt;li>CloudOptimo — Kubernetes AI Infrastructure in 2026: GPU Scheduling and Production Realities — &lt;a href="https://www.cloudoptimo.com/blog/kubernetes-ai-infrastructure-in-2026-gpu-scheduling-and-production-realities/">https://www.cloudoptimo.com/blog/kubernetes-ai-infrastructure-in-2026-gpu-scheduling-and-production-realities/&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>