<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Keep on lo0 — Blog Técnico</title><link>https://blog.lo0.es/tags/keep/</link><description>Recent content in Keep on lo0 — Blog Técnico</description><generator>Hugo -- gohugo.io</generator><language>es</language><lastBuildDate>Tue, 02 Jun 2026 04:30:00 +0200</lastBuildDate><atom:link href="https://blog.lo0.es/tags/keep/index.xml" rel="self" type="application/rss+xml"/><item><title>Runbooks de incident response para inferencia LLM: cada alerta a una acción concreta con Kafka y Keep</title><link>https://blog.lo0.es/posts/runbooks-incident-response-llm-keep-kafka/</link><pubDate>Tue, 02 Jun 2026 04:30:00 +0200</pubDate><guid>https://blog.lo0.es/posts/runbooks-incident-response-llm-keep-kafka/</guid><description>&lt;blockquote>
&lt;p>Este post cierra la trilogía de observabilidad que abrieron &lt;a href="https://blog.lo0.es/posts/observabilidad-gpu-dcgm-llm/">Observabilidad GPU para inferencia LLM&lt;/a> (qué métricas) y &lt;a href="https://blog.lo0.es/posts/anatomia-metricas-dcgm-vllm-anomalias/">Anatomía de las doce métricas DCGM y cinco vLLM&lt;/a> (qué anomalía documentada por métrica). Aquí cada anomalía recibe su acción concreta y se encaja en la maquinaria de gestión de incidentes que compliance exige.&lt;/p>
&lt;/blockquote>
&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;p>Las alertas de &lt;a href="https://blog.lo0.es/posts/observabilidad-gpu-dcgm-llm/">observabilidad GPU&lt;/a> son inútiles sin un procedimiento codificado por cada una; el operador que las interpreta a mano cada vez opera por intuición. La combinación correcta tiene &lt;strong>tres piezas indispensables&lt;/strong>. (1) &lt;strong>Catálogo de runbooks&lt;/strong>: para cada una de las seis alertas críticas (&lt;code>GpuHbmNearOom&lt;/code>, &lt;code>GpuThermalOrPowerThrottle&lt;/code>, &lt;code>GpuXidErrorDetected&lt;/code>, &lt;code>GpuEccDoubleBit&lt;/code>, &lt;code>VllmKvCachePoolNearFull&lt;/code>, &lt;code>VllmTtftP95OutOfSlo&lt;/code>), severity, mitigación inmediata, evidencia que capturar &lt;strong>antes&lt;/strong> de remediar, acción de resolución, criterio de cierre y trigger de postmortem. (2) &lt;strong>Pipeline reproducible&lt;/strong>: Prometheus + DCGM → Alertmanager → &lt;strong>Kafka como event bus&lt;/strong> (topics &lt;code>gpu.alerts.enriched&lt;/code>, &lt;code>incidents.lifecycle&lt;/code>, &lt;code>audit.actions&lt;/code> con retención WORM) → &lt;strong>Keep como workflow engine&lt;/strong> (workflows declarativos YAML versionados en git) → ejecutores Kubernetes jobs / scripts / ChatOps. (3) &lt;strong>Encaje formal en gestión de incidentes&lt;/strong> según el corpus normativo: &lt;strong>ISO/IEC 27035&lt;/strong> fases &lt;code>identify → report → assess → respond → learn&lt;/code>; &lt;strong>ENS&lt;/strong> controles &lt;code>op.exp.7&lt;/code> (gestión de incidentes), &lt;code>op.exp.8&lt;/code> (registro de actividad), &lt;code>op.exp.10&lt;/code> (notificación a usuarios); &lt;strong>NIS2&lt;/strong> art. 23 con notificación temprana &lt;strong>24 h&lt;/strong>, notificación formal &lt;strong>72 h&lt;/strong> e informe final &lt;strong>1 mes&lt;/strong>; &lt;strong>EU AI Act&lt;/strong> art. 73 para incidente grave de un sistema de alto riesgo, plazos &lt;strong>2 a 15 días&lt;/strong> según severity; &lt;strong>ISO/IEC 42001&lt;/strong> cláusula 10 (mejora continua del AIMS). La taxonomía de acción es &lt;strong>mitigación inmediata&lt;/strong> (drain, throttle, scale-down: contiene el daño en segundos) → &lt;strong>diagnóstico&lt;/strong> (captura de evidencia con &lt;code>nvidia-smi -q&lt;/code>, &lt;code>dmesg&lt;/code>, vLLM &lt;code>/metrics&lt;/code> snapshot, traza OTel relacionada; sin esto el postmortem no es defensible) → &lt;strong>resolución&lt;/strong> (restart, reset, RMA, rollback) → &lt;strong>postmortem&lt;/strong> (RCA por 5-whys, plan de prevención, actualización del runbook). Kafka aporta el &lt;strong>audit trail inmutable&lt;/strong> que ENS y EU AI Act exigen — cada acción ejecutada por Keep o por humano se publica como evento en &lt;code>audit.actions&lt;/code> con timestamp, actor, decisión y evidencia, retenido WORM mínimo 6 meses. Keep aporta los &lt;strong>workflows como código&lt;/strong>: este post incluye tres workflows completos (XID con drain + ticket Jira, ECC DBE con paginación inmediata y bloqueo del nodo en scheduler, canary rollback automático por TTFT P95 fuera de SLO). Cuatro anti-patrones cierran el material: alertas sin runbook (la mayoría), runbook sin captura de evidencia previa (perpetúa el incidente porque la causa raíz se pierde), escalada por antigüedad en vez de severity (operador junior gestiona ECC DBE), ausencia de gate humano para acciones destructivas (Keep ejecutando &lt;code>nvidia-smi --gpu-reset&lt;/code> sin confirmación). Aplicable a un cluster genérico de 4×H100 SXM con Kafka y Keep ya desplegados.&lt;/p>
&lt;h2 id="estás-aquí-observe--deploy-incident-response-cierra-el-bucle">Estás aquí: OBSERVE → DEPLOY (incident response cierra el bucle)&lt;/h2>
&lt;div class="diagram" style="max-width:780px;margin:1rem auto;">
&lt;svg viewBox="0 0 780 90" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="incident response: bucle Observe-Deploy">
&lt;style>.box{stroke:#444;stroke-width:1.4;rx:6}.active{fill:#c9a8e9;stroke-width:3}.semiactive{fill:#cfead0;stroke-width:2}.idle{fill:#f4f4f4}.lbl{font:600 12px sans-serif;fill:#222}.arr{stroke:#666;stroke-width:1.4;fill:none;marker-end:url(#rbm)}.cyc{stroke:#888;stroke-width:1.2;fill:none;stroke-dasharray:4 2;marker-end:url(#rbm)}.loop{stroke:#c33;stroke-width:1.8;fill:none;stroke-dasharray:5 3;marker-end:url(#rbmc)}&lt;/style>
&lt;defs>&lt;marker id="rbm" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">&lt;path d="M0,0 L10,5 L0,10 z" fill="#666"/>&lt;/marker>&lt;marker id="rbmc" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">&lt;path d="M0,0 L10,5 L0,10 z" fill="#c33"/>&lt;/marker>&lt;/defs>
&lt;text x="390" y="20" text-anchor="middle" class="lbl">Incident response: cierra el bucle de OBSERVE a DEPLOY (acción)&lt;/text>
&lt;rect x="30" y="35" width="110" height="35" class="box idle"/>&lt;text x="85" y="58" text-anchor="middle" class="lbl">1 · Data&lt;/text>
&lt;rect x="155" y="35" width="110" height="35" class="box idle"/>&lt;text x="210" y="58" text-anchor="middle" class="lbl">2 · Tune&lt;/text>
&lt;rect x="280" y="35" width="110" height="35" class="box idle"/>&lt;text x="335" y="58" text-anchor="middle" class="lbl">3 · Eval&lt;/text>
&lt;rect x="405" y="35" width="110" height="35" class="box semiactive"/>&lt;text x="460" y="58" text-anchor="middle" class="lbl">4 · Deploy&lt;/text>
&lt;rect x="530" y="35" width="110" height="35" class="box active"/>&lt;text x="585" y="58" text-anchor="middle" class="lbl">5 · Observe&lt;/text>
&lt;rect x="655" y="35" width="110" height="35" class="box idle"/>&lt;text x="710" y="58" text-anchor="middle" class="lbl">6 · Retrain&lt;/text>
&lt;path class="arr" d="M140,52 L155,52"/>&lt;path class="arr" d="M265,52 L280,52"/>&lt;path class="arr" d="M390,52 L405,52"/>&lt;path class="arr" d="M515,52 L530,52"/>&lt;path class="arr" d="M640,52 L655,52"/>
&lt;path class="loop" d="M530,40 C500,5 480,5 460,40"/>
&lt;/svg>
&lt;/div>
&lt;h2 id="la-analogía-la-sala-de-control-de-un-reactor-nuclear">La analogía: la sala de control de un reactor nuclear&lt;/h2>
&lt;p>En una sala de control de central nuclear, el operador de turno &lt;strong>nunca decide qué hacer al ver una alarma&lt;/strong>. La decisión está pre-tomada y codificada en un procedimiento escrito (SOP) que cubre cada alarma del panel: si suena la X, abrir libro X, leer los pasos 1-N, ejecutar exactamente, llamar al supervisor en el paso M, escalar al director de planta en el paso N+3. La razón es estricta: las alarmas críticas son raras pero catastróficas si se gestionan mal; un operador improvisando en una emergencia toma decisiones peores que uno aplicando un procedimiento revisado por expertos y validado por simulación.&lt;/p>
&lt;p>El reactor no espera que el operador sea genio. Espera que conozca los procedimientos al pie de la letra y que el sistema de gestión de operaciones le entregue el procedimiento correcto al momento. Si los procedimientos no están escritos, no están versionados, o no están integrados con las alarmas que disparan, la sala de control opera por intuición. La diferencia entre ambas operaciones —procedimentada vs intuitiva— es la diferencia entre una central que opera 30 años sin incidentes y otra que entra en lista negra.&lt;/p>
&lt;p>El incident response de un cluster de inferencia LLM funciona idéntico. Las alertas DCGM y vLLM que los posts anteriores listaron son las alarmas del panel. Cada una necesita su SOP escrito, versionado, integrado con la alerta que la dispara y revisado tras cada incidente. Sin esa codificación, el operador de turno improvisa en mitad de un fallo de ECC DBE a las 4 de la mañana; con ella, ejecuta los nueve pasos del runbook 12 y el incidente se cierra en 20 minutos.&lt;/p>
&lt;h2 id="la-arquitectura-del-incident-pipeline">La arquitectura del incident pipeline&lt;/h2>
&lt;div class="diagram" style="max-width:840px;margin:1.5rem auto;">
&lt;svg viewBox="0 0 840 320" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="pipeline de incident response">
&lt;style>.b{stroke:#333;stroke-width:1.4;rx:6}.src{fill:#dfe9f5;stroke:#356}.am{fill:#eef0d0;stroke:#7a3}.k{fill:#f4e3cf;stroke:#a63}.kp{fill:#ead8f5;stroke:#634}.ex{fill:#d8eecf;stroke:#373}.au{fill:#f6e2e2;stroke:#a33}.title{font:600 13px sans-serif;fill:#222}.h{font:700 12px sans-serif;fill:#222}.l{font:11px sans-serif;fill:#222}.n{font:italic 10px sans-serif;fill:#444}.arr{stroke:#666;stroke-width:1.4;fill:none;marker-end:url(#pim)}.dbl{stroke:#666;stroke-width:1.4;fill:none;stroke-dasharray:4 2;marker-end:url(#pim)}&lt;/style>
&lt;defs>&lt;marker id="pim" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">&lt;path d="M0,0 L10,5 L0,10 z" fill="#666"/>&lt;/marker>&lt;/defs>
&lt;text x="420" y="20" text-anchor="middle" class="title">Pipeline: Prometheus → Alertmanager → Kafka → Keep → Ejecutores · audit WORM en paralelo&lt;/text>
&lt;rect x="20" y="45" width="140" height="60" class="b src"/>&lt;text x="90" y="65" text-anchor="middle" class="h">Prometheus&lt;/text>&lt;text x="90" y="82" text-anchor="middle" class="l">DCGM + vLLM&lt;/text>&lt;text x="90" y="98" text-anchor="middle" class="n">scrape 15s&lt;/text>
&lt;rect x="190" y="45" width="140" height="60" class="b am"/>&lt;text x="260" y="65" text-anchor="middle" class="h">Alertmanager&lt;/text>&lt;text x="260" y="82" text-anchor="middle" class="l">PrometheusRule&lt;/text>&lt;text x="260" y="98" text-anchor="middle" class="n">webhook → kafka&lt;/text>
&lt;rect x="360" y="45" width="160" height="60" class="b k"/>&lt;text x="440" y="65" text-anchor="middle" class="h">Kafka&lt;/text>&lt;text x="440" y="82" text-anchor="middle" class="l">gpu.alerts.enriched&lt;/text>&lt;text x="440" y="98" text-anchor="middle" class="n">incidents.lifecycle&lt;/text>
&lt;rect x="550" y="45" width="140" height="60" class="b kp"/>&lt;text x="620" y="65" text-anchor="middle" class="h">Keep&lt;/text>&lt;text x="620" y="82" text-anchor="middle" class="l">workflows YAML&lt;/text>&lt;text x="620" y="98" text-anchor="middle" class="n">git-versioned&lt;/text>
&lt;rect x="720" y="45" width="100" height="60" class="b ex"/>&lt;text x="770" y="65" text-anchor="middle" class="h">Ejecutores&lt;/text>&lt;text x="770" y="82" text-anchor="middle" class="l">kubectl · API&lt;/text>&lt;text x="770" y="98" text-anchor="middle" class="n">ChatOps&lt;/text>
&lt;path class="arr" d="M160,75 L190,75"/>
&lt;path class="arr" d="M330,75 L360,75"/>
&lt;path class="arr" d="M520,75 L550,75"/>
&lt;path class="arr" d="M690,75 L720,75"/>
&lt;rect x="360" y="160" width="160" height="60" class="b au"/>&lt;text x="440" y="180" text-anchor="middle" class="h">audit.actions&lt;/text>&lt;text x="440" y="197" text-anchor="middle" class="l">topic WORM&lt;/text>&lt;text x="440" y="213" text-anchor="middle" class="n">retención 6 meses+&lt;/text>
&lt;path class="dbl" d="M620,105 L520,160"/>
&lt;path class="dbl" d="M770,105 L520,168"/>
&lt;text x="420" y="252" text-anchor="middle" class="n">Cada acción de Keep o humano se publica en audit.actions: WORM exigido por ENS op.exp.8 + EU AI Act art. 12.&lt;/text>
&lt;rect x="20" y="240" width="220" height="60" class="b kp"/>&lt;text x="130" y="260" text-anchor="middle" class="h">Compliance consumers&lt;/text>&lt;text x="130" y="277" text-anchor="middle" class="l">DPO · auditoría ENS · NIS2 reporting&lt;/text>&lt;text x="130" y="293" text-anchor="middle" class="n">consumen audit.actions read-only&lt;/text>
&lt;path class="arr" d="M360,180 L240,260"/>
&lt;rect x="600" y="240" width="220" height="60" class="b ex"/>&lt;text x="710" y="260" text-anchor="middle" class="h">Postmortem tooling&lt;/text>&lt;text x="710" y="277" text-anchor="middle" class="l">Jira · MLflow · Langfuse&lt;/text>&lt;text x="710" y="293" text-anchor="middle" class="n">enriquecidos con timeline&lt;/text>
&lt;path class="arr" d="M520,180 L600,260"/>
&lt;/svg>
&lt;/div>
&lt;p>&lt;strong>Prometheus + DCGM.&lt;/strong> Recolecta las métricas descritas en los dos posts anteriores. PrometheusRules definen las seis alertas críticas con &lt;code>for: &amp;lt;duración&amp;gt;&lt;/code> para evitar ruido.&lt;/p>
&lt;p>&lt;strong>Alertmanager.&lt;/strong> Recibe alertas crudas; deduplica, agrupa por labels (&lt;code>{cluster, node, gpu, model}&lt;/code>), enruta. En vez de enviar directamente a PagerDuty o Slack, &lt;strong>envía a Kafka&lt;/strong> vía webhook receiver — esto convierte la alerta en un evento del bus que múltiples consumidores procesan (Keep para acción, audit topic para compliance, dashboards para visualización).&lt;/p>
&lt;p>&lt;strong>Kafka como event bus.&lt;/strong> Tres topics canónicos:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>gpu.alerts.enriched&lt;/code>&lt;/strong> — alertas con contexto añadido (tenant, modelo, versión, owner del namespace, severity efectiva). Retención: 7 días, replication factor 3.&lt;/li>
&lt;li>&lt;strong>&lt;code>incidents.lifecycle&lt;/code>&lt;/strong> — eventos del ciclo del incidente: &lt;code>incident.opened&lt;/code>, &lt;code>incident.acknowledged&lt;/code>, &lt;code>action.proposed&lt;/code>, &lt;code>action.executed&lt;/code>, &lt;code>incident.escalated&lt;/code>, &lt;code>incident.resolved&lt;/code>, &lt;code>postmortem.attached&lt;/code>. Retención: 90 días.&lt;/li>
&lt;li>&lt;strong>&lt;code>audit.actions&lt;/code>&lt;/strong> — registro inmutable de cada acción ejecutada (por Keep automáticamente o por humano confirmando). Retención: &lt;strong>6 meses mínimo con compaction off + tiered storage&lt;/strong>, almacenamiento WORM. Es el topic que ENS &lt;code>op.exp.8&lt;/code>, EU AI Act art. 12 y NIS2 obligan a conservar.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Keep como workflow engine.&lt;/strong> Consume de &lt;code>gpu.alerts.enriched&lt;/code>, dispara workflows YAML versionados en git, ejecuta acciones (llamadas HTTP, kubectl jobs, mensajes Slack, tickets Jira) y publica el resultado en &lt;code>incidents.lifecycle&lt;/code> + &lt;code>audit.actions&lt;/code>. La elección de Keep sobre Alertmanager solo (o sobre PagerDuty solo) es deliberada: Keep separa &lt;strong>declaración del runbook&lt;/strong> (YAML legible y revisable) de &lt;strong>distribución de notificación&lt;/strong> (PagerDuty). El runbook es código versionado; las notificaciones son detalles operativos.&lt;/p>
&lt;p>&lt;strong>Ejecutores.&lt;/strong> Lo que de verdad mueve el cluster:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Kubernetes jobs&lt;/strong>: &lt;code>kubectl drain&lt;/code>, &lt;code>kubectl cordon&lt;/code>, &lt;code>kubectl rollout undo&lt;/code>.&lt;/li>
&lt;li>&lt;strong>NVIDIA API&lt;/strong>: &lt;code>nvidia-smi --gpu-reset&lt;/code>, &lt;code>dcgmi diag -r &amp;lt;level&amp;gt;&lt;/code>.&lt;/li>
&lt;li>&lt;strong>ChatOps&lt;/strong>: confirmaciones humanas a través de Slack interactive messages antes de ejecutar acción destructiva.&lt;/li>
&lt;li>&lt;strong>Tooling externo&lt;/strong>: ticket Jira, notificación PagerDuty, llamada a CMDB.&lt;/li>
&lt;/ul>
&lt;h2 id="las-seis-alertas-críticas-y-sus-runbooks">Las seis alertas críticas y sus runbooks&lt;/h2>
&lt;p>Para cada alerta: severity, mitigación inmediata (segundos), evidencia que capturar &lt;strong>antes de remediar&lt;/strong>, acción de resolución, criterios de cierre, trigger de postmortem.&lt;/p>
&lt;h3 id="rb-01--gpuhbmnearoom--hbm--92--sostenido">RB-01 · &lt;code>GpuHbmNearOom&lt;/code> — HBM &amp;gt; 92 % sostenido&lt;/h3>
&lt;p>&lt;strong>Severity&lt;/strong>: WARNING. Riesgo OOM en la siguiente asignación de PagedAttention.&lt;/p>
&lt;p>&lt;strong>Mitigación inmediata.&lt;/strong> Reducir admission temporalmente bajando &lt;code>max_num_seqs&lt;/code> del motor afectado vía hot reload (si el motor lo soporta) o restart escalonado de réplicas. Disparar scale-out adicional vía KEDA si hay nodos GPU libres. No es necesario drenar el nodo.&lt;/p>
&lt;p>&lt;strong>Evidencia a capturar.&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">nvidia-smi --query-gpu&lt;span class="o">=&lt;/span>index,memory.used,memory.free,memory.total --format&lt;span class="o">=&lt;/span>csv
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d ROW_REMAPPER &lt;span class="p">|&lt;/span> grep -i pending
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">curl http://vllm-pod:8000/metrics &lt;span class="p">|&lt;/span> grep -E &lt;span class="s2">&amp;#34;gpu_cache_usage|num_requests&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kubectl logs &amp;lt;pod&amp;gt; --tail&lt;span class="o">=&lt;/span>&lt;span class="m">200&lt;/span> &lt;span class="p">|&lt;/span> grep -i &lt;span class="s2">&amp;#34;preempt\|swap&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Guardar snapshot en &lt;code>audit.actions&lt;/code> con timestamp y &lt;code>incident_id&lt;/code>.&lt;/p>
&lt;p>&lt;strong>Resolución.&lt;/strong> Si la causa es pico de tráfico: dejar al autoscaler escalar a régimen estable, monitorizar 30 min. Si la causa es regresión de modelo (canary v2 consume más KV cache que v1): rollback del canary (ver RB-06). Si es leak (la métrica crece sin que el tráfico crezca): restart del pod con captura de heap dump.&lt;/p>
&lt;p>&lt;strong>Cierre.&lt;/strong> &lt;code>gpu_cache_usage_perc &amp;lt; 80 %&lt;/code> sostenido durante 15 min Y &lt;code>num_requests_waiting == 0&lt;/code>.&lt;/p>
&lt;p>&lt;strong>Postmortem.&lt;/strong> No obligatorio salvo si el incidente duró &amp;gt; 30 min o tuvo impacto en SLO.&lt;/p>
&lt;h3 id="rb-02--gputhermalorpowerthrottle--bit--0-ni-idle-en-clock_throttle_reasons">RB-02 · &lt;code>GpuThermalOrPowerThrottle&lt;/code> — bit ≠ 0 ni Idle en CLOCK_THROTTLE_REASONS&lt;/h3>
&lt;p>&lt;strong>Severity&lt;/strong>: WARNING (térmico) o CRITICAL (HW Power Brake sostenido, riesgo PDU).&lt;/p>
&lt;p>&lt;strong>Mitigación inmediata.&lt;/strong> Identificar el bit (decodificar bitmap). Si es &lt;strong>&lt;code>0x40 HW_THERMAL&lt;/code>&lt;/strong> o &lt;strong>&lt;code>0x20 SW_THERMAL&lt;/code>&lt;/strong>: drenar workload del nodo a otras réplicas si la temperatura no baja en 2 min, evitar nuevos pods en ese nodo (&lt;code>kubectl cordon&lt;/code>). Si es &lt;strong>&lt;code>0x80 HW_POWER_BRAKE&lt;/code>&lt;/strong>: alerta a infraestructura de DC inmediatamente (probable PDU sobrecomprometida — caso Dell KB 000220508 / Lenovo HT514380), reducir TDP de las GPUs del rack vía &lt;code>nvidia-smi -pl&lt;/code> a un valor menor para liberar carga sobre el breaker.&lt;/p>
&lt;p>&lt;strong>Evidencia.&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">nvidia-smi --query-gpu&lt;span class="o">=&lt;/span>index,temperature.gpu,temperature.memory,power.draw,clocks_throttle_reasons.active --format&lt;span class="o">=&lt;/span>csv
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">ipmitool sdr &lt;span class="p">|&lt;/span> grep -i &lt;span class="s2">&amp;#34;fan\|temp\|inlet&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Datos de PDU si están instrumentadas (modbus / SNMP)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Resolución.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Térmico&lt;/strong>: revisar flujo de aire del rack, verificar rear-door HX, T_inlet, ventiladores DGX. Issue de infra, no de motor.&lt;/li>
&lt;li>&lt;strong>Power Brake&lt;/strong>: revisar dimensionado de PDU rama, breaker, distribución 415 VAC. Probable redistribución de carga a otra rama o limitación temporal de TDP.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Cierre.&lt;/strong> &lt;code>CLOCK_THROTTLE_REASONS == 0x1&lt;/code> (solo Idle) o &lt;code>0x0&lt;/code> durante 30 min con carga normal.&lt;/p>
&lt;p>&lt;strong>Postmortem.&lt;/strong> Obligatorio si fue HW Power Brake — implica infraestructura eléctrica del DC.&lt;/p>
&lt;h3 id="rb-03--gpuxiderrordetected--increasedcgm_fi_dev_xid_errors5m--0">RB-03 · &lt;code>GpuXidErrorDetected&lt;/code> — &lt;code>increase(DCGM_FI_DEV_XID_ERRORS[5m]) &amp;gt; 0&lt;/code>&lt;/h3>
&lt;p>&lt;strong>Severity&lt;/strong>: CRITICAL.&lt;/p>
&lt;p>&lt;strong>Mitigación inmediata.&lt;/strong> &lt;code>kubectl cordon&lt;/code> del nodo (sin más nuevos pods). Si el XID es 31/48/79/94/95 (hardware o cascada): drenar los pods existentes del nodo. Si el XID es 13/43 (posible software): mantener pods pero bloquear nuevos, capturar trace y workload activo.&lt;/p>
&lt;p>&lt;strong>Evidencia.&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># El XID concreto del dmesg&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">dmesg &lt;span class="p">|&lt;/span> grep -i xid &lt;span class="p">|&lt;/span> tail -30
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d ERROR
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d PCIE
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Estado de las páginas retiradas&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d ROW_REMAPPER
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Workload que estaba ejecutándose&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kubectl get pods -o wide &lt;span class="p">|&lt;/span> grep &amp;lt;node&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kubectl logs &amp;lt;pod&amp;gt; --previous --tail&lt;span class="o">=&lt;/span>&lt;span class="m">500&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Resolución.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>XID 13/43&lt;/strong> (software exception / channel verif): si recurre solo con un modelo concreto, es bug del workload — issue al equipo de modelos. Si es transitorio, reiniciar el pod basta.&lt;/li>
&lt;li>&lt;strong>XID 31&lt;/strong> (MMU fault): suele ser cascada de un XID 48 previo. Reset de la GPU (&lt;code>nvidia-smi --gpu-reset -i &amp;lt;index&amp;gt;&lt;/code>) o reboot del nodo si reset no resuelve.&lt;/li>
&lt;li>&lt;strong>XID 48 / 95&lt;/strong> (DBE / uncontained ECC): ver RB-04. El nodo entra en cuarentena.&lt;/li>
&lt;li>&lt;strong>XID 79&lt;/strong> (fallen off the bus): reboot del nodo. Si recurre tras reboot, abrir RMA de la GPU. ByteDance reporta 43 % de coocurrencia con errores PCIe — verificar también el slot y el cable.&lt;/li>
&lt;li>&lt;strong>XID 94 / 145 / 149&lt;/strong>: catalogados en el Xid Catalog de NVIDIA con procedimiento específico.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Cierre.&lt;/strong> Smoke test del nodo pasado (&lt;code>dcgmi diag -r 3&lt;/code>), 24 h sin nuevos XIDs, vuelta al pool.&lt;/p>
&lt;p>&lt;strong>Postmortem.&lt;/strong> &lt;strong>Obligatorio&lt;/strong>. Incluir XID concreto, distribución de XIDs en el cluster, MTBE actualizado.&lt;/p>
&lt;h3 id="rb-04--gpueccdoublebit--dcgm_fi_dev_ecc_dbe_vol_total--0">RB-04 · &lt;code>GpuEccDoubleBit&lt;/code> — &lt;code>DCGM_FI_DEV_ECC_DBE_VOL_TOTAL &amp;gt; 0&lt;/code>&lt;/h3>
&lt;p>&lt;strong>Severity&lt;/strong>: CRITICAL — corrupción de datos en curso.&lt;/p>
&lt;p>&lt;strong>Mitigación inmediata.&lt;/strong> &lt;strong>Drenar el nodo inmediatamente sin esperar evidencia adicional&lt;/strong>. Páginas guardia (PagerDuty / OpsGenie) ON-CALL primario. Marcar el nodo &lt;code>unschedulable&lt;/code> y &lt;code>failed&lt;/code>. El XID 48 tiene &lt;strong>100 % probabilidad de matar el job en curso&lt;/strong> según el dataset de &lt;em>Story of Two GPUs&lt;/em>; cualquier inferencia ya está comprometida.&lt;/p>
&lt;p>&lt;strong>Evidencia (en paralelo a la mitigación).&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d ECC
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d ROW_REMAPPER &lt;span class="c1"># Pending: Yes esperado&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">dmesg &lt;span class="p">|&lt;/span> grep -E &lt;span class="s2">&amp;#34;Xid.*48|DBE|double-bit&amp;#34;&lt;/span> &lt;span class="p">|&lt;/span> tail -50
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Captura completa del estado de la GPU&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">dcgmi diag -r &lt;span class="m">4&lt;/span> -i &amp;lt;gpu_index&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Resolución.&lt;/strong> Reset completo de la GPU (&lt;code>nvidia-smi --gpu-reset&lt;/code>) o reboot del nodo si reset no completa. El reset activa el row remap. Tras el reboot:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d ROW_REMAPPER &lt;span class="c1"># Pending: No esperado&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">nvidia-smi -q -d ECC &lt;span class="c1"># contadores volátiles a 0&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Si &lt;code>RETIRED_DBE &amp;gt; 8&lt;/code> páginas tras el remap: planificar &lt;strong>reemplazo de GPU&lt;/strong> en próxima ventana — la degradación del silicio es progresiva. Documentado &lt;em>~19 horas de downtime&lt;/em> típico en el caso real publicado.&lt;/p>
&lt;p>&lt;strong>Cierre.&lt;/strong> Nodo en pool tras 48 h sin nuevos DBE.&lt;/p>
&lt;p>&lt;strong>Postmortem.&lt;/strong> &lt;strong>Obligatorio&lt;/strong>. Si el incidente afectó a una request con datos personales / clasificados, evaluar notificación a DPO bajo GDPR art. 33 (no es necesariamente brecha, pero hay que evaluarlo).&lt;/p>
&lt;h3 id="rb-05--vllmkvcachepoolnearfull--gpu_cache_usage_perc--95--sostenido-3-min">RB-05 · &lt;code>VllmKvCachePoolNearFull&lt;/code> — &lt;code>gpu_cache_usage_perc &amp;gt; 95 %&lt;/code> sostenido 3 min&lt;/h3>
&lt;p>&lt;strong>Severity&lt;/strong>: WARNING (riesgo de preempt-on-OOM, no de OOM real).&lt;/p>
&lt;p>&lt;strong>Mitigación inmediata.&lt;/strong> Activar scale-out del autoscaler bajando el umbral de KEDA temporalmente (de 0.85 a 0.75) durante 30 min. Si está en modo &lt;code>recompute&lt;/code>, los preempts elevan TTFT pero no rompen requests; aceptable a corto plazo. Si está en modo &lt;code>swap&lt;/code>, latencia se va al techo — preferible cortar tráfico nuevo (devolver 503 desde el &lt;a href="https://blog.lo0.es/posts/router-inferencia-llm-gateway-l7/">router&lt;/a>) durante 5 min.&lt;/p>
&lt;p>&lt;strong>Evidencia.&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">curl http://vllm-pod:8000/metrics &lt;span class="p">|&lt;/span> grep -E &lt;span class="s2">&amp;#34;gpu_cache|num_requests|num_preemptions&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kubectl get hpa vllm-llama70b
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kubectl logs &amp;lt;pod&amp;gt; --tail&lt;span class="o">=&lt;/span>&lt;span class="m">200&lt;/span> &lt;span class="p">|&lt;/span> grep -i preempt
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Resolución.&lt;/strong> Si recurre regularmente: capacity planning revisado, posiblemente reducir &lt;code>max_num_seqs&lt;/code> o subir réplicas estables. Ver &lt;a href="https://blog.lo0.es/posts/capacity-planning-inferencia-llm-on-premise/">Capacity planning&lt;/a>.&lt;/p>
&lt;p>&lt;strong>Cierre.&lt;/strong> Pool &amp;lt; 85 % sostenido 30 min, sin preempts en último 15 min.&lt;/p>
&lt;p>&lt;strong>Postmortem.&lt;/strong> No obligatorio salvo recurrencia &amp;gt; 3 veces / semana.&lt;/p>
&lt;h3 id="rb-06--vllmttftp95outofslo--ttft-p95--15-s-durante-5-min">RB-06 · &lt;code>VllmTtftP95OutOfSlo&lt;/code> — TTFT P95 &amp;gt; 1.5 s durante 5 min&lt;/h3>
&lt;p>&lt;strong>Severity&lt;/strong>: CRITICAL (violación de SLO contractual).&lt;/p>
&lt;p>&lt;strong>Mitigación inmediata.&lt;/strong> Diagnóstico rápido del régimen (en orden de probabilidad):&lt;/p>
&lt;ol>
&lt;li>Si hay canary v2 activo y el ratio &lt;code>ttft_p95(v2)/ttft_p95(v1) &amp;gt; 1.30&lt;/code>: &lt;strong>rollback automático&lt;/strong> del canary vía Argo Rollouts (&lt;code>argo rollouts abort vllm-llama70b&lt;/code>).&lt;/li>
&lt;li>Si &lt;code>num_requests_waiting &amp;gt; 5&lt;/code>: scale-out vía KEDA.&lt;/li>
&lt;li>Si &lt;code>DRAM_ACTIVE &amp;gt; 90 %&lt;/code> + &lt;code>gpu_cache_usage_perc &amp;gt; 90 %&lt;/code>: cuello en HBM, palanca de quantization o reducción de contexto.&lt;/li>
&lt;li>Si &lt;code>CLOCK_THROTTLE_REASONS != 0&lt;/code>: ver RB-02.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Evidencia.&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Snapshot del histograma&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">curl http://vllm-pod:8000/metrics &lt;span class="p">|&lt;/span> grep time_to_first_token
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Distribución por versión si hay canary&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Estado DCGM del momento&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">curl http://dcgm-exporter:9400/metrics &lt;span class="p">|&lt;/span> grep -E &lt;span class="s2">&amp;#34;PIPE_TENSOR|DRAM_ACTIVE|THROTTLE&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Tráfico activo&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kubectl top pods -n inference
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Resolución.&lt;/strong> Depende del diagnóstico. Casos típicos:&lt;/p>
&lt;ul>
&lt;li>Canary regresión → rollback completo (ver &lt;a href="https://blog.lo0.es/posts/canary-blue-green-shadow-modelos-llm/">Canary&lt;/a>).&lt;/li>
&lt;li>Saturación de capacidad → escalar réplicas o aceptar 503 temporal con &lt;code>Retry-After&lt;/code>.&lt;/li>
&lt;li>Prefill bound → activar/calibrar chunked prefill o disaggregated serving (ver &lt;a href="https://blog.lo0.es/posts/disaggregated-serving-prefill-decode/">Disaggregated serving&lt;/a>).&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Cierre.&lt;/strong> TTFT P95 dentro de SLO sostenido 30 min.&lt;/p>
&lt;p>&lt;strong>Postmortem.&lt;/strong> &lt;strong>Obligatorio&lt;/strong>. Documentar causa raíz y palanca aplicada; actualizar runbook.&lt;/p>
&lt;h2 id="workflows-keep-yaml--tres-ejemplos-completos">Workflows Keep YAML — tres ejemplos completos&lt;/h2>
&lt;p>Los runbooks son útiles solo si están &lt;strong>codificados&lt;/strong> en el workflow engine. Keep permite declararlos en YAML versionados en git.&lt;/p>
&lt;h3 id="workflow-1--xid-detectedyaml">Workflow 1 — &lt;code>xid-detected.yaml&lt;/code>&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">workflow&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">id&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">xid-detected-drain&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;XID error detected — cordon node and capture evidence&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">description&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;RB-03 implementation&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">triggers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">alert&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">filters&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">alertname&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">GpuXidErrorDetected&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">steps&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">capture-evidence&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">bash&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">command&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> set -e
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> NODE=&amp;#34;{{ alert.labels.node }}&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> GPU=&amp;#34;{{ alert.labels.gpu }}&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> INC_ID=&amp;#34;{{ alert.fingerprint }}&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> mkdir -p /var/evidence/$INC_ID
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> kubectl debug node/$NODE -it --image=nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04 -- \
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> bash -c &amp;#34;nvidia-smi -q -d ERROR,PCIE,ROW_REMAPPER &amp;gt; /host/var/evidence/$INC_ID/smi.txt&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> kubectl describe node $NODE &amp;gt; /var/evidence/$INC_ID/node.txt&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cordon-node&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kubernetes&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cordon&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.labels.node }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">if&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.labels.severity == &amp;#39;critical&amp;#39; }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">actions&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">open-jira-ticket&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">jira&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ providers.jira-prod }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">project&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">GPUOPS&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">issuetype&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Incident&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">summary&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;RB-03: XID {{ alert.annotations.xid_code }} on {{ alert.labels.node }}/{{ alert.labels.gpu }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">description&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> Severity: {{ alert.labels.severity }}
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> XID: {{ alert.annotations.xid_code }}
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> Evidence: /var/evidence/{{ alert.fingerprint }}
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> Runbook: https://runbooks.example.local/RB-03&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">notify-slack&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">slack&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ providers.slack-gpu-incidents }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">message&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> :warning: *RB-03 triggered*
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> Node: `{{ alert.labels.node }}` GPU: `{{ alert.labels.gpu }}`
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> XID: `{{ alert.annotations.xid_code }}`
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> &amp;lt;{{ jira.url }}|Jira ticket&amp;gt;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">emit-audit&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kafka&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ providers.kafka-audit }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">topic&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">audit.actions&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">message&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">incident_id&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.fingerprint }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;cordon_node&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">actor&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;keep-workflow&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">workflow_id&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;xid-detected-drain&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">target&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.labels.node }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">timestamp&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ now }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="workflow-2--ecc-dbeyaml--paginación-inmediata">Workflow 2 — &lt;code>ecc-dbe.yaml&lt;/code> — paginación inmediata&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">workflow&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">id&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ecc-dbe-critical&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;ECC double-bit — page on-call and quarantine node&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">triggers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">alert&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">filters&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">alertname&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">GpuEccDoubleBit&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">steps&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cordon-immediately&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kubernetes&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">cordon&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.labels.node }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">drain-workload&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kubernetes&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">drain&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.labels.node }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">options&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ignore-daemonsets&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">delete-emptydir-data&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">grace-period&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">120&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">page-oncall&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pagerduty&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ providers.pagerduty-critical }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">service_key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ env.PD_SERVICE_KEY }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">severity&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">critical&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">summary&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;RB-04 ECC DBE on {{ alert.labels.node }}/{{ alert.labels.gpu }} — node drained&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">emit-lifecycle&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kafka&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ providers.kafka-incidents }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">topic&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">incidents.lifecycle&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">message&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">incident_id&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.fingerprint }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">event&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">incident.opened&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">severity&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">critical&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">runbook&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">RB-04&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">requires_postmortem&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">true&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">notify-dpo&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">email&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">to&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">dpo@example.local&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">subject&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;ECC DBE en GPU productiva — evaluación necesaria&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">body&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> Incidente RB-04 ECC DBE detectado en {{ alert.labels.node }}.
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> Modelo afectado: {{ alert.labels.model }}.
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> Por favor evaluar si hubo procesamiento de datos personales/clasificados
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> durante la ventana de error y necesidad de notificación GDPR art. 33.&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="workflow-3--canary-rollbackyaml--ttft-p95-fuera-de-slo">Workflow 3 — &lt;code>canary-rollback.yaml&lt;/code> — TTFT P95 fuera de SLO&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">workflow&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">id&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">canary-rollback-ttft&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;Rollback canary when TTFT P95 ratio v2/v1 &amp;gt; 1.30&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">triggers&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">alert&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">filters&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">alertname&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">VllmTtftP95OutOfSlo&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">key&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">canary_active&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;true&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">steps&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">check-ratio&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">prometheus&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ providers.prom-prod }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> histogram_quantile(0.95, sum by(le)(rate(vllm:time_to_first_token_seconds_bucket{version=&amp;#34;v2&amp;#34;}[5m])))
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> /
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="sd"> histogram_quantile(0.95, sum by(le)(rate(vllm:time_to_first_token_seconds_bucket{version=&amp;#34;v1&amp;#34;}[5m])))&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">condition&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">result &amp;gt; 1.30&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">actions&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">argo-rollback&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kubernetes&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">exec&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">command&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="l">kubectl&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="l">argo&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="l">rollouts&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="l">abort&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;{{ alert.labels.rollout }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- -&lt;span class="kc">n&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;{{ alert.labels.namespace }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">notify-and-audit&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">provider&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">kafka&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ providers.kafka-audit }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">with&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">topic&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">audit.actions&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">message&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">incident_id&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ alert.fingerprint }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">action&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">canary_rollback&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ratio&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ steps.check-ratio.result }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">actor&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">keep-workflow&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">timestamp&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;{{ now }}&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Cada workflow se guarda en &lt;code>repos/keep-workflows/&lt;/code> versionado en git, revisado por pull request, validado por CI (&lt;code>keep workflow validate&lt;/code>). El runbook escrito vive como &lt;code>docs/runbooks/RB-XX.md&lt;/code> enlazado desde el workflow YAML — los dos siempre evolucionan juntos.&lt;/p>
&lt;h2 id="el-schema-canónico-de-eventos-kafka">El schema canónico de eventos Kafka&lt;/h2>
&lt;p>Para que los topics sean consumibles por compliance, postmortem tooling y dashboards sin que cada consumer tenga que adivinar el shape, se fija schema con Avro / Protobuf.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;IncidentLifecycleEvent&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;record&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;fields&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;incident_id&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;event&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;enum&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;symbols&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;incident.opened&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;incident.acknowledged&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;action.proposed&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;action.executed&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;action.failed&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;incident.escalated&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;incident.resolved&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;postmortem.attached&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">]}},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;timestamp&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;logicalType&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;timestamp-millis&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;actor&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;severity&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;enum&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;symbols&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;low&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;warning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;critical&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="p">}&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;runbook&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;null&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="nt">&amp;#34;default&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">null&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;alert_name&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;labels&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;map&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;values&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">}&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;annotations&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;map&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;values&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">}&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;evidence_uri&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;null&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="nt">&amp;#34;default&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">null&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;requires_postmortem&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;boolean&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;default&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">false&lt;/span> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Para &lt;code>audit.actions&lt;/code> (WORM), un schema separado más exigente con campos no-modificables:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;AuditAction&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;record&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;fields&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;incident_id&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;action&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;actor&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;actor_type&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;enum&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;symbols&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;human&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;workflow&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;scheduler&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="p">}&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;workflow_id&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;null&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="nt">&amp;#34;default&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">null&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;target&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;command&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;null&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="nt">&amp;#34;default&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">null&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;result&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;enum&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;symbols&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;success&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;failure&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;partial&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="p">}&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;timestamp&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;logicalType&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;timestamp-millis&amp;#34;&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;evidence_uri&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;null&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="nt">&amp;#34;default&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">null&lt;/span> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;approver&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nt">&amp;#34;type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;null&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;string&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="nt">&amp;#34;default&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">null&lt;/span> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>El topic se configura con &lt;code>cleanup.policy=delete&lt;/code>, &lt;code>retention.ms=15552000000&lt;/code> (6 meses) y &lt;code>min.insync.replicas=2&lt;/code> con &lt;code>acks=all&lt;/code> para garantizar durabilidad. Para retención más larga sin coste de Kafka, &lt;strong>tiered storage&lt;/strong> a Ceph RGW o S3-compatible — el log nuevo en hot tier, el viejo en cold tier transparente al consumer.&lt;/p>
&lt;h2 id="encaje-formal-en-gestión-de-incidentes">Encaje formal en gestión de incidentes&lt;/h2>
&lt;p>Los runbooks no son una práctica de SRE aislada — encajan en cuatro marcos normativos que las plataformas LLM productivas tocan a diario.&lt;/p>
&lt;h3 id="isoiec-27035--gestión-de-incidentes-de-seguridad-de-la-información">ISO/IEC 27035 — gestión de incidentes de seguridad de la información&lt;/h3>
&lt;p>Define el ciclo formal en cinco fases: &lt;strong>plan &amp;amp; prepare&lt;/strong> → &lt;strong>detect &amp;amp; report&lt;/strong> → &lt;strong>assess &amp;amp; decide&lt;/strong> → &lt;strong>respond&lt;/strong> → &lt;strong>lessons learned&lt;/strong>. Cada fase tiene salidas exigibles documentalmente. La traducción al stack:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Plan &amp;amp; prepare&lt;/strong>: los runbooks RB-01 a RB-06 + los workflows Keep son parte del &lt;em>Information Security Incident Management Plan&lt;/em>. Versionados en git, revisados anualmente.&lt;/li>
&lt;li>&lt;strong>Detect &amp;amp; report&lt;/strong>: las alertas Prometheus que entran a Kafka son la materialización.&lt;/li>
&lt;li>&lt;strong>Assess &amp;amp; decide&lt;/strong>: la severity en &lt;code>gpu.alerts.enriched&lt;/code> + la lógica del workflow Keep.&lt;/li>
&lt;li>&lt;strong>Respond&lt;/strong>: ejecución de los &lt;code>steps&lt;/code> + &lt;code>actions&lt;/code> del workflow.&lt;/li>
&lt;li>&lt;strong>Lessons learned&lt;/strong>: postmortem obligatorio para los runbooks que lo marcan; salida documentada en el repo de postmortems + actualización del runbook.&lt;/li>
&lt;/ul>
&lt;h3 id="ens-esquema-nacional-de-seguridad--controles-opexp">ENS (Esquema Nacional de Seguridad) — controles op.exp&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>&lt;code>op.exp.7&lt;/code> Gestión de incidentes&lt;/strong>: el catálogo de runbooks + el pipeline Keep / Kafka materializan la &amp;ldquo;respuesta organizada y procedimentada&amp;rdquo;.&lt;/li>
&lt;li>&lt;strong>&lt;code>op.exp.8&lt;/code> Registro de actividad&lt;/strong>: el topic &lt;code>audit.actions&lt;/code> con retención WORM 6 meses (mínimo nivel ALTO).&lt;/li>
&lt;li>&lt;strong>&lt;code>op.exp.9&lt;/code> Registro de la gestión de incidentes&lt;/strong>: el topic &lt;code>incidents.lifecycle&lt;/code> con el ciclo completo de cada incidente.&lt;/li>
&lt;li>&lt;strong>&lt;code>op.exp.10&lt;/code> Protección de los registros de actividad&lt;/strong>: WORM + cifrado en reposo + control de acceso (consumers compliance solo-lectura).&lt;/li>
&lt;/ul>
&lt;h3 id="nis2--notificación-a-autoridad-competente">NIS2 — notificación a autoridad competente&lt;/h3>
&lt;p>Para entidades esenciales / importantes, el art. 23 fija tres plazos a partir del &lt;em>significant impact&lt;/em> detectado:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>24 horas&lt;/strong>: notificación temprana (&amp;ldquo;early warning&amp;rdquo;) al CSIRT nacional (INCIBE-CERT en España).&lt;/li>
&lt;li>&lt;strong>72 horas&lt;/strong>: notificación formal con assessment inicial.&lt;/li>
&lt;li>&lt;strong>1 mes&lt;/strong>: informe final con causa raíz, impacto, medidas correctivas.&lt;/li>
&lt;/ul>
&lt;p>Los datos para esos informes salen directamente de &lt;code>incidents.lifecycle&lt;/code> + &lt;code>audit.actions&lt;/code> con un consumer que genera el dossier en el formato requerido. Sin el pipeline auditable, los plazos NIS2 son inalcanzables.&lt;/p>
&lt;h3 id="eu-ai-act--art-73-serious-incident-reporting">EU AI Act — art. 73 (serious incident reporting)&lt;/h3>
&lt;p>Aplicable a sistemas de alto riesgo. Plazos:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>2 días&lt;/strong>: para incidentes que provoquen fallecimiento o daño irreversible a personas o infraestructuras críticas.&lt;/li>
&lt;li>&lt;strong>10 días&lt;/strong>: para incidentes que produzcan disrupción seria de infraestructura crítica.&lt;/li>
&lt;li>&lt;strong>15 días&lt;/strong>: para el resto de &amp;ldquo;serious incidents&amp;rdquo;.&lt;/li>
&lt;/ul>
&lt;p>La definición de &amp;ldquo;serious incident&amp;rdquo; incluye fallos sistemáticos del modelo, brecha de fundamental rights, daño material o medioambiental. Los runbooks deben marcar qué alertas pueden derivar en serious incident (típicamente cualquier cosa que afecte la salida del modelo en un contexto de alto riesgo) y disparar un sub-workflow específico de evaluación legal.&lt;/p>
&lt;h3 id="isoiec-42001--aims-cláusula-10-mejora-continua">ISO/IEC 42001 — AIMS cláusula 10 mejora continua&lt;/h3>
&lt;p>El postmortem obligatorio post-incidente alimenta la cláusula 10. La actualización del runbook tras cada incidente que descubre un patrón nuevo es la &amp;ldquo;acción correctiva con verificación de eficacia&amp;rdquo; que la norma exige. Ver &lt;a href="https://blog.lo0.es/posts/iso-42001-aims-llm-on-premise/">ISO 42001 AIMS&lt;/a>.&lt;/p>
&lt;h2 id="cuatro-anti-patrones">Cuatro anti-patrones&lt;/h2>
&lt;p>&lt;strong>Anti-patrón 1 — alertas sin runbook.&lt;/strong> La alerta dispara, el operador junior de guardia mira el dashboard, busca en Confluence, no encuentra nada actualizado, llama al senior por Slack, espera 20 minutos. En ese tiempo el incidente ha crecido. Regla: &lt;strong>ninguna alerta entra a producción sin runbook publicado y workflow Keep aprobado&lt;/strong>. CI valida que cada &lt;code>PrometheusRule&lt;/code> con severity ≥ warning tiene su &lt;code>keep workflow&lt;/code> correspondiente.&lt;/p>
&lt;p>&lt;strong>Anti-patrón 2 — runbook sin captura de evidencia previa.&lt;/strong> El workflow ejecuta &lt;code>nvidia-smi --gpu-reset&lt;/code> en cuanto llega el XID, perdiendo el estado que habría diagnosticado la causa raíz. El siguiente XID idéntico exige rehacer el diagnóstico desde cero. Regla: &lt;strong>&lt;code>steps&lt;/code> antes de &lt;code>actions&lt;/code>&lt;/strong>; toda evidencia se captura primero, las acciones destructivas después.&lt;/p>
&lt;p>&lt;strong>Anti-patrón 3 — escalada por antigüedad en vez de severity.&lt;/strong> El operador junior de guardia gestiona un ECC DBE porque &amp;ldquo;le toca&amp;rdquo;. Le falta contexto para entender row remap, retired pages o el riesgo de corrupción de datos. Regla: &lt;strong>paginación por severity, no por rotación&lt;/strong>: RB-04 y RB-03 dispararon ON-CALL primario senior con escalada automática a infra/hardware si no acuse en 10 min.&lt;/p>
&lt;p>&lt;strong>Anti-patrón 4 — ausencia de gate humano para acciones destructivas.&lt;/strong> El workflow ejecuta &lt;code>kubectl drain&lt;/code> automáticamente sobre cualquier alerta marcada como CRITICAL. En la primera falsa alarma (un transitorio que se autoresolvió en 30 s), Keep drenó un nodo productivo durante hora pico. Regla: &lt;strong>acciones destructivas (drain, reset, RMA, rollback completo) exigen confirmación humana&lt;/strong> vía Slack interactive message, con timeout configurable. Excepción justificada: ECC DBE confirmado por &amp;gt; 1 medición — el riesgo de corrupción supera el de falsa alarma.&lt;/p>
&lt;h2 id="aplicado-a-hardware-on-premise-típico">Aplicado a hardware on-premise típico&lt;/h2>
&lt;p>Para un cluster genérico de &lt;strong>4 nodos × 4×H100 SXM 80 GB&lt;/strong> con &lt;strong>Kafka y Keep ya desplegados&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Kafka&lt;/strong>: cluster de 3 brokers en nodos no-GPU del cluster K8s; topics &lt;code>gpu.alerts.enriched&lt;/code>, &lt;code>incidents.lifecycle&lt;/code>, &lt;code>audit.actions&lt;/code> configurados con replication factor 3, min.insync.replicas 2. Audit con tiered storage a Ceph RGW para retención &amp;gt; 6 meses sin coste brutal.&lt;/li>
&lt;li>&lt;strong>Keep&lt;/strong>: 2 réplicas del operator + 1 réplica del worker en un namespace &lt;code>keep&lt;/code>; conectado a Prometheus (provider read), Kafka (provider read + write), Slack, PagerDuty, Jira, Kubernetes (provider con SA específico con permisos &lt;code>get/list/patch nodes&lt;/code>, &lt;code>create jobs&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Workflows&lt;/strong>: ~25-40 YAML en el repo &lt;code>infra/keep-workflows/&lt;/code>, sincronizado con el cluster vía Flux o Argo CD. Validados por CI (&lt;code>keep workflow validate&lt;/code>) en cada PR.&lt;/li>
&lt;li>&lt;strong>Volumen de eventos&lt;/strong>: para 16 GPUs en operación normal con alertas debounced, ~50-200 eventos/día en &lt;code>gpu.alerts.enriched&lt;/code>. En incidente típico, picos de 500-2000 eventos/día.&lt;/li>
&lt;li>&lt;strong>Compliance consumers&lt;/strong>: un consumer python en namespace &lt;code>compliance&lt;/code> que genera reportes NIS2 / ENS / EU AI Act semanalmente, leyendo &lt;code>audit.actions&lt;/code> y &lt;code>incidents.lifecycle&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h2 id="lo-que-no-hemos-cubierto-próximos-posts">Lo que no hemos cubierto (próximos posts)&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Playbooks de postmortem&lt;/strong> — la mecánica de RCA con 5-whys, Ishikawa adaptado a LLM, integración con MLflow tracking de re-training si el postmortem produce dataset enriquecido.&lt;/li>
&lt;li>&lt;strong>Chaos engineering para LLM&lt;/strong> — inyección controlada de XID errors, ECC simulados, latencia HBM artificial para validar runbooks &lt;strong>antes&lt;/strong> del incidente real.&lt;/li>
&lt;li>&lt;strong>Multi-cluster incident coordination&lt;/strong> — cómo coordinar Keep entre clusters geográficos cuando un incidente afecta a múltiples regiones.&lt;/li>
&lt;li>&lt;strong>Integración con CMDB y procurement&lt;/strong> — el ciclo &lt;code>RMA → ticket → ServiceNow → reposición de hardware&lt;/code> automatizado vía workflow.&lt;/li>
&lt;li>&lt;strong>Forense LLM&lt;/strong> — extracción de la traza OTel completa de una request afectada por un incidente, redacted PII, conservación en evidence vault.&lt;/li>
&lt;/ul>
&lt;h2 id="ver-también">Ver también&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://blog.lo0.es/posts/anatomia-metricas-dcgm-vllm-anomalias/">Anatomía de las doce métricas DCGM y cinco vLLM&lt;/a> — la anomalía documentada por métrica que estos runbooks resuelven.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/observabilidad-gpu-dcgm-llm/">Observabilidad GPU para inferencia LLM&lt;/a> — la lista compacta y las seis alertas críticas.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/tracing-llm-otel-genai/">Tracing LLM con OpenTelemetry GenAI&lt;/a> — la traza OTel que se captura como evidencia.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/canary-blue-green-shadow-modelos-llm/">Canary, blue-green y shadow&lt;/a> — el mecanismo de rollback que RB-06 invoca.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/autoscaling-llm-kubernetes-keda/">Autoscaling LLM en Kubernetes&lt;/a> — la palanca de escalado que RB-01 y RB-05 invocan.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/capacity-planning-inferencia-llm-on-premise/">Capacity planning&lt;/a> — el head-room presupuestado para absorber incidentes sin SLO break.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/iso-42001-aims-llm-on-premise/">ISO/IEC 42001 AIMS para LLM on-premise&lt;/a> — la cláusula 10 que estos postmortems materializan.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/controles-tecnicos-ens-42001-eu-ai-act/">Controles técnicos ENS × 42001 × EU AI Act&lt;/a> — el mapeo de controles que estos runbooks satisfacen.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/eu-ai-act-mapeo-arquitectura-llm-on-premise/">EU AI Act: mapeo a arquitectura LLM&lt;/a> — el art. 73 de incidentes graves que activa el sub-workflow legal.&lt;/li>
&lt;li>&lt;a href="https://blog.lo0.es/posts/cinco-niveles-madurez-plataforma-llm-on-premise/">Cinco niveles de madurez&lt;/a> — los runbooks codificados son requisito del nivel 3-4.&lt;/li>
&lt;/ul>
&lt;h2 id="referencias">Referencias&lt;/h2>
&lt;ul>
&lt;li>ISO/IEC 27035-1:2023 — &lt;em>Information security incident management — Principles and process&lt;/em>.&lt;/li>
&lt;li>ISO/IEC 27035-2:2023 — &lt;em>Information security incident management — Guidelines to plan and prepare for incident response&lt;/em>.&lt;/li>
&lt;li>ENS — &lt;em>Real Decreto 311/2022&lt;/em>, Anexo II controles &lt;code>op.exp.7&lt;/code> a &lt;code>op.exp.10&lt;/code>.&lt;/li>
&lt;li>Directiva NIS2 (UE 2022/2555) — art. 23 (notificación de incidentes significativos).&lt;/li>
&lt;li>Reglamento EU AI Act (UE 2024/1689) — art. 73 (reporting of serious incidents).&lt;/li>
&lt;li>ISO/IEC 42001:2023 — &lt;em>AI management system — cláusula 10 (mejora continua)&lt;/em>.&lt;/li>
&lt;li>Keep project — &lt;code>keephq.dev&lt;/code> y &lt;code>github.com/keephq/keep&lt;/code> (documentación de workflows YAML, providers).&lt;/li>
&lt;li>Apache Kafka — &lt;em>Tiered Storage&lt;/em> y &lt;code>cleanup.policy&lt;/code> (docs.confluent.io / kafka.apache.org).&lt;/li>
&lt;li>Confluent — &lt;em>Schema Registry&lt;/em> y best practices para eventos lifecycle.&lt;/li>
&lt;li>NVIDIA — &lt;em>Xid Errors Documentation&lt;/em> y procedimientos de remediación.&lt;/li>
&lt;li>Google SRE Book — &lt;em>Effective Troubleshooting&lt;/em> y &lt;em>Postmortem Culture&lt;/em>.&lt;/li>
&lt;li>Atlassian — &lt;em>Incident Management Handbook&lt;/em> (referencia para severity matrices).&lt;/li>
&lt;/ul></description></item></channel></rss>