Agentic Notes Library

Chapter 21: Fault Tolerance, Idempotency, and Graceful Degradation

Agentic AI systems operate at the intersection of stochastic inference, distributed tool execution, external API dependencies, and human-in-the-loop governance. Every one of these boundaries is a failure surface. A system that cannot tol...

March 20, 2026 16 min read 3,352 words
Chapter 21MathRaw HTML

Preamble#

Agentic AI systems operate at the intersection of stochastic inference, distributed tool execution, external API dependencies, and human-in-the-loop governance. Every one of these boundaries is a failure surface. A system that cannot tolerate failures deterministically is not a production system—it is a demonstration. This chapter formalizes fault tolerance for agentic platforms with the same rigor applied to avionics, financial trading systems, and distributed databases. Every failure mode is taxonomized, every mitigation is mathematically characterized, every recovery protocol is specified as a bounded, instrumented, auditable procedure. The objective is not merely to survive failures but to operate predictably, safely, and cost-efficiently through them—maintaining measurable service-level guarantees while preserving the correctness invariants on which agentic reliability depends.


21.1 Failure Taxonomy: Transient, Persistent, Cascading, Byzantine, and Semantic Failures#

21.1.1 The Necessity of Formal Taxonomy#

Effective fault tolerance demands that failures are not treated as a monolithic category. Each failure class has distinct detection signatures, propagation characteristics, recovery strategies, and cost profiles. A retry strategy appropriate for a transient network timeout is catastrophically wrong for a persistent authorization failure. A circuit breaker tuned for external API flakiness is useless against a semantic hallucination that passes all schema checks.

21.1.2 Failure Class Definitions#

Let F\mathcal{F} be the space of all failures. Define a classification function:

Classify:F{TRANSIENT,PERSISTENT,CASCADING,BYZANTINE,SEMANTIC}\text{Classify}: \mathcal{F} \rightarrow \{\texttt{TRANSIENT}, \texttt{PERSISTENT}, \texttt{CASCADING}, \texttt{BYZANTINE}, \texttt{SEMANTIC}\}
Failure ClassDefinitionDetection SignatureRecovery StrategyExample
TransientTemporary condition that resolves without interventionSucceeds on retry; error code in retriable setRetry with backoffNetwork timeout, rate limit 429, temporary overload
PersistentStable failure that will not self-resolveRepeated identical failure across retriesEscalate, substitute, or failInvalid credentials, deleted resource, schema mismatch
CascadingFailure in one component propagates to dependent componentsCorrelated failures across services within temporal windowIsolate, shed load, break dependency chainDatabase overload → retrieval failure → agent stall → queue backup
ByzantineComponent produces incorrect results without signaling errorOutput passes schema validation but is semantically wrongRedundant verification, voting, cross-validationLLM hallucination, corrupted cache returning stale data, tool returning wrong result silently
SemanticOutput is structurally valid but violates task-level correctness or safety requirementsDetected only by domain-specific verificationCritique → repair → re-verify; escalate to humanFactually incorrect answer, unsafe recommendation, logically inconsistent plan

21.1.3 Formal Failure Model#

Define a failure event fFf \in \mathcal{F} as a tuple:

f=(id,class,source,timestamp,severity,retriable,context,propagation_risk)f = (\text{id}, \text{class}, \text{source}, \text{timestamp}, \text{severity}, \text{retriable}, \text{context}, \text{propagation\_risk})

The failure rate for component cc over window Δt\Delta t is:

λc(Δt)={fFc:tf[tΔt,t]}Δt\lambda_c(\Delta t) = \frac{|\{f \in \mathcal{F}_c : t_f \in [t - \Delta t, t]\}|}{\Delta t}

The mean time between failures (MTBF) and mean time to recovery (MTTR) for a component are:

MTBFc=1λc,MTTRc=E[TrecoveryfFc]\text{MTBF}_c = \frac{1}{\lambda_c}, \quad \text{MTTR}_c = \mathbb{E}[T_{\text{recovery}} \mid f \in \mathcal{F}_c]

Component availability is:

Ac=MTBFcMTBFc+MTTRcA_c = \frac{\text{MTBF}_c}{\text{MTBF}_c + \text{MTTR}_c}

For a serial chain of nn components, system availability is:

Asystem=i=1nAiA_{\text{system}} = \prod_{i=1}^{n} A_i

For parallel redundancy with kk independent replicas:

Aparallel=1(1Ac)kA_{\text{parallel}} = 1 - (1 - A_c)^k

21.1.4 Cascading Failure Propagation Model#

Model the agentic system as a directed dependency graph G=(V,E)G = (V, E) where vertices are components and edges are dependencies. A failure at component viv_i propagates to vjv_j if (vi,vj)E(v_i, v_j) \in E and vjv_j lacks isolation.

The blast radius of a failure at viv_i is:

BlastRadius(vi)=Reachable(vi,G)={vj:vivj in G}\text{BlastRadius}(v_i) = |\text{Reachable}(v_i, G)| = |\{v_j : v_i \rightsquigarrow v_j \text{ in } G\}|

The cascade probability depends on the failure propagation probability pijp_{ij} along each edge:

P(cascade to vjfail at vi)=1paths vivj(1(u,w)pathpuw)P(\text{cascade to } v_j \mid \text{fail at } v_i) = 1 - \prod_{\text{paths } v_i \rightsquigarrow v_j} \left(1 - \prod_{(u,w) \in \text{path}} p_{uw}\right)

Isolation mechanisms (bulkheads, circuit breakers) reduce pijp_{ij} toward zero on specific edges.

21.1.5 Pseudo-Algorithm: Failure Classification Engine#

ALGORITHM ClassifyFailure
  INPUT:
    error: ErrorEvent
    history: RecentFailureHistory
    component: ComponentID
 
  OUTPUT:
    classification: FailureClassification
 
  BEGIN:
    // ─── Step 1: Check retriability from error code taxonomy ───
    IF error.code IN TRANSIENT_ERROR_CODES THEN
      // Verify not persistent by checking history
      recent_identical ← history.count(
        component=component,
        error_code=error.code,
        window=config.persistence_window
      )
      IF recent_identical ≥ config.persistence_threshold THEN
        RETURN FailureClassification(
          class=PERSISTENT,
          confidence=0.9,
          evidence="repeated_identical_failure",
          recommended_action=ESCALATE
        )
      ELSE
        RETURN FailureClassification(
          class=TRANSIENT,
          confidence=0.85,
          evidence="retriable_error_code",
          recommended_action=RETRY_WITH_BACKOFF
        )
      END IF
    END IF
 
    // ─── Step 2: Check for cascade indicators ───
    correlated_failures ← history.correlated_failures(
      temporal_window=config.cascade_detection_window,
      min_components=2
    )
    IF LEN(correlated_failures) ≥ config.cascade_threshold THEN
      upstream_root ← IDENTIFY_CASCADE_ROOT(correlated_failures, dependency_graph)
      RETURN FailureClassification(
        class=CASCADING,
        confidence=0.8,
        evidence="correlated_multi_component_failure",
        root_cause=upstream_root,
        recommended_action=ISOLATE_AND_SHED_LOAD
      )
    END IF
 
    // ─── Step 3: Check for Byzantine indicators ───
    IF error.type = VERIFICATION_FAILURE AND error.schema_valid = TRUE THEN
      RETURN FailureClassification(
        class=BYZANTINE,
        confidence=0.75,
        evidence="schema_valid_but_semantically_incorrect",
        recommended_action=REDUNDANT_VERIFICATION
      )
    END IF
 
    // ─── Step 4: Check for semantic failures ───
    IF error.type IN {HALLUCINATION, SAFETY_VIOLATION, LOGIC_ERROR,
                       FACTUAL_ERROR, COHERENCE_FAILURE} THEN
      RETURN FailureClassification(
        class=SEMANTIC,
        confidence=0.85,
        evidence=error.verification_details,
        recommended_action=CRITIQUE_AND_REPAIR
      )
    END IF
 
    // ─── Step 5: Default to persistent ───
    RETURN FailureClassification(
      class=PERSISTENT,
      confidence=0.5,
      evidence="unclassified_error",
      recommended_action=FAIL_WITH_DIAGNOSTICS
    )
  END

21.2 Retry Engineering#

21.2.1 Exponential Backoff with Jitter: Configuration, Bounds, and Anti-Thundering-Herd#

Formal Backoff Function#

The backoff delay for retry attempt rr is:

d(r)=min(dmax,  d0br+J(r))d(r) = \min\left(d_{\max}, \; d_0 \cdot b^r + J(r)\right)

where:

  • d0d_0: base delay (e.g., 100ms)
  • bb: exponential base (typically b=2b = 2)
  • dmaxd_{\max}: ceiling delay to prevent unbounded waits
  • J(r)J(r): jitter function to decorrelate concurrent retriers

Jitter Strategies#

StrategyFormulaProperties
Full JitterJ(r)=Uniform(0,d0br)J(r) = \text{Uniform}(0, d_0 \cdot b^r)Maximum decorrelation; wide delay variance
Equal JitterJ(r)=d0br2+Uniform(0,d0br2)J(r) = \frac{d_0 \cdot b^r}{2} + \text{Uniform}\left(0, \frac{d_0 \cdot b^r}{2}\right)Balanced: guaranteed minimum wait + jitter
Decorrelated Jitterd(r)=min(dmax,Uniform(d0,d(r1)3))d(r) = \min\left(d_{\max}, \text{Uniform}(d_0, d(r-1) \cdot 3)\right)Self-adapting; depends on previous delay

The expected total wait across RR retries with full jitter is:

E[r=0R1d(r)]=r=0R1min(dmax,d0br2)\mathbb{E}\left[\sum_{r=0}^{R-1} d(r)\right] = \sum_{r=0}^{R-1} \min\left(d_{\max}, \frac{d_0 \cdot b^r}{2}\right)

For dmaxd0bR1d_{\max} \gg d_0 \cdot b^{R-1}:

E[Ttotal]=d02bR1b1\mathbb{E}[T_{\text{total}}] = \frac{d_0}{2} \cdot \frac{b^R - 1}{b - 1}

Anti-Thundering-Herd Analysis#

When NN concurrent clients retry the same failing service simultaneously, without jitter the aggregate retry rate spikes to:

λspike(r)=Nd0br\lambda_{\text{spike}}(r) = \frac{N}{d_0 \cdot b^r}

With full jitter, the aggregate rate smooths to:

λjittered(r)=Nd0br1d0br/2=2d0br\lambda_{\text{jittered}}(r) = \frac{N}{d_0 \cdot b^r} \cdot \frac{1}{d_0 \cdot b^r / 2} = \frac{2}{d_0 \cdot b^r}

effectively distributing NN retries uniformly over the interval [0,d0br][0, d_0 \cdot b^r], eliminating synchronization.

21.2.2 Retry Budgets: Per-Request, Per-Session, and System-Wide Limits#

Unbounded retries convert transient failures into persistent resource exhaustion. The system enforces layered retry budgets:

RetryBudget=(Rrequest,Rsession,Rsystem,Tretry_window)\text{RetryBudget} = (R_{\text{request}}, R_{\text{session}}, R_{\text{system}}, T_{\text{retry\_window}})
LevelScopeBudgetTypical Value
Per-RequestSingle tool call or API invocationRrequestR_{\text{request}}3–5 attempts
Per-SessionAll retries within one sessionRsessionR_{\text{session}}20–50 attempts
System-WideTotal retries across all sessions per windowRsystem/TwindowR_{\text{system}} / T_{\text{window}}1000 retries / minute

The retry budget utilization at time tt is:

Uretry(t)=retries_consumed(t,tTwindow)RsystemU_{\text{retry}}(t) = \frac{\text{retries\_consumed}(t, t - T_{\text{window}})}{R_{\text{system}}}

When Uretry(t)>UthresholdU_{\text{retry}}(t) > U_{\text{threshold}} (e.g., 0.8), the system enters retry backpressure mode: new retry attempts are rejected or deferred, and the incident response pipeline is triggered.

21.2.3 Idempotency Keys: Generation, Propagation, and Server-Side Deduplication#

Idempotency Key Specification#

An idempotency key κ\kappa uniquely identifies a logical operation such that:

Execute(op,κ)=Execute(op,κ)number of invocations\text{Execute}(\text{op}, \kappa) = \text{Execute}(\text{op}, \kappa) \quad \forall \text{number of invocations}

i.e., the effect is applied exactly once regardless of how many times the request is submitted.

Key Generation#

κ=HMAC-SHA256(Ksession,  operation_typecanonical_paramssequence_number)\kappa = \text{HMAC-SHA256}\left(K_{\text{session}}, \; \text{operation\_type} \| \text{canonical\_params} \| \text{sequence\_number}\right)

where:

  • KsessionK_{\text{session}} is the session-scoped secret
  • canonical_params\text{canonical\_params} is the deterministic serialization of operation parameters
  • sequence_number\text{sequence\_number} is a monotonically increasing counter within the session

Key Propagation#

Idempotency keys propagate through the call chain. When an agent invokes a tool, which invokes a downstream service:

κdownstream=Derive(κparent,step_index)\kappa_{\text{downstream}} = \text{Derive}(\kappa_{\text{parent}}, \text{step\_index})

This ensures that a retry of the parent operation generates the same derived key, enabling deduplication at every layer.

Server-Side Deduplication#

The server maintains a deduplication store D\mathcal{D} mapping keys to results:

D:κ(result,timestamp,expiry)\mathcal{D}: \kappa \rightarrow (\text{result}, \text{timestamp}, \text{expiry})
ALGORITHM IdempotentExecute
  INPUT:
    operation: Operation
    idempotency_key: IdempotencyKey
    dedup_store: DeduplicationStore
    config: IdempotencyConfig
 
  OUTPUT:
    result: OperationResult
 
  BEGIN:
    // ─── Check deduplication store ───
    existing ← dedup_store.get(idempotency_key)
 
    IF existing IS NOT NULL THEN
      IF existing.status = COMPLETED THEN
        EMIT_METRIC("idempotency.dedup_hit", {key: idempotency_key})
        RETURN existing.result  // Return cached result
      ELSE IF existing.status = IN_PROGRESS THEN
        // Another invocation is in flight
        IF NOW() - existing.started_at > config.in_progress_timeout THEN
          // Stale in-progress record; likely crashed
          dedup_store.update(idempotency_key, status=EXPIRED)
          // Fall through to execute
        ELSE
          RETURN OperationResult(status=PENDING, retry_after=existing.estimated_completion)
        END IF
      END IF
    END IF
 
    // ─── Claim the key ───
    claimed ← dedup_store.claim(idempotency_key, {
      status: IN_PROGRESS,
      started_at: NOW(),
      expiry: NOW() + config.key_ttl
    })
 
    IF NOT claimed THEN
      // Race condition: another instance claimed first
      RETURN OperationResult(status=PENDING, retry_after=1s)
    END IF
 
    // ─── Execute operation ───
    TRY:
      result ← EXECUTE(operation)
      dedup_store.update(idempotency_key, {
        status: COMPLETED,
        result: result,
        completed_at: NOW(),
        expiry: NOW() + config.result_ttl
      })
      RETURN result
 
    CATCH error:
      IF IS_RETRIABLE(error) THEN
        dedup_store.update(idempotency_key, status=FAILED_RETRIABLE)
        // Allow future retries with same key
      ELSE
        dedup_store.update(idempotency_key, {
          status: FAILED_PERMANENT,
          error: error,
          expiry: NOW() + config.result_ttl
        })
      END IF
      RAISE error
    END TRY
  END

Deduplication Store TTL#

Keys expire after a configurable TTL to prevent unbounded storage growth:

TTL(κ)={Tresultif status=COMPLETED(typically 24h)Tin_progressif status=IN_PROGRESS(typically 5min)Tfailedif status=FAILED(typically 1h)\text{TTL}(\kappa) = \begin{cases} T_{\text{result}} & \text{if status} = \texttt{COMPLETED} \quad (\text{typically 24h}) \\ T_{\text{in\_progress}} & \text{if status} = \texttt{IN\_PROGRESS} \quad (\text{typically 5min}) \\ T_{\text{failed}} & \text{if status} = \texttt{FAILED} \quad (\text{typically 1h}) \end{cases}

Storage cost:

Cdedup=D(key_size+result_size+metadata_size)Cper_byteC_{\text{dedup}} = |\mathcal{D}| \cdot (\text{key\_size} + \text{result\_size} + \text{metadata\_size}) \cdot C_{\text{per\_byte}}

21.3 Circuit Breakers: Open/Half-Open/Closed States, Failure Rate Thresholds, and Recovery Probes#

21.3.1 Circuit Breaker State Machine#

The circuit breaker is a protective state machine that prevents a failing downstream dependency from consuming unbounded resources:

CB=(S,s0,Σ,δ)\mathcal{CB} = (S, s_0, \Sigma, \delta)

where:

  • S={CLOSED,OPEN,HALF_OPEN}S = \{\texttt{CLOSED}, \texttt{OPEN}, \texttt{HALF\_OPEN}\}
  • s0=CLOSEDs_0 = \texttt{CLOSED} (normal operation)
  • Σ={success,failure,timeout_elapsed,probe_success,probe_failure}\Sigma = \{\texttt{success}, \texttt{failure}, \texttt{timeout\_elapsed}, \texttt{probe\_success}, \texttt{probe\_failure}\}

Transition function:

δ(CLOSED,failure_threshold_exceeded)=OPEN\delta(\texttt{CLOSED}, \texttt{failure\_threshold\_exceeded}) = \texttt{OPEN} δ(OPEN,timeout_elapsed)=HALF_OPEN\delta(\texttt{OPEN}, \texttt{timeout\_elapsed}) = \texttt{HALF\_OPEN} δ(HALF_OPEN,probe_success)=CLOSED\delta(\texttt{HALF\_OPEN}, \texttt{probe\_success}) = \texttt{CLOSED} δ(HALF_OPEN,probe_failure)=OPEN\delta(\texttt{HALF\_OPEN}, \texttt{probe\_failure}) = \texttt{OPEN}

21.3.2 Failure Rate Computation#

The failure rate is computed over a sliding window of WW requests:

ϕ(t)={rWindow(t,W):r.failed}Window(t,W)\phi(t) = \frac{|\{r \in \text{Window}(t, W) : r.\text{failed}\}|}{|\text{Window}(t, W)|}

The circuit opens when:

ϕ(t)>ϕthresholdWindow(t,W)Nmin\phi(t) > \phi_{\text{threshold}} \quad \wedge \quad |\text{Window}(t, W)| \geq N_{\min}

where ϕthreshold\phi_{\text{threshold}} (e.g., 0.5) is the failure rate threshold and NminN_{\min} is the minimum sample size to prevent premature tripping on low traffic.

21.3.3 Recovery Probing#

In HALF_OPEN state, the circuit breaker admits a limited number of probe requests PprobesP_{\text{probes}} (typically 1–3):

RecoveryDecision={CLOSEDif probe_successesPprobesρrecoveryOPENotherwise\text{RecoveryDecision} = \begin{cases} \texttt{CLOSED} & \text{if } \frac{\text{probe\_successes}}{P_{\text{probes}}} \geq \rho_{\text{recovery}} \\ \texttt{OPEN} & \text{otherwise} \end{cases}

The open duration before transitioning to HALF_OPEN follows an exponential backoff:

Topen(n)=min(Topen_max,  Topen_base2n1)T_{\text{open}}(n) = \min\left(T_{\text{open\_max}}, \; T_{\text{open\_base}} \cdot 2^{n-1}\right)

where nn is the number of consecutive open→half-open→open cycles. This prevents rapid oscillation (circuit "flapping").

21.3.4 Circuit Breaker Metrics#

MetricFormulaOperational Significance
Trip Rateλtrip=tripsΔt\lambda_{\text{trip}} = \frac{\text{trips}}{\Delta t}Frequency of dependency degradation
Open DurationTˉopen=E[Topen]\bar{T}_{\text{open}} = \mathbb{E}[T_{\text{open}}]Average time dependency is unavailable
Recovery Success Rateρrecovery=successful_recoveriesrecovery_attempts\rho_{\text{recovery}} = \frac{\text{successful\_recoveries}}{\text{recovery\_attempts}}Dependency stability after incidents
Requests ShedNshed=open periodsλincomingTopenN_{\text{shed}} = \sum_{\text{open periods}} \lambda_{\text{incoming}} \cdot T_{\text{open}}Requests rejected during outages

21.3.5 Pseudo-Algorithm: Circuit Breaker#

ALGORITHM CircuitBreaker
  INPUT:
    dependency: DependencyID
    config: CircuitBreakerConfig
 
  STATE:
    state ← CLOSED
    failure_window ← SlidingWindow(size=config.window_size)
    open_since ← NULL
    consecutive_opens ← 0
    probe_results ← []
 
  METHOD execute(operation):
    MATCH state:
 
      CLOSED:
        TRY:
          result ← operation.execute(timeout=config.call_timeout)
          failure_window.record(SUCCESS)
          RETURN result
        CATCH error:
          failure_window.record(FAILURE)
          // Check trip condition
          IF failure_window.count() ≥ config.N_min THEN
            failure_rate ← failure_window.failure_rate()
            IF failure_rate > config.phi_threshold THEN
              TRIP_OPEN()
            END IF
          END IF
          RAISE error
        END TRY
 
      OPEN:
        // Check if open timeout has elapsed
        elapsed ← NOW() - open_since
        T_open ← MIN(config.T_open_max,
                      config.T_open_base * 2^(consecutive_opens - 1))
        IF elapsed ≥ T_open THEN
          state ← HALF_OPEN
          probe_results ← []
          EMIT_TRACE("circuit_breaker.half_open", dependency)
          // Fall through to HALF_OPEN handling below
          RETURN EXECUTE_HALF_OPEN(operation)
        ELSE
          // Fast-fail: do not attempt the call
          EMIT_METRIC("circuit_breaker.rejected", dependency)
          RAISE CircuitOpenError(dependency, retry_after=T_open - elapsed)
        END IF
 
      HALF_OPEN:
        RETURN EXECUTE_HALF_OPEN(operation)
 
  METHOD EXECUTE_HALF_OPEN(operation):
    IF LEN(probe_results) ≥ config.P_probes THEN
      // Already collected enough probes; wait for decision
      RAISE CircuitOpenError(dependency, retry_after=1s)
    END IF
 
    TRY:
      result ← operation.execute(timeout=config.call_timeout)
      APPEND probe_results, SUCCESS
 
      IF COUNT(SUCCESS IN probe_results) ≥ config.probes_to_close THEN
        state ← CLOSED
        consecutive_opens ← 0
        failure_window.reset()
        EMIT_TRACE("circuit_breaker.closed", dependency)
      END IF
 
      RETURN result
 
    CATCH error:
      APPEND probe_results, FAILURE
      TRIP_OPEN()
      RAISE error
    END TRY
 
  METHOD TRIP_OPEN():
    state ← OPEN
    open_since ← NOW()
    consecutive_opens ← consecutive_opens + 1
    EMIT_TRACE("circuit_breaker.opened", {
      dependency: dependency,
      failure_rate: failure_window.failure_rate(),
      consecutive_opens: consecutive_opens
    })
    EMIT_ALERT_IF(consecutive_opens ≥ config.alert_threshold)

21.4 Bulkhead Isolation: Partitioning Resources to Prevent Cross-Concern Failure Propagation#

21.4.1 Bulkhead Principle#

Borrowed from naval engineering, the bulkhead pattern partitions system resources into isolated compartments such that failure in one compartment cannot drain resources from another.

Formally, let the system's resource pool R\mathcal{R} (thread pools, connection pools, memory, token budgets) be partitioned:

R=i=1kRi,RiRj=  ij\mathcal{R} = \bigsqcup_{i=1}^{k} \mathcal{R}_i, \quad \mathcal{R}_i \cap \mathcal{R}_j = \varnothing \; \forall i \neq j

Each partition Ri\mathcal{R}_i has a hard capacity limit CiC_i:

usage(Ri,t)Cit\text{usage}(\mathcal{R}_i, t) \leq C_i \quad \forall t

A resource-exhaustion failure in partition ii does not affect partitions jij \neq i:

usage(Ri,t)=Ci  ⇏  degradation(Rj)ji\text{usage}(\mathcal{R}_i, t) = C_i \;\not\Rightarrow\; \text{degradation}(\mathcal{R}_j) \quad \forall j \neq i

21.4.2 Bulkhead Dimensions for Agentic Systems#

DimensionPartition ByRationale
Tool invocation poolsPer-tool or per-tool-classSlow tool cannot exhaust pool used by fast tools
LLM inference queuesPer-priority-tierLow-priority background tasks cannot block high-priority user requests
Retrieval connectionsPer-source (vector DB, graph DB, cache)One slow source cannot block others
Agent execution slotsPer-session or per-userOne user's runaway agent cannot consume all compute
Token budget poolsPer-session, per-taskOne expensive task cannot drain organization budget
Memory allocationPer-session working memoryOne session's large context cannot cause OOM for others

21.4.3 Bulkhead Sizing#

The capacity of each bulkhead is determined by:

Ci=max(Cmin,  wiCtotaljwj)C_i = \max\left(C_{\min}, \; \left\lfloor \frac{w_i \cdot C_{\text{total}}}{\sum_j w_j} \right\rfloor \right)

where wiw_i is the weight assigned based on priority and expected demand, CtotalC_{\text{total}} is the total resource pool, and CminC_{\min} guarantees a minimum viable allocation per partition.

The utilization-adjusted sizing dynamically reallocates unused capacity:

Cieffective(t)=Ci+αjimax(0,Cjusage(Rj,t))C_i^{\text{effective}}(t) = C_i + \alpha \cdot \sum_{j \neq i} \max\left(0, C_j - \text{usage}(\mathcal{R}_j, t)\right)

where α(0,1)\alpha \in (0, 1) is the borrowing fraction that controls how much slack from other partitions can be temporarily utilized. Setting α=0\alpha = 0 provides strict isolation; α>0\alpha > 0 provides elastic isolation with guaranteed minimums.

21.4.4 Pseudo-Algorithm: Bulkhead Resource Manager#

ALGORITHM BulkheadResourceManager
  INPUT:
    partitions: Map<PartitionID, BulkheadConfig>
    total_capacity: Int
 
  STATE:
    allocations: Map<PartitionID, Semaphore>
    usage_counters: Map<PartitionID, AtomicInt>
 
  METHOD initialize():
    FOR EACH (id, config) IN partitions DO
      capacity ← MAX(config.C_min,
                      FLOOR(config.weight * total_capacity / total_weight))
      allocations[id] ← Semaphore(capacity)
      usage_counters[id] ← AtomicInt(0)
    END FOR
 
  METHOD acquire(partition_id, timeout):
    semaphore ← allocations[partition_id]
 
    // Try primary allocation
    acquired ← semaphore.try_acquire(timeout=timeout)
 
    IF acquired THEN
      usage_counters[partition_id].increment()
      EMIT_METRIC("bulkhead.acquired", partition_id)
      RETURN AcquisitionToken(partition_id, PRIMARY)
    END IF
 
    // Try borrowing from slack partitions
    IF config.alpha > 0 THEN
      FOR EACH (other_id, other_sem) IN allocations DO
        IF other_id = partition_id THEN CONTINUE END IF
        slack ← other_sem.available_permits()
        IF slack > config.borrow_min_slack THEN
          borrowed ← other_sem.try_acquire(timeout=0)
          IF borrowed THEN
            usage_counters[partition_id].increment()
            EMIT_METRIC("bulkhead.borrowed", {
              partition: partition_id,
              from: other_id
            })
            RETURN AcquisitionToken(partition_id, BORROWED, donor=other_id)
          END IF
        END IF
      END FOR
    END IF
 
    // Allocation failed
    EMIT_METRIC("bulkhead.rejected", partition_id)
    RAISE BulkheadFullError(partition_id, usage=usage_counters[partition_id].get())
 
  METHOD release(token):
    MATCH token.type:
      PRIMARY:
        allocations[token.partition_id].release()
      BORROWED:
        allocations[token.donor].release()
    usage_counters[token.partition_id].decrement()
    EMIT_METRIC("bulkhead.released", token.partition_id)

21.5 Timeout Engineering: Deadline Propagation, Cascading Timeout Budgets, and Deadline-Aware Scheduling#

21.5.1 The Timeout Problem in Agentic Systems#

Agentic systems compose multiple asynchronous operations: LLM inference, tool invocations, retrieval queries, human approvals. Without disciplined timeout engineering, a single slow operation blocks the entire agent loop indefinitely, consuming resources and violating latency SLAs.

21.5.2 Deadline Propagation#

Every request entering the system carries a deadline DD:

D=trequest+TSLAD = t_{\text{request}} + T_{\text{SLA}}

As the request flows through the call chain, each component consumes time and propagates a remaining deadline:

Ddownstream=DTconsumed_so_farToverhead_budgetD_{\text{downstream}} = D - T_{\text{consumed\_so\_far}} - T_{\text{overhead\_budget}}

where Toverhead_budgetT_{\text{overhead\_budget}} reserves time for post-processing, serialization, and response transmission.

The effective timeout for a downstream call at depth dd is:

τd=Dtcurrentj=d+1nT^jmin\tau_d = D - t_{\text{current}} - \sum_{j=d+1}^{n} \hat{T}_j^{\min}

where T^jmin\hat{T}_j^{\min} is the estimated minimum time for all remaining downstream steps. This ensures that even if the current call uses its full timeout, sufficient time remains for subsequent steps.

21.5.3 Cascading Timeout Budget Allocation#

For a sequential chain of nn operations, the total budget is:

TSLA=i=1nτi+ToverheadT_{\text{SLA}} = \sum_{i=1}^{n} \tau_i + T_{\text{overhead}}

The optimal allocation minimizes the probability of timeout given per-operation latency distributions LiL_i:

minτ1,,τni=1nP(Li>τi)\min_{\tau_1, \ldots, \tau_n} \sum_{i=1}^{n} P(L_i > \tau_i) s.t.i=1nτiTSLAToverhead,τiτimin  i\text{s.t.} \quad \sum_{i=1}^{n} \tau_i \leq T_{\text{SLA}} - T_{\text{overhead}}, \quad \tau_i \geq \tau_i^{\min} \; \forall i

For operations with exponentially distributed latencies LiExp(μi)L_i \sim \text{Exp}(\mu_i):

P(Li>τi)=eμiτiP(L_i > \tau_i) = e^{-\mu_i \tau_i}

The Lagrangian yields:

τi=1μiln(μiλ)\tau_i^* = \frac{1}{\mu_i} \ln\left(\frac{\mu_i}{\lambda}\right)

where λ\lambda is the Lagrange multiplier determined by the budget constraint. Intuitively, more budget is allocated to operations with higher latency variance (smaller μi\mu_i).

21.5.4 Deadline-Aware Scheduling#

The agent loop scheduler must be deadline-aware: actions approaching their deadline receive scheduling priority:

Priority(a)=1D(a)tcurrentT^remaining(a)\text{Priority}(a) = \frac{1}{D(a) - t_{\text{current}} - \hat{T}_{\text{remaining}}(a)}

Actions with D(a)tcurrent<T^remaining(a)D(a) - t_{\text{current}} < \hat{T}_{\text{remaining}}(a) are infeasible and should be preemptively cancelled:

Cancel(a)D(a)tcurrent<T^remaining(a)\text{Cancel}(a) \Leftrightarrow D(a) - t_{\text{current}} < \hat{T}_{\text{remaining}}(a)

21.5.5 Pseudo-Algorithm: Deadline-Propagating Invocation#

ALGORITHM DeadlinePropagatingInvoke
  INPUT:
    operation: Operation
    deadline: Timestamp
    remaining_steps: List<OperationSpec>
 
  OUTPUT:
    result: OperationResult
 
  BEGIN:
    // Calculate time needed for remaining steps
    T_remaining_minimum ← SUM(step.min_latency FOR step IN remaining_steps)
    T_overhead ← config.response_overhead
 
    // Available time for this operation
    T_available ← deadline - NOW() - T_remaining_minimum - T_overhead
 
    IF T_available ≤ 0 THEN
      RAISE DeadlineExceededError(
        "insufficient_time_for_operation",
        deficit=ABS(T_available)
      )
    END IF
 
    // Set operation timeout
    operation_timeout ← MIN(T_available, operation.spec.max_timeout)
 
    IF operation_timeout < operation.spec.min_viable_timeout THEN
      // Not enough time for meaningful execution
      RETURN DEGRADE_OR_SKIP(operation, deadline)
    END IF
 
    // Execute with propagated deadline
    downstream_deadline ← NOW() + operation_timeout
 
    TRY:
      result ← operation.execute(
        timeout=operation_timeout,
        propagated_deadline=downstream_deadline
      )
      RETURN result
 
    CATCH TimeoutException:
      EMIT_METRIC("deadline.timeout", {
        operation: operation.name,
        allocated: operation_timeout,
        deadline_remaining: deadline - NOW()
      })
 
      // Decide whether to return partial result or propagate failure
      IF operation.supports_partial_result THEN
        RETURN operation.get_partial_result()
      ELSE
        RAISE TimeoutException(operation.name, allocated=operation_timeout)
      END IF
    END TRY
  END

21.6 Queue Isolation and Backpressure: Rate Limiting, Admission Control, and Load Shedding#

21.6.1 Queue Architecture for Agentic Systems#

Agentic workloads are bursty and heterogeneous. A code-generation task consumes 10× more tokens than a simple Q&A. Without queue isolation, heavy tasks starve light tasks.

Define a multi-queue architecture with QQ queues:

Q={q1,q2,,qQ}\mathcal{Q} = \{q_1, q_2, \ldots, q_Q\}

Each queue qiq_i has:

qi=(priorityi,capacityi,rate_limiti,admission_policyi,shedding_policyi)q_i = \left(\text{priority}_i, \text{capacity}_i, \text{rate\_limit}_i, \text{admission\_policy}_i, \text{shedding\_policy}_i\right)

21.6.2 Rate Limiting#

Rate limits are enforced using a token bucket algorithm:

Bucket(t)=min(Bmax,  Bucket(tΔt)+rΔt)\text{Bucket}(t) = \min\left(B_{\max}, \; \text{Bucket}(t - \Delta t) + r \cdot \Delta t\right)

where BmaxB_{\max} is the bucket capacity (burst limit) and rr is the refill rate (sustained throughput limit).

A request of cost cc is admitted if:

Bucket(t)c\text{Bucket}(t) \geq c

After admission, the bucket is decremented:

Bucket(t)Bucket(t)c\text{Bucket}(t) \leftarrow \text{Bucket}(t) - c

For agentic systems, cost cc is measured in token units rather than raw request count:

c(request)=T^input+T^outputc(\text{request}) = \hat{T}_{\text{input}} + \hat{T}_{\text{output}}

This prevents a single high-token request from being treated equivalently to a simple health check.

21.6.3 Admission Control#

Admission control decides whether to accept a new request based on current system load:

Admit(r,t)={ACCEPTif Load(t)<LacceptTHROTTLEif LacceptLoad(t)<LshedREJECTif Load(t)Lshed\text{Admit}(r, t) = \begin{cases} \texttt{ACCEPT} & \text{if } \text{Load}(t) < L_{\text{accept}} \\ \texttt{THROTTLE} & \text{if } L_{\text{accept}} \leq \text{Load}(t) < L_{\text{shed}} \\ \texttt{REJECT} & \text{if } \text{Load}(t) \geq L_{\text{shed}} \end{cases}

where:

Load(t)=wqqueued(t)capacity+wcactive_tokens(t)token_budget+wlLˉ(t)LSLA\text{Load}(t) = w_q \cdot \frac{|\text{queued}(t)|}{\text{capacity}} + w_c \cdot \frac{\text{active\_tokens}(t)}{\text{token\_budget}} + w_l \cdot \frac{\bar{L}(t)}{L_{\text{SLA}}}

is a weighted composite load signal combining queue depth, token consumption, and observed latency.

21.6.4 Load Shedding Strategies#

When admission control cannot prevent overload, load shedding drops requests to protect system stability:

StrategySelection CriterionProperties
LIFO (Last In, First Out)Drop newest requestsPreserves older, likely-more-invested requests
Priority-BasedDrop lowest-priority firstProtects business-critical workloads
Cost-BasedDrop most expensive requestsMaximizes throughput in request count
Deadline-BasedDrop requests closest to expiryDrops requests unlikely to complete anyway
RandomDrop uniformly at randomFair, prevents starvation, simple

The optimal shedding policy maximizes aggregate value delivered:

maxradmittedV(r)s.t.radmittedc(r)Csystem\max \sum_{r \in \text{admitted}} V(r) \quad \text{s.t.} \quad \sum_{r \in \text{admitted}} c(r) \leq C_{\text{system}}

This is a knapsack problem; for online decision-making, the greedy approximation sorts by V(r)c(r)\frac{V(r)}{c(r)} and admits in decreasing order until capacity is exhausted.

21.6.5 Pseudo-Algorithm: Admission Controller with Load Shedding#

ALGORITHM AdmissionController
  INPUT:
    request: IncomingRequest
    queues: Map<Priority, Queue>
    system_state: SystemState
 
  OUTPUT:
    decision: ACCEPT | THROTTLE | REJECT
 
  BEGIN:
    // ─── Compute composite load ───
    queue_load ← system_state.total_queued / system_state.total_capacity
    token_load ← system_state.active_tokens / system_state.token_budget
    latency_load ← system_state.p95_latency / system_state.latency_sla
 
    load ← config.w_q * queue_load
          + config.w_c * token_load
          + config.w_l * latency_load
 
    // ─── Rate limit check ───
    estimated_cost ← ESTIMATE_TOKEN_COST(request)
    IF NOT TOKEN_BUCKET.try_consume(estimated_cost) THEN
      EMIT_METRIC("admission.rate_limited", request.priority)
      RETURN THROTTLE(retry_after=TOKEN_BUCKET.time_to_refill(estimated_cost))
    END IF
 
    // ─── Load-based admission ───
    IF load < config.L_accept THEN
      target_queue ← queues[request.priority]
 
      IF target_queue.size() < target_queue.capacity THEN
        target_queue.enqueue(request)
        EMIT_METRIC("admission.accepted", request.priority)
        RETURN ACCEPT
      ELSE
        // Queue full; try shedding lower-priority items
        IF SHED_LOWER_PRIORITY(target_queue, request) THEN
          target_queue.enqueue(request)
          RETURN ACCEPT
        ELSE
          RETURN THROTTLE(retry_after=config.throttle_delay)
        END IF
      END IF
 
    ELSE IF load < config.L_shed THEN
      // Throttle: accept only high-priority
      IF request.priority ≥ PRIORITY_HIGH THEN
        queues[request.priority].enqueue(request)
        EMIT_METRIC("admission.accepted_under_pressure", request.priority)
        RETURN ACCEPT
      ELSE
        EMIT_METRIC("admission.throttled", request.priority)
        RETURN THROTTLE(retry_after=config.throttle_delay)
      END IF
 
    ELSE
      // Critical load: shed
      IF request.priority = PRIORITY_CRITICAL THEN
        // Even critical requests enter only if queue has room
        IF queues[PRIORITY_CRITICAL].size() < queues[PRIORITY_CRITICAL].capacity THEN
          queues[PRIORITY_CRITICAL].enqueue(request)
          RETURN ACCEPT
        END IF
      END IF
 
      EMIT_METRIC("admission.shed", request.priority)
      RETURN REJECT(reason="system_overloaded", load=load)
    END IF
  END

21.7 Graceful Degradation Strategies#

21.7.1 Reduced-Capability Modes: Simpler Models, Cached Responses, and Partial Results#

Graceful degradation maintains service availability by reducing capability rather than failing entirely. The system defines a hierarchy of capability levels:

CapabilityLevel={C0,C1,C2,C3,C4}\text{CapabilityLevel} = \{C_0, C_1, C_2, C_3, C_4\}

ordered from full capability to minimal viable service:

LevelDescriptionOperational Mode
C0C_0FullAll features, primary model, real-time retrieval, full verification
C1C_1Reduced VerificationPrimary model, retrieval, but skip adversarial critique and self-consistency
C2C_2Fallback ModelSmaller/faster model, basic verification, cached retrieval where possible
C3C_3Cache-FirstReturn cached or pre-computed responses; LLM only for cache misses
C4C_4Static FallbackReturn templated responses, documentation links, or "service degraded" messages

The degradation trigger function maps system health to capability level:

Level(t)=max{Ci:H(t)Himin}\text{Level}(t) = \max\left\{C_i : H(t) \geq H_i^{\min}\right\}

where H(t)H(t) is the composite health score and HiminH_i^{\min} is the minimum health required for level CiC_i:

H(t)=min(Amodel(t)Atarget,  1Lˉ(t)/LSLA,  1ϕerror(t),  budget_remaining(t)budget_threshold)H(t) = \min\left(\frac{A_{\text{model}}(t)}{A_{\text{target}}}, \; \frac{1}{\bar{L}(t) / L_{\text{SLA}}}, \; 1 - \phi_{\text{error}}(t), \; \frac{\text{budget\_remaining}(t)}{\text{budget\_threshold}}\right)

21.7.2 Feature Flags for Progressive Agent Capability Reduction#

Each degradation level activates or deactivates specific feature flags:

Feature FlagC0C_0C1C_1C2C_2C3C_3C4C_4
primary_model
fallback_model
real_time_retrieval
cached_retrieval
self_consistency
adversarial_critique
rubric_verification
tool_invocation
human_escalation

(✓ = enabled, ✗ = disabled, △ = simplified version)

Feature flags are evaluated at every phase boundary of the agent loop, allowing mid-execution degradation if system health declines during processing.

21.7.3 User-Facing Degradation Communication: Transparent Status and ETA#

Users must be transparently informed of degraded operation:

DegradationNotice=(level,affected_capabilities,estimated_impact,eta_normal,alternatives)\text{DegradationNotice} = \left(\text{level}, \text{affected\_capabilities}, \text{estimated\_impact}, \text{eta\_normal}, \text{alternatives}\right)

The system computes an estimated time to recovery (ETR) based on historical degradation durations:

ETR=Percentile75({Trecovery:similar past incidents})\text{ETR} = \text{Percentile}_{75}\left(\{T_{\text{recovery}} : \text{similar past incidents}\}\right)
ALGORITHM GracefulDegradationController
  INPUT:
    system_health: HealthMetrics
    config: DegradationConfig
 
  OUTPUT:
    level: CapabilityLevel
    feature_flags: Map<FeatureFlag, Bool>
 
  STATE:
    current_level ← C_0
    level_entry_time ← NOW()
 
  METHOD evaluate():
    // Compute composite health score
    H ← MIN(
      system_health.model_availability / config.A_target,
      1.0 / MAX(system_health.p95_latency / config.L_sla, 0.01),
      1.0 - system_health.error_rate,
      system_health.budget_remaining / config.budget_threshold
    )
    H ← CLAMP(H, 0.0, 1.0)
 
    // Determine target level
    target_level ← C_4  // Default to most degraded
    FOR EACH level IN [C_0, C_1, C_2, C_3] DO  // Check best to worst
      IF H ≥ config.health_thresholds[level] THEN
        target_level ← level
        BREAK
      END IF
    END FOR
 
    // Hysteresis: prevent flapping
    IF target_level > current_level THEN
      // Degrading: apply immediately
      current_level ← target_level
      level_entry_time ← NOW()
      EMIT_ALERT("degradation.level_changed", current_level)
    ELSE IF target_level < current_level THEN
      // Recovering: require stability period
      IF NOW() - level_entry_time > config.stability_period THEN
        current_level ← target_level
        level_entry_time ← NOW()
        EMIT_TRACE("degradation.level_improved", current_level)
      END IF
      // Else: wait for stability confirmation
    END IF
 
    // Resolve feature flags for current level
    feature_flags ← config.flag_matrix[current_level]
 
    RETURN (current_level, feature_flags)
  END

21.8 Compensating Transactions: Undo, Rollback, and Saga Coordination for Multi-Step Agent Actions#

21.8.1 The Compensation Problem#

Agent loops execute multi-step plans where each step may mutate external state. When step kk fails after steps 1,,k11, \ldots, k-1 have committed, the system cannot atomically roll back. Compensating transactions provide eventual consistency by executing semantic inverses of completed steps.

21.8.2 Saga Pattern for Agent Actions#

A saga S\mathcal{S} is a sequence of transactions with compensating counterparts:

S=(T1,C1),(T2,C2),,(Tn,Cn)\mathcal{S} = \langle (T_1, C_1), (T_2, C_2), \ldots, (T_n, C_n) \rangle

where TiT_i is the forward transaction and CiC_i is its compensating transaction.

Forward execution: T1T2TnT_1 \rightarrow T_2 \rightarrow \ldots \rightarrow T_n

Backward recovery after failure at TkT_k: Ck1Ck2C1C_{k-1} \rightarrow C_{k-2} \rightarrow \ldots \rightarrow C_1

The saga invariant requires:

i:Apply(Ci,Apply(Ti,s))s\forall i: \text{Apply}(C_i, \text{Apply}(T_i, s)) \approx s

where \approx denotes semantic equivalence (exact state reversal may be impossible for irreversible operations).

21.8.3 Compensation Classification#

Action TypeCompensation TypeExample
Exactly ReversibleExact inverseFile create → file delete
Approximately ReversibleBest-effort undoCode commit → revert commit (history preserved)
Compensable OnlyCounter-action, not undoSent notification → send correction notification
IrreversibleNo compensation possibleExternal API call with permanent effect

For irreversible actions, the system must either:

  1. Gate execution with human approval before the irreversible step.
  2. Accept the risk and document it in the saga's compensation plan.
  3. Simulate the action in a sandbox before committing to production.

21.8.4 Saga Coordinator State Machine#

The saga coordinator tracks the execution state:

SagaState=(saga_id,steps:ListStepRecord,phase:FORWARDCOMPENSATINGCOMPLETEDFAILED)\text{SagaState} = \left(\text{saga\_id}, \text{steps}: \text{List}\langle\text{StepRecord}\rangle, \text{phase}: \text{FORWARD} \mid \text{COMPENSATING} \mid \text{COMPLETED} \mid \text{FAILED}\right)

Each StepRecord\text{StepRecord}:

StepRecord=(step_id,status:PENDINGCOMMITTEDCOMPENSATEDCOMPENSATION_FAILED,result,idempotency_key)\text{StepRecord} = (\text{step\_id}, \text{status}: \text{PENDING} \mid \text{COMMITTED} \mid \text{COMPENSATED} \mid \text{COMPENSATION\_FAILED}, \text{result}, \text{idempotency\_key})

21.8.5 Pseudo-Algorithm: Saga Coordinator#

ALGORITHM SagaCoordinator
  INPUT:
    saga: SagaDefinition            // List of (transaction, compensation) pairs
    context: ExecutionContext
 
  OUTPUT:
    saga_result: SagaResult
 
  STATE:
    step_records ← []
    phase ← FORWARD
 
  BEGIN:
    // ─── Forward Execution ───
    FOR i ← 0 TO LEN(saga.steps) - 1 DO
      (transaction, compensation) ← saga.steps[i]
      idempotency_key ← DERIVE_KEY(saga.id, i)
 
      record ← StepRecord(
        step_id=i,
        status=PENDING,
        idempotency_key=idempotency_key
      )
 
      TRY:
        // Pre-flight check for irreversible actions
        IF transaction.reversibility = IRREVERSIBLE THEN
          IF NOT context.has_approval(transaction) THEN
            approval ← REQUEST_HUMAN_APPROVAL(transaction)
            IF NOT approval.granted THEN
              RAISE SagaAbortedError("approval_denied", step=i)
            END IF
          END IF
        END IF
 
        result ← IDEMPOTENT_EXECUTE(transaction, idempotency_key)
        record.status ← COMMITTED
        record.result ← result
        APPEND step_records, record
 
        // Persist saga state (WAL)
        PERSIST_SAGA_STATE(saga.id, step_records, phase=FORWARD)
 
      CATCH error:
        record.status ← FAILED
        record.error ← error
        APPEND step_records, record
        PERSIST_SAGA_STATE(saga.id, step_records, phase=FORWARD)
 
        // Enter compensation phase
        phase ← COMPENSATING
        BREAK
      END TRY
    END FOR
 
    IF phase = FORWARD THEN
      // All steps succeeded
      PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPLETED)
      RETURN SagaResult(status=COMPLETED, steps=step_records)
    END IF
 
    // ─── Compensation Phase ───
    compensation_failures ← []
 
    FOR i ← LEN(step_records) - 2 DOWNTO 0 DO
      // Compensate all committed steps before the failed one
      record ← step_records[i]
 
      IF record.status ≠ COMMITTED THEN
        CONTINUE
      END IF
 
      (_, compensation) ← saga.steps[i]
      comp_key ← DERIVE_KEY(saga.id, i, "compensate")
 
      IF compensation IS NULL THEN
        // No compensation defined (irreversible)
        APPEND compensation_failures, CompensationFailure(
          step=i,
          reason="no_compensation_defined"
        )
        CONTINUE
      END IF
 
      TRY:
        IDEMPOTENT_EXECUTE(compensation, comp_key)
        record.status ← COMPENSATED
        PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPENSATING)
 
      CATCH comp_error:
        record.status ← COMPENSATION_FAILED
        record.compensation_error ← comp_error
        APPEND compensation_failures, CompensationFailure(
          step=i,
          error=comp_error
        )
        PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPENSATING)
 
        // Compensation failure is critical: alert operator
        ALERT_OPERATOR("saga_compensation_failed", {
          saga_id: saga.id,
          step: i,
          error: comp_error
        })
      END TRY
    END FOR
 
    final_status ← IF LEN(compensation_failures) = 0 THEN COMPENSATED
                    ELSE COMPENSATION_INCOMPLETE
 
    PERSIST_SAGA_STATE(saga.id, step_records, phase=final_status)
 
    RETURN SagaResult(
      status=final_status,
      steps=step_records,
      compensation_failures=compensation_failures
    )
  END

21.8.6 Saga Consistency Guarantees#

The saga pattern provides eventual consistency, not ACID atomicity. The guarantees are:

Saga Guarantee:Either all Ti commit, OR all committed Ti are compensated by Ci\text{Saga Guarantee}: \quad \text{Either all } T_i \text{ commit, OR all committed } T_i \text{ are compensated by } C_i

With the caveat that compensation may itself fail, requiring compensating compensation or human intervention. The saga coordinator must be itself crash-recoverable via WAL, which is why saga state is persisted after every step.


21.9 Crash Recovery: Checkpointed State, Write-Ahead Logs, and Deterministic Replay#

21.9.1 Crash Recovery Architecture#

Crash recovery ensures that the agent system can resume from a consistent state after any unplanned termination. The architecture combines three mechanisms:

Recovery=CheckpointWAL ReplayDeterministic Re-execution\text{Recovery} = \text{Checkpoint} \oplus \text{WAL Replay} \oplus \text{Deterministic Re-execution}

21.9.2 Write-Ahead Log (WAL) Specification#

The WAL is an append-only, durably persisted log of all state mutations:

WAL=e1,e2,,eN\text{WAL} = \langle e_1, e_2, \ldots, e_N \rangle

Each entry:

ek=(seq,session_id,op_type,before_image,after_image,timestamp,idempotency_key,checksum)e_k = (\text{seq}, \text{session\_id}, \text{op\_type}, \text{before\_image}, \text{after\_image}, \text{timestamp}, \text{idempotency\_key}, \text{checksum})

Durability guarantee: eke_k is fsync'd to persistent storage before the mutation is applied to in-memory state.

Integrity: Each entry's checksum covers the previous entry's checksum, forming a hash chain:

checksum(ek)=SHA256(checksum(ek1)payload(ek))\text{checksum}(e_k) = \text{SHA256}(\text{checksum}(e_{k-1}) \| \text{payload}(e_k))

21.9.3 Recovery Protocol#

ALGORITHM CrashRecovery
  INPUT:
    wal: WriteAheadLog
    checkpoint_store: CheckpointStore
    session_registry: SessionRegistry
 
  OUTPUT:
    recovered_sessions: List<Session>
 
  BEGIN:
    recovered ← []
 
    // ─── Step 1: Identify sessions needing recovery ───
    active_sessions ← session_registry.sessions_in_state(
      {ACTIVE, RESUMED, DISPATCHED}
    )
 
    FOR EACH session_id IN active_sessions DO
      // ─── Step 2: Find latest checkpoint ───
      checkpoint ← checkpoint_store.latest_consistent(session_id)
 
      IF checkpoint IS NULL THEN
        WARN("no_checkpoint_found", session_id)
        MARK_SESSION_FAILED(session_id, "no_recovery_checkpoint")
        CONTINUE
      END IF
 
      // ─── Step 3: Validate checkpoint integrity ───
      IF NOT VERIFY_CHECKSUM(checkpoint) THEN
        // Try previous checkpoint
        checkpoint ← checkpoint_store.latest_consistent_before(
          session_id, checkpoint.seq
        )
        IF checkpoint IS NULL THEN
          MARK_SESSION_FAILED(session_id, "all_checkpoints_corrupt")
          CONTINUE
        END IF
      END IF
 
      // ─── Step 4: Replay WAL from checkpoint ───
      state ← DESERIALIZE(checkpoint.state)
      wal_entries ← wal.entries_after(session_id, checkpoint.wal_seq)
 
      // Verify WAL chain integrity
      IF NOT VERIFY_WAL_CHAIN(wal_entries) THEN
        WARN("wal_chain_broken", session_id)
        // Recover up to the break point
        wal_entries ← TRUNCATE_AT_BREAK(wal_entries)
      END IF
 
      FOR EACH entry IN wal_entries DO
        MATCH entry.op_type:
 
          STATE_MUTATION:
            state ← APPLY_MUTATION(state, entry)
 
          TOOL_INVOCATION_INTENT:
            // Check if completed
            completion ← FIND_COMPLETION(wal_entries, entry.idempotency_key)
            IF completion IS NOT NULL THEN
              state ← APPLY_COMPLETION(state, completion)
            ELSE
              // In-flight at crash: verify via idempotency key
              external_result ← CHECK_IDEMPOTENT_RESULT(
                entry.tool, entry.idempotency_key
              )
              IF external_result IS NOT NULL THEN
                state ← APPLY_COMPLETION(state, external_result)
                wal.append(COMPLETION_ENTRY(entry, external_result))
              ELSE
                // Action did not execute or result unknown
                state ← MARK_ACTION_PENDING(state, entry)
              END IF
            END IF
 
          SAGA_STEP:
            // Handled by saga coordinator recovery
            SAGA_RECOVERY_QUEUE.enqueue(entry)
 
      END FOR
 
      // ─── Step 5: Reconstruct session ───
      session ← RECONSTRUCT_SESSION(session_id, state)
      session.lifecycle_phase ← RESUMED
 
      APPEND recovered, session
 
      EMIT_METRIC("crash_recovery.session_recovered", {
        session_id: session_id,
        checkpoint_seq: checkpoint.seq,
        wal_entries_replayed: LEN(wal_entries),
        pending_actions: COUNT_PENDING(state)
      })
    END FOR
 
    // ─── Step 6: Recover sagas ───
    FOR EACH saga_entry IN SAGA_RECOVERY_QUEUE DO
      SAGA_COORDINATOR.recover(saga_entry)
    END FOR
 
    RETURN recovered
  END

21.9.4 Recovery Time Objective (RTO) Analysis#

The total recovery time is:

TRTO=Tdetection+Tcheckpoint_load+Twal_replay+Tstate_reconstruction+Ttool_rebindT_{\text{RTO}} = T_{\text{detection}} + T_{\text{checkpoint\_load}} + T_{\text{wal\_replay}} + T_{\text{state\_reconstruction}} + T_{\text{tool\_rebind}}

where:

Twal_replay=k=1Nentriestreplay(ek)T_{\text{wal\_replay}} = \sum_{k=1}^{N_{\text{entries}}} t_{\text{replay}}(e_k)

Minimizing NentriesN_{\text{entries}} (through frequent checkpointing) directly reduces RTO:

NentriesTsince_last_checkpointtˉbetween_entriesN_{\text{entries}} \leq \frac{T_{\text{since\_last\_checkpoint}}}{\bar{t}_{\text{between\_entries}}}

The checkpoint frequency is therefore an RTO–I/O cost trade-off:

CheckpointInterval=argminΔT(αTsince_crashΔT+βCcheckpoint1ΔT)\text{CheckpointInterval}^* = \arg\min_{\Delta T} \left(\alpha \cdot \frac{T_{\text{since\_crash}}}{\Delta T} + \beta \cdot C_{\text{checkpoint}} \cdot \frac{1}{\Delta T}\right)

yielding:

ΔT=βCcheckpointα\Delta T^* = \sqrt{\frac{\beta \cdot C_{\text{checkpoint}}}{\alpha}}

21.10 Chaos Engineering for Agents: Fault Injection, Latency Injection, and Resource Starvation Testing#

21.10.1 Chaos Engineering Principles for Agentic Systems#

Chaos engineering proactively identifies failure modes by intentionally injecting faults into the system under controlled conditions. For agentic systems, this extends beyond infrastructure chaos to include semantic chaos: injecting hallucinations, degraded retrieval, contradictory evidence, and adversarial tool responses.

21.10.2 Fault Injection Taxonomy#

Injection TypeTargetMechanismPurpose
Network PartitionInter-service communicationDrop/delay packets between agent and tool serversTest circuit breaker and retry behavior
Latency InjectionAny RPC callAdd configurable delay to tool invocations, retrieval, or LLM callsTest deadline propagation and timeout handling
Error InjectionTool responsesReturn error codes for configured fraction of requestsTest retry budgets and error classification
Resource StarvationToken budgets, memory, CPUArtificially constrain available resourcesTest graceful degradation and load shedding
State CorruptionSession state, cacheInject invalid data into cache or corrupt checkpointTest integrity verification and fallback
Model DegradationLLM inferenceRoute to a deliberately poor model or inject noiseTest quality gates and repair loops
Tool UnavailabilityMCP tool serversShut down tool servers or return 503Test tool substitution and degraded mode
Concurrent ConflictShared stateInject conflicting concurrent mutationsTest isolation and conflict resolution

21.10.3 Experiment Specification#

A chaos experiment is formally specified as:

E=(hypothesis,scope,injection,duration,blast_radius_limit,abort_conditions,metrics)\mathcal{E} = (\text{hypothesis}, \text{scope}, \text{injection}, \text{duration}, \text{blast\_radius\_limit}, \text{abort\_conditions}, \text{metrics})

where:

  • Hypothesis: "The system maintains P99P_{99} latency <LSLA< L_{\text{SLA}} when 30% of retrieval requests fail."
  • Scope: Specific service, region, or traffic percentage.
  • Injection: Fault type and parameters.
  • Duration: Fixed time window or event count.
  • Blast radius limit: Maximum percentage of traffic affected.
  • Abort conditions: Safety criteria that trigger immediate experiment termination.
  • Metrics: Observable quantities measured during the experiment.

21.10.4 Chaos Experiment Safety Envelope#

The experiment operates within a safety envelope:

Safe(E,t){A(t)Aminϕerror(t)ϕmaxLP99(t)Lmaxblast_radius(t)Bmaxno_data_loss(t)no_user_impact_beyond_scope(t)\text{Safe}(\mathcal{E}, t) \Leftrightarrow \bigwedge \begin{cases} A(t) \geq A_{\min} \\ \phi_{\text{error}}(t) \leq \phi_{\max} \\ L_{P99}(t) \leq L_{\max} \\ \text{blast\_radius}(t) \leq B_{\max} \\ \text{no\_data\_loss}(t) \\ \text{no\_user\_impact\_beyond\_scope}(t) \end{cases}

If any condition is violated, the experiment is automatically aborted and all injections are removed.

21.10.5 Pseudo-Algorithm: Chaos Experiment Runner#

ALGORITHM ChaosExperimentRunner
  INPUT:
    experiment: ChaosExperiment
    system: SystemUnderTest
    safety_config: SafetyEnvelope
 
  OUTPUT:
    report: ChaosExperimentReport
 
  BEGIN:
    // ─── Pre-experiment validation ───
    baseline_metrics ← COLLECT_BASELINE_METRICS(system, duration=config.baseline_window)
 
    IF NOT VALIDATE_SYSTEM_HEALTHY(baseline_metrics) THEN
      RETURN ExperimentReport(status=ABORTED, reason="system_unhealthy_before_start")
    END IF
 
    // ─── Install fault injection ───
    injection_handle ← INSTALL_FAULT_INJECTION(experiment.injection, experiment.scope)
 
    experiment_start ← NOW()
    metrics_during ← []
    abort_triggered ← FALSE
 
    // ─── Monitor during experiment ───
    WHILE NOW() - experiment_start < experiment.duration DO
      SLEEP(config.monitor_interval)
 
      current_metrics ← COLLECT_METRICS(system)
      APPEND metrics_during, current_metrics
 
      // Safety envelope check
      IF NOT CHECK_SAFETY_ENVELOPE(current_metrics, safety_config) THEN
        EMIT_ALERT("chaos.safety_violation", experiment.id)
        abort_triggered ← TRUE
        BREAK
      END IF
 
      // Blast radius check
      IF MEASURE_BLAST_RADIUS(current_metrics) > experiment.blast_radius_limit THEN
        EMIT_ALERT("chaos.blast_radius_exceeded", experiment.id)
        abort_triggered ← TRUE
        BREAK
      END IF
    END WHILE
 
    // ─── Remove fault injection ───
    REMOVE_FAULT_INJECTION(injection_handle)
 
    // ─── Post-experiment recovery verification ───
    WAIT(config.recovery_observation_window)
    recovery_metrics ← COLLECT_METRICS(system)
 
    // ─── Analyze results ───
    hypothesis_validated ← EVALUATE_HYPOTHESIS(
      experiment.hypothesis,
      baseline_metrics,
      metrics_during,
      recovery_metrics
    )
 
    report ← ChaosExperimentReport(
      experiment_id=experiment.id,
      status=IF abort_triggered THEN ABORTED ELSE COMPLETED,
      hypothesis_validated=hypothesis_validated,
      baseline_metrics=baseline_metrics,
      during_metrics=AGGREGATE(metrics_during),
      recovery_metrics=recovery_metrics,
      abort_triggered=abort_triggered,
      abort_reason=IF abort_triggered THEN IDENTIFY_VIOLATION() ELSE NULL,
      duration_actual=NOW() - experiment_start,
      recommendations=GENERATE_RECOMMENDATIONS(
        hypothesis_validated, metrics_during, baseline_metrics
      )
    )
 
    PERSIST_EXPERIMENT_REPORT(report)
 
    // ─── Feed into evaluation infrastructure ───
    IF NOT hypothesis_validated THEN
      CREATE_IMPROVEMENT_TASK(report)
      UPDATE_RUNBOOK(experiment.injection.type, report)
    END IF
 
    RETURN report
  END

21.11 Operational Runbooks: Automated Incident Response, Escalation, and Post-Mortem Integration#

21.11.1 Runbook as Executable Policy#

An operational runbook is not a static document. It is an executable policy that maps observed symptoms to diagnostic steps, automated remediation actions, and escalation paths:

Runbook=(conditioni,actionsi,escalationi)i=1m\text{Runbook} = \langle (\text{condition}_i, \text{actions}_i, \text{escalation}_i) \rangle_{i=1}^{m}

where each entry is a response rule:

conditioni:MetricVector×AlertContext{0,1}\text{condition}_i: \text{MetricVector} \times \text{AlertContext} \rightarrow \{0, 1\} actionsi:ListRemediationAction\text{actions}_i: \text{List}\langle\text{RemediationAction}\rangle escalationi:EscalationPolicy\text{escalation}_i: \text{EscalationPolicy}

21.11.2 Automated Incident Response Pipeline#

ALGORITHM AutomatedIncidentResponse
  INPUT:
    alert: Alert
    runbook: Runbook
    system: SystemState
 
  OUTPUT:
    incident_record: IncidentRecord
 
  BEGIN:
    incident ← CREATE_INCIDENT(alert, severity=alert.severity)
 
    // ─── Step 1: Classify incident ───
    matched_rules ← []
    FOR EACH rule IN runbook.rules DO
      IF rule.condition(system.metrics, alert.context) THEN
        APPEND matched_rules, rule
      END IF
    END FOR
 
    IF LEN(matched_rules) = 0 THEN
      // Unknown incident type
      ESCALATE_TO_ONCALL(incident, reason="no_matching_runbook")
      RETURN incident
    END IF
 
    // Select highest-priority matching rule
    rule ← MAX(matched_rules, key=λr: r.priority)
 
    // ─── Step 2: Execute automated remediation ───
    FOR EACH action IN rule.actions DO
      IF action.requires_approval AND NOT AUTO_APPROVE(action, incident.severity) THEN
        approval ← REQUEST_APPROVAL(action, timeout=rule.approval_timeout)
        IF NOT approval.granted THEN
          CONTINUE  // Skip this action
        END IF
      END IF
 
      TRY:
        result ← EXECUTE_REMEDIATION(action, system)
        incident.add_action_record(action, result)
        EMIT_TRACE("incident.remediation_executed", action.name)
 
        // Check if remediation resolved the issue
        WAIT(rule.verification_delay)
        IF CHECK_RESOLUTION(alert, system) THEN
          incident.status ← RESOLVED
          incident.resolution ← action.name
          BREAK
        END IF
 
      CATCH error:
        incident.add_action_record(action, error)
        EMIT_ALERT("incident.remediation_failed", action.name, error)
      END TRY
    END FOR
 
    // ─── Step 3: Escalation if not resolved ───
    IF incident.status ≠ RESOLVED THEN
      MATCH rule.escalation.policy:
        HUMAN_ONCALL:
          PAGE_ONCALL(incident, rule.escalation.team)
        MANAGEMENT:
          NOTIFY_MANAGEMENT(incident)
        EXTERNAL_VENDOR:
          OPEN_SUPPORT_TICKET(incident, rule.escalation.vendor)
 
      incident.status ← ESCALATED
    END IF
 
    // ─── Step 4: Persist and schedule post-mortem ───
    PERSIST_INCIDENT(incident)
 
    IF incident.severity ≤ SEVERITY_HIGH THEN
      SCHEDULE_POST_MORTEM(incident, within=config.post_mortem_deadline)
    END IF
 
    RETURN incident
  END

21.11.3 Post-Mortem Integration#

Post-mortems produce actionable artifacts that feed back into the system:

PostMortem{New runbook rules(expand automated response)New chaos experiments(validate fixes)New monitoring alerts(improve detection)Architecture changes(eliminate root cause)Evaluation tasks(regression tests for CI/CD)\text{PostMortem} \rightarrow \begin{cases} \text{New runbook rules} & \text{(expand automated response)} \\ \text{New chaos experiments} & \text{(validate fixes)} \\ \text{New monitoring alerts} & \text{(improve detection)} \\ \text{Architecture changes} & \text{(eliminate root cause)} \\ \text{Evaluation tasks} & \text{(regression tests for CI/CD)} \end{cases}

Each post-mortem output is tracked as a work item with an owner, deadline, and verification criterion. The closure criterion for a post-mortem action item is:

Closed(item)Implemented(item)Tested(item)Verified_In_Production(item)\text{Closed}(\text{item}) \Leftrightarrow \text{Implemented}(\text{item}) \wedge \text{Tested}(\text{item}) \wedge \text{Verified\_In\_Production}(\text{item})

21.12 SLA Definition and Enforcement: Availability, Latency P50/P95/P99, Error Budget, and Burn Rate#

21.12.1 SLA Specification#

The Service Level Agreement (SLA) for an agentic platform is defined as a set of Service Level Objectives (SLOs), each with a Service Level Indicator (SLI) and a target:

SLOi=(SLIi,targeti,windowi,consequencei)\text{SLO}_i = (\text{SLI}_i, \text{target}_i, \text{window}_i, \text{consequence}_i)

21.12.2 Core SLOs for Agentic Systems#

SLOSLI DefinitionTargetWindow
Availabilitysuccessful_requeststotal_requests\frac{\text{successful\_requests}}{\text{total\_requests}}99.9%\geq 99.9\%30 days
Latency P5050th percentile response time2s\leq 2s30 days
Latency P9595th percentile response time10s\leq 10s30 days
Latency P9999th percentile response time30s\leq 30s30 days
Correctnessverified_correct_outputstotal_outputs\frac{\text{verified\_correct\_outputs}}{\text{total\_outputs}}95%\geq 95\%30 days
Error Rateerrorstotal_requests\frac{\text{errors}}{\text{total\_requests}}0.1%\leq 0.1\%30 days
ThroughputRequests per second at target latencyTmin\geq T_{\text{min}}1 hour

21.12.3 Error Budget#

The error budget is the allowed failure margin:

ErrorBudgeti=(1targeti)×Ntotal_requests(window)\text{ErrorBudget}_i = (1 - \text{target}_i) \times N_{\text{total\_requests}}(\text{window})

For 99.9% availability over 30 days with 1M requests:

ErrorBudget=(10.999)×106=1000 failed requests\text{ErrorBudget} = (1 - 0.999) \times 10^6 = 1000 \text{ failed requests}

The remaining error budget at time tt within the window:

Bremaining(t)=ErrorBudgetfailures_so_far(t)B_{\text{remaining}}(t) = \text{ErrorBudget} - \text{failures\_so\_far}(t)

21.12.4 Burn Rate#

The burn rate measures how quickly the error budget is being consumed:

BurnRate(t)=observed_error_rate(t)1target\text{BurnRate}(t) = \frac{\text{observed\_error\_rate}(t)}{1 - \text{target}}

A burn rate of 1.0 means the budget is consumed at exactly the planned rate. A burn rate >1.0> 1.0 means the budget will be exhausted before the window ends.

Time to budget exhaustion at current burn rate:

Texhaust=Bremaining(t)BurnRate(t)×(1target)×λrequestsT_{\text{exhaust}} = \frac{B_{\text{remaining}}(t)}{\text{BurnRate}(t) \times (1 - \text{target}) \times \lambda_{\text{requests}}}

where λrequests\lambda_{\text{requests}} is the current request rate.

21.12.5 Multi-Window Burn Rate Alerts#

To balance between catching fast-burning incidents and slow degradation, use multi-window burn rate alerts:

Alert SeverityFast WindowSlow WindowBurn Rate ThresholdHours to Exhaustion
Page (Critical)5 min1 hour14.4×1h (2% budget)
Page (High)30 min6 hours5h (5% budget)
Ticket (Medium)2 hours1 day10h (10% budget)
Ticket (Low)6 hours3 days30 days (100% budget)

An alert fires when the burn rate exceeds the threshold in both the fast and slow windows simultaneously, preventing false alarms from transient spikes.

21.12.6 SLA Enforcement Mechanism#

Enforcement(BurnRate,t)={NORMAL_OPERATIONif BurnRate<1.0INCREASED_MONITORINGif 1.0BurnRate<3.0FREEZE_DEPLOYMENTSif 3.0BurnRate<6.0ACTIVATE_DEGRADATIONif 6.0BurnRate<14.4INCIDENT_RESPONSEif BurnRate14.4\text{Enforcement}(\text{BurnRate}, t) = \begin{cases} \texttt{NORMAL\_OPERATION} & \text{if BurnRate} < 1.0 \\ \texttt{INCREASED\_MONITORING} & \text{if } 1.0 \leq \text{BurnRate} < 3.0 \\ \texttt{FREEZE\_DEPLOYMENTS} & \text{if } 3.0 \leq \text{BurnRate} < 6.0 \\ \texttt{ACTIVATE\_DEGRADATION} & \text{if } 6.0 \leq \text{BurnRate} < 14.4 \\ \texttt{INCIDENT\_RESPONSE} & \text{if BurnRate} \geq 14.4 \end{cases}

21.12.7 Pseudo-Algorithm: SLA Monitor and Enforcer#

ALGORITHM SLAMonitorAndEnforcer
  INPUT:
    slos: List<SLO>
    metrics_stream: MetricsStream
    config: SLAConfig
 
  STATE:
    error_budgets: Map<SLO_ID, ErrorBudget>
    burn_rates: Map<SLO_ID, BurnRateTracker>
    enforcement_level: EnforcementLevel ← NORMAL_OPERATION
 
  // Runs continuously
  METHOD monitor():
    LOOP EVERY config.evaluation_interval DO
 
      FOR EACH slo IN slos DO
        // ─── Compute SLI ───
        sli_value ← COMPUTE_SLI(slo, metrics_stream, slo.window)
 
        // ─── Update error budget ───
        total_events ← metrics_stream.count(slo.window)
        budget_total ← (1 - slo.target) * total_events
        budget_consumed ← total_events * (1 - sli_value)
        budget_remaining ← budget_total - budget_consumed
 
        error_budgets[slo.id] ← ErrorBudget(
          total=budget_total,
          consumed=budget_consumed,
          remaining=budget_remaining,
          remaining_fraction=budget_remaining / MAX(budget_total, 1)
        )
 
        // ─── Compute burn rates across multiple windows ───
        FOR EACH (alert_level, fast_window, slow_window, threshold) IN config.burn_rate_windows DO
          burn_fast ← COMPUTE_BURN_RATE(slo, metrics_stream, fast_window)
          burn_slow ← COMPUTE_BURN_RATE(slo, metrics_stream, slow_window)
 
          IF burn_fast > threshold AND burn_slow > threshold THEN
            FIRE_ALERT(alert_level, slo, burn_fast, burn_slow, budget_remaining)
          END IF
        END FOR
 
        // ─── Determine enforcement level ───
        max_burn ← MAX(COMPUTE_BURN_RATE(slo, metrics_stream, 1h) FOR slo IN slos)
 
        new_level ← MATCH max_burn:
          < 1.0  → NORMAL_OPERATION
          < 3.0  → INCREASED_MONITORING
          < 6.0  → FREEZE_DEPLOYMENTS
          < 14.4 → ACTIVATE_DEGRADATION
          ≥ 14.4 → INCIDENT_RESPONSE
 
        IF new_level > enforcement_level THEN
          // Escalate immediately
          enforcement_level ← new_level
          EXECUTE_ENFORCEMENT(enforcement_level)
        ELSE IF new_level < enforcement_level THEN
          // De-escalate only after stability period
          IF STABLE_FOR(new_level, config.stability_window) THEN
            enforcement_level ← new_level
            EXECUTE_ENFORCEMENT(enforcement_level)
          END IF
        END IF
 
        // ─── Emit dashboard metrics ───
        EMIT_METRIC("sla.sli", {slo: slo.id, value: sli_value})
        EMIT_METRIC("sla.error_budget_remaining",
                     {slo: slo.id, value: budget_remaining})
        EMIT_METRIC("sla.burn_rate_1h",
                     {slo: slo.id, value: COMPUTE_BURN_RATE(slo, metrics_stream, 1h)})
 
      END FOR
    END LOOP
 
  METHOD EXECUTE_ENFORCEMENT(level):
    MATCH level:
      NORMAL_OPERATION:
        SET_DEGRADATION_LEVEL(C_0)
        UNFREEZE_DEPLOYMENTS()
 
      INCREASED_MONITORING:
        INCREASE_MONITORING_FREQUENCY(2x)
        NOTIFY_TEAM("sla_burn_rate_elevated")
 
      FREEZE_DEPLOYMENTS:
        FREEZE_DEPLOYMENT_PIPELINE()
        NOTIFY_TEAM("deployments_frozen_sla_risk")
 
      ACTIVATE_DEGRADATION:
        SET_DEGRADATION_LEVEL(C_1)  // Reduce verification overhead
        FREEZE_DEPLOYMENT_PIPELINE()
        NOTIFY_ONCALL("degradation_activated")
 
      INCIDENT_RESPONSE:
        SET_DEGRADATION_LEVEL(C_2)  // Fallback model
        FREEZE_DEPLOYMENT_PIPELINE()
        PAGE_ONCALL("sla_critical_burn_rate")
        CREATE_INCIDENT(severity=CRITICAL)
  END

21.12.8 SLA Reporting and Accountability#

Monthly SLA reports are auto-generated:

SLAReport(month)={(SLOi,SLIiactual,targeti,met:Bool,budget_consumed_pct)}i=1SLOs\text{SLAReport}(\text{month}) = \left\{(\text{SLO}_i, \text{SLI}_i^{\text{actual}}, \text{target}_i, \text{met}: \text{Bool}, \text{budget\_consumed\_pct})\right\}_{i=1}^{|\text{SLOs}|}

The report includes:

  • Achieved SLI vs. target for each SLO.
  • Error budget consumed and remaining.
  • Top contributing incidents to budget consumption.
  • Burn rate trends over the period.
  • Recommendations from chaos experiments and post-mortems.

Synthesis: Fault Tolerance as an Architectural Discipline#

The fault tolerance architecture presented in this chapter treats failures not as exceptions but as first-class system events with typed classifications, bounded recovery procedures, and measurable impact on service-level guarantees. The key architectural contributions:

PrincipleMechanismGuarantee
Typed failure classification5-class taxonomy with automated classification engineCorrect recovery strategy selection
Bounded retryExponential backoff with jitter, layered budgets, idempotency keysNo unbounded resource consumption; at-most-once semantics
Circuit isolationCircuit breakers with adaptive recovery probingFailing dependencies do not cascade
Resource partitioningBulkhead isolation with elastic borrowingCross-concern failure containment
Deadline disciplinePropagated deadlines, optimal budget allocation, preemptive cancellationLatency SLA compliance under composition
Admission governanceToken-bucket rate limits, composite load signals, priority-based sheddingSystem stability under overload
Graceful degradation5-level capability hierarchy, health-driven feature flags, hysteresisContinued service at reduced capability
Compensating transactionsSaga pattern with WAL-persisted coordinationEventually consistent rollback of multi-step mutations
Crash recoveryCheckpoint + WAL replay with idempotent re-executionResumable sessions after unplanned termination
Proactive validationChaos engineering with safety envelopesDiscovery of unknown failure modes before production incidents
Operational automationExecutable runbooks, post-mortem → improvement pipelineReduced MTTR, institutional learning
SLA enforcementMulti-window burn rate monitoring, automated enforcement escalationMeasurable, enforceable service guarantees

A system that cannot quantify its failure tolerance cannot claim to be production-grade. The architecture in this chapter ensures every failure mode is classified, every recovery is bounded, every degradation is transparent, and every service guarantee is continuously measured against an explicit error budget. This is the engineering standard required for agentic AI systems operating at sustained enterprise scale.


End of Chapter 21.