Chapter 21: Fault Tolerance, Idempotency, and Graceful Degradation

Preamble#

Agentic AI systems operate at the intersection of stochastic inference, distributed tool execution, external API dependencies, and human-in-the-loop governance. Every one of these boundaries is a failure surface. A system that cannot tolerate failures deterministically is not a production system—it is a demonstration. This chapter formalizes fault tolerance for agentic platforms with the same rigor applied to avionics, financial trading systems, and distributed databases. Every failure mode is taxonomized, every mitigation is mathematically characterized, every recovery protocol is specified as a bounded, instrumented, auditable procedure. The objective is not merely to survive failures but to operate predictably, safely, and cost-efficiently through them—maintaining measurable service-level guarantees while preserving the correctness invariants on which agentic reliability depends.

21.1 Failure Taxonomy: Transient, Persistent, Cascading, Byzantine, and Semantic Failures#

21.1.1 The Necessity of Formal Taxonomy#

Effective fault tolerance demands that failures are not treated as a monolithic category. Each failure class has distinct detection signatures, propagation characteristics, recovery strategies, and cost profiles. A retry strategy appropriate for a transient network timeout is catastrophically wrong for a persistent authorization failure. A circuit breaker tuned for external API flakiness is useless against a semantic hallucination that passes all schema checks.

21.1.2 Failure Class Definitions#

Let $\mathcal{F}$ be the space of all failures. Define a classification function:

\text{Classify}: \mathcal{F} \rightarrow \{\texttt{TRANSIENT}, \texttt{PERSISTENT}, \texttt{CASCADING}, \texttt{BYZANTINE}, \texttt{SEMANTIC}\}

Failure Class	Definition	Detection Signature	Recovery Strategy	Example
Transient	Temporary condition that resolves without intervention	Succeeds on retry; error code in retriable set	Retry with backoff	Network timeout, rate limit 429, temporary overload
Persistent	Stable failure that will not self-resolve	Repeated identical failure across retries	Escalate, substitute, or fail	Invalid credentials, deleted resource, schema mismatch
Cascading	Failure in one component propagates to dependent components	Correlated failures across services within temporal window	Isolate, shed load, break dependency chain	Database overload → retrieval failure → agent stall → queue backup
Byzantine	Component produces incorrect results without signaling error	Output passes schema validation but is semantically wrong	Redundant verification, voting, cross-validation	LLM hallucination, corrupted cache returning stale data, tool returning wrong result silently
Semantic	Output is structurally valid but violates task-level correctness or safety requirements	Detected only by domain-specific verification	Critique → repair → re-verify; escalate to human	Factually incorrect answer, unsafe recommendation, logically inconsistent plan

21.1.3 Formal Failure Model#

Define a failure event $f \in \mathcal{F}$ as a tuple:

f = (\text{id}, \text{class}, \text{source}, \text{timestamp}, \text{severity}, \text{retriable}, \text{context}, \text{propagation\_risk})

The failure rate for component $c$ over window $\Delta t$ is:

\lambda_c(\Delta t) = \frac{|\{f \in \mathcal{F}_c : t_f \in [t - \Delta t, t]\}|}{\Delta t}

The mean time between failures (MTBF) and mean time to recovery (MTTR) for a component are:

\text{MTBF}_c = \frac{1}{\lambda_c}, \quad \text{MTTR}_c = \mathbb{E}[T_{\text{recovery}} \mid f \in \mathcal{F}_c]

Component availability is:

A_c = \frac{\text{MTBF}_c}{\text{MTBF}_c + \text{MTTR}_c}

For a serial chain of $n$ components, system availability is:

A_{\text{system}} = \prod_{i=1}^{n} A_i

For parallel redundancy with $k$ independent replicas:

A_{\text{parallel}} = 1 - (1 - A_c)^k

21.1.4 Cascading Failure Propagation Model#

Model the agentic system as a directed dependency graph $G = (V, E)$ where vertices are components and edges are dependencies. A failure at component $v_i$ propagates to $v_j$ if $(v_i, v_j) \in E$ and $v_j$ lacks isolation.

The blast radius of a failure at $v_i$ is:

\text{BlastRadius}(v_i) = |\text{Reachable}(v_i, G)| = |\{v_j : v_i \rightsquigarrow v_j \text{ in } G\}|

The cascade probability depends on the failure propagation probability $p_{ij}$ along each edge:

P(\text{cascade to } v_j \mid \text{fail at } v_i) = 1 - \prod_{\text{paths } v_i \rightsquigarrow v_j} \left(1 - \prod_{(u,w) \in \text{path}} p_{uw}\right)

Isolation mechanisms (bulkheads, circuit breakers) reduce $p_{ij}$ toward zero on specific edges.

21.1.5 Pseudo-Algorithm: Failure Classification Engine#

ALGORITHM ClassifyFailure
  INPUT:
    error: ErrorEvent
    history: RecentFailureHistory
    component: ComponentID
 
  OUTPUT:
    classification: FailureClassification
 
  BEGIN:
    // ─── Step 1: Check retriability from error code taxonomy ───
    IF error.code IN TRANSIENT_ERROR_CODES THEN
      // Verify not persistent by checking history
      recent_identical ← history.count(
        component=component,
        error_code=error.code,
        window=config.persistence_window
      )
      IF recent_identical ≥ config.persistence_threshold THEN
        RETURN FailureClassification(
          class=PERSISTENT,
          confidence=0.9,
          evidence="repeated_identical_failure",
          recommended_action=ESCALATE
        )
      ELSE
        RETURN FailureClassification(
          class=TRANSIENT,
          confidence=0.85,
          evidence="retriable_error_code",
          recommended_action=RETRY_WITH_BACKOFF
        )
      END IF
    END IF
 
    // ─── Step 2: Check for cascade indicators ───
    correlated_failures ← history.correlated_failures(
      temporal_window=config.cascade_detection_window,
      min_components=2
    )
    IF LEN(correlated_failures) ≥ config.cascade_threshold THEN
      upstream_root ← IDENTIFY_CASCADE_ROOT(correlated_failures, dependency_graph)
      RETURN FailureClassification(
        class=CASCADING,
        confidence=0.8,
        evidence="correlated_multi_component_failure",
        root_cause=upstream_root,
        recommended_action=ISOLATE_AND_SHED_LOAD
      )
    END IF
 
    // ─── Step 3: Check for Byzantine indicators ───
    IF error.type = VERIFICATION_FAILURE AND error.schema_valid = TRUE THEN
      RETURN FailureClassification(
        class=BYZANTINE,
        confidence=0.75,
        evidence="schema_valid_but_semantically_incorrect",
        recommended_action=REDUNDANT_VERIFICATION
      )
    END IF
 
    // ─── Step 4: Check for semantic failures ───
    IF error.type IN {HALLUCINATION, SAFETY_VIOLATION, LOGIC_ERROR,
                       FACTUAL_ERROR, COHERENCE_FAILURE} THEN
      RETURN FailureClassification(
        class=SEMANTIC,
        confidence=0.85,
        evidence=error.verification_details,
        recommended_action=CRITIQUE_AND_REPAIR
      )
    END IF
 
    // ─── Step 5: Default to persistent ───
    RETURN FailureClassification(
      class=PERSISTENT,
      confidence=0.5,
      evidence="unclassified_error",
      recommended_action=FAIL_WITH_DIAGNOSTICS
    )
  END

21.2 Retry Engineering#

21.2.1 Exponential Backoff with Jitter: Configuration, Bounds, and Anti-Thundering-Herd#

Formal Backoff Function#

The backoff delay for retry attempt $r$ is:

d(r) = \min\left(d_{\max}, \; d_0 \cdot b^r + J(r)\right)

where:

$d_0$ : base delay (e.g., 100ms)
$b$ : exponential base (typically $b = 2$ )
$d_{\max}$ : ceiling delay to prevent unbounded waits
$J(r)$ : jitter function to decorrelate concurrent retriers

Jitter Strategies#

Strategy	Formula	Properties
Full Jitter	$J(r) = \text{Uniform}(0, d_0 \cdot b^r)$	Maximum decorrelation; wide delay variance
Equal Jitter	$J(r) = \frac{d_0 \cdot b^r}{2} + \text{Uniform}\left(0, \frac{d_0 \cdot b^r}{2}\right)$	Balanced: guaranteed minimum wait + jitter
Decorrelated Jitter	$d(r) = \min\left(d_{\max}, \text{Uniform}(d_0, d(r-1) \cdot 3)\right)$	Self-adapting; depends on previous delay

The expected total wait across $R$ retries with full jitter is:

\mathbb{E}\left[\sum_{r=0}^{R-1} d(r)\right] = \sum_{r=0}^{R-1} \min\left(d_{\max}, \frac{d_0 \cdot b^r}{2}\right)

For $d_{\max} \gg d_0 \cdot b^{R-1}$ :

\mathbb{E}[T_{\text{total}}] = \frac{d_0}{2} \cdot \frac{b^R - 1}{b - 1}

Anti-Thundering-Herd Analysis#

When $N$ concurrent clients retry the same failing service simultaneously, without jitter the aggregate retry rate spikes to:

\lambda_{\text{spike}}(r) = \frac{N}{d_0 \cdot b^r}

With full jitter, the aggregate rate smooths to:

\lambda_{\text{jittered}}(r) = \frac{N}{d_0 \cdot b^r} \cdot \frac{1}{d_0 \cdot b^r / 2} = \frac{2}{d_0 \cdot b^r}

effectively distributing $N$ retries uniformly over the interval $[0, d_0 \cdot b^r]$ , eliminating synchronization.

21.2.2 Retry Budgets: Per-Request, Per-Session, and System-Wide Limits#

Unbounded retries convert transient failures into persistent resource exhaustion. The system enforces layered retry budgets:

\text{RetryBudget} = (R_{\text{request}}, R_{\text{session}}, R_{\text{system}}, T_{\text{retry\_window}})

Level	Scope	Budget	Typical Value
Per-Request	Single tool call or API invocation	$R_{\text{request}}$	3–5 attempts
Per-Session	All retries within one session	$R_{\text{session}}$	20–50 attempts
System-Wide	Total retries across all sessions per window	$R_{\text{system}} / T_{\text{window}}$	1000 retries / minute

The retry budget utilization at time $t$ is:

U_{\text{retry}}(t) = \frac{\text{retries\_consumed}(t, t - T_{\text{window}})}{R_{\text{system}}}

When $U_{\text{retry}}(t) > U_{\text{threshold}}$ (e.g., 0.8), the system enters retry backpressure mode: new retry attempts are rejected or deferred, and the incident response pipeline is triggered.

21.2.3 Idempotency Keys: Generation, Propagation, and Server-Side Deduplication#

Idempotency Key Specification#

An idempotency key $\kappa$ uniquely identifies a logical operation such that:

\text{Execute}(\text{op}, \kappa) = \text{Execute}(\text{op}, \kappa) \quad \forall \text{number of invocations}

i.e., the effect is applied exactly once regardless of how many times the request is submitted.

Key Generation#

\kappa = \text{HMAC-SHA256}\left(K_{\text{session}}, \; \text{operation\_type} \| \text{canonical\_params} \| \text{sequence\_number}\right)

where:

$K_{\text{session}}$ is the session-scoped secret
$\text{canonical\_params}$ is the deterministic serialization of operation parameters
$\text{sequence\_number}$ is a monotonically increasing counter within the session

Key Propagation#

Idempotency keys propagate through the call chain. When an agent invokes a tool, which invokes a downstream service:

\kappa_{\text{downstream}} = \text{Derive}(\kappa_{\text{parent}}, \text{step\_index})

This ensures that a retry of the parent operation generates the same derived key, enabling deduplication at every layer.

Server-Side Deduplication#

The server maintains a deduplication store $\mathcal{D}$ mapping keys to results:

\mathcal{D}: \kappa \rightarrow (\text{result}, \text{timestamp}, \text{expiry})

ALGORITHM IdempotentExecute
  INPUT:
    operation: Operation
    idempotency_key: IdempotencyKey
    dedup_store: DeduplicationStore
    config: IdempotencyConfig
 
  OUTPUT:
    result: OperationResult
 
  BEGIN:
    // ─── Check deduplication store ───
    existing ← dedup_store.get(idempotency_key)
 
    IF existing IS NOT NULL THEN
      IF existing.status = COMPLETED THEN
        EMIT_METRIC("idempotency.dedup_hit", {key: idempotency_key})
        RETURN existing.result  // Return cached result
      ELSE IF existing.status = IN_PROGRESS THEN
        // Another invocation is in flight
        IF NOW() - existing.started_at > config.in_progress_timeout THEN
          // Stale in-progress record; likely crashed
          dedup_store.update(idempotency_key, status=EXPIRED)
          // Fall through to execute
        ELSE
          RETURN OperationResult(status=PENDING, retry_after=existing.estimated_completion)
        END IF
      END IF
    END IF
 
    // ─── Claim the key ───
    claimed ← dedup_store.claim(idempotency_key, {
      status: IN_PROGRESS,
      started_at: NOW(),
      expiry: NOW() + config.key_ttl
    })
 
    IF NOT claimed THEN
      // Race condition: another instance claimed first
      RETURN OperationResult(status=PENDING, retry_after=1s)
    END IF
 
    // ─── Execute operation ───
    TRY:
      result ← EXECUTE(operation)
      dedup_store.update(idempotency_key, {
        status: COMPLETED,
        result: result,
        completed_at: NOW(),
        expiry: NOW() + config.result_ttl
      })
      RETURN result
 
    CATCH error:
      IF IS_RETRIABLE(error) THEN
        dedup_store.update(idempotency_key, status=FAILED_RETRIABLE)
        // Allow future retries with same key
      ELSE
        dedup_store.update(idempotency_key, {
          status: FAILED_PERMANENT,
          error: error,
          expiry: NOW() + config.result_ttl
        })
      END IF
      RAISE error
    END TRY
  END

Deduplication Store TTL#

Keys expire after a configurable TTL to prevent unbounded storage growth:

\text{TTL}(\kappa) = \begin{cases} T_{\text{result}} & \text{if status} = \texttt{COMPLETED} \quad (\text{typically 24h}) \\ T_{\text{in\_progress}} & \text{if status} = \texttt{IN\_PROGRESS} \quad (\text{typically 5min}) \\ T_{\text{failed}} & \text{if status} = \texttt{FAILED} \quad (\text{typically 1h}) \end{cases}

Storage cost:

C_{\text{dedup}} = |\mathcal{D}| \cdot (\text{key\_size} + \text{result\_size} + \text{metadata\_size}) \cdot C_{\text{per\_byte}}

21.3 Circuit Breakers: Open/Half-Open/Closed States, Failure Rate Thresholds, and Recovery Probes#

21.3.1 Circuit Breaker State Machine#

The circuit breaker is a protective state machine that prevents a failing downstream dependency from consuming unbounded resources:

\mathcal{CB} = (S, s_0, \Sigma, \delta)

where:

$S = \{\texttt{CLOSED}, \texttt{OPEN}, \texttt{HALF\_OPEN}\}$
$s_0 = \texttt{CLOSED}$ (normal operation)
$\Sigma = \{\texttt{success}, \texttt{failure}, \texttt{timeout\_elapsed}, \texttt{probe\_success}, \texttt{probe\_failure}\}$

Transition function:

\delta(\texttt{CLOSED}, \texttt{failure\_threshold\_exceeded}) = \texttt{OPEN}

\delta(\texttt{OPEN}, \texttt{timeout\_elapsed}) = \texttt{HALF\_OPEN}

\delta(\texttt{HALF\_OPEN}, \texttt{probe\_success}) = \texttt{CLOSED}

\delta(\texttt{HALF\_OPEN}, \texttt{probe\_failure}) = \texttt{OPEN}

21.3.2 Failure Rate Computation#

The failure rate is computed over a sliding window of $W$ requests:

\phi(t) = \frac{|\{r \in \text{Window}(t, W) : r.\text{failed}\}|}{|\text{Window}(t, W)|}

The circuit opens when:

\phi(t) > \phi_{\text{threshold}} \quad \wedge \quad |\text{Window}(t, W)| \geq N_{\min}

where $\phi_{\text{threshold}}$ (e.g., 0.5) is the failure rate threshold and $N_{\min}$ is the minimum sample size to prevent premature tripping on low traffic.

21.3.3 Recovery Probing#

In HALF_OPEN state, the circuit breaker admits a limited number of probe requests $P_{\text{probes}}$ (typically 1–3):

\text{RecoveryDecision} = \begin{cases} \texttt{CLOSED} & \text{if } \frac{\text{probe\_successes}}{P_{\text{probes}}} \geq \rho_{\text{recovery}} \\ \texttt{OPEN} & \text{otherwise} \end{cases}

The open duration before transitioning to HALF_OPEN follows an exponential backoff:

T_{\text{open}}(n) = \min\left(T_{\text{open\_max}}, \; T_{\text{open\_base}} \cdot 2^{n-1}\right)

where $n$ is the number of consecutive open→half-open→open cycles. This prevents rapid oscillation (circuit "flapping").

21.3.4 Circuit Breaker Metrics#

Metric	Formula	Operational Significance
Trip Rate	$\lambda_{\text{trip}} = \frac{\text{trips}}{\Delta t}$	Frequency of dependency degradation
Open Duration	$\bar{T}_{\text{open}} = \mathbb{E}[T_{\text{open}}]$	Average time dependency is unavailable
Recovery Success Rate	$\rho_{\text{recovery}} = \frac{\text{successful\_recoveries}}{\text{recovery\_attempts}}$	Dependency stability after incidents
Requests Shed	$N_{\text{shed}} = \sum_{\text{open periods}} \lambda_{\text{incoming}} \cdot T_{\text{open}}$	Requests rejected during outages

21.3.5 Pseudo-Algorithm: Circuit Breaker#

ALGORITHM CircuitBreaker
  INPUT:
    dependency: DependencyID
    config: CircuitBreakerConfig
 
  STATE:
    state ← CLOSED
    failure_window ← SlidingWindow(size=config.window_size)
    open_since ← NULL
    consecutive_opens ← 0
    probe_results ← []
 
  METHOD execute(operation):
    MATCH state:
 
      CLOSED:
        TRY:
          result ← operation.execute(timeout=config.call_timeout)
          failure_window.record(SUCCESS)
          RETURN result
        CATCH error:
          failure_window.record(FAILURE)
          // Check trip condition
          IF failure_window.count() ≥ config.N_min THEN
            failure_rate ← failure_window.failure_rate()
            IF failure_rate > config.phi_threshold THEN
              TRIP_OPEN()
            END IF
          END IF
          RAISE error
        END TRY
 
      OPEN:
        // Check if open timeout has elapsed
        elapsed ← NOW() - open_since
        T_open ← MIN(config.T_open_max,
                      config.T_open_base * 2^(consecutive_opens - 1))
        IF elapsed ≥ T_open THEN
          state ← HALF_OPEN
          probe_results ← []
          EMIT_TRACE("circuit_breaker.half_open", dependency)
          // Fall through to HALF_OPEN handling below
          RETURN EXECUTE_HALF_OPEN(operation)
        ELSE
          // Fast-fail: do not attempt the call
          EMIT_METRIC("circuit_breaker.rejected", dependency)
          RAISE CircuitOpenError(dependency, retry_after=T_open - elapsed)
        END IF
 
      HALF_OPEN:
        RETURN EXECUTE_HALF_OPEN(operation)
 
  METHOD EXECUTE_HALF_OPEN(operation):
    IF LEN(probe_results) ≥ config.P_probes THEN
      // Already collected enough probes; wait for decision
      RAISE CircuitOpenError(dependency, retry_after=1s)
    END IF
 
    TRY:
      result ← operation.execute(timeout=config.call_timeout)
      APPEND probe_results, SUCCESS
 
      IF COUNT(SUCCESS IN probe_results) ≥ config.probes_to_close THEN
        state ← CLOSED
        consecutive_opens ← 0
        failure_window.reset()
        EMIT_TRACE("circuit_breaker.closed", dependency)
      END IF
 
      RETURN result
 
    CATCH error:
      APPEND probe_results, FAILURE
      TRIP_OPEN()
      RAISE error
    END TRY
 
  METHOD TRIP_OPEN():
    state ← OPEN
    open_since ← NOW()
    consecutive_opens ← consecutive_opens + 1
    EMIT_TRACE("circuit_breaker.opened", {
      dependency: dependency,
      failure_rate: failure_window.failure_rate(),
      consecutive_opens: consecutive_opens
    })
    EMIT_ALERT_IF(consecutive_opens ≥ config.alert_threshold)

21.4 Bulkhead Isolation: Partitioning Resources to Prevent Cross-Concern Failure Propagation#

21.4.1 Bulkhead Principle#

Borrowed from naval engineering, the bulkhead pattern partitions system resources into isolated compartments such that failure in one compartment cannot drain resources from another.

Formally, let the system's resource pool $\mathcal{R}$ (thread pools, connection pools, memory, token budgets) be partitioned:

\mathcal{R} = \bigsqcup_{i=1}^{k} \mathcal{R}_i, \quad \mathcal{R}_i \cap \mathcal{R}_j = \varnothing \; \forall i \neq j

Each partition $\mathcal{R}_i$ has a hard capacity limit $C_i$ :

\text{usage}(\mathcal{R}_i, t) \leq C_i \quad \forall t

A resource-exhaustion failure in partition $i$ does not affect partitions $j \neq i$ :

\text{usage}(\mathcal{R}_i, t) = C_i \;\not\Rightarrow\; \text{degradation}(\mathcal{R}_j) \quad \forall j \neq i

21.4.2 Bulkhead Dimensions for Agentic Systems#

Dimension	Partition By	Rationale
Tool invocation pools	Per-tool or per-tool-class	Slow tool cannot exhaust pool used by fast tools
LLM inference queues	Per-priority-tier	Low-priority background tasks cannot block high-priority user requests
Retrieval connections	Per-source (vector DB, graph DB, cache)	One slow source cannot block others
Agent execution slots	Per-session or per-user	One user's runaway agent cannot consume all compute
Token budget pools	Per-session, per-task	One expensive task cannot drain organization budget
Memory allocation	Per-session working memory	One session's large context cannot cause OOM for others

21.4.3 Bulkhead Sizing#

The capacity of each bulkhead is determined by:

C_i = \max\left(C_{\min}, \; \left\lfloor \frac{w_i \cdot C_{\text{total}}}{\sum_j w_j} \right\rfloor \right)

where $w_i$ is the weight assigned based on priority and expected demand, $C_{\text{total}}$ is the total resource pool, and $C_{\min}$ guarantees a minimum viable allocation per partition.

The utilization-adjusted sizing dynamically reallocates unused capacity:

C_i^{\text{effective}}(t) = C_i + \alpha \cdot \sum_{j \neq i} \max\left(0, C_j - \text{usage}(\mathcal{R}_j, t)\right)

where $\alpha \in (0, 1)$ is the borrowing fraction that controls how much slack from other partitions can be temporarily utilized. Setting $\alpha = 0$ provides strict isolation; $\alpha > 0$ provides elastic isolation with guaranteed minimums.

21.4.4 Pseudo-Algorithm: Bulkhead Resource Manager#

ALGORITHM BulkheadResourceManager
  INPUT:
    partitions: Map<PartitionID, BulkheadConfig>
    total_capacity: Int
 
  STATE:
    allocations: Map<PartitionID, Semaphore>
    usage_counters: Map<PartitionID, AtomicInt>
 
  METHOD initialize():
    FOR EACH (id, config) IN partitions DO
      capacity ← MAX(config.C_min,
                      FLOOR(config.weight * total_capacity / total_weight))
      allocations[id] ← Semaphore(capacity)
      usage_counters[id] ← AtomicInt(0)
    END FOR
 
  METHOD acquire(partition_id, timeout):
    semaphore ← allocations[partition_id]
 
    // Try primary allocation
    acquired ← semaphore.try_acquire(timeout=timeout)
 
    IF acquired THEN
      usage_counters[partition_id].increment()
      EMIT_METRIC("bulkhead.acquired", partition_id)
      RETURN AcquisitionToken(partition_id, PRIMARY)
    END IF
 
    // Try borrowing from slack partitions
    IF config.alpha > 0 THEN
      FOR EACH (other_id, other_sem) IN allocations DO
        IF other_id = partition_id THEN CONTINUE END IF
        slack ← other_sem.available_permits()
        IF slack > config.borrow_min_slack THEN
          borrowed ← other_sem.try_acquire(timeout=0)
          IF borrowed THEN
            usage_counters[partition_id].increment()
            EMIT_METRIC("bulkhead.borrowed", {
              partition: partition_id,
              from: other_id
            })
            RETURN AcquisitionToken(partition_id, BORROWED, donor=other_id)
          END IF
        END IF
      END FOR
    END IF
 
    // Allocation failed
    EMIT_METRIC("bulkhead.rejected", partition_id)
    RAISE BulkheadFullError(partition_id, usage=usage_counters[partition_id].get())
 
  METHOD release(token):
    MATCH token.type:
      PRIMARY:
        allocations[token.partition_id].release()
      BORROWED:
        allocations[token.donor].release()
    usage_counters[token.partition_id].decrement()
    EMIT_METRIC("bulkhead.released", token.partition_id)

21.5 Timeout Engineering: Deadline Propagation, Cascading Timeout Budgets, and Deadline-Aware Scheduling#

21.5.1 The Timeout Problem in Agentic Systems#

Agentic systems compose multiple asynchronous operations: LLM inference, tool invocations, retrieval queries, human approvals. Without disciplined timeout engineering, a single slow operation blocks the entire agent loop indefinitely, consuming resources and violating latency SLAs.

21.5.2 Deadline Propagation#

Every request entering the system carries a deadline $D$ :

D = t_{\text{request}} + T_{\text{SLA}}

As the request flows through the call chain, each component consumes time and propagates a remaining deadline:

D_{\text{downstream}} = D - T_{\text{consumed\_so\_far}} - T_{\text{overhead\_budget}}

where $T_{\text{overhead\_budget}}$ reserves time for post-processing, serialization, and response transmission.

The effective timeout for a downstream call at depth $d$ is:

\tau_d = D - t_{\text{current}} - \sum_{j=d+1}^{n} \hat{T}_j^{\min}

where $\hat{T}_j^{\min}$ is the estimated minimum time for all remaining downstream steps. This ensures that even if the current call uses its full timeout, sufficient time remains for subsequent steps.

21.5.3 Cascading Timeout Budget Allocation#

For a sequential chain of $n$ operations, the total budget is:

T_{\text{SLA}} = \sum_{i=1}^{n} \tau_i + T_{\text{overhead}}

The optimal allocation minimizes the probability of timeout given per-operation latency distributions $L_i$ :

\min_{\tau_1, \ldots, \tau_n} \sum_{i=1}^{n} P(L_i > \tau_i)

\text{s.t.} \quad \sum_{i=1}^{n} \tau_i \leq T_{\text{SLA}} - T_{\text{overhead}}, \quad \tau_i \geq \tau_i^{\min} \; \forall i

For operations with exponentially distributed latencies $L_i \sim \text{Exp}(\mu_i)$ :

P(L_i > \tau_i) = e^{-\mu_i \tau_i}

The Lagrangian yields:

\tau_i^* = \frac{1}{\mu_i} \ln\left(\frac{\mu_i}{\lambda}\right)

where $\lambda$ is the Lagrange multiplier determined by the budget constraint. Intuitively, more budget is allocated to operations with higher latency variance (smaller $\mu_i$ ).

21.5.4 Deadline-Aware Scheduling#

The agent loop scheduler must be deadline-aware: actions approaching their deadline receive scheduling priority:

\text{Priority}(a) = \frac{1}{D(a) - t_{\text{current}} - \hat{T}_{\text{remaining}}(a)}

Actions with $D(a) - t_{\text{current}} < \hat{T}_{\text{remaining}}(a)$ are infeasible and should be preemptively cancelled:

\text{Cancel}(a) \Leftrightarrow D(a) - t_{\text{current}} < \hat{T}_{\text{remaining}}(a)

21.5.5 Pseudo-Algorithm: Deadline-Propagating Invocation#

ALGORITHM DeadlinePropagatingInvoke
  INPUT:
    operation: Operation
    deadline: Timestamp
    remaining_steps: List<OperationSpec>
 
  OUTPUT:
    result: OperationResult
 
  BEGIN:
    // Calculate time needed for remaining steps
    T_remaining_minimum ← SUM(step.min_latency FOR step IN remaining_steps)
    T_overhead ← config.response_overhead
 
    // Available time for this operation
    T_available ← deadline - NOW() - T_remaining_minimum - T_overhead
 
    IF T_available ≤ 0 THEN
      RAISE DeadlineExceededError(
        "insufficient_time_for_operation",
        deficit=ABS(T_available)
      )
    END IF
 
    // Set operation timeout
    operation_timeout ← MIN(T_available, operation.spec.max_timeout)
 
    IF operation_timeout < operation.spec.min_viable_timeout THEN
      // Not enough time for meaningful execution
      RETURN DEGRADE_OR_SKIP(operation, deadline)
    END IF
 
    // Execute with propagated deadline
    downstream_deadline ← NOW() + operation_timeout
 
    TRY:
      result ← operation.execute(
        timeout=operation_timeout,
        propagated_deadline=downstream_deadline
      )
      RETURN result
 
    CATCH TimeoutException:
      EMIT_METRIC("deadline.timeout", {
        operation: operation.name,
        allocated: operation_timeout,
        deadline_remaining: deadline - NOW()
      })
 
      // Decide whether to return partial result or propagate failure
      IF operation.supports_partial_result THEN
        RETURN operation.get_partial_result()
      ELSE
        RAISE TimeoutException(operation.name, allocated=operation_timeout)
      END IF
    END TRY
  END

21.6 Queue Isolation and Backpressure: Rate Limiting, Admission Control, and Load Shedding#

21.6.1 Queue Architecture for Agentic Systems#

Agentic workloads are bursty and heterogeneous. A code-generation task consumes 10× more tokens than a simple Q&A. Without queue isolation, heavy tasks starve light tasks.

Define a multi-queue architecture with $Q$ queues:

\mathcal{Q} = \{q_1, q_2, \ldots, q_Q\}

Each queue $q_i$ has:

q_i = \left(\text{priority}_i, \text{capacity}_i, \text{rate\_limit}_i, \text{admission\_policy}_i, \text{shedding\_policy}_i\right)

21.6.2 Rate Limiting#

Rate limits are enforced using a token bucket algorithm:

\text{Bucket}(t) = \min\left(B_{\max}, \; \text{Bucket}(t - \Delta t) + r \cdot \Delta t\right)

where $B_{\max}$ is the bucket capacity (burst limit) and $r$ is the refill rate (sustained throughput limit).

A request of cost $c$ is admitted if:

\text{Bucket}(t) \geq c

After admission, the bucket is decremented:

\text{Bucket}(t) \leftarrow \text{Bucket}(t) - c

For agentic systems, cost $c$ is measured in token units rather than raw request count:

c(\text{request}) = \hat{T}_{\text{input}} + \hat{T}_{\text{output}}

This prevents a single high-token request from being treated equivalently to a simple health check.

21.6.3 Admission Control#

Admission control decides whether to accept a new request based on current system load:

\text{Admit}(r, t) = \begin{cases} \texttt{ACCEPT} & \text{if } \text{Load}(t) < L_{\text{accept}} \\ \texttt{THROTTLE} & \text{if } L_{\text{accept}} \leq \text{Load}(t) < L_{\text{shed}} \\ \texttt{REJECT} & \text{if } \text{Load}(t) \geq L_{\text{shed}} \end{cases}

where:

\text{Load}(t) = w_q \cdot \frac{|\text{queued}(t)|}{\text{capacity}} + w_c \cdot \frac{\text{active\_tokens}(t)}{\text{token\_budget}} + w_l \cdot \frac{\bar{L}(t)}{L_{\text{SLA}}}

is a weighted composite load signal combining queue depth, token consumption, and observed latency.

21.6.4 Load Shedding Strategies#

When admission control cannot prevent overload, load shedding drops requests to protect system stability:

Strategy	Selection Criterion	Properties
LIFO (Last In, First Out)	Drop newest requests	Preserves older, likely-more-invested requests
Priority-Based	Drop lowest-priority first	Protects business-critical workloads
Cost-Based	Drop most expensive requests	Maximizes throughput in request count
Deadline-Based	Drop requests closest to expiry	Drops requests unlikely to complete anyway
Random	Drop uniformly at random	Fair, prevents starvation, simple

The optimal shedding policy maximizes aggregate value delivered:

\max \sum_{r \in \text{admitted}} V(r) \quad \text{s.t.} \quad \sum_{r \in \text{admitted}} c(r) \leq C_{\text{system}}

This is a knapsack problem; for online decision-making, the greedy approximation sorts by $\frac{V(r)}{c(r)}$ and admits in decreasing order until capacity is exhausted.

21.6.5 Pseudo-Algorithm: Admission Controller with Load Shedding#

ALGORITHM AdmissionController
  INPUT:
    request: IncomingRequest
    queues: Map<Priority, Queue>
    system_state: SystemState
 
  OUTPUT:
    decision: ACCEPT | THROTTLE | REJECT
 
  BEGIN:
    // ─── Compute composite load ───
    queue_load ← system_state.total_queued / system_state.total_capacity
    token_load ← system_state.active_tokens / system_state.token_budget
    latency_load ← system_state.p95_latency / system_state.latency_sla
 
    load ← config.w_q * queue_load
          + config.w_c * token_load
          + config.w_l * latency_load
 
    // ─── Rate limit check ───
    estimated_cost ← ESTIMATE_TOKEN_COST(request)
    IF NOT TOKEN_BUCKET.try_consume(estimated_cost) THEN
      EMIT_METRIC("admission.rate_limited", request.priority)
      RETURN THROTTLE(retry_after=TOKEN_BUCKET.time_to_refill(estimated_cost))
    END IF
 
    // ─── Load-based admission ───
    IF load < config.L_accept THEN
      target_queue ← queues[request.priority]
 
      IF target_queue.size() < target_queue.capacity THEN
        target_queue.enqueue(request)
        EMIT_METRIC("admission.accepted", request.priority)
        RETURN ACCEPT
      ELSE
        // Queue full; try shedding lower-priority items
        IF SHED_LOWER_PRIORITY(target_queue, request) THEN
          target_queue.enqueue(request)
          RETURN ACCEPT
        ELSE
          RETURN THROTTLE(retry_after=config.throttle_delay)
        END IF
      END IF
 
    ELSE IF load < config.L_shed THEN
      // Throttle: accept only high-priority
      IF request.priority ≥ PRIORITY_HIGH THEN
        queues[request.priority].enqueue(request)
        EMIT_METRIC("admission.accepted_under_pressure", request.priority)
        RETURN ACCEPT
      ELSE
        EMIT_METRIC("admission.throttled", request.priority)
        RETURN THROTTLE(retry_after=config.throttle_delay)
      END IF
 
    ELSE
      // Critical load: shed
      IF request.priority = PRIORITY_CRITICAL THEN
        // Even critical requests enter only if queue has room
        IF queues[PRIORITY_CRITICAL].size() < queues[PRIORITY_CRITICAL].capacity THEN
          queues[PRIORITY_CRITICAL].enqueue(request)
          RETURN ACCEPT
        END IF
      END IF
 
      EMIT_METRIC("admission.shed", request.priority)
      RETURN REJECT(reason="system_overloaded", load=load)
    END IF
  END

21.7 Graceful Degradation Strategies#

21.7.1 Reduced-Capability Modes: Simpler Models, Cached Responses, and Partial Results#

Graceful degradation maintains service availability by reducing capability rather than failing entirely. The system defines a hierarchy of capability levels:

\text{CapabilityLevel} = \{C_0, C_1, C_2, C_3, C_4\}

ordered from full capability to minimal viable service:

Level	Description	Operational Mode
$C_0$	Full	All features, primary model, real-time retrieval, full verification
$C_1$	Reduced Verification	Primary model, retrieval, but skip adversarial critique and self-consistency
$C_2$	Fallback Model	Smaller/faster model, basic verification, cached retrieval where possible
$C_3$	Cache-First	Return cached or pre-computed responses; LLM only for cache misses
$C_4$	Static Fallback	Return templated responses, documentation links, or "service degraded" messages

The degradation trigger function maps system health to capability level:

\text{Level}(t) = \max\left\{C_i : H(t) \geq H_i^{\min}\right\}

where $H(t)$ is the composite health score and $H_i^{\min}$ is the minimum health required for level $C_i$ :

H(t) = \min\left(\frac{A_{\text{model}}(t)}{A_{\text{target}}}, \; \frac{1}{\bar{L}(t) / L_{\text{SLA}}}, \; 1 - \phi_{\text{error}}(t), \; \frac{\text{budget\_remaining}(t)}{\text{budget\_threshold}}\right)

21.7.2 Feature Flags for Progressive Agent Capability Reduction#

Each degradation level activates or deactivates specific feature flags:

Feature Flag	$C_0$	$C_1$	$C_2$	$C_3$	$C_4$
`primary_model`	✓	✓	✗	✗	✗
`fallback_model`	✗	✗	✓	△	✗
`real_time_retrieval`	✓	✓	✓	✗	✗
`cached_retrieval`	✓	✓	✓	✓	✗
`self_consistency`	✓	✗	✗	✗	✗
`adversarial_critique`	✓	✗	✗	✗	✗
`rubric_verification`	✓	✓	△	✗	✗
`tool_invocation`	✓	✓	✓	✗	✗
`human_escalation`	✓	✓	✓	✓	✓

(✓ = enabled, ✗ = disabled, △ = simplified version)

Feature flags are evaluated at every phase boundary of the agent loop, allowing mid-execution degradation if system health declines during processing.

21.7.3 User-Facing Degradation Communication: Transparent Status and ETA#

Users must be transparently informed of degraded operation:

\text{DegradationNotice} = \left(\text{level}, \text{affected\_capabilities}, \text{estimated\_impact}, \text{eta\_normal}, \text{alternatives}\right)

The system computes an estimated time to recovery (ETR) based on historical degradation durations:

\text{ETR} = \text{Percentile}_{75}\left(\{T_{\text{recovery}} : \text{similar past incidents}\}\right)

ALGORITHM GracefulDegradationController
  INPUT:
    system_health: HealthMetrics
    config: DegradationConfig
 
  OUTPUT:
    level: CapabilityLevel
    feature_flags: Map<FeatureFlag, Bool>
 
  STATE:
    current_level ← C_0
    level_entry_time ← NOW()
 
  METHOD evaluate():
    // Compute composite health score
    H ← MIN(
      system_health.model_availability / config.A_target,
      1.0 / MAX(system_health.p95_latency / config.L_sla, 0.01),
      1.0 - system_health.error_rate,
      system_health.budget_remaining / config.budget_threshold
    )
    H ← CLAMP(H, 0.0, 1.0)
 
    // Determine target level
    target_level ← C_4  // Default to most degraded
    FOR EACH level IN [C_0, C_1, C_2, C_3] DO  // Check best to worst
      IF H ≥ config.health_thresholds[level] THEN
        target_level ← level
        BREAK
      END IF
    END FOR
 
    // Hysteresis: prevent flapping
    IF target_level > current_level THEN
      // Degrading: apply immediately
      current_level ← target_level
      level_entry_time ← NOW()
      EMIT_ALERT("degradation.level_changed", current_level)
    ELSE IF target_level < current_level THEN
      // Recovering: require stability period
      IF NOW() - level_entry_time > config.stability_period THEN
        current_level ← target_level
        level_entry_time ← NOW()
        EMIT_TRACE("degradation.level_improved", current_level)
      END IF
      // Else: wait for stability confirmation
    END IF
 
    // Resolve feature flags for current level
    feature_flags ← config.flag_matrix[current_level]
 
    RETURN (current_level, feature_flags)
  END

21.8 Compensating Transactions: Undo, Rollback, and Saga Coordination for Multi-Step Agent Actions#

21.8.1 The Compensation Problem#

Agent loops execute multi-step plans where each step may mutate external state. When step $k$ fails after steps $1, \ldots, k-1$ have committed, the system cannot atomically roll back. Compensating transactions provide eventual consistency by executing semantic inverses of completed steps.

21.8.2 Saga Pattern for Agent Actions#

A saga $\mathcal{S}$ is a sequence of transactions with compensating counterparts:

\mathcal{S} = \langle (T_1, C_1), (T_2, C_2), \ldots, (T_n, C_n) \rangle

where $T_i$ is the forward transaction and $C_i$ is its compensating transaction.

Forward execution: $T_1 \rightarrow T_2 \rightarrow \ldots \rightarrow T_n$

Backward recovery after failure at $T_k$ : $C_{k-1} \rightarrow C_{k-2} \rightarrow \ldots \rightarrow C_1$

The saga invariant requires:

\forall i: \text{Apply}(C_i, \text{Apply}(T_i, s)) \approx s

where $\approx$ denotes semantic equivalence (exact state reversal may be impossible for irreversible operations).

21.8.3 Compensation Classification#

Action Type	Compensation Type	Example
Exactly Reversible	Exact inverse	File create → file delete
Approximately Reversible	Best-effort undo	Code commit → revert commit (history preserved)
Compensable Only	Counter-action, not undo	Sent notification → send correction notification
Irreversible	No compensation possible	External API call with permanent effect

For irreversible actions, the system must either:

Gate execution with human approval before the irreversible step.
Accept the risk and document it in the saga's compensation plan.
Simulate the action in a sandbox before committing to production.

21.8.4 Saga Coordinator State Machine#

The saga coordinator tracks the execution state:

\text{SagaState} = \left(\text{saga\_id}, \text{steps}: \text{List}\langle\text{StepRecord}\rangle, \text{phase}: \text{FORWARD} \mid \text{COMPENSATING} \mid \text{COMPLETED} \mid \text{FAILED}\right)

Each $\text{StepRecord}$ :

\text{StepRecord} = (\text{step\_id}, \text{status}: \text{PENDING} \mid \text{COMMITTED} \mid \text{COMPENSATED} \mid \text{COMPENSATION\_FAILED}, \text{result}, \text{idempotency\_key})

21.8.5 Pseudo-Algorithm: Saga Coordinator#

ALGORITHM SagaCoordinator
  INPUT:
    saga: SagaDefinition            // List of (transaction, compensation) pairs
    context: ExecutionContext
 
  OUTPUT:
    saga_result: SagaResult
 
  STATE:
    step_records ← []
    phase ← FORWARD
 
  BEGIN:
    // ─── Forward Execution ───
    FOR i ← 0 TO LEN(saga.steps) - 1 DO
      (transaction, compensation) ← saga.steps[i]
      idempotency_key ← DERIVE_KEY(saga.id, i)
 
      record ← StepRecord(
        step_id=i,
        status=PENDING,
        idempotency_key=idempotency_key
      )
 
      TRY:
        // Pre-flight check for irreversible actions
        IF transaction.reversibility = IRREVERSIBLE THEN
          IF NOT context.has_approval(transaction) THEN
            approval ← REQUEST_HUMAN_APPROVAL(transaction)
            IF NOT approval.granted THEN
              RAISE SagaAbortedError("approval_denied", step=i)
            END IF
          END IF
        END IF
 
        result ← IDEMPOTENT_EXECUTE(transaction, idempotency_key)
        record.status ← COMMITTED
        record.result ← result
        APPEND step_records, record
 
        // Persist saga state (WAL)
        PERSIST_SAGA_STATE(saga.id, step_records, phase=FORWARD)
 
      CATCH error:
        record.status ← FAILED
        record.error ← error
        APPEND step_records, record
        PERSIST_SAGA_STATE(saga.id, step_records, phase=FORWARD)
 
        // Enter compensation phase
        phase ← COMPENSATING
        BREAK
      END TRY
    END FOR
 
    IF phase = FORWARD THEN
      // All steps succeeded
      PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPLETED)
      RETURN SagaResult(status=COMPLETED, steps=step_records)
    END IF
 
    // ─── Compensation Phase ───
    compensation_failures ← []
 
    FOR i ← LEN(step_records) - 2 DOWNTO 0 DO
      // Compensate all committed steps before the failed one
      record ← step_records[i]
 
      IF record.status ≠ COMMITTED THEN
        CONTINUE
      END IF
 
      (_, compensation) ← saga.steps[i]
      comp_key ← DERIVE_KEY(saga.id, i, "compensate")
 
      IF compensation IS NULL THEN
        // No compensation defined (irreversible)
        APPEND compensation_failures, CompensationFailure(
          step=i,
          reason="no_compensation_defined"
        )
        CONTINUE
      END IF
 
      TRY:
        IDEMPOTENT_EXECUTE(compensation, comp_key)
        record.status ← COMPENSATED
        PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPENSATING)
 
      CATCH comp_error:
        record.status ← COMPENSATION_FAILED
        record.compensation_error ← comp_error
        APPEND compensation_failures, CompensationFailure(
          step=i,
          error=comp_error
        )
        PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPENSATING)
 
        // Compensation failure is critical: alert operator
        ALERT_OPERATOR("saga_compensation_failed", {
          saga_id: saga.id,
          step: i,
          error: comp_error
        })
      END TRY
    END FOR
 
    final_status ← IF LEN(compensation_failures) = 0 THEN COMPENSATED
                    ELSE COMPENSATION_INCOMPLETE
 
    PERSIST_SAGA_STATE(saga.id, step_records, phase=final_status)
 
    RETURN SagaResult(
      status=final_status,
      steps=step_records,
      compensation_failures=compensation_failures
    )
  END

21.8.6 Saga Consistency Guarantees#

The saga pattern provides eventual consistency, not ACID atomicity. The guarantees are:

\text{Saga Guarantee}: \quad \text{Either all } T_i \text{ commit, OR all committed } T_i \text{ are compensated by } C_i

With the caveat that compensation may itself fail, requiring compensating compensation or human intervention. The saga coordinator must be itself crash-recoverable via WAL, which is why saga state is persisted after every step.

21.9 Crash Recovery: Checkpointed State, Write-Ahead Logs, and Deterministic Replay#

21.9.1 Crash Recovery Architecture#

Crash recovery ensures that the agent system can resume from a consistent state after any unplanned termination. The architecture combines three mechanisms:

\text{Recovery} = \text{Checkpoint} \oplus \text{WAL Replay} \oplus \text{Deterministic Re-execution}

21.9.2 Write-Ahead Log (WAL) Specification#

The WAL is an append-only, durably persisted log of all state mutations:

\text{WAL} = \langle e_1, e_2, \ldots, e_N \rangle

Each entry:

e_k = (\text{seq}, \text{session\_id}, \text{op\_type}, \text{before\_image}, \text{after\_image}, \text{timestamp}, \text{idempotency\_key}, \text{checksum})

Durability guarantee: $e_k$ is fsync'd to persistent storage before the mutation is applied to in-memory state.

Integrity: Each entry's checksum covers the previous entry's checksum, forming a hash chain:

\text{checksum}(e_k) = \text{SHA256}(\text{checksum}(e_{k-1}) \| \text{payload}(e_k))

21.9.3 Recovery Protocol#

ALGORITHM CrashRecovery
  INPUT:
    wal: WriteAheadLog
    checkpoint_store: CheckpointStore
    session_registry: SessionRegistry
 
  OUTPUT:
    recovered_sessions: List<Session>
 
  BEGIN:
    recovered ← []
 
    // ─── Step 1: Identify sessions needing recovery ───
    active_sessions ← session_registry.sessions_in_state(
      {ACTIVE, RESUMED, DISPATCHED}
    )
 
    FOR EACH session_id IN active_sessions DO
      // ─── Step 2: Find latest checkpoint ───
      checkpoint ← checkpoint_store.latest_consistent(session_id)
 
      IF checkpoint IS NULL THEN
        WARN("no_checkpoint_found", session_id)
        MARK_SESSION_FAILED(session_id, "no_recovery_checkpoint")
        CONTINUE
      END IF
 
      // ─── Step 3: Validate checkpoint integrity ───
      IF NOT VERIFY_CHECKSUM(checkpoint) THEN
        // Try previous checkpoint
        checkpoint ← checkpoint_store.latest_consistent_before(
          session_id, checkpoint.seq
        )
        IF checkpoint IS NULL THEN
          MARK_SESSION_FAILED(session_id, "all_checkpoints_corrupt")
          CONTINUE
        END IF
      END IF
 
      // ─── Step 4: Replay WAL from checkpoint ───
      state ← DESERIALIZE(checkpoint.state)
      wal_entries ← wal.entries_after(session_id, checkpoint.wal_seq)
 
      // Verify WAL chain integrity
      IF NOT VERIFY_WAL_CHAIN(wal_entries) THEN
        WARN("wal_chain_broken", session_id)
        // Recover up to the break point
        wal_entries ← TRUNCATE_AT_BREAK(wal_entries)
      END IF
 
      FOR EACH entry IN wal_entries DO
        MATCH entry.op_type:
 
          STATE_MUTATION:
            state ← APPLY_MUTATION(state, entry)
 
          TOOL_INVOCATION_INTENT:
            // Check if completed
            completion ← FIND_COMPLETION(wal_entries, entry.idempotency_key)
            IF completion IS NOT NULL THEN
              state ← APPLY_COMPLETION(state, completion)
            ELSE
              // In-flight at crash: verify via idempotency key
              external_result ← CHECK_IDEMPOTENT_RESULT(
                entry.tool, entry.idempotency_key
              )
              IF external_result IS NOT NULL THEN
                state ← APPLY_COMPLETION(state, external_result)
                wal.append(COMPLETION_ENTRY(entry, external_result))
              ELSE
                // Action did not execute or result unknown
                state ← MARK_ACTION_PENDING(state, entry)
              END IF
            END IF
 
          SAGA_STEP:
            // Handled by saga coordinator recovery
            SAGA_RECOVERY_QUEUE.enqueue(entry)
 
      END FOR
 
      // ─── Step 5: Reconstruct session ───
      session ← RECONSTRUCT_SESSION(session_id, state)
      session.lifecycle_phase ← RESUMED
 
      APPEND recovered, session
 
      EMIT_METRIC("crash_recovery.session_recovered", {
        session_id: session_id,
        checkpoint_seq: checkpoint.seq,
        wal_entries_replayed: LEN(wal_entries),
        pending_actions: COUNT_PENDING(state)
      })
    END FOR
 
    // ─── Step 6: Recover sagas ───
    FOR EACH saga_entry IN SAGA_RECOVERY_QUEUE DO
      SAGA_COORDINATOR.recover(saga_entry)
    END FOR
 
    RETURN recovered
  END

21.9.4 Recovery Time Objective (RTO) Analysis#

The total recovery time is:

T_{\text{RTO}} = T_{\text{detection}} + T_{\text{checkpoint\_load}} + T_{\text{wal\_replay}} + T_{\text{state\_reconstruction}} + T_{\text{tool\_rebind}}

where:

T_{\text{wal\_replay}} = \sum_{k=1}^{N_{\text{entries}}} t_{\text{replay}}(e_k)

Minimizing $N_{\text{entries}}$ (through frequent checkpointing) directly reduces RTO:

N_{\text{entries}} \leq \frac{T_{\text{since\_last\_checkpoint}}}{\bar{t}_{\text{between\_entries}}}

The checkpoint frequency is therefore an RTO–I/O cost trade-off:

\text{CheckpointInterval}^* = \arg\min_{\Delta T} \left(\alpha \cdot \frac{T_{\text{since\_crash}}}{\Delta T} + \beta \cdot C_{\text{checkpoint}} \cdot \frac{1}{\Delta T}\right)

yielding:

\Delta T^* = \sqrt{\frac{\beta \cdot C_{\text{checkpoint}}}{\alpha}}

21.10 Chaos Engineering for Agents: Fault Injection, Latency Injection, and Resource Starvation Testing#

21.10.1 Chaos Engineering Principles for Agentic Systems#

Chaos engineering proactively identifies failure modes by intentionally injecting faults into the system under controlled conditions. For agentic systems, this extends beyond infrastructure chaos to include semantic chaos: injecting hallucinations, degraded retrieval, contradictory evidence, and adversarial tool responses.

21.10.2 Fault Injection Taxonomy#

Injection Type	Target	Mechanism	Purpose
Network Partition	Inter-service communication	Drop/delay packets between agent and tool servers	Test circuit breaker and retry behavior
Latency Injection	Any RPC call	Add configurable delay to tool invocations, retrieval, or LLM calls	Test deadline propagation and timeout handling
Error Injection	Tool responses	Return error codes for configured fraction of requests	Test retry budgets and error classification
Resource Starvation	Token budgets, memory, CPU	Artificially constrain available resources	Test graceful degradation and load shedding
State Corruption	Session state, cache	Inject invalid data into cache or corrupt checkpoint	Test integrity verification and fallback
Model Degradation	LLM inference	Route to a deliberately poor model or inject noise	Test quality gates and repair loops
Tool Unavailability	MCP tool servers	Shut down tool servers or return `503`	Test tool substitution and degraded mode
Concurrent Conflict	Shared state	Inject conflicting concurrent mutations	Test isolation and conflict resolution

21.10.3 Experiment Specification#

A chaos experiment is formally specified as:

\mathcal{E} = (\text{hypothesis}, \text{scope}, \text{injection}, \text{duration}, \text{blast\_radius\_limit}, \text{abort\_conditions}, \text{metrics})

where:

Hypothesis: "The system maintains $P_{99}$ latency $< L_{\text{SLA}}$ when 30% of retrieval requests fail."
Scope: Specific service, region, or traffic percentage.
Injection: Fault type and parameters.
Duration: Fixed time window or event count.
Blast radius limit: Maximum percentage of traffic affected.
Abort conditions: Safety criteria that trigger immediate experiment termination.
Metrics: Observable quantities measured during the experiment.

21.10.4 Chaos Experiment Safety Envelope#

The experiment operates within a safety envelope:

\text{Safe}(\mathcal{E}, t) \Leftrightarrow \bigwedge \begin{cases} A(t) \geq A_{\min} \\ \phi_{\text{error}}(t) \leq \phi_{\max} \\ L_{P99}(t) \leq L_{\max} \\ \text{blast\_radius}(t) \leq B_{\max} \\ \text{no\_data\_loss}(t) \\ \text{no\_user\_impact\_beyond\_scope}(t) \end{cases}

If any condition is violated, the experiment is automatically aborted and all injections are removed.

21.10.5 Pseudo-Algorithm: Chaos Experiment Runner#

ALGORITHM ChaosExperimentRunner
  INPUT:
    experiment: ChaosExperiment
    system: SystemUnderTest
    safety_config: SafetyEnvelope
 
  OUTPUT:
    report: ChaosExperimentReport
 
  BEGIN:
    // ─── Pre-experiment validation ───
    baseline_metrics ← COLLECT_BASELINE_METRICS(system, duration=config.baseline_window)
 
    IF NOT VALIDATE_SYSTEM_HEALTHY(baseline_metrics) THEN
      RETURN ExperimentReport(status=ABORTED, reason="system_unhealthy_before_start")
    END IF
 
    // ─── Install fault injection ───
    injection_handle ← INSTALL_FAULT_INJECTION(experiment.injection, experiment.scope)
 
    experiment_start ← NOW()
    metrics_during ← []
    abort_triggered ← FALSE
 
    // ─── Monitor during experiment ───
    WHILE NOW() - experiment_start < experiment.duration DO
      SLEEP(config.monitor_interval)
 
      current_metrics ← COLLECT_METRICS(system)
      APPEND metrics_during, current_metrics
 
      // Safety envelope check
      IF NOT CHECK_SAFETY_ENVELOPE(current_metrics, safety_config) THEN
        EMIT_ALERT("chaos.safety_violation", experiment.id)
        abort_triggered ← TRUE
        BREAK
      END IF
 
      // Blast radius check
      IF MEASURE_BLAST_RADIUS(current_metrics) > experiment.blast_radius_limit THEN
        EMIT_ALERT("chaos.blast_radius_exceeded", experiment.id)
        abort_triggered ← TRUE
        BREAK
      END IF
    END WHILE
 
    // ─── Remove fault injection ───
    REMOVE_FAULT_INJECTION(injection_handle)
 
    // ─── Post-experiment recovery verification ───
    WAIT(config.recovery_observation_window)
    recovery_metrics ← COLLECT_METRICS(system)
 
    // ─── Analyze results ───
    hypothesis_validated ← EVALUATE_HYPOTHESIS(
      experiment.hypothesis,
      baseline_metrics,
      metrics_during,
      recovery_metrics
    )
 
    report ← ChaosExperimentReport(
      experiment_id=experiment.id,
      status=IF abort_triggered THEN ABORTED ELSE COMPLETED,
      hypothesis_validated=hypothesis_validated,
      baseline_metrics=baseline_metrics,
      during_metrics=AGGREGATE(metrics_during),
      recovery_metrics=recovery_metrics,
      abort_triggered=abort_triggered,
      abort_reason=IF abort_triggered THEN IDENTIFY_VIOLATION() ELSE NULL,
      duration_actual=NOW() - experiment_start,
      recommendations=GENERATE_RECOMMENDATIONS(
        hypothesis_validated, metrics_during, baseline_metrics
      )
    )
 
    PERSIST_EXPERIMENT_REPORT(report)
 
    // ─── Feed into evaluation infrastructure ───
    IF NOT hypothesis_validated THEN
      CREATE_IMPROVEMENT_TASK(report)
      UPDATE_RUNBOOK(experiment.injection.type, report)
    END IF
 
    RETURN report
  END

21.11 Operational Runbooks: Automated Incident Response, Escalation, and Post-Mortem Integration#

21.11.1 Runbook as Executable Policy#

An operational runbook is not a static document. It is an executable policy that maps observed symptoms to diagnostic steps, automated remediation actions, and escalation paths:

\text{Runbook} = \langle (\text{condition}_i, \text{actions}_i, \text{escalation}_i) \rangle_{i=1}^{m}

where each entry is a response rule:

\text{condition}_i: \text{MetricVector} \times \text{AlertContext} \rightarrow \{0, 1\}

\text{actions}_i: \text{List}\langle\text{RemediationAction}\rangle

\text{escalation}_i: \text{EscalationPolicy}

21.11.2 Automated Incident Response Pipeline#

ALGORITHM AutomatedIncidentResponse
  INPUT:
    alert: Alert
    runbook: Runbook
    system: SystemState
 
  OUTPUT:
    incident_record: IncidentRecord
 
  BEGIN:
    incident ← CREATE_INCIDENT(alert, severity=alert.severity)
 
    // ─── Step 1: Classify incident ───
    matched_rules ← []
    FOR EACH rule IN runbook.rules DO
      IF rule.condition(system.metrics, alert.context) THEN
        APPEND matched_rules, rule
      END IF
    END FOR
 
    IF LEN(matched_rules) = 0 THEN
      // Unknown incident type
      ESCALATE_TO_ONCALL(incident, reason="no_matching_runbook")
      RETURN incident
    END IF
 
    // Select highest-priority matching rule
    rule ← MAX(matched_rules, key=λr: r.priority)
 
    // ─── Step 2: Execute automated remediation ───
    FOR EACH action IN rule.actions DO
      IF action.requires_approval AND NOT AUTO_APPROVE(action, incident.severity) THEN
        approval ← REQUEST_APPROVAL(action, timeout=rule.approval_timeout)
        IF NOT approval.granted THEN
          CONTINUE  // Skip this action
        END IF
      END IF
 
      TRY:
        result ← EXECUTE_REMEDIATION(action, system)
        incident.add_action_record(action, result)
        EMIT_TRACE("incident.remediation_executed", action.name)
 
        // Check if remediation resolved the issue
        WAIT(rule.verification_delay)
        IF CHECK_RESOLUTION(alert, system) THEN
          incident.status ← RESOLVED
          incident.resolution ← action.name
          BREAK
        END IF
 
      CATCH error:
        incident.add_action_record(action, error)
        EMIT_ALERT("incident.remediation_failed", action.name, error)
      END TRY
    END FOR
 
    // ─── Step 3: Escalation if not resolved ───
    IF incident.status ≠ RESOLVED THEN
      MATCH rule.escalation.policy:
        HUMAN_ONCALL:
          PAGE_ONCALL(incident, rule.escalation.team)
        MANAGEMENT:
          NOTIFY_MANAGEMENT(incident)
        EXTERNAL_VENDOR:
          OPEN_SUPPORT_TICKET(incident, rule.escalation.vendor)
 
      incident.status ← ESCALATED
    END IF
 
    // ─── Step 4: Persist and schedule post-mortem ───
    PERSIST_INCIDENT(incident)
 
    IF incident.severity ≤ SEVERITY_HIGH THEN
      SCHEDULE_POST_MORTEM(incident, within=config.post_mortem_deadline)
    END IF
 
    RETURN incident
  END

21.11.3 Post-Mortem Integration#

Post-mortems produce actionable artifacts that feed back into the system:

\text{PostMortem} \rightarrow \begin{cases} \text{New runbook rules} & \text{(expand automated response)} \\ \text{New chaos experiments} & \text{(validate fixes)} \\ \text{New monitoring alerts} & \text{(improve detection)} \\ \text{Architecture changes} & \text{(eliminate root cause)} \\ \text{Evaluation tasks} & \text{(regression tests for CI/CD)} \end{cases}

Each post-mortem output is tracked as a work item with an owner, deadline, and verification criterion. The closure criterion for a post-mortem action item is:

\text{Closed}(\text{item}) \Leftrightarrow \text{Implemented}(\text{item}) \wedge \text{Tested}(\text{item}) \wedge \text{Verified\_In\_Production}(\text{item})

21.12 SLA Definition and Enforcement: Availability, Latency P50/P95/P99, Error Budget, and Burn Rate#

21.12.1 SLA Specification#

The Service Level Agreement (SLA) for an agentic platform is defined as a set of Service Level Objectives (SLOs), each with a Service Level Indicator (SLI) and a target:

\text{SLO}_i = (\text{SLI}_i, \text{target}_i, \text{window}_i, \text{consequence}_i)

21.12.2 Core SLOs for Agentic Systems#

SLO	SLI Definition	Target	Window
Availability	$\frac{\text{successful\_requests}}{\text{total\_requests}}$	$\geq 99.9\%$	30 days
Latency P50	50th percentile response time	$\leq 2s$	30 days
Latency P95	95th percentile response time	$\leq 10s$	30 days
Latency P99	99th percentile response time	$\leq 30s$	30 days
Correctness	$\frac{\text{verified\_correct\_outputs}}{\text{total\_outputs}}$	$\geq 95\%$	30 days
Error Rate	$\frac{\text{errors}}{\text{total\_requests}}$	$\leq 0.1\%$	30 days
Throughput	Requests per second at target latency	$\geq T_{\text{min}}$	1 hour

21.12.3 Error Budget#

The error budget is the allowed failure margin:

\text{ErrorBudget}_i = (1 - \text{target}_i) \times N_{\text{total\_requests}}(\text{window})

For 99.9% availability over 30 days with 1M requests:

\text{ErrorBudget} = (1 - 0.999) \times 10^6 = 1000 \text{ failed requests}

The remaining error budget at time $t$ within the window:

B_{\text{remaining}}(t) = \text{ErrorBudget} - \text{failures\_so\_far}(t)

21.12.4 Burn Rate#

The burn rate measures how quickly the error budget is being consumed:

\text{BurnRate}(t) = \frac{\text{observed\_error\_rate}(t)}{1 - \text{target}}

A burn rate of 1.0 means the budget is consumed at exactly the planned rate. A burn rate $> 1.0$ means the budget will be exhausted before the window ends.

Time to budget exhaustion at current burn rate:

T_{\text{exhaust}} = \frac{B_{\text{remaining}}(t)}{\text{BurnRate}(t) \times (1 - \text{target}) \times \lambda_{\text{requests}}}

where $\lambda_{\text{requests}}$ is the current request rate.

21.12.5 Multi-Window Burn Rate Alerts#

To balance between catching fast-burning incidents and slow degradation, use multi-window burn rate alerts:

Alert Severity	Fast Window	Slow Window	Burn Rate Threshold	Hours to Exhaustion
Page (Critical)	5 min	1 hour	14.4×	1h (2% budget)
Page (High)	30 min	6 hours	6×	5h (5% budget)
Ticket (Medium)	2 hours	1 day	3×	10h (10% budget)
Ticket (Low)	6 hours	3 days	1×	30 days (100% budget)

An alert fires when the burn rate exceeds the threshold in both the fast and slow windows simultaneously, preventing false alarms from transient spikes.

21.12.6 SLA Enforcement Mechanism#

\text{Enforcement}(\text{BurnRate}, t) = \begin{cases} \texttt{NORMAL\_OPERATION} & \text{if BurnRate} < 1.0 \\ \texttt{INCREASED\_MONITORING} & \text{if } 1.0 \leq \text{BurnRate} < 3.0 \\ \texttt{FREEZE\_DEPLOYMENTS} & \text{if } 3.0 \leq \text{BurnRate} < 6.0 \\ \texttt{ACTIVATE\_DEGRADATION} & \text{if } 6.0 \leq \text{BurnRate} < 14.4 \\ \texttt{INCIDENT\_RESPONSE} & \text{if BurnRate} \geq 14.4 \end{cases}

21.12.7 Pseudo-Algorithm: SLA Monitor and Enforcer#

ALGORITHM SLAMonitorAndEnforcer
  INPUT:
    slos: List<SLO>
    metrics_stream: MetricsStream
    config: SLAConfig
 
  STATE:
    error_budgets: Map<SLO_ID, ErrorBudget>
    burn_rates: Map<SLO_ID, BurnRateTracker>
    enforcement_level: EnforcementLevel ← NORMAL_OPERATION
 
  // Runs continuously
  METHOD monitor():
    LOOP EVERY config.evaluation_interval DO
 
      FOR EACH slo IN slos DO
        // ─── Compute SLI ───
        sli_value ← COMPUTE_SLI(slo, metrics_stream, slo.window)
 
        // ─── Update error budget ───
        total_events ← metrics_stream.count(slo.window)
        budget_total ← (1 - slo.target) * total_events
        budget_consumed ← total_events * (1 - sli_value)
        budget_remaining ← budget_total - budget_consumed
 
        error_budgets[slo.id] ← ErrorBudget(
          total=budget_total,
          consumed=budget_consumed,
          remaining=budget_remaining,
          remaining_fraction=budget_remaining / MAX(budget_total, 1)
        )
 
        // ─── Compute burn rates across multiple windows ───
        FOR EACH (alert_level, fast_window, slow_window, threshold) IN config.burn_rate_windows DO
          burn_fast ← COMPUTE_BURN_RATE(slo, metrics_stream, fast_window)
          burn_slow ← COMPUTE_BURN_RATE(slo, metrics_stream, slow_window)
 
          IF burn_fast > threshold AND burn_slow > threshold THEN
            FIRE_ALERT(alert_level, slo, burn_fast, burn_slow, budget_remaining)
          END IF
        END FOR
 
        // ─── Determine enforcement level ───
        max_burn ← MAX(COMPUTE_BURN_RATE(slo, metrics_stream, 1h) FOR slo IN slos)
 
        new_level ← MATCH max_burn:
          < 1.0  → NORMAL_OPERATION
          < 3.0  → INCREASED_MONITORING
          < 6.0  → FREEZE_DEPLOYMENTS
          < 14.4 → ACTIVATE_DEGRADATION
          ≥ 14.4 → INCIDENT_RESPONSE
 
        IF new_level > enforcement_level THEN
          // Escalate immediately
          enforcement_level ← new_level
          EXECUTE_ENFORCEMENT(enforcement_level)
        ELSE IF new_level < enforcement_level THEN
          // De-escalate only after stability period
          IF STABLE_FOR(new_level, config.stability_window) THEN
            enforcement_level ← new_level
            EXECUTE_ENFORCEMENT(enforcement_level)
          END IF
        END IF
 
        // ─── Emit dashboard metrics ───
        EMIT_METRIC("sla.sli", {slo: slo.id, value: sli_value})
        EMIT_METRIC("sla.error_budget_remaining",
                     {slo: slo.id, value: budget_remaining})
        EMIT_METRIC("sla.burn_rate_1h",
                     {slo: slo.id, value: COMPUTE_BURN_RATE(slo, metrics_stream, 1h)})
 
      END FOR
    END LOOP
 
  METHOD EXECUTE_ENFORCEMENT(level):
    MATCH level:
      NORMAL_OPERATION:
        SET_DEGRADATION_LEVEL(C_0)
        UNFREEZE_DEPLOYMENTS()
 
      INCREASED_MONITORING:
        INCREASE_MONITORING_FREQUENCY(2x)
        NOTIFY_TEAM("sla_burn_rate_elevated")
 
      FREEZE_DEPLOYMENTS:
        FREEZE_DEPLOYMENT_PIPELINE()
        NOTIFY_TEAM("deployments_frozen_sla_risk")
 
      ACTIVATE_DEGRADATION:
        SET_DEGRADATION_LEVEL(C_1)  // Reduce verification overhead
        FREEZE_DEPLOYMENT_PIPELINE()
        NOTIFY_ONCALL("degradation_activated")
 
      INCIDENT_RESPONSE:
        SET_DEGRADATION_LEVEL(C_2)  // Fallback model
        FREEZE_DEPLOYMENT_PIPELINE()
        PAGE_ONCALL("sla_critical_burn_rate")
        CREATE_INCIDENT(severity=CRITICAL)
  END

21.12.8 SLA Reporting and Accountability#

Monthly SLA reports are auto-generated:

\text{SLAReport}(\text{month}) = \left\{(\text{SLO}_i, \text{SLI}_i^{\text{actual}}, \text{target}_i, \text{met}: \text{Bool}, \text{budget\_consumed\_pct})\right\}_{i=1}^{|\text{SLOs}|}

The report includes:

Achieved SLI vs. target for each SLO.
Error budget consumed and remaining.
Top contributing incidents to budget consumption.
Burn rate trends over the period.
Recommendations from chaos experiments and post-mortems.

Synthesis: Fault Tolerance as an Architectural Discipline#

The fault tolerance architecture presented in this chapter treats failures not as exceptions but as first-class system events with typed classifications, bounded recovery procedures, and measurable impact on service-level guarantees. The key architectural contributions:

Principle	Mechanism	Guarantee
Typed failure classification	5-class taxonomy with automated classification engine	Correct recovery strategy selection
Bounded retry	Exponential backoff with jitter, layered budgets, idempotency keys	No unbounded resource consumption; at-most-once semantics
Circuit isolation	Circuit breakers with adaptive recovery probing	Failing dependencies do not cascade
Resource partitioning	Bulkhead isolation with elastic borrowing	Cross-concern failure containment
Deadline discipline	Propagated deadlines, optimal budget allocation, preemptive cancellation	Latency SLA compliance under composition
Admission governance	Token-bucket rate limits, composite load signals, priority-based shedding	System stability under overload
Graceful degradation	5-level capability hierarchy, health-driven feature flags, hysteresis	Continued service at reduced capability
Compensating transactions	Saga pattern with WAL-persisted coordination	Eventually consistent rollback of multi-step mutations
Crash recovery	Checkpoint + WAL replay with idempotent re-execution	Resumable sessions after unplanned termination
Proactive validation	Chaos engineering with safety envelopes	Discovery of unknown failure modes before production incidents
Operational automation	Executable runbooks, post-mortem → improvement pipeline	Reduced MTTR, institutional learning
SLA enforcement	Multi-window burn rate monitoring, automated enforcement escalation	Measurable, enforceable service guarantees

A system that cannot quantify its failure tolerance cannot claim to be production-grade. The architecture in this chapter ensures every failure mode is classified, every recovery is bounded, every degradation is transparent, and every service guarantee is continuously measured against an explicit error budget. This is the engineering standard required for agentic AI systems operating at sustained enterprise scale.

End of Chapter 21.