Preamble#
Agentic AI systems operate at the intersection of stochastic inference, distributed tool execution, external API dependencies, and human-in-the-loop governance. Every one of these boundaries is a failure surface. A system that cannot tolerate failures deterministically is not a production system—it is a demonstration. This chapter formalizes fault tolerance for agentic platforms with the same rigor applied to avionics, financial trading systems, and distributed databases. Every failure mode is taxonomized, every mitigation is mathematically characterized, every recovery protocol is specified as a bounded, instrumented, auditable procedure. The objective is not merely to survive failures but to operate predictably, safely, and cost-efficiently through them—maintaining measurable service-level guarantees while preserving the correctness invariants on which agentic reliability depends.
21.1 Failure Taxonomy: Transient, Persistent, Cascading, Byzantine, and Semantic Failures#
21.1.1 The Necessity of Formal Taxonomy#
Effective fault tolerance demands that failures are not treated as a monolithic category. Each failure class has distinct detection signatures, propagation characteristics, recovery strategies, and cost profiles. A retry strategy appropriate for a transient network timeout is catastrophically wrong for a persistent authorization failure. A circuit breaker tuned for external API flakiness is useless against a semantic hallucination that passes all schema checks.
21.1.2 Failure Class Definitions#
Let be the space of all failures. Define a classification function:
| Failure Class | Definition | Detection Signature | Recovery Strategy | Example |
|---|---|---|---|---|
| Transient | Temporary condition that resolves without intervention | Succeeds on retry; error code in retriable set | Retry with backoff | Network timeout, rate limit 429, temporary overload |
| Persistent | Stable failure that will not self-resolve | Repeated identical failure across retries | Escalate, substitute, or fail | Invalid credentials, deleted resource, schema mismatch |
| Cascading | Failure in one component propagates to dependent components | Correlated failures across services within temporal window | Isolate, shed load, break dependency chain | Database overload → retrieval failure → agent stall → queue backup |
| Byzantine | Component produces incorrect results without signaling error | Output passes schema validation but is semantically wrong | Redundant verification, voting, cross-validation | LLM hallucination, corrupted cache returning stale data, tool returning wrong result silently |
| Semantic | Output is structurally valid but violates task-level correctness or safety requirements | Detected only by domain-specific verification | Critique → repair → re-verify; escalate to human | Factually incorrect answer, unsafe recommendation, logically inconsistent plan |
21.1.3 Formal Failure Model#
Define a failure event as a tuple:
The failure rate for component over window is:
The mean time between failures (MTBF) and mean time to recovery (MTTR) for a component are:
Component availability is:
For a serial chain of components, system availability is:
For parallel redundancy with independent replicas:
21.1.4 Cascading Failure Propagation Model#
Model the agentic system as a directed dependency graph where vertices are components and edges are dependencies. A failure at component propagates to if and lacks isolation.
The blast radius of a failure at is:
The cascade probability depends on the failure propagation probability along each edge:
Isolation mechanisms (bulkheads, circuit breakers) reduce toward zero on specific edges.
21.1.5 Pseudo-Algorithm: Failure Classification Engine#
ALGORITHM ClassifyFailure
INPUT:
error: ErrorEvent
history: RecentFailureHistory
component: ComponentID
OUTPUT:
classification: FailureClassification
BEGIN:
// ─── Step 1: Check retriability from error code taxonomy ───
IF error.code IN TRANSIENT_ERROR_CODES THEN
// Verify not persistent by checking history
recent_identical ← history.count(
component=component,
error_code=error.code,
window=config.persistence_window
)
IF recent_identical ≥ config.persistence_threshold THEN
RETURN FailureClassification(
class=PERSISTENT,
confidence=0.9,
evidence="repeated_identical_failure",
recommended_action=ESCALATE
)
ELSE
RETURN FailureClassification(
class=TRANSIENT,
confidence=0.85,
evidence="retriable_error_code",
recommended_action=RETRY_WITH_BACKOFF
)
END IF
END IF
// ─── Step 2: Check for cascade indicators ───
correlated_failures ← history.correlated_failures(
temporal_window=config.cascade_detection_window,
min_components=2
)
IF LEN(correlated_failures) ≥ config.cascade_threshold THEN
upstream_root ← IDENTIFY_CASCADE_ROOT(correlated_failures, dependency_graph)
RETURN FailureClassification(
class=CASCADING,
confidence=0.8,
evidence="correlated_multi_component_failure",
root_cause=upstream_root,
recommended_action=ISOLATE_AND_SHED_LOAD
)
END IF
// ─── Step 3: Check for Byzantine indicators ───
IF error.type = VERIFICATION_FAILURE AND error.schema_valid = TRUE THEN
RETURN FailureClassification(
class=BYZANTINE,
confidence=0.75,
evidence="schema_valid_but_semantically_incorrect",
recommended_action=REDUNDANT_VERIFICATION
)
END IF
// ─── Step 4: Check for semantic failures ───
IF error.type IN {HALLUCINATION, SAFETY_VIOLATION, LOGIC_ERROR,
FACTUAL_ERROR, COHERENCE_FAILURE} THEN
RETURN FailureClassification(
class=SEMANTIC,
confidence=0.85,
evidence=error.verification_details,
recommended_action=CRITIQUE_AND_REPAIR
)
END IF
// ─── Step 5: Default to persistent ───
RETURN FailureClassification(
class=PERSISTENT,
confidence=0.5,
evidence="unclassified_error",
recommended_action=FAIL_WITH_DIAGNOSTICS
)
END21.2 Retry Engineering#
21.2.1 Exponential Backoff with Jitter: Configuration, Bounds, and Anti-Thundering-Herd#
Formal Backoff Function#
The backoff delay for retry attempt is:
where:
- : base delay (e.g., 100ms)
- : exponential base (typically )
- : ceiling delay to prevent unbounded waits
- : jitter function to decorrelate concurrent retriers
Jitter Strategies#
| Strategy | Formula | Properties |
|---|---|---|
| Full Jitter | Maximum decorrelation; wide delay variance | |
| Equal Jitter | Balanced: guaranteed minimum wait + jitter | |
| Decorrelated Jitter | Self-adapting; depends on previous delay |
The expected total wait across retries with full jitter is:
For :
Anti-Thundering-Herd Analysis#
When concurrent clients retry the same failing service simultaneously, without jitter the aggregate retry rate spikes to:
With full jitter, the aggregate rate smooths to:
effectively distributing retries uniformly over the interval , eliminating synchronization.
21.2.2 Retry Budgets: Per-Request, Per-Session, and System-Wide Limits#
Unbounded retries convert transient failures into persistent resource exhaustion. The system enforces layered retry budgets:
| Level | Scope | Budget | Typical Value |
|---|---|---|---|
| Per-Request | Single tool call or API invocation | 3–5 attempts | |
| Per-Session | All retries within one session | 20–50 attempts | |
| System-Wide | Total retries across all sessions per window | 1000 retries / minute |
The retry budget utilization at time is:
When (e.g., 0.8), the system enters retry backpressure mode: new retry attempts are rejected or deferred, and the incident response pipeline is triggered.
21.2.3 Idempotency Keys: Generation, Propagation, and Server-Side Deduplication#
Idempotency Key Specification#
An idempotency key uniquely identifies a logical operation such that:
i.e., the effect is applied exactly once regardless of how many times the request is submitted.
Key Generation#
where:
- is the session-scoped secret
- is the deterministic serialization of operation parameters
- is a monotonically increasing counter within the session
Key Propagation#
Idempotency keys propagate through the call chain. When an agent invokes a tool, which invokes a downstream service:
This ensures that a retry of the parent operation generates the same derived key, enabling deduplication at every layer.
Server-Side Deduplication#
The server maintains a deduplication store mapping keys to results:
ALGORITHM IdempotentExecute
INPUT:
operation: Operation
idempotency_key: IdempotencyKey
dedup_store: DeduplicationStore
config: IdempotencyConfig
OUTPUT:
result: OperationResult
BEGIN:
// ─── Check deduplication store ───
existing ← dedup_store.get(idempotency_key)
IF existing IS NOT NULL THEN
IF existing.status = COMPLETED THEN
EMIT_METRIC("idempotency.dedup_hit", {key: idempotency_key})
RETURN existing.result // Return cached result
ELSE IF existing.status = IN_PROGRESS THEN
// Another invocation is in flight
IF NOW() - existing.started_at > config.in_progress_timeout THEN
// Stale in-progress record; likely crashed
dedup_store.update(idempotency_key, status=EXPIRED)
// Fall through to execute
ELSE
RETURN OperationResult(status=PENDING, retry_after=existing.estimated_completion)
END IF
END IF
END IF
// ─── Claim the key ───
claimed ← dedup_store.claim(idempotency_key, {
status: IN_PROGRESS,
started_at: NOW(),
expiry: NOW() + config.key_ttl
})
IF NOT claimed THEN
// Race condition: another instance claimed first
RETURN OperationResult(status=PENDING, retry_after=1s)
END IF
// ─── Execute operation ───
TRY:
result ← EXECUTE(operation)
dedup_store.update(idempotency_key, {
status: COMPLETED,
result: result,
completed_at: NOW(),
expiry: NOW() + config.result_ttl
})
RETURN result
CATCH error:
IF IS_RETRIABLE(error) THEN
dedup_store.update(idempotency_key, status=FAILED_RETRIABLE)
// Allow future retries with same key
ELSE
dedup_store.update(idempotency_key, {
status: FAILED_PERMANENT,
error: error,
expiry: NOW() + config.result_ttl
})
END IF
RAISE error
END TRY
ENDDeduplication Store TTL#
Keys expire after a configurable TTL to prevent unbounded storage growth:
Storage cost:
21.3 Circuit Breakers: Open/Half-Open/Closed States, Failure Rate Thresholds, and Recovery Probes#
21.3.1 Circuit Breaker State Machine#
The circuit breaker is a protective state machine that prevents a failing downstream dependency from consuming unbounded resources:
where:
- (normal operation)
Transition function:
21.3.2 Failure Rate Computation#
The failure rate is computed over a sliding window of requests:
The circuit opens when:
where (e.g., 0.5) is the failure rate threshold and is the minimum sample size to prevent premature tripping on low traffic.
21.3.3 Recovery Probing#
In HALF_OPEN state, the circuit breaker admits a limited number of probe requests (typically 1–3):
The open duration before transitioning to HALF_OPEN follows an exponential backoff:
where is the number of consecutive open→half-open→open cycles. This prevents rapid oscillation (circuit "flapping").
21.3.4 Circuit Breaker Metrics#
| Metric | Formula | Operational Significance |
|---|---|---|
| Trip Rate | Frequency of dependency degradation | |
| Open Duration | Average time dependency is unavailable | |
| Recovery Success Rate | Dependency stability after incidents | |
| Requests Shed | Requests rejected during outages |
21.3.5 Pseudo-Algorithm: Circuit Breaker#
ALGORITHM CircuitBreaker
INPUT:
dependency: DependencyID
config: CircuitBreakerConfig
STATE:
state ← CLOSED
failure_window ← SlidingWindow(size=config.window_size)
open_since ← NULL
consecutive_opens ← 0
probe_results ← []
METHOD execute(operation):
MATCH state:
CLOSED:
TRY:
result ← operation.execute(timeout=config.call_timeout)
failure_window.record(SUCCESS)
RETURN result
CATCH error:
failure_window.record(FAILURE)
// Check trip condition
IF failure_window.count() ≥ config.N_min THEN
failure_rate ← failure_window.failure_rate()
IF failure_rate > config.phi_threshold THEN
TRIP_OPEN()
END IF
END IF
RAISE error
END TRY
OPEN:
// Check if open timeout has elapsed
elapsed ← NOW() - open_since
T_open ← MIN(config.T_open_max,
config.T_open_base * 2^(consecutive_opens - 1))
IF elapsed ≥ T_open THEN
state ← HALF_OPEN
probe_results ← []
EMIT_TRACE("circuit_breaker.half_open", dependency)
// Fall through to HALF_OPEN handling below
RETURN EXECUTE_HALF_OPEN(operation)
ELSE
// Fast-fail: do not attempt the call
EMIT_METRIC("circuit_breaker.rejected", dependency)
RAISE CircuitOpenError(dependency, retry_after=T_open - elapsed)
END IF
HALF_OPEN:
RETURN EXECUTE_HALF_OPEN(operation)
METHOD EXECUTE_HALF_OPEN(operation):
IF LEN(probe_results) ≥ config.P_probes THEN
// Already collected enough probes; wait for decision
RAISE CircuitOpenError(dependency, retry_after=1s)
END IF
TRY:
result ← operation.execute(timeout=config.call_timeout)
APPEND probe_results, SUCCESS
IF COUNT(SUCCESS IN probe_results) ≥ config.probes_to_close THEN
state ← CLOSED
consecutive_opens ← 0
failure_window.reset()
EMIT_TRACE("circuit_breaker.closed", dependency)
END IF
RETURN result
CATCH error:
APPEND probe_results, FAILURE
TRIP_OPEN()
RAISE error
END TRY
METHOD TRIP_OPEN():
state ← OPEN
open_since ← NOW()
consecutive_opens ← consecutive_opens + 1
EMIT_TRACE("circuit_breaker.opened", {
dependency: dependency,
failure_rate: failure_window.failure_rate(),
consecutive_opens: consecutive_opens
})
EMIT_ALERT_IF(consecutive_opens ≥ config.alert_threshold)21.4 Bulkhead Isolation: Partitioning Resources to Prevent Cross-Concern Failure Propagation#
21.4.1 Bulkhead Principle#
Borrowed from naval engineering, the bulkhead pattern partitions system resources into isolated compartments such that failure in one compartment cannot drain resources from another.
Formally, let the system's resource pool (thread pools, connection pools, memory, token budgets) be partitioned:
Each partition has a hard capacity limit :
A resource-exhaustion failure in partition does not affect partitions :
21.4.2 Bulkhead Dimensions for Agentic Systems#
| Dimension | Partition By | Rationale |
|---|---|---|
| Tool invocation pools | Per-tool or per-tool-class | Slow tool cannot exhaust pool used by fast tools |
| LLM inference queues | Per-priority-tier | Low-priority background tasks cannot block high-priority user requests |
| Retrieval connections | Per-source (vector DB, graph DB, cache) | One slow source cannot block others |
| Agent execution slots | Per-session or per-user | One user's runaway agent cannot consume all compute |
| Token budget pools | Per-session, per-task | One expensive task cannot drain organization budget |
| Memory allocation | Per-session working memory | One session's large context cannot cause OOM for others |
21.4.3 Bulkhead Sizing#
The capacity of each bulkhead is determined by:
where is the weight assigned based on priority and expected demand, is the total resource pool, and guarantees a minimum viable allocation per partition.
The utilization-adjusted sizing dynamically reallocates unused capacity:
where is the borrowing fraction that controls how much slack from other partitions can be temporarily utilized. Setting provides strict isolation; provides elastic isolation with guaranteed minimums.
21.4.4 Pseudo-Algorithm: Bulkhead Resource Manager#
ALGORITHM BulkheadResourceManager
INPUT:
partitions: Map<PartitionID, BulkheadConfig>
total_capacity: Int
STATE:
allocations: Map<PartitionID, Semaphore>
usage_counters: Map<PartitionID, AtomicInt>
METHOD initialize():
FOR EACH (id, config) IN partitions DO
capacity ← MAX(config.C_min,
FLOOR(config.weight * total_capacity / total_weight))
allocations[id] ← Semaphore(capacity)
usage_counters[id] ← AtomicInt(0)
END FOR
METHOD acquire(partition_id, timeout):
semaphore ← allocations[partition_id]
// Try primary allocation
acquired ← semaphore.try_acquire(timeout=timeout)
IF acquired THEN
usage_counters[partition_id].increment()
EMIT_METRIC("bulkhead.acquired", partition_id)
RETURN AcquisitionToken(partition_id, PRIMARY)
END IF
// Try borrowing from slack partitions
IF config.alpha > 0 THEN
FOR EACH (other_id, other_sem) IN allocations DO
IF other_id = partition_id THEN CONTINUE END IF
slack ← other_sem.available_permits()
IF slack > config.borrow_min_slack THEN
borrowed ← other_sem.try_acquire(timeout=0)
IF borrowed THEN
usage_counters[partition_id].increment()
EMIT_METRIC("bulkhead.borrowed", {
partition: partition_id,
from: other_id
})
RETURN AcquisitionToken(partition_id, BORROWED, donor=other_id)
END IF
END IF
END FOR
END IF
// Allocation failed
EMIT_METRIC("bulkhead.rejected", partition_id)
RAISE BulkheadFullError(partition_id, usage=usage_counters[partition_id].get())
METHOD release(token):
MATCH token.type:
PRIMARY:
allocations[token.partition_id].release()
BORROWED:
allocations[token.donor].release()
usage_counters[token.partition_id].decrement()
EMIT_METRIC("bulkhead.released", token.partition_id)21.5 Timeout Engineering: Deadline Propagation, Cascading Timeout Budgets, and Deadline-Aware Scheduling#
21.5.1 The Timeout Problem in Agentic Systems#
Agentic systems compose multiple asynchronous operations: LLM inference, tool invocations, retrieval queries, human approvals. Without disciplined timeout engineering, a single slow operation blocks the entire agent loop indefinitely, consuming resources and violating latency SLAs.
21.5.2 Deadline Propagation#
Every request entering the system carries a deadline :
As the request flows through the call chain, each component consumes time and propagates a remaining deadline:
where reserves time for post-processing, serialization, and response transmission.
The effective timeout for a downstream call at depth is:
where is the estimated minimum time for all remaining downstream steps. This ensures that even if the current call uses its full timeout, sufficient time remains for subsequent steps.
21.5.3 Cascading Timeout Budget Allocation#
For a sequential chain of operations, the total budget is:
The optimal allocation minimizes the probability of timeout given per-operation latency distributions :
For operations with exponentially distributed latencies :
The Lagrangian yields:
where is the Lagrange multiplier determined by the budget constraint. Intuitively, more budget is allocated to operations with higher latency variance (smaller ).
21.5.4 Deadline-Aware Scheduling#
The agent loop scheduler must be deadline-aware: actions approaching their deadline receive scheduling priority:
Actions with are infeasible and should be preemptively cancelled:
21.5.5 Pseudo-Algorithm: Deadline-Propagating Invocation#
ALGORITHM DeadlinePropagatingInvoke
INPUT:
operation: Operation
deadline: Timestamp
remaining_steps: List<OperationSpec>
OUTPUT:
result: OperationResult
BEGIN:
// Calculate time needed for remaining steps
T_remaining_minimum ← SUM(step.min_latency FOR step IN remaining_steps)
T_overhead ← config.response_overhead
// Available time for this operation
T_available ← deadline - NOW() - T_remaining_minimum - T_overhead
IF T_available ≤ 0 THEN
RAISE DeadlineExceededError(
"insufficient_time_for_operation",
deficit=ABS(T_available)
)
END IF
// Set operation timeout
operation_timeout ← MIN(T_available, operation.spec.max_timeout)
IF operation_timeout < operation.spec.min_viable_timeout THEN
// Not enough time for meaningful execution
RETURN DEGRADE_OR_SKIP(operation, deadline)
END IF
// Execute with propagated deadline
downstream_deadline ← NOW() + operation_timeout
TRY:
result ← operation.execute(
timeout=operation_timeout,
propagated_deadline=downstream_deadline
)
RETURN result
CATCH TimeoutException:
EMIT_METRIC("deadline.timeout", {
operation: operation.name,
allocated: operation_timeout,
deadline_remaining: deadline - NOW()
})
// Decide whether to return partial result or propagate failure
IF operation.supports_partial_result THEN
RETURN operation.get_partial_result()
ELSE
RAISE TimeoutException(operation.name, allocated=operation_timeout)
END IF
END TRY
END21.6 Queue Isolation and Backpressure: Rate Limiting, Admission Control, and Load Shedding#
21.6.1 Queue Architecture for Agentic Systems#
Agentic workloads are bursty and heterogeneous. A code-generation task consumes 10× more tokens than a simple Q&A. Without queue isolation, heavy tasks starve light tasks.
Define a multi-queue architecture with queues:
Each queue has:
21.6.2 Rate Limiting#
Rate limits are enforced using a token bucket algorithm:
where is the bucket capacity (burst limit) and is the refill rate (sustained throughput limit).
A request of cost is admitted if:
After admission, the bucket is decremented:
For agentic systems, cost is measured in token units rather than raw request count:
This prevents a single high-token request from being treated equivalently to a simple health check.
21.6.3 Admission Control#
Admission control decides whether to accept a new request based on current system load:
where:
is a weighted composite load signal combining queue depth, token consumption, and observed latency.
21.6.4 Load Shedding Strategies#
When admission control cannot prevent overload, load shedding drops requests to protect system stability:
| Strategy | Selection Criterion | Properties |
|---|---|---|
| LIFO (Last In, First Out) | Drop newest requests | Preserves older, likely-more-invested requests |
| Priority-Based | Drop lowest-priority first | Protects business-critical workloads |
| Cost-Based | Drop most expensive requests | Maximizes throughput in request count |
| Deadline-Based | Drop requests closest to expiry | Drops requests unlikely to complete anyway |
| Random | Drop uniformly at random | Fair, prevents starvation, simple |
The optimal shedding policy maximizes aggregate value delivered:
This is a knapsack problem; for online decision-making, the greedy approximation sorts by and admits in decreasing order until capacity is exhausted.
21.6.5 Pseudo-Algorithm: Admission Controller with Load Shedding#
ALGORITHM AdmissionController
INPUT:
request: IncomingRequest
queues: Map<Priority, Queue>
system_state: SystemState
OUTPUT:
decision: ACCEPT | THROTTLE | REJECT
BEGIN:
// ─── Compute composite load ───
queue_load ← system_state.total_queued / system_state.total_capacity
token_load ← system_state.active_tokens / system_state.token_budget
latency_load ← system_state.p95_latency / system_state.latency_sla
load ← config.w_q * queue_load
+ config.w_c * token_load
+ config.w_l * latency_load
// ─── Rate limit check ───
estimated_cost ← ESTIMATE_TOKEN_COST(request)
IF NOT TOKEN_BUCKET.try_consume(estimated_cost) THEN
EMIT_METRIC("admission.rate_limited", request.priority)
RETURN THROTTLE(retry_after=TOKEN_BUCKET.time_to_refill(estimated_cost))
END IF
// ─── Load-based admission ───
IF load < config.L_accept THEN
target_queue ← queues[request.priority]
IF target_queue.size() < target_queue.capacity THEN
target_queue.enqueue(request)
EMIT_METRIC("admission.accepted", request.priority)
RETURN ACCEPT
ELSE
// Queue full; try shedding lower-priority items
IF SHED_LOWER_PRIORITY(target_queue, request) THEN
target_queue.enqueue(request)
RETURN ACCEPT
ELSE
RETURN THROTTLE(retry_after=config.throttle_delay)
END IF
END IF
ELSE IF load < config.L_shed THEN
// Throttle: accept only high-priority
IF request.priority ≥ PRIORITY_HIGH THEN
queues[request.priority].enqueue(request)
EMIT_METRIC("admission.accepted_under_pressure", request.priority)
RETURN ACCEPT
ELSE
EMIT_METRIC("admission.throttled", request.priority)
RETURN THROTTLE(retry_after=config.throttle_delay)
END IF
ELSE
// Critical load: shed
IF request.priority = PRIORITY_CRITICAL THEN
// Even critical requests enter only if queue has room
IF queues[PRIORITY_CRITICAL].size() < queues[PRIORITY_CRITICAL].capacity THEN
queues[PRIORITY_CRITICAL].enqueue(request)
RETURN ACCEPT
END IF
END IF
EMIT_METRIC("admission.shed", request.priority)
RETURN REJECT(reason="system_overloaded", load=load)
END IF
END21.7 Graceful Degradation Strategies#
21.7.1 Reduced-Capability Modes: Simpler Models, Cached Responses, and Partial Results#
Graceful degradation maintains service availability by reducing capability rather than failing entirely. The system defines a hierarchy of capability levels:
ordered from full capability to minimal viable service:
| Level | Description | Operational Mode |
|---|---|---|
| Full | All features, primary model, real-time retrieval, full verification | |
| Reduced Verification | Primary model, retrieval, but skip adversarial critique and self-consistency | |
| Fallback Model | Smaller/faster model, basic verification, cached retrieval where possible | |
| Cache-First | Return cached or pre-computed responses; LLM only for cache misses | |
| Static Fallback | Return templated responses, documentation links, or "service degraded" messages |
The degradation trigger function maps system health to capability level:
where is the composite health score and is the minimum health required for level :
21.7.2 Feature Flags for Progressive Agent Capability Reduction#
Each degradation level activates or deactivates specific feature flags:
| Feature Flag | |||||
|---|---|---|---|---|---|
primary_model | ✓ | ✓ | ✗ | ✗ | ✗ |
fallback_model | ✗ | ✗ | ✓ | △ | ✗ |
real_time_retrieval | ✓ | ✓ | ✓ | ✗ | ✗ |
cached_retrieval | ✓ | ✓ | ✓ | ✓ | ✗ |
self_consistency | ✓ | ✗ | ✗ | ✗ | ✗ |
adversarial_critique | ✓ | ✗ | ✗ | ✗ | ✗ |
rubric_verification | ✓ | ✓ | △ | ✗ | ✗ |
tool_invocation | ✓ | ✓ | ✓ | ✗ | ✗ |
human_escalation | ✓ | ✓ | ✓ | ✓ | ✓ |
(✓ = enabled, ✗ = disabled, △ = simplified version)
Feature flags are evaluated at every phase boundary of the agent loop, allowing mid-execution degradation if system health declines during processing.
21.7.3 User-Facing Degradation Communication: Transparent Status and ETA#
Users must be transparently informed of degraded operation:
The system computes an estimated time to recovery (ETR) based on historical degradation durations:
ALGORITHM GracefulDegradationController
INPUT:
system_health: HealthMetrics
config: DegradationConfig
OUTPUT:
level: CapabilityLevel
feature_flags: Map<FeatureFlag, Bool>
STATE:
current_level ← C_0
level_entry_time ← NOW()
METHOD evaluate():
// Compute composite health score
H ← MIN(
system_health.model_availability / config.A_target,
1.0 / MAX(system_health.p95_latency / config.L_sla, 0.01),
1.0 - system_health.error_rate,
system_health.budget_remaining / config.budget_threshold
)
H ← CLAMP(H, 0.0, 1.0)
// Determine target level
target_level ← C_4 // Default to most degraded
FOR EACH level IN [C_0, C_1, C_2, C_3] DO // Check best to worst
IF H ≥ config.health_thresholds[level] THEN
target_level ← level
BREAK
END IF
END FOR
// Hysteresis: prevent flapping
IF target_level > current_level THEN
// Degrading: apply immediately
current_level ← target_level
level_entry_time ← NOW()
EMIT_ALERT("degradation.level_changed", current_level)
ELSE IF target_level < current_level THEN
// Recovering: require stability period
IF NOW() - level_entry_time > config.stability_period THEN
current_level ← target_level
level_entry_time ← NOW()
EMIT_TRACE("degradation.level_improved", current_level)
END IF
// Else: wait for stability confirmation
END IF
// Resolve feature flags for current level
feature_flags ← config.flag_matrix[current_level]
RETURN (current_level, feature_flags)
END21.8 Compensating Transactions: Undo, Rollback, and Saga Coordination for Multi-Step Agent Actions#
21.8.1 The Compensation Problem#
Agent loops execute multi-step plans where each step may mutate external state. When step fails after steps have committed, the system cannot atomically roll back. Compensating transactions provide eventual consistency by executing semantic inverses of completed steps.
21.8.2 Saga Pattern for Agent Actions#
A saga is a sequence of transactions with compensating counterparts:
where is the forward transaction and is its compensating transaction.
Forward execution:
Backward recovery after failure at :
The saga invariant requires:
where denotes semantic equivalence (exact state reversal may be impossible for irreversible operations).
21.8.3 Compensation Classification#
| Action Type | Compensation Type | Example |
|---|---|---|
| Exactly Reversible | Exact inverse | File create → file delete |
| Approximately Reversible | Best-effort undo | Code commit → revert commit (history preserved) |
| Compensable Only | Counter-action, not undo | Sent notification → send correction notification |
| Irreversible | No compensation possible | External API call with permanent effect |
For irreversible actions, the system must either:
- Gate execution with human approval before the irreversible step.
- Accept the risk and document it in the saga's compensation plan.
- Simulate the action in a sandbox before committing to production.
21.8.4 Saga Coordinator State Machine#
The saga coordinator tracks the execution state:
Each :
21.8.5 Pseudo-Algorithm: Saga Coordinator#
ALGORITHM SagaCoordinator
INPUT:
saga: SagaDefinition // List of (transaction, compensation) pairs
context: ExecutionContext
OUTPUT:
saga_result: SagaResult
STATE:
step_records ← []
phase ← FORWARD
BEGIN:
// ─── Forward Execution ───
FOR i ← 0 TO LEN(saga.steps) - 1 DO
(transaction, compensation) ← saga.steps[i]
idempotency_key ← DERIVE_KEY(saga.id, i)
record ← StepRecord(
step_id=i,
status=PENDING,
idempotency_key=idempotency_key
)
TRY:
// Pre-flight check for irreversible actions
IF transaction.reversibility = IRREVERSIBLE THEN
IF NOT context.has_approval(transaction) THEN
approval ← REQUEST_HUMAN_APPROVAL(transaction)
IF NOT approval.granted THEN
RAISE SagaAbortedError("approval_denied", step=i)
END IF
END IF
END IF
result ← IDEMPOTENT_EXECUTE(transaction, idempotency_key)
record.status ← COMMITTED
record.result ← result
APPEND step_records, record
// Persist saga state (WAL)
PERSIST_SAGA_STATE(saga.id, step_records, phase=FORWARD)
CATCH error:
record.status ← FAILED
record.error ← error
APPEND step_records, record
PERSIST_SAGA_STATE(saga.id, step_records, phase=FORWARD)
// Enter compensation phase
phase ← COMPENSATING
BREAK
END TRY
END FOR
IF phase = FORWARD THEN
// All steps succeeded
PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPLETED)
RETURN SagaResult(status=COMPLETED, steps=step_records)
END IF
// ─── Compensation Phase ───
compensation_failures ← []
FOR i ← LEN(step_records) - 2 DOWNTO 0 DO
// Compensate all committed steps before the failed one
record ← step_records[i]
IF record.status ≠ COMMITTED THEN
CONTINUE
END IF
(_, compensation) ← saga.steps[i]
comp_key ← DERIVE_KEY(saga.id, i, "compensate")
IF compensation IS NULL THEN
// No compensation defined (irreversible)
APPEND compensation_failures, CompensationFailure(
step=i,
reason="no_compensation_defined"
)
CONTINUE
END IF
TRY:
IDEMPOTENT_EXECUTE(compensation, comp_key)
record.status ← COMPENSATED
PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPENSATING)
CATCH comp_error:
record.status ← COMPENSATION_FAILED
record.compensation_error ← comp_error
APPEND compensation_failures, CompensationFailure(
step=i,
error=comp_error
)
PERSIST_SAGA_STATE(saga.id, step_records, phase=COMPENSATING)
// Compensation failure is critical: alert operator
ALERT_OPERATOR("saga_compensation_failed", {
saga_id: saga.id,
step: i,
error: comp_error
})
END TRY
END FOR
final_status ← IF LEN(compensation_failures) = 0 THEN COMPENSATED
ELSE COMPENSATION_INCOMPLETE
PERSIST_SAGA_STATE(saga.id, step_records, phase=final_status)
RETURN SagaResult(
status=final_status,
steps=step_records,
compensation_failures=compensation_failures
)
END21.8.6 Saga Consistency Guarantees#
The saga pattern provides eventual consistency, not ACID atomicity. The guarantees are:
With the caveat that compensation may itself fail, requiring compensating compensation or human intervention. The saga coordinator must be itself crash-recoverable via WAL, which is why saga state is persisted after every step.
21.9 Crash Recovery: Checkpointed State, Write-Ahead Logs, and Deterministic Replay#
21.9.1 Crash Recovery Architecture#
Crash recovery ensures that the agent system can resume from a consistent state after any unplanned termination. The architecture combines three mechanisms:
21.9.2 Write-Ahead Log (WAL) Specification#
The WAL is an append-only, durably persisted log of all state mutations:
Each entry:
Durability guarantee: is fsync'd to persistent storage before the mutation is applied to in-memory state.
Integrity: Each entry's checksum covers the previous entry's checksum, forming a hash chain:
21.9.3 Recovery Protocol#
ALGORITHM CrashRecovery
INPUT:
wal: WriteAheadLog
checkpoint_store: CheckpointStore
session_registry: SessionRegistry
OUTPUT:
recovered_sessions: List<Session>
BEGIN:
recovered ← []
// ─── Step 1: Identify sessions needing recovery ───
active_sessions ← session_registry.sessions_in_state(
{ACTIVE, RESUMED, DISPATCHED}
)
FOR EACH session_id IN active_sessions DO
// ─── Step 2: Find latest checkpoint ───
checkpoint ← checkpoint_store.latest_consistent(session_id)
IF checkpoint IS NULL THEN
WARN("no_checkpoint_found", session_id)
MARK_SESSION_FAILED(session_id, "no_recovery_checkpoint")
CONTINUE
END IF
// ─── Step 3: Validate checkpoint integrity ───
IF NOT VERIFY_CHECKSUM(checkpoint) THEN
// Try previous checkpoint
checkpoint ← checkpoint_store.latest_consistent_before(
session_id, checkpoint.seq
)
IF checkpoint IS NULL THEN
MARK_SESSION_FAILED(session_id, "all_checkpoints_corrupt")
CONTINUE
END IF
END IF
// ─── Step 4: Replay WAL from checkpoint ───
state ← DESERIALIZE(checkpoint.state)
wal_entries ← wal.entries_after(session_id, checkpoint.wal_seq)
// Verify WAL chain integrity
IF NOT VERIFY_WAL_CHAIN(wal_entries) THEN
WARN("wal_chain_broken", session_id)
// Recover up to the break point
wal_entries ← TRUNCATE_AT_BREAK(wal_entries)
END IF
FOR EACH entry IN wal_entries DO
MATCH entry.op_type:
STATE_MUTATION:
state ← APPLY_MUTATION(state, entry)
TOOL_INVOCATION_INTENT:
// Check if completed
completion ← FIND_COMPLETION(wal_entries, entry.idempotency_key)
IF completion IS NOT NULL THEN
state ← APPLY_COMPLETION(state, completion)
ELSE
// In-flight at crash: verify via idempotency key
external_result ← CHECK_IDEMPOTENT_RESULT(
entry.tool, entry.idempotency_key
)
IF external_result IS NOT NULL THEN
state ← APPLY_COMPLETION(state, external_result)
wal.append(COMPLETION_ENTRY(entry, external_result))
ELSE
// Action did not execute or result unknown
state ← MARK_ACTION_PENDING(state, entry)
END IF
END IF
SAGA_STEP:
// Handled by saga coordinator recovery
SAGA_RECOVERY_QUEUE.enqueue(entry)
END FOR
// ─── Step 5: Reconstruct session ───
session ← RECONSTRUCT_SESSION(session_id, state)
session.lifecycle_phase ← RESUMED
APPEND recovered, session
EMIT_METRIC("crash_recovery.session_recovered", {
session_id: session_id,
checkpoint_seq: checkpoint.seq,
wal_entries_replayed: LEN(wal_entries),
pending_actions: COUNT_PENDING(state)
})
END FOR
// ─── Step 6: Recover sagas ───
FOR EACH saga_entry IN SAGA_RECOVERY_QUEUE DO
SAGA_COORDINATOR.recover(saga_entry)
END FOR
RETURN recovered
END21.9.4 Recovery Time Objective (RTO) Analysis#
The total recovery time is:
where:
Minimizing (through frequent checkpointing) directly reduces RTO:
The checkpoint frequency is therefore an RTO–I/O cost trade-off:
yielding:
21.10 Chaos Engineering for Agents: Fault Injection, Latency Injection, and Resource Starvation Testing#
21.10.1 Chaos Engineering Principles for Agentic Systems#
Chaos engineering proactively identifies failure modes by intentionally injecting faults into the system under controlled conditions. For agentic systems, this extends beyond infrastructure chaos to include semantic chaos: injecting hallucinations, degraded retrieval, contradictory evidence, and adversarial tool responses.
21.10.2 Fault Injection Taxonomy#
| Injection Type | Target | Mechanism | Purpose |
|---|---|---|---|
| Network Partition | Inter-service communication | Drop/delay packets between agent and tool servers | Test circuit breaker and retry behavior |
| Latency Injection | Any RPC call | Add configurable delay to tool invocations, retrieval, or LLM calls | Test deadline propagation and timeout handling |
| Error Injection | Tool responses | Return error codes for configured fraction of requests | Test retry budgets and error classification |
| Resource Starvation | Token budgets, memory, CPU | Artificially constrain available resources | Test graceful degradation and load shedding |
| State Corruption | Session state, cache | Inject invalid data into cache or corrupt checkpoint | Test integrity verification and fallback |
| Model Degradation | LLM inference | Route to a deliberately poor model or inject noise | Test quality gates and repair loops |
| Tool Unavailability | MCP tool servers | Shut down tool servers or return 503 | Test tool substitution and degraded mode |
| Concurrent Conflict | Shared state | Inject conflicting concurrent mutations | Test isolation and conflict resolution |
21.10.3 Experiment Specification#
A chaos experiment is formally specified as:
where:
- Hypothesis: "The system maintains latency when 30% of retrieval requests fail."
- Scope: Specific service, region, or traffic percentage.
- Injection: Fault type and parameters.
- Duration: Fixed time window or event count.
- Blast radius limit: Maximum percentage of traffic affected.
- Abort conditions: Safety criteria that trigger immediate experiment termination.
- Metrics: Observable quantities measured during the experiment.
21.10.4 Chaos Experiment Safety Envelope#
The experiment operates within a safety envelope:
If any condition is violated, the experiment is automatically aborted and all injections are removed.
21.10.5 Pseudo-Algorithm: Chaos Experiment Runner#
ALGORITHM ChaosExperimentRunner
INPUT:
experiment: ChaosExperiment
system: SystemUnderTest
safety_config: SafetyEnvelope
OUTPUT:
report: ChaosExperimentReport
BEGIN:
// ─── Pre-experiment validation ───
baseline_metrics ← COLLECT_BASELINE_METRICS(system, duration=config.baseline_window)
IF NOT VALIDATE_SYSTEM_HEALTHY(baseline_metrics) THEN
RETURN ExperimentReport(status=ABORTED, reason="system_unhealthy_before_start")
END IF
// ─── Install fault injection ───
injection_handle ← INSTALL_FAULT_INJECTION(experiment.injection, experiment.scope)
experiment_start ← NOW()
metrics_during ← []
abort_triggered ← FALSE
// ─── Monitor during experiment ───
WHILE NOW() - experiment_start < experiment.duration DO
SLEEP(config.monitor_interval)
current_metrics ← COLLECT_METRICS(system)
APPEND metrics_during, current_metrics
// Safety envelope check
IF NOT CHECK_SAFETY_ENVELOPE(current_metrics, safety_config) THEN
EMIT_ALERT("chaos.safety_violation", experiment.id)
abort_triggered ← TRUE
BREAK
END IF
// Blast radius check
IF MEASURE_BLAST_RADIUS(current_metrics) > experiment.blast_radius_limit THEN
EMIT_ALERT("chaos.blast_radius_exceeded", experiment.id)
abort_triggered ← TRUE
BREAK
END IF
END WHILE
// ─── Remove fault injection ───
REMOVE_FAULT_INJECTION(injection_handle)
// ─── Post-experiment recovery verification ───
WAIT(config.recovery_observation_window)
recovery_metrics ← COLLECT_METRICS(system)
// ─── Analyze results ───
hypothesis_validated ← EVALUATE_HYPOTHESIS(
experiment.hypothesis,
baseline_metrics,
metrics_during,
recovery_metrics
)
report ← ChaosExperimentReport(
experiment_id=experiment.id,
status=IF abort_triggered THEN ABORTED ELSE COMPLETED,
hypothesis_validated=hypothesis_validated,
baseline_metrics=baseline_metrics,
during_metrics=AGGREGATE(metrics_during),
recovery_metrics=recovery_metrics,
abort_triggered=abort_triggered,
abort_reason=IF abort_triggered THEN IDENTIFY_VIOLATION() ELSE NULL,
duration_actual=NOW() - experiment_start,
recommendations=GENERATE_RECOMMENDATIONS(
hypothesis_validated, metrics_during, baseline_metrics
)
)
PERSIST_EXPERIMENT_REPORT(report)
// ─── Feed into evaluation infrastructure ───
IF NOT hypothesis_validated THEN
CREATE_IMPROVEMENT_TASK(report)
UPDATE_RUNBOOK(experiment.injection.type, report)
END IF
RETURN report
END21.11 Operational Runbooks: Automated Incident Response, Escalation, and Post-Mortem Integration#
21.11.1 Runbook as Executable Policy#
An operational runbook is not a static document. It is an executable policy that maps observed symptoms to diagnostic steps, automated remediation actions, and escalation paths:
where each entry is a response rule:
21.11.2 Automated Incident Response Pipeline#
ALGORITHM AutomatedIncidentResponse
INPUT:
alert: Alert
runbook: Runbook
system: SystemState
OUTPUT:
incident_record: IncidentRecord
BEGIN:
incident ← CREATE_INCIDENT(alert, severity=alert.severity)
// ─── Step 1: Classify incident ───
matched_rules ← []
FOR EACH rule IN runbook.rules DO
IF rule.condition(system.metrics, alert.context) THEN
APPEND matched_rules, rule
END IF
END FOR
IF LEN(matched_rules) = 0 THEN
// Unknown incident type
ESCALATE_TO_ONCALL(incident, reason="no_matching_runbook")
RETURN incident
END IF
// Select highest-priority matching rule
rule ← MAX(matched_rules, key=λr: r.priority)
// ─── Step 2: Execute automated remediation ───
FOR EACH action IN rule.actions DO
IF action.requires_approval AND NOT AUTO_APPROVE(action, incident.severity) THEN
approval ← REQUEST_APPROVAL(action, timeout=rule.approval_timeout)
IF NOT approval.granted THEN
CONTINUE // Skip this action
END IF
END IF
TRY:
result ← EXECUTE_REMEDIATION(action, system)
incident.add_action_record(action, result)
EMIT_TRACE("incident.remediation_executed", action.name)
// Check if remediation resolved the issue
WAIT(rule.verification_delay)
IF CHECK_RESOLUTION(alert, system) THEN
incident.status ← RESOLVED
incident.resolution ← action.name
BREAK
END IF
CATCH error:
incident.add_action_record(action, error)
EMIT_ALERT("incident.remediation_failed", action.name, error)
END TRY
END FOR
// ─── Step 3: Escalation if not resolved ───
IF incident.status ≠ RESOLVED THEN
MATCH rule.escalation.policy:
HUMAN_ONCALL:
PAGE_ONCALL(incident, rule.escalation.team)
MANAGEMENT:
NOTIFY_MANAGEMENT(incident)
EXTERNAL_VENDOR:
OPEN_SUPPORT_TICKET(incident, rule.escalation.vendor)
incident.status ← ESCALATED
END IF
// ─── Step 4: Persist and schedule post-mortem ───
PERSIST_INCIDENT(incident)
IF incident.severity ≤ SEVERITY_HIGH THEN
SCHEDULE_POST_MORTEM(incident, within=config.post_mortem_deadline)
END IF
RETURN incident
END21.11.3 Post-Mortem Integration#
Post-mortems produce actionable artifacts that feed back into the system:
Each post-mortem output is tracked as a work item with an owner, deadline, and verification criterion. The closure criterion for a post-mortem action item is:
21.12 SLA Definition and Enforcement: Availability, Latency P50/P95/P99, Error Budget, and Burn Rate#
21.12.1 SLA Specification#
The Service Level Agreement (SLA) for an agentic platform is defined as a set of Service Level Objectives (SLOs), each with a Service Level Indicator (SLI) and a target:
21.12.2 Core SLOs for Agentic Systems#
| SLO | SLI Definition | Target | Window |
|---|---|---|---|
| Availability | 30 days | ||
| Latency P50 | 50th percentile response time | 30 days | |
| Latency P95 | 95th percentile response time | 30 days | |
| Latency P99 | 99th percentile response time | 30 days | |
| Correctness | 30 days | ||
| Error Rate | 30 days | ||
| Throughput | Requests per second at target latency | 1 hour |
21.12.3 Error Budget#
The error budget is the allowed failure margin:
For 99.9% availability over 30 days with 1M requests:
The remaining error budget at time within the window:
21.12.4 Burn Rate#
The burn rate measures how quickly the error budget is being consumed:
A burn rate of 1.0 means the budget is consumed at exactly the planned rate. A burn rate means the budget will be exhausted before the window ends.
Time to budget exhaustion at current burn rate:
where is the current request rate.
21.12.5 Multi-Window Burn Rate Alerts#
To balance between catching fast-burning incidents and slow degradation, use multi-window burn rate alerts:
| Alert Severity | Fast Window | Slow Window | Burn Rate Threshold | Hours to Exhaustion |
|---|---|---|---|---|
| Page (Critical) | 5 min | 1 hour | 14.4× | 1h (2% budget) |
| Page (High) | 30 min | 6 hours | 6× | 5h (5% budget) |
| Ticket (Medium) | 2 hours | 1 day | 3× | 10h (10% budget) |
| Ticket (Low) | 6 hours | 3 days | 1× | 30 days (100% budget) |
An alert fires when the burn rate exceeds the threshold in both the fast and slow windows simultaneously, preventing false alarms from transient spikes.
21.12.6 SLA Enforcement Mechanism#
21.12.7 Pseudo-Algorithm: SLA Monitor and Enforcer#
ALGORITHM SLAMonitorAndEnforcer
INPUT:
slos: List<SLO>
metrics_stream: MetricsStream
config: SLAConfig
STATE:
error_budgets: Map<SLO_ID, ErrorBudget>
burn_rates: Map<SLO_ID, BurnRateTracker>
enforcement_level: EnforcementLevel ← NORMAL_OPERATION
// Runs continuously
METHOD monitor():
LOOP EVERY config.evaluation_interval DO
FOR EACH slo IN slos DO
// ─── Compute SLI ───
sli_value ← COMPUTE_SLI(slo, metrics_stream, slo.window)
// ─── Update error budget ───
total_events ← metrics_stream.count(slo.window)
budget_total ← (1 - slo.target) * total_events
budget_consumed ← total_events * (1 - sli_value)
budget_remaining ← budget_total - budget_consumed
error_budgets[slo.id] ← ErrorBudget(
total=budget_total,
consumed=budget_consumed,
remaining=budget_remaining,
remaining_fraction=budget_remaining / MAX(budget_total, 1)
)
// ─── Compute burn rates across multiple windows ───
FOR EACH (alert_level, fast_window, slow_window, threshold) IN config.burn_rate_windows DO
burn_fast ← COMPUTE_BURN_RATE(slo, metrics_stream, fast_window)
burn_slow ← COMPUTE_BURN_RATE(slo, metrics_stream, slow_window)
IF burn_fast > threshold AND burn_slow > threshold THEN
FIRE_ALERT(alert_level, slo, burn_fast, burn_slow, budget_remaining)
END IF
END FOR
// ─── Determine enforcement level ───
max_burn ← MAX(COMPUTE_BURN_RATE(slo, metrics_stream, 1h) FOR slo IN slos)
new_level ← MATCH max_burn:
< 1.0 → NORMAL_OPERATION
< 3.0 → INCREASED_MONITORING
< 6.0 → FREEZE_DEPLOYMENTS
< 14.4 → ACTIVATE_DEGRADATION
≥ 14.4 → INCIDENT_RESPONSE
IF new_level > enforcement_level THEN
// Escalate immediately
enforcement_level ← new_level
EXECUTE_ENFORCEMENT(enforcement_level)
ELSE IF new_level < enforcement_level THEN
// De-escalate only after stability period
IF STABLE_FOR(new_level, config.stability_window) THEN
enforcement_level ← new_level
EXECUTE_ENFORCEMENT(enforcement_level)
END IF
END IF
// ─── Emit dashboard metrics ───
EMIT_METRIC("sla.sli", {slo: slo.id, value: sli_value})
EMIT_METRIC("sla.error_budget_remaining",
{slo: slo.id, value: budget_remaining})
EMIT_METRIC("sla.burn_rate_1h",
{slo: slo.id, value: COMPUTE_BURN_RATE(slo, metrics_stream, 1h)})
END FOR
END LOOP
METHOD EXECUTE_ENFORCEMENT(level):
MATCH level:
NORMAL_OPERATION:
SET_DEGRADATION_LEVEL(C_0)
UNFREEZE_DEPLOYMENTS()
INCREASED_MONITORING:
INCREASE_MONITORING_FREQUENCY(2x)
NOTIFY_TEAM("sla_burn_rate_elevated")
FREEZE_DEPLOYMENTS:
FREEZE_DEPLOYMENT_PIPELINE()
NOTIFY_TEAM("deployments_frozen_sla_risk")
ACTIVATE_DEGRADATION:
SET_DEGRADATION_LEVEL(C_1) // Reduce verification overhead
FREEZE_DEPLOYMENT_PIPELINE()
NOTIFY_ONCALL("degradation_activated")
INCIDENT_RESPONSE:
SET_DEGRADATION_LEVEL(C_2) // Fallback model
FREEZE_DEPLOYMENT_PIPELINE()
PAGE_ONCALL("sla_critical_burn_rate")
CREATE_INCIDENT(severity=CRITICAL)
END21.12.8 SLA Reporting and Accountability#
Monthly SLA reports are auto-generated:
The report includes:
- Achieved SLI vs. target for each SLO.
- Error budget consumed and remaining.
- Top contributing incidents to budget consumption.
- Burn rate trends over the period.
- Recommendations from chaos experiments and post-mortems.
Synthesis: Fault Tolerance as an Architectural Discipline#
The fault tolerance architecture presented in this chapter treats failures not as exceptions but as first-class system events with typed classifications, bounded recovery procedures, and measurable impact on service-level guarantees. The key architectural contributions:
| Principle | Mechanism | Guarantee |
|---|---|---|
| Typed failure classification | 5-class taxonomy with automated classification engine | Correct recovery strategy selection |
| Bounded retry | Exponential backoff with jitter, layered budgets, idempotency keys | No unbounded resource consumption; at-most-once semantics |
| Circuit isolation | Circuit breakers with adaptive recovery probing | Failing dependencies do not cascade |
| Resource partitioning | Bulkhead isolation with elastic borrowing | Cross-concern failure containment |
| Deadline discipline | Propagated deadlines, optimal budget allocation, preemptive cancellation | Latency SLA compliance under composition |
| Admission governance | Token-bucket rate limits, composite load signals, priority-based shedding | System stability under overload |
| Graceful degradation | 5-level capability hierarchy, health-driven feature flags, hysteresis | Continued service at reduced capability |
| Compensating transactions | Saga pattern with WAL-persisted coordination | Eventually consistent rollback of multi-step mutations |
| Crash recovery | Checkpoint + WAL replay with idempotent re-execution | Resumable sessions after unplanned termination |
| Proactive validation | Chaos engineering with safety envelopes | Discovery of unknown failure modes before production incidents |
| Operational automation | Executable runbooks, post-mortem → improvement pipeline | Reduced MTTR, institutional learning |
| SLA enforcement | Multi-window burn rate monitoring, automated enforcement escalation | Measurable, enforceable service guarantees |
A system that cannot quantify its failure tolerance cannot claim to be production-grade. The architecture in this chapter ensures every failure mode is classified, every recovery is bounded, every degradation is transparent, and every service guarantee is continuously measured against an explicit error budget. This is the engineering standard required for agentic AI systems operating at sustained enterprise scale.
End of Chapter 21.