Chapter 13: Tool Architecture — MCP Servers, Typed Contracts, and Least-Privilege Execution

Preface to Chapter 13#

An agentic system that cannot actuate the external world through well-governed, observable, and fault-tolerant tool interfaces is merely a text generator. This chapter formally defines the architecture of tool infrastructure in production agentic systems: from the protocol-level design of Model Context Protocol (MCP) servers, through typed contract enforcement, invocation lifecycle management, authorization propagation, and operational observability, to the governance mechanisms that keep autonomous agents safe under mutation pressure. Every tool interaction is treated as a first-class distributed systems problem — with schemas, deadlines, idempotency guarantees, versioning contracts, and human-interruptible control surfaces.

13.1 Tools as First-Class Infrastructure: Beyond Simple Function Calling#

13.1.1 The Inadequacy of Naive Function Calling#

Conventional LLM function calling — where a model emits a JSON object matching a function signature and the runtime blindly dispatches it — is structurally inadequate for production agentic systems. The deficiencies are categorical:

No schema negotiation at runtime: The model receives a static list of function signatures at prompt compilation time; there is no mechanism for capability discovery, version awareness, or conditional availability.
No authorization boundary: The calling model implicitly inherits the credentials of the host process rather than operating under caller-scoped, least-privilege authorization.
No invocation lifecycle: There is no pre-validation, no post-verification, no timeout classification, and no structured error recovery — only success or opaque failure.
No observability contract: Invocations are fire-and-forget from the model's perspective; latency, cost, side-effect auditing, and trace propagation are afterthoughts rather than architectural invariants.
No idempotency guarantee: Retries after transient failures may duplicate state-changing mutations, with no deduplication mechanism.

13.1.2 The Tool-as-Infrastructure Thesis#

Tools in a principled agentic architecture must be treated as typed, versioned, observable, authorization-scoped, lifecycle-managed infrastructure components — equivalent in rigor to microservice APIs in distributed systems engineering. Formally, a tool $\mathcal{T}$ is defined as a tuple:

\mathcal{T} = \langle \texttt{id}, \texttt{version}, \Sigma_{\text{in}}, \Sigma_{\text{out}}, \Sigma_{\text{err}}, \mathcal{C}_{\text{auth}}, \mathcal{P}_{\text{timeout}}, \mathcal{I}_{\text{idempotency}}, \mathcal{M}_{\text{mutation\_class}}, \mathcal{O}_{\text{observability}} \rangle

where:

Symbol	Definition
$\texttt{id}$	Globally unique, stable tool identifier (namespace-scoped)
$\texttt{version}$	Semantic version of the tool contract
$\Sigma_{\text{in}}$	JSON Schema–validated input type
$\Sigma_{\text{out}}$	Typed structured output schema (including pagination metadata)
$\Sigma_{\text{err}}$	Error envelope with classified error codes, retryability flag, and human-readable diagnostics
$\mathcal{C}_{\text{auth}}$	Authorization contract: required scopes, credential propagation rules
$\mathcal{P}_{\text{timeout}}$	Timeout class assignment: interactive, standard, long-running, or async
$\mathcal{I}_{\text{idempotency}}$	Idempotency specification: key derivation, deduplication window, at-least-once semantics
$\mathcal{M}_{\text{mutation\_class}}$	Read-only vs. state-changing classification with side-effect manifest
$\mathcal{O}_{\text{observability}}$	Trace context propagation, metric emission points, audit log contract

13.1.3 Protocol Layering for Tool Communication#

The tool protocol stack mirrors the broader platform architecture:

MCP (Model Context Protocol) — Discovery, schema announcement, prompt surface exposure, resource listing, and change notification for tools accessible to the agent runtime.
JSON-RPC 2.0 — The user/application boundary protocol for tool invocations initiated by external orchestrators or human-in-the-loop interfaces.
gRPC/Protobuf — Internal service-to-service execution path for low-latency tool dispatch between orchestration agents and tool servers, with strongly typed message contracts, deadline propagation, and bidirectional streaming for long-running operations.

Every tool invocation carries a trace context (W3C Trace Context or equivalent), a deadline (derived from the timeout class), and a caller identity (propagated, never impersonated).

13.1.4 Formal Quality Gate for Tool Admission#

A tool $\mathcal{T}$ is admitted into the agent's available toolset if and only if it satisfies the following admission predicate:

\texttt{Admit}(\mathcal{T}) \iff \bigwedge \begin{cases} \texttt{SchemaValid}(\Sigma_{\text{in}}, \Sigma_{\text{out}}, \Sigma_{\text{err}}) \\ \texttt{VersionRegistered}(\texttt{id}, \texttt{version}) \\ \texttt{AuthPolicyDefined}(\mathcal{C}_{\text{auth}}) \\ \texttt{TimeoutClassAssigned}(\mathcal{P}_{\text{timeout}}) \\ \texttt{IdempotencySpecified}(\mathcal{I}_{\text{idempotency}}) \\ \texttt{MutationClassLabeled}(\mathcal{M}_{\text{mutation\_class}}) \\ \texttt{ObservabilityInstrumented}(\mathcal{O}_{\text{observability}}) \\ \texttt{ContractTestsPassing}(\mathcal{T}) \end{cases}

Tools that fail any conjunct are rejected at registration time — never silently degraded.

13.2 MCP Tool Server Design Patterns#

The Model Context Protocol (MCP) defines a standardized interface through which agents discover, negotiate, and invoke tools. MCP servers are categorized by state management requirements and composition complexity.

13.2.1 Stateless Tool Servers: Pure Computation and Data Retrieval#

Definition. A stateless MCP tool server $\mathcal{S}_{\varnothing}$ processes each request as an isolated, side-effect-free computation. No server-side session state persists between invocations.

Characteristics:

Referential transparency: For identical inputs and identical external data snapshots, the output is deterministic.
Horizontal scalability: Any replica can serve any request; no affinity routing required.
Trivial idempotency: Re-execution is safe by construction since no mutation occurs.
Cache-friendly: Responses can be cached by input hash with TTL governed by data freshness requirements.

Canonical Examples:

Mathematical computation servers (symbolic algebra, unit conversion, statistical calculation)
Read-only database query executors (with parameterized, pre-authorized query templates)
Document format converters (Markdown → PDF, CSV → JSON)
Embedding computation servers (text → vector, with model version pinning)

Pseudo-Algorithm 13.1: Stateless MCP Tool Server Request Handler

PROCEDURE HandleStatelessRequest(request, toolRegistry, cache):
    // Phase 1: Schema validation
    tool ← toolRegistry.Resolve(request.toolId, request.version)
    IF tool = NULL THEN
        RETURN ErrorEnvelope(code=TOOL_NOT_FOUND, retryable=FALSE)
    
    validationResult ← ValidateAgainstSchema(request.params, tool.inputSchema)
    IF NOT validationResult.valid THEN
        RETURN ErrorEnvelope(code=INVALID_INPUT, details=validationResult.errors, retryable=FALSE)
    
    // Phase 2: Cache lookup
    cacheKey ← DeriveCanonicalHash(request.toolId, request.version, request.params)
    cachedResult ← cache.Get(cacheKey)
    IF cachedResult ≠ NULL AND cachedResult.age < tool.cacheTTL THEN
        RETURN SuccessEnvelope(data=cachedResult.value, source="cache", traceId=request.traceId)
    
    // Phase 3: Execute
    deadline ← MIN(request.deadline, tool.timeoutClass.maxDuration)
    result ← ExecuteWithDeadline(tool.handler, request.params, deadline)
    
    IF result.timedOut THEN
        RETURN ErrorEnvelope(code=DEADLINE_EXCEEDED, retryable=TRUE)
    
    // Phase 4: Output validation
    outputValid ← ValidateAgainstSchema(result.value, tool.outputSchema)
    IF NOT outputValid.valid THEN
        EmitAlert(TOOL_OUTPUT_SCHEMA_VIOLATION, tool.id, result.value)
        RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR, retryable=FALSE)
    
    // Phase 5: Cache store and return
    cache.Put(cacheKey, result.value, ttl=tool.cacheTTL)
    EmitMetric(tool.id, latency=result.duration, status=SUCCESS)
    RETURN SuccessEnvelope(data=result.value, source="compute", traceId=request.traceId)

Design Constraints:

Stateless servers MUST NOT write to any persistent store as a side effect of tool execution.
All external data reads MUST be through read-only connections or pre-authorized query templates.
Cache invalidation MUST be event-driven (via MCP change notifications) or TTL-bounded; stale data is a correctness hazard in agentic reasoning.

13.2.2 Stateful Tool Servers: Session-Aware, Transaction-Capable Services#

Definition. A stateful MCP tool server $\mathcal{S}_{\sigma}$ maintains mutable state across invocations within a bounded session or transaction scope. State transitions must be explicit, auditable, and reversible where feasible.

Characteristics:

Session affinity: Requests within a session must route to the same server instance or to a shared session store.
Transaction boundaries: State mutations occur within explicit transaction scopes with commit/rollback semantics.
Idempotency enforcement: Every mutating operation MUST accept an idempotency key; duplicate submissions within the deduplication window MUST return the original result without re-execution.
Compensating actions: For mutations that cannot be atomically rolled back, a compensating action (inverse operation) MUST be registered at commit time.

Canonical Examples:

Code execution sandboxes (stateful REPL sessions with persistent variable scope)
Database write proxies (INSERT, UPDATE, DELETE with transactional guarantees)
File system manipulation servers (create, modify, delete with journaling)
External API integrators that perform multi-step workflows (e.g., create a cloud resource, then configure it)

Formal State Transition Model:

Let $S$ be the state space of a stateful tool server, and let $a \in A$ be an action (tool invocation). The transition function is:

\delta: S \times A \times \mathcal{K}_{\text{idem}} \rightarrow S' \times R \times \mathcal{J}_{\text{journal}}

where $\mathcal{K}_{\text{idem}}$ is the idempotency key, $R$ is the result, and $\mathcal{J}_{\text{journal}}$ is the journal entry recording the transition for auditability and rollback.

The idempotency invariant is:

\forall s, a, k: \quad \delta(s, a, k) = (s', r, j) \implies \delta(s', a, k) = (s', r, j_{\text{noop}})

That is, re-applying the same action with the same idempotency key from the post-state yields no further state change.

Pseudo-Algorithm 13.2: Stateful MCP Tool Server with Idempotency

PROCEDURE HandleStatefulRequest(request, toolRegistry, sessionStore, idempotencyStore):
    // Phase 1: Resolve tool and session
    tool ← toolRegistry.Resolve(request.toolId, request.version)
    session ← sessionStore.GetOrCreate(request.sessionId)
    
    // Phase 2: Idempotency check
    IF request.idempotencyKey ≠ NULL THEN
        prior ← idempotencyStore.Lookup(request.toolId, request.idempotencyKey)
        IF prior ≠ NULL AND prior.age < tool.idempotencyWindow THEN
            EmitMetric(tool.id, status=IDEMPOTENT_REPLAY)
            RETURN prior.result   // Return cached result without re-execution
    
    // Phase 3: Authorization (caller-scoped)
    authDecision ← EvaluatePolicy(
        callerIdentity=request.callerToken,
        requiredScopes=tool.authContract.requiredScopes,
        resourceContext=session.resourceScope
    )
    IF authDecision = DENY THEN
        RETURN ErrorEnvelope(code=AUTHORIZATION_DENIED, retryable=FALSE)
    IF authDecision = REQUIRES_APPROVAL THEN
        approvalId ← SubmitApprovalRequest(request, tool, session)
        RETURN PendingEnvelope(approvalId=approvalId, pollEndpoint=...)
    
    // Phase 4: Input validation
    validationResult ← ValidateAgainstSchema(request.params, tool.inputSchema)
    IF NOT validationResult.valid THEN
        RETURN ErrorEnvelope(code=INVALID_INPUT, details=validationResult.errors)
    
    // Phase 5: Begin transaction, execute, commit
    txn ← session.BeginTransaction()
    TRY:
        result ← ExecuteWithDeadline(tool.handler, request.params, session.state, txn, deadline)
        outputValid ← ValidateAgainstSchema(result.value, tool.outputSchema)
        IF NOT outputValid.valid THEN
            txn.Rollback()
            RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR)
        
        compensatingAction ← tool.DeriveCompensation(request.params, result.value)
        journalEntry ← JournalEntry(tool.id, request.params, result.value, compensatingAction, timestamp=NOW())
        
        txn.Commit()
        session.AppendJournal(journalEntry)
        
        // Phase 6: Store idempotency record
        IF request.idempotencyKey ≠ NULL THEN
            idempotencyStore.Store(request.toolId, request.idempotencyKey, 
                                   SuccessEnvelope(data=result.value), ttl=tool.idempotencyWindow)
        
        EmitMetric(tool.id, latency=result.duration, status=SUCCESS, mutation=TRUE)
        RETURN SuccessEnvelope(data=result.value, traceId=request.traceId)
    
    CATCH exception:
        txn.Rollback()
        EmitMetric(tool.id, status=FAILURE, error=exception.class)
        RETURN ErrorEnvelope(code=EXECUTION_FAILED, retryable=IsTransient(exception))

13.2.3 Composite Tool Servers: Orchestrating Multi-Step Tool Chains#

Definition. A composite MCP tool server $\mathcal{S}_{\oplus}$ composes multiple atomic tool invocations into a coherent multi-step workflow, managing inter-step dependencies, partial failure recovery, and aggregate result synthesis.

Motivation. Agents frequently require tool chains — e.g., "search for relevant files, then read their contents, then run static analysis on the results." Delegating chain orchestration entirely to the LLM planner introduces fragility: the model may mis-sequence steps, fail to handle partial failures, or lose intermediate results due to context window pressure. Composite tool servers internalize the orchestration logic while exposing a single, high-level tool interface to the agent.

Formal Composition Model:

A composite tool $\mathcal{T}_{\oplus}$ is defined as a directed acyclic graph (DAG) of atomic tools:

\mathcal{T}_{\oplus} = (V, E, \phi_{\text{bind}}, \psi_{\text{merge}})

where:

$V = \{\mathcal{T}_1, \mathcal{T}_2, \ldots, \mathcal{T}_n\}$ is the set of constituent atomic tools.
$E \subseteq V \times V$ defines execution dependencies ( $(\mathcal{T}_i, \mathcal{T}_j) \in E$ means $\mathcal{T}_j$ depends on $\mathcal{T}_i$ 's output).
$\phi_{\text{bind}}: \Sigma_{\text{out}}(\mathcal{T}_i) \rightarrow \Sigma_{\text{in}}(\mathcal{T}_j)$ is the binding function that maps upstream outputs to downstream inputs.
$\psi_{\text{merge}}: \prod_{i} \Sigma_{\text{out}}(\mathcal{T}_i) \rightarrow \Sigma_{\text{out}}(\mathcal{T}_{\oplus})$ is the merge function that synthesizes the composite output.

Execution Semantics:

Topologically sort $V$ according to $E$ .
Execute each level of the topological sort in parallel where no data dependencies exist.
Apply $\phi_{\text{bind}}$ to route intermediate results.
On any step failure, evaluate the step's retry budget; if exhausted, invoke compensating actions for all committed prior steps (reverse topological order) and report aggregate failure.
Apply $\psi_{\text{merge}}$ to produce the composite result.

Pseudo-Algorithm 13.3: Composite Tool Execution with Partial Failure Recovery

PROCEDURE ExecuteComposite(compositeSpec, inputParams, deadline):
    dag ← BuildDAG(compositeSpec.tools, compositeSpec.edges)
    levels ← TopologicalSort(dag)
    intermediateResults ← EmptyMap()
    committedSteps ← EmptyStack()
    
    FOR EACH level IN levels:
        parallelTasks ← []
        FOR EACH tool IN level:
            boundInput ← ApplyBindings(compositeSpec.bindings[tool.id], inputParams, intermediateResults)
            parallelTasks.Add(ScheduleToolInvocation(tool, boundInput, deadline))
        
        results ← AwaitAll(parallelTasks, deadline)
        
        FOR EACH (tool, result) IN results:
            IF result.status = FAILURE THEN
                IF tool.retryBudget > 0 THEN
                    retryResult ← RetryWithBackoff(tool, result.input, tool.retryBudget, deadline)
                    IF retryResult.status = FAILURE THEN
                        GOTO CompensateAndFail
                    result ← retryResult
                ELSE
                    GOTO CompensateAndFail
            
            intermediateResults[tool.id] ← result.value
            committedSteps.Push(tool)
    
    compositeOutput ← ApplyMergeFunction(compositeSpec.mergeFunction, intermediateResults)
    RETURN SuccessEnvelope(data=compositeOutput)
 
CompensateAndFail:
    WHILE committedSteps IS NOT EMPTY:
        completedTool ← committedSteps.Pop()
        compensation ← completedTool.DeriveCompensation(intermediateResults[completedTool.id])
        IF compensation ≠ NULL THEN
            ExecuteCompensation(compensation, deadline)
    RETURN ErrorEnvelope(code=COMPOSITE_PARTIAL_FAILURE, 
                         failedStep=tool.id, 
                         completedSteps=committedSteps.AsList())

Design Constraints for Composite Servers:

The composite DAG MUST be acyclic; cycle detection is performed at registration time.
Maximum DAG depth is bounded (typically $\leq 10$ steps) to enforce latency predictability.
Each constituent tool invocation inherits the caller's authorization scope — the composite server MUST NOT escalate privileges.
Intermediate results are held in ephemeral working memory and MUST NOT be persisted beyond the composite execution scope unless explicitly required by the output contract.

13.3 Tool Schema Design: JSON Schema Input Validation, Structured Output Types, Error Envelopes#

13.3.1 Input Schema: JSON Schema with Semantic Annotations#

Every tool input MUST be described by a JSON Schema document (draft 2020-12 or later) augmented with semantic annotations that guide both the LLM's parameter generation and the runtime's validation logic.

Required Schema Properties:

Property	Purpose
`type`, `properties`, `required`	Standard structural validation
`description` (per-property)	Semantic guidance for the LLM planner — must be precise, unambiguous, and include valid example values
`enum` / `const`	Constrained value sets; reduces hallucination risk in parameter generation
`format`	Semantic format hints (`date-time`, `uri`, `email`, `uuid`) with runtime validation
`minLength`, `maxLength`, `pattern`	String constraint enforcement
`minimum`, `maximum`, `multipleOf`	Numeric constraint enforcement
`x-sensitivity`	Custom extension: marks fields containing PII, secrets, or credentials for audit redaction
`x-token-cost`	Custom extension: estimated token cost of including this field in context, used by the prefill compiler
`default`	Default values for optional parameters; reduces LLM burden for common cases

Validation Rigor:

Input validation is strict by default — additional properties not declared in the schema are rejected (i.e., additionalProperties: false). This prevents the LLM from inventing parameters that the tool does not support, which is a common hallucination pattern.

\texttt{Valid}(p, \Sigma_{\text{in}}) \iff \texttt{JSONSchemaValidate}(p, \Sigma_{\text{in}}) = \top \;\wedge\; \texttt{AdditionalProperties}(p, \Sigma_{\text{in}}) = \varnothing

13.3.2 Output Schema: Typed Structured Responses#

Tool outputs are not freeform text. Every tool defines a typed output schema $\Sigma_{\text{out}}$ that includes:

Result payload: The primary structured data returned by the tool.
Pagination metadata (where applicable): cursor, hasMore, totalCount — enabling the agent to request subsequent pages without re-specifying the query.
Provenance metadata: source, retrievedAt, dataVersion — enabling the agent and downstream verifiers to assess result authority and freshness.
Execution metadata: executionDurationMs, resourceCost — enabling the orchestrator to update latency and cost models.

13.3.3 Error Envelope: Classified, Retryable, Diagnostic#

All tool failures MUST be returned within a standardized error envelope, never as raw exceptions or unstructured text.

Error Envelope Structure:

Field	Type	Description
`errorCode`	Enum	Machine-readable error class (see taxonomy below)
`retryable`	Boolean	Whether the caller should retry with the same parameters
`retryAfterMs`	Integer (optional)	Suggested backoff duration before retry
`humanMessage`	String	Human-readable diagnostic (never leaked credentials or internal paths)
`details`	Object (optional)	Structured diagnostic data (validation errors, partial results, trace references)
`traceId`	String	Correlation identifier for distributed tracing

Error Code Taxonomy:

\mathcal{E} = \begin{cases} \texttt{INVALID\_INPUT} & \text{Schema validation failed; not retryable} \\ \texttt{AUTHORIZATION\_DENIED} & \text{Caller lacks required scopes; not retryable without re-auth} \\ \texttt{TOOL\_NOT\_FOUND} & \text{Requested tool/version not registered; not retryable} \\ \texttt{DEADLINE\_EXCEEDED} & \text{Execution exceeded timeout class; retryable with backoff} \\ \texttt{RATE\_LIMITED} & \text{Caller exceeded rate quota; retryable after backoff} \\ \texttt{UPSTREAM\_FAILURE} & \text{Dependency failure; retryable with backoff} \\ \texttt{INTERNAL\_TOOL\_ERROR} & \text{Unhandled internal error; retryable with caution} \\ \texttt{RESOURCE\_EXHAUSTED} & \text{Memory, disk, or compute capacity exceeded; not retryable} \\ \texttt{PRECONDITION\_FAILED} & \text{Required preconditions not met; not retryable without state change} \\ \texttt{CONFLICT} & \text{Optimistic concurrency conflict; retryable after refresh} \\ \texttt{APPROVAL\_REQUIRED} & \text{Human approval gate triggered; poll for resolution} \end{cases}

The retryable flag is the authoritative signal — the agent loop MUST NOT attempt retries on non-retryable errors, regardless of heuristic judgment.

13.4 Tool Discovery and Registration: Dynamic Capability Announcement, Schema Negotiation#

13.4.1 Discovery Protocol#

Tool discovery follows the MCP capability announcement pattern. When an agent runtime connects to an MCP server (or when an MCP server is dynamically registered with the orchestrator), the following handshake occurs:

Pseudo-Algorithm 13.4: MCP Tool Discovery Handshake

PROCEDURE DiscoverTools(mcpEndpoint, agentCapabilities):
    // Step 1: Capability negotiation
    serverCapabilities ← mcpEndpoint.Initialize(
        protocolVersion=CURRENT_MCP_VERSION,
        clientCapabilities=agentCapabilities
    )
    IF NOT CompatibleVersion(serverCapabilities.protocolVersion, CURRENT_MCP_VERSION) THEN
        RETURN DiscoveryFailure(reason="Protocol version mismatch")
    
    // Step 2: List available tools with schemas
    toolList ← mcpEndpoint.ListTools(cursor=NULL)
    allTools ← []
    WHILE toolList IS NOT NULL:
        FOR EACH toolDescriptor IN toolList.tools:
            // Step 3: Schema validation at discovery time
            schemaValid ← ValidateToolDescriptor(toolDescriptor)
            IF schemaValid THEN
                allTools.Add(toolDescriptor)
            ELSE
                EmitWarning(INVALID_TOOL_SCHEMA, toolDescriptor.id)
        
        IF toolList.hasMore THEN
            toolList ← mcpEndpoint.ListTools(cursor=toolList.nextCursor)
        ELSE
            toolList ← NULL
    
    // Step 4: Register discovered tools in agent's tool registry
    FOR EACH tool IN allTools:
        agentToolRegistry.Register(tool, source=mcpEndpoint.id, discoveredAt=NOW())
    
    // Step 5: Subscribe to change notifications
    mcpEndpoint.SubscribeToChanges(callback=OnToolSetChanged)
    
    RETURN allTools

13.4.2 Registration Invariants#

Every registered tool MUST satisfy the admission predicate defined in §13.1.4. Additionally:

Namespace isolation: Tool IDs are scoped by MCP server namespace to prevent collisions across independently managed servers.
Version coexistence: Multiple versions of the same tool may be registered simultaneously. The agent's prefill compiler selects the appropriate version based on task requirements and deprecation status.
Liveness probing: The orchestrator periodically probes registered MCP servers for health. Tools from unresponsive servers are marked UNAVAILABLE in the registry but not deregistered (to preserve discovery history and enable rapid re-availability).
Change propagation: When an MCP server emits a change notification (tool added, removed, or schema modified), the agent's tool registry is updated atomically, and the prefill compiler's tool affordance cache is invalidated.

13.4.3 Schema Negotiation#

Schema negotiation resolves capability mismatches between agent expectations and tool offerings:

\texttt{Negotiate}(\Sigma_{\text{agent\_expected}}, \Sigma_{\text{tool\_offered}}) = \begin{cases} \texttt{FULL\_MATCH} & \text{if } \Sigma_{\text{agent\_expected}} \subseteq \Sigma_{\text{tool\_offered}} \\ \texttt{PARTIAL\_MATCH} & \text{if } \Sigma_{\text{required}} \subseteq \Sigma_{\text{tool\_offered}} \wedge \Sigma_{\text{optional}} \not\subseteq \Sigma_{\text{tool\_offered}} \\ \texttt{INCOMPATIBLE} & \text{if } \Sigma_{\text{required}} \not\subseteq \Sigma_{\text{tool\_offered}} \end{cases}

Partial matches are acceptable when missing capabilities are optional. The agent loop adapts its plan accordingly.

13.5 Lazy Tool Loading: Minimizing Context Cost by Deferring Schema Injection#

13.5.1 The Context Cost Problem#

Each tool schema injected into the agent's context window consumes tokens. For a system with $N$ registered tools, each with an average schema size of $\bar{s}$ tokens, the total tool-affordance context cost is:

C_{\text{tools}} = N \cdot \bar{s}

For production systems with $N = 100$ tools and $\bar{s} = 200$ tokens, $C_{\text{tools}} = 20{,}000$ tokens — a substantial fraction of even large context windows. This directly reduces the budget available for task-relevant instructions, retrieved evidence, memory summaries, and reasoning.

13.5.2 Lazy Loading Strategy#

Lazy tool loading defers schema injection until the agent's planner determines that a tool is likely needed for the current task. The strategy operates in three tiers:

Tier 0 — Tool Index (Always Loaded): A compressed index of all available tools, containing only (toolId, oneSentenceDescription, mutationClass). Cost: $\sim 5{-}10$ tokens per tool.

C_{\text{index}} = N \cdot \bar{s}_{\text{index}}, \quad \bar{s}_{\text{index}} \approx 8 \text{ tokens}

Tier 1 — Selected Tool Schemas (Loaded on Plan): When the agent's planner identifies relevant tools during the planning phase, full schemas for only those tools are injected.

C_{\text{selected}} = |\mathcal{T}_{\text{selected}}| \cdot \bar{s}, \quad |\mathcal{T}_{\text{selected}}| \ll N

Tier 2 — Extended Tool Details (Loaded on Demand): For tools with complex schemas (e.g., multi-page input structures), only the top-level schema is injected at Tier 1; nested details are loaded only when the agent begins constructing the specific invocation.

Pseudo-Algorithm 13.5: Lazy Tool Loading in Prefill Compilation

PROCEDURE CompileToolAffordances(taskObjective, toolRegistry, tokenBudget):
    // Tier 0: Always include compressed index
    toolIndex ← toolRegistry.GetCompressedIndex()
    indexTokens ← CountTokens(toolIndex)
    remainingBudget ← tokenBudget - indexTokens
    affordanceBlock ← [toolIndex]
    
    // Tier 1: Select relevant tools via planner query
    relevantToolIds ← PlannerSelectTools(taskObjective, toolIndex)
    
    // Rank by predicted utility for current task
    rankedTools ← RankByTaskUtility(relevantToolIds, taskObjective)
    
    FOR EACH toolId IN rankedTools:
        schema ← toolRegistry.GetSchema(toolId, detail=STANDARD)
        schemaTokens ← CountTokens(schema)
        
        IF schemaTokens ≤ remainingBudget THEN
            affordanceBlock.Add(schema)
            remainingBudget ← remainingBudget - schemaTokens
        ELSE
            // Tier 2: Include compressed schema reference only
            compressedRef ← toolRegistry.GetSchema(toolId, detail=MINIMAL)
            refTokens ← CountTokens(compressedRef)
            IF refTokens ≤ remainingBudget THEN
                affordanceBlock.Add(compressedRef)
                remainingBudget ← remainingBudget - refTokens
            ELSE
                BREAK   // Budget exhausted
    
    RETURN affordanceBlock, remainingBudget

13.5.3 Token Budget Optimization#

The optimal tool selection under a token budget $B$ is formulated as a 0-1 knapsack problem:

\max_{\mathbf{x} \in \{0,1\}^N} \sum_{i=1}^{N} u_i \cdot x_i \quad \text{subject to} \quad \sum_{i=1}^{N} s_i \cdot x_i \leq B

where $u_i$ is the estimated utility of tool $i$ for the current task (derived from planner relevance scoring) and $s_i$ is the token cost of tool $i$ 's schema. In practice, a greedy approximation (sort by $u_i / s_i$ descending, select greedily) is sufficient given that tool counts are moderate ( $N < 500$ ).

13.6 Tool Invocation Lifecycle: Request → Validate → Authorize → Execute → Verify → Return#

13.6.1 Lifecycle Overview#

Every tool invocation — regardless of tool type, server pattern, or timeout class — traverses a deterministic six-phase lifecycle. Skipping any phase constitutes a protocol violation.

\texttt{Lifecycle}(\mathcal{T}) : \texttt{Request} \xrightarrow{\text{1}} \texttt{Validate} \xrightarrow{\text{2}} \texttt{Authorize} \xrightarrow{\text{3}} \texttt{Execute} \xrightarrow{\text{4}} \texttt{Verify} \xrightarrow{\text{5}} \texttt{Return}

13.6.2 Phase-by-Phase Specification#

Phase 1 — Request Construction. The agent's action generator produces a structured tool invocation request containing:

toolId and version
params (JSON object matching $\Sigma_{\text{in}}$ )
idempotencyKey (for mutating operations)
callerToken (propagated credential)
traceId (from current execution trace)
deadline (computed from timeout class and remaining agent loop budget)

Phase 2 — Validate. The tool server validates the request:

Schema validation against $\Sigma_{\text{in}}$
Semantic validation (e.g., referenced resource exists, date range is valid)
Idempotency key format validation

Validation failures produce INVALID_INPUT error envelopes with actionable diagnostic details.

Phase 3 — Authorize. The tool server evaluates the authorization policy $\mathcal{C}_{\text{auth}}$ against the caller's identity and requested operation:

\texttt{AuthDecision} = \mathcal{C}_{\text{auth}}(\texttt{callerToken}, \texttt{toolId}, \texttt{operation}, \texttt{resourceScope})

Possible decisions: $\{\texttt{ALLOW}, \texttt{DENY}, \texttt{REQUIRES\_APPROVAL}\}$ .

Phase 4 — Execute. The tool handler executes the core logic within the assigned deadline. For stateful tools, execution occurs within a transaction scope. The tool server emits structured telemetry (start timestamp, intermediate checkpoints for long-running operations, completion status).

Phase 5 — Verify. Post-execution verification:

Output schema validation against $\Sigma_{\text{out}}$
Side-effect verification (for mutating tools): confirm the intended state change occurred correctly
Integrity checks: hash verification for file operations, row count verification for database operations

Phase 6 — Return. The tool server constructs and returns either a SuccessEnvelope or an ErrorEnvelope, with full telemetry (latency, cost attribution, trace references).

Pseudo-Algorithm 13.6: Complete Tool Invocation Lifecycle

PROCEDURE InvokeTool(request):
    span ← StartTraceSpan("tool.invoke", traceId=request.traceId, toolId=request.toolId)
    
    // PHASE 1: Request already constructed by agent action generator
    
    // PHASE 2: Validate
    tool ← ResolveToolOrFail(request.toolId, request.version)
    validationErrors ← ValidateInput(request.params, tool.inputSchema)
    IF validationErrors ≠ EMPTY THEN
        span.SetStatus(INVALID_INPUT)
        RETURN ErrorEnvelope(code=INVALID_INPUT, details=validationErrors)
    
    // PHASE 3: Authorize
    authDecision ← EvaluateAuthPolicy(tool.authContract, request.callerToken, request.params)
    IF authDecision = DENY THEN
        span.SetStatus(AUTHORIZATION_DENIED)
        EmitAuditLog(DENIED, request.callerToken, tool.id, request.params)
        RETURN ErrorEnvelope(code=AUTHORIZATION_DENIED)
    IF authDecision = REQUIRES_APPROVAL THEN
        approvalTicket ← CreateApprovalTicket(request, tool)
        span.AddEvent("approval_requested", approvalTicket.id)
        RETURN PendingEnvelope(approvalTicket)
    
    // PHASE 4: Execute
    executionContext ← PrepareExecutionContext(tool, request, span)
    result ← ExecuteWithDeadline(tool.handler, executionContext, request.deadline)
    
    IF result.status = TIMEOUT THEN
        span.SetStatus(DEADLINE_EXCEEDED)
        RETURN ErrorEnvelope(code=DEADLINE_EXCEEDED, retryable=TRUE)
    IF result.status = ERROR THEN
        span.SetStatus(EXECUTION_FAILED)
        RETURN ErrorEnvelope(code=MapErrorCode(result.error), retryable=IsTransient(result.error))
    
    // PHASE 5: Verify
    outputErrors ← ValidateOutput(result.value, tool.outputSchema)
    IF outputErrors ≠ EMPTY THEN
        span.SetStatus(OUTPUT_SCHEMA_VIOLATION)
        EmitAlert(TOOL_OUTPUT_CORRUPTION, tool.id)
        RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR)
    
    IF tool.mutationClass = WRITE THEN
        sideEffectVerification ← VerifySideEffects(tool, request.params, result.value)
        IF NOT sideEffectVerification.consistent THEN
            AttemptRollback(tool, request.params, result.value)
            span.SetStatus(SIDE_EFFECT_INCONSISTENCY)
            RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR, details=sideEffectVerification)
    
    // PHASE 6: Return
    EmitAuditLog(SUCCESS, request.callerToken, tool.id, request.params, result.value)
    span.SetStatus(SUCCESS)
    span.SetAttribute("latencyMs", result.durationMs)
    span.SetAttribute("cost", result.resourceCost)
    span.End()
    
    RETURN SuccessEnvelope(
        data=result.value,
        provenance={source: tool.id, version: tool.version, executedAt: NOW()},
        executionMetadata={latencyMs: result.durationMs, cost: result.resourceCost}
    )

13.7 Tool Timeout Classes: Interactive (<500ms), Standard (<5s), Long-Running (<5min), Async (>5min)#

13.7.1 Classification Rationale#

Not all tool invocations have the same latency profile. A single timeout value is either too aggressive (causing spurious failures for legitimately slow operations) or too permissive (allowing stuck invocations to block the agent loop). Timeout classes formalize latency expectations and enable the orchestrator to make informed scheduling decisions.

13.7.2 Timeout Class Definitions#

Class	Symbol	Max Duration	Use Cases	Agent Loop Behavior
Interactive	$\tau_I$	$< 500\text{ms}$	In-memory computation, cache lookups, simple data retrieval	Synchronous wait; no intermediate feedback
Standard	$\tau_S$	$< 5\text{s}$	Database queries, API calls, file reads, search operations	Synchronous wait; may emit progress indicator
Long-Running	$\tau_L$	$< 5\text{min}$	Code execution, large file processing, complex analysis, CI runs	Checkpoint-based progress; agent may interleave other work
Async	$\tau_A$	$> 5\text{min}$	Deployments, training jobs, batch processing, external approvals	Submit-and-poll; agent continues other tasks; callback on completion

13.7.3 Deadline Propagation#

The effective deadline for a tool invocation is the minimum of the tool's timeout class maximum and the agent loop's remaining time budget:

d_{\text{effective}} = \min\left(\tau_{\text{class}}(\mathcal{T}), \; d_{\text{agent\_loop}} - t_{\text{elapsed}}\right)

where $d_{\text{agent\_loop}}$ is the total time budget for the agent's current task iteration and $t_{\text{elapsed}}$ is time already consumed.

13.7.4 Long-Running and Async Tool Patterns#

Pseudo-Algorithm 13.7: Async Tool Invocation with Polling

PROCEDURE InvokeAsyncTool(request, pollingPolicy):
    // Submit the async operation
    submission ← ToolServer.SubmitAsync(request)
    IF submission.status = REJECTED THEN
        RETURN ErrorEnvelope(code=submission.errorCode)
    
    operationId ← submission.operationId
    estimatedDuration ← submission.estimatedDurationMs
    
    // Register callback if supported
    IF ToolServer.SupportsCallbacks() THEN
        ToolServer.RegisterCallback(operationId, callback=OnAsyncToolComplete)
        RETURN AsyncPendingEnvelope(operationId, estimatedCompletion=NOW()+estimatedDuration)
    
    // Fallback: polling with exponential backoff
    pollInterval ← pollingPolicy.initialIntervalMs
    totalWait ← 0
    
    WHILE totalWait < pollingPolicy.maxWaitMs:
        Sleep(pollInterval)
        totalWait ← totalWait + pollInterval
        
        status ← ToolServer.PollStatus(operationId)
        IF status.state = COMPLETED THEN
            RETURN SuccessEnvelope(data=status.result)
        IF status.state = FAILED THEN
            RETURN ErrorEnvelope(code=status.errorCode, retryable=status.retryable)
        IF status.state = IN_PROGRESS THEN
            // Optionally report progress to agent
            EmitProgress(operationId, status.progressPercent, status.checkpoint)
        
        pollInterval ← MIN(pollInterval * pollingPolicy.backoffMultiplier, pollingPolicy.maxIntervalMs)
    
    // Max wait exceeded
    ToolServer.RequestCancellation(operationId)
    RETURN ErrorEnvelope(code=DEADLINE_EXCEEDED, retryable=TRUE)

13.7.5 Timeout Class Assignment Rules#

A tool's timeout class is assigned at registration time based on empirical latency profiling and operational characteristics:

\tau_{\text{class}}(\mathcal{T}) = \begin{cases} \tau_I & \text{if } p_{99}(\text{latency}(\mathcal{T})) < 500\text{ms} \wedge \text{no external I/O} \\ \tau_S & \text{if } p_{99}(\text{latency}(\mathcal{T})) < 5\text{s} \wedge \text{bounded external I/O} \\ \tau_L & \text{if } p_{99}(\text{latency}(\mathcal{T})) < 5\text{min} \wedge \text{checkpointable} \\ \tau_A & \text{otherwise} \end{cases}

Misclassification is detected via observability (§13.13) and triggers automatic reclassification alerts.

13.8 Tool Idempotency Requirements: Safe Retries, Deduplication Keys, and At-Least-Once Semantics#

13.8.1 The Idempotency Imperative#

In distributed agentic systems, transient failures (network partitions, deadline exceedances, process restarts) are routine. Without idempotency guarantees, retrying a failed tool invocation may duplicate state-changing mutations (double-writes, duplicate resource creation, repeated financial transactions). Idempotency is not optional for mutating tools — it is a correctness requirement.

13.8.2 Formal Definition#

A tool $\mathcal{T}$ is idempotent with respect to idempotency key $k$ if:

\forall s \in S, \; \forall a \in A, \; \forall n \geq 1: \quad \underbrace{\delta(s, a, k) \circ \delta(\cdot, a, k) \circ \cdots}_{n \text{ applications}} = \delta(s, a, k)

In operational terms: applying the same action with the same idempotency key $n$ times produces the same final state and the same result as applying it once.

13.8.3 Idempotency Key Design#

The idempotency key $k$ must be:

Caller-generated: The agent (or orchestrator) generates the key before the first attempt. The tool server MUST NOT generate idempotency keys — this defeats the purpose.
Deterministically derivable: For a given (task, step, invocation intent) tuple, the same key is produced. This ensures that retries after agent process restart use the same key.
Scoped: Keys are scoped to (toolId, callerSessionId) to prevent cross-session collisions.

Key derivation function:

k = \texttt{HMAC-SHA256}\left(\texttt{sessionId} \| \texttt{toolId} \| \texttt{stepId} \| \texttt{canonicalParams}\right)

where $\texttt{canonicalParams}$ is a deterministic JSON serialization of the input parameters (keys sorted, whitespace normalized).

13.8.4 Deduplication Window#

Idempotency records are stored with a bounded TTL (the deduplication window):

w_{\text{dedup}} = \max\left(2 \cdot \tau_{\text{class}}(\mathcal{T}), \; w_{\text{min}}\right)

where $w_{\text{min}}$ is a configurable minimum (typically 1 hour). Records older than $w_{\text{dedup}}$ are evicted — subsequent invocations with the same key are treated as new requests.

13.8.5 At-Least-Once Semantics with Idempotent Receivers#

The tool invocation protocol guarantees at-least-once delivery: the orchestrator will retry until it receives a definitive response (success or non-retryable error). Combined with idempotent tool implementations, this yields effectively-once execution semantics:

\texttt{at-least-once delivery} + \texttt{idempotent receiver} = \texttt{effectively-once execution}

Pseudo-Algorithm 13.8: Idempotent Retry Logic in Agent Orchestrator

PROCEDURE InvokeWithIdempotentRetry(tool, params, retryPolicy):
    idempotencyKey ← DeriveIdempotencyKey(currentSession, tool.id, currentStep, params)
    attempt ← 0
    lastError ← NULL
    
    WHILE attempt < retryPolicy.maxAttempts:
        attempt ← attempt + 1
        
        request ← BuildToolRequest(
            toolId=tool.id,
            version=tool.version,
            params=params,
            idempotencyKey=idempotencyKey,
            callerToken=currentSession.callerToken,
            traceId=currentTrace.id,
            deadline=ComputeDeadline(tool.timeoutClass, retryPolicy, attempt)
        )
        
        response ← ToolServer.Invoke(request)
        
        IF response.status = SUCCESS THEN
            RETURN response
        
        IF response.status = ERROR THEN
            IF NOT response.retryable THEN
                RETURN response   // Terminal failure — do not retry
            
            lastError ← response
            backoff ← ComputeBackoff(retryPolicy.baseMs, attempt, retryPolicy.jitterFactor)
            Sleep(backoff)
    
    // Exhausted retry budget
    EmitAlert(RETRY_BUDGET_EXHAUSTED, tool.id, attempt, lastError)
    RETURN ErrorEnvelope(code=RETRY_EXHAUSTED, lastError=lastError)
 
FUNCTION ComputeBackoff(baseMs, attempt, jitterFactor):
    exponentialMs ← baseMs * 2^(attempt - 1)
    cappedMs ← MIN(exponentialMs, MAX_BACKOFF_MS)
    jitter ← RANDOM_UNIFORM(-jitterFactor * cappedMs, +jitterFactor * cappedMs)
    RETURN MAX(0, cappedMs + jitter)

13.9 Read vs. Write Tool Classification: Mutation Detection, Side-Effect Auditing#

13.9.1 Classification Taxonomy#

Every tool is classified at registration time into one of three mutation classes:

Class	Symbol	Characteristics	Governance
Read-Only	$\mathcal{M}_R$	No state changes; referentially transparent within data freshness window	No approval gates; freely cacheable; unlimited retries
Write (Reversible)	$\mathcal{M}_{W_r}$	State-changing with compensating action available	Approval gates configurable; idempotency required; audit logged
Write (Irreversible)	$\mathcal{M}_{W_i}$	State-changing with no compensating action (e.g., email sent, funds transferred)	Mandatory approval gate; dry-run mode required; enhanced audit

13.9.2 Mutation Detection#

Mutation class is declared in the tool schema but verified at runtime through side-effect auditing:

Pseudo-Algorithm 13.9: Runtime Mutation Detection

PROCEDURE AuditSideEffects(tool, request, preState, postState, declaredClass):
    detectedMutations ← DiffState(preState, postState)
    
    IF declaredClass = READ_ONLY AND detectedMutations ≠ EMPTY THEN
        EmitAlert(MUTATION_CLASS_VIOLATION, tool.id, "Declared READ_ONLY but mutations detected")
        QuarantineTool(tool.id)   // Immediate quarantine until human review
        RETURN VIOLATION
    
    IF declaredClass ∈ {WRITE_REVERSIBLE, WRITE_IRREVERSIBLE} AND detectedMutations = EMPTY THEN
        EmitWarning(WRITE_TOOL_NO_MUTATION, tool.id, "Declared WRITE but no mutations observed")
        // Non-critical — may indicate a no-op invocation
    
    // Log all detected mutations for audit trail
    FOR EACH mutation IN detectedMutations:
        AuditLog.Record(
            toolId=tool.id,
            callerToken=request.callerToken,
            timestamp=NOW(),
            mutationType=mutation.type,
            affectedResource=mutation.resource,
            previousValue=mutation.before,
            newValue=mutation.after,
            idempotencyKey=request.idempotencyKey
        )
    
    RETURN CONSISTENT

13.9.3 Side-Effect Manifests#

Write tools MUST declare a side-effect manifest at registration time — a structured enumeration of all state changes the tool may produce:

\mathcal{SE}(\mathcal{T}) = \left\{ (r_i, \texttt{op}_i, \texttt{scope}_i) \;\middle|\; r_i \in \text{Resources}, \; \texttt{op}_i \in \{\texttt{CREATE}, \texttt{UPDATE}, \texttt{DELETE}\}, \; \texttt{scope}_i \in \text{Scopes} \right\}

The side-effect manifest enables:

Static analysis: The orchestrator can predict which resources will be affected before invocation.
Conflict detection: Parallel agent executions operating on overlapping resource scopes are detected and serialized.
Compensating action derivation: The compensation engine uses the manifest to generate inverse operations.

13.10 Human-in-the-Loop Tool Governance#

13.10.1 Approval Gates for State-Changing Operations#

State-changing tool invocations in production agentic systems MUST pass through configurable approval gates. The approval gate policy is a function of tool mutation class, operation scope, caller trust level, and environmental risk:

\texttt{ApprovalRequired}(\mathcal{T}, \texttt{op}, \texttt{caller}, \texttt{env}) = \begin{cases} \texttt{FALSE} & \text{if } \mathcal{M}(\mathcal{T}) = \mathcal{M}_R \\ \texttt{FALSE} & \text{if } \mathcal{M}(\mathcal{T}) = \mathcal{M}_{W_r} \wedge \texttt{caller.trustLevel} \geq \theta_{\text{auto}} \wedge \texttt{env} \neq \texttt{PROD} \\ \texttt{TRUE} & \text{if } \mathcal{M}(\mathcal{T}) = \mathcal{M}_{W_i} \\ \texttt{TRUE} & \text{if } \texttt{env} = \texttt{PROD} \wedge \texttt{op.scope} \geq \sigma_{\text{critical}} \\ \texttt{POLICY}(\mathcal{T}, \texttt{op}, \texttt{caller}, \texttt{env}) & \text{otherwise} \end{cases}

where $\theta_{\text{auto}}$ is the minimum trust level for auto-approval and $\sigma_{\text{critical}}$ is the scope threshold above which all mutations require approval.

Pseudo-Algorithm 13.10: Approval Gate Execution

PROCEDURE ExecuteApprovalGate(request, tool, approvalPolicy):
    decision ← EvaluateApprovalPolicy(tool, request.operation, request.callerToken, CURRENT_ENV)
    
    IF decision = NOT_REQUIRED THEN
        RETURN ApprovalResult(approved=TRUE, mechanism="auto")
    
    // Create approval ticket
    ticket ← ApprovalTicket(
        id=GenerateTicketId(),
        toolId=tool.id,
        operation=request.operation,
        params=RedactSensitiveFields(request.params, tool.inputSchema),
        requestedBy=request.callerToken.identity,
        requestedAt=NOW(),
        expiresAt=NOW() + approvalPolicy.timeoutDuration,
        status=PENDING,
        escalationChain=approvalPolicy.escalationChain
    )
    
    // Notify approvers
    NotifyApprovers(ticket, approvalPolicy.approverGroups)
    
    // Store ticket and wait
    ApprovalStore.Save(ticket)
    
    RETURN ApprovalResult(
        approved=FALSE, 
        mechanism="human_review",
        ticketId=ticket.id,
        pollEndpoint=BuildPollEndpoint(ticket.id),
        expiresAt=ticket.expiresAt
    )

13.10.2 Dry-Run / Preview Modes for Destructive Actions#

For tools classified as $\mathcal{M}_{W_i}$ (irreversible write), a dry-run mode MUST be supported. In dry-run mode, the tool executes all validation, authorization, and planning logic but does not commit the mutation. Instead, it returns a preview envelope describing:

The exact state changes that would occur
The resources that would be affected
Any preconditions that would fail
The estimated cost or impact of the operation

\texttt{DryRun}(\mathcal{T}, \texttt{params}) \rightarrow \texttt{PreviewEnvelope} = \langle \mathcal{SE}_{\text{predicted}}, \texttt{preconditionStatus}, \texttt{estimatedImpact} \rangle

The agent or human reviewer examines the preview envelope and explicitly authorizes the live execution, or the preview is used as input to the approval gate.

13.10.3 Approval Escalation Policies and Timeout-Based Auto-Deny#

Approval requests that are not acted upon within the configured timeout are auto-denied — never auto-approved. This is a fundamental safety invariant:

\texttt{TicketResolution}(t) = \begin{cases} \texttt{APPROVED} & \text{if approved by authorized reviewer before } t.\texttt{expiresAt} \\ \texttt{DENIED} & \text{if denied by authorized reviewer} \\ \texttt{AUTO\_DENIED} & \text{if } \texttt{NOW()} > t.\texttt{expiresAt} \wedge t.\texttt{status} = \texttt{PENDING} \end{cases}

Escalation chains define a sequence of approver groups with increasing authority. If the primary approver group does not respond within a fraction of the timeout (e.g., 50%), the request escalates to the next group:

Pseudo-Algorithm 13.11: Escalation and Timeout Logic

PROCEDURE MonitorApprovalTicket(ticket, escalationPolicy):
    escalationIndex ← 0
    
    WHILE NOW() < ticket.expiresAt:
        // Check for resolution
        currentStatus ← ApprovalStore.GetStatus(ticket.id)
        IF currentStatus ∈ {APPROVED, DENIED} THEN
            RETURN currentStatus
        
        // Check for escalation
        escalationThreshold ← ticket.createdAt + 
            (escalationPolicy.escalationFraction * (ticket.expiresAt - ticket.createdAt)) * (escalationIndex + 1) /
            LENGTH(escalationPolicy.chain)
        
        IF NOW() > escalationThreshold AND escalationIndex < LENGTH(escalationPolicy.chain) - 1 THEN
            escalationIndex ← escalationIndex + 1
            NotifyApprovers(ticket, escalationPolicy.chain[escalationIndex])
            EmitEvent(APPROVAL_ESCALATED, ticket.id, escalationIndex)
        
        Sleep(escalationPolicy.pollIntervalMs)
    
    // Timeout reached — auto-deny
    ApprovalStore.UpdateStatus(ticket.id, AUTO_DENIED)
    EmitEvent(APPROVAL_AUTO_DENIED, ticket.id)
    RETURN AUTO_DENIED

13.11 Caller-Scoped Authorization: Credential Propagation, Least Privilege, and Audit Trails#

13.11.1 Principle of Least Privilege for Agent Tool Access#

Agents MUST NOT operate with broad, ambient credentials. Instead, every tool invocation is authorized against the caller's identity and scopes — the human user or system principal that initiated the agentic task. This is the fundamental difference between "agent-owned credentials" (dangerous) and "caller-scoped authorization" (safe).

\texttt{AuthScope}(\text{tool invocation}) = \texttt{AuthScope}(\text{originating caller}) \cap \texttt{RequiredScopes}(\mathcal{T})

The effective authorization is the intersection of the caller's granted scopes and the tool's required scopes. If the intersection does not cover all required scopes, the invocation is denied.

13.11.2 Credential Propagation Architecture#

Three-Tier Credential Model:

User Credential ( $C_u$ ): The original credential (OAuth token, API key, session token) of the human user who initiated the task. This credential is the root of trust.
Agent Credential ( $C_a$ ): A derived, time-limited, scope-restricted credential issued to the agent for the duration of the task. Derived via delegation:

C_a = \texttt{Delegate}(C_u, \texttt{scopes}_{\text{task}}, \texttt{ttl}_{\text{task}}, \texttt{constraints})

Tool Invocation Credential ( $C_t$ ): A further-restricted credential for a specific tool invocation, derived from $C_a$ :

C_t = \texttt{Narrow}(C_a, \texttt{scopes}_{\text{tool}}, \texttt{resourceScope}_{\text{invocation}})

Invariants:

\texttt{Scopes}(C_t) \subseteq \texttt{Scopes}(C_a) \subseteq \texttt{Scopes}(C_u)

\texttt{TTL}(C_t) \leq \texttt{TTL}(C_a) \leq \texttt{TTL}(C_u)

No credential escalation is permitted at any layer.

13.11.3 Audit Trail Requirements#

Every tool invocation generates an immutable audit record:

Field	Content
`timestamp`	ISO-8601 UTC timestamp of invocation
`traceId`	Distributed trace correlation ID
`callerId`	Identity of the originating caller (from $C_u$ )
`agentId`	Identity of the executing agent (from $C_a$ )
`toolId`	Tool identifier and version
`operation`	Specific operation within the tool
`inputHash`	SHA-256 hash of canonical input (not raw input, for PII protection)
`outputSummary`	Structured summary of output (not full output, for cost/privacy)
`mutationClass`	Read / Write-Reversible / Write-Irreversible
`sideEffects`	List of detected state changes (from side-effect auditing)
`authDecision`	ALLOW / DENY / REQUIRES_APPROVAL
`approvalTicketId`	If approval was required, the ticket ID
`latencyMs`	Execution duration
`status`	SUCCESS / ERROR / TIMEOUT
`errorCode`	If failed, the classified error code

Audit records are written to an append-only, tamper-evident log (write-once storage or cryptographically chained log). Retention policies are governed by organizational compliance requirements (typically 1–7 years).

13.12 Tool Versioning and Backward Compatibility: Schema Evolution, Deprecation Notices#

13.12.1 Semantic Versioning for Tool Contracts#

Tools follow strict semantic versioning:

\texttt{version} = \texttt{MAJOR}.\texttt{MINOR}.\texttt{PATCH}

Version Component	Change Semantics
MAJOR	Breaking changes to input schema, output schema, or behavioral semantics. Requires agent-side adaptation.
MINOR	Backward-compatible additions (new optional input fields, new output fields, new capabilities). Existing agents function without modification.
PATCH	Bug fixes, performance improvements, documentation updates. No schema changes.

13.12.2 Schema Evolution Rules#

Backward-compatible (MINOR) changes:

Adding optional input fields with defaults
Adding new fields to the output schema
Relaxing input constraints (e.g., increasing maxLength)
Adding new enum values (if the consumer handles unknown values gracefully)

Breaking (MAJOR) changes:

Adding required input fields
Removing or renaming existing fields (input or output)
Changing field types
Tightening input constraints
Changing behavioral semantics (same input produces different output)

Formal Compatibility Check:

\texttt{BackwardCompatible}(\Sigma_{\text{old}}, \Sigma_{\text{new}}) \iff \begin{cases} \forall \texttt{req} \in \Sigma_{\text{old}}.\texttt{required}: \texttt{req} \in \Sigma_{\text{new}}.\texttt{properties} \\ \forall p \in \Sigma_{\text{old}}.\texttt{properties}: \texttt{Type}(p, \Sigma_{\text{new}}) \supseteq \texttt{Type}(p, \Sigma_{\text{old}}) \\ \Sigma_{\text{new}}.\texttt{required} \subseteq \Sigma_{\text{old}}.\texttt{required} \cup \texttt{HasDefault}(\Sigma_{\text{new}}) \end{cases}

13.12.3 Deprecation Protocol#

Pseudo-Algorithm 13.12: Tool Deprecation Lifecycle

PROCEDURE DeprecateTool(toolId, oldVersion, newVersion, deprecationPolicy):
    // Phase 1: Announce deprecation
    toolRegistry.SetDeprecationNotice(toolId, oldVersion, DeprecationNotice(
        deprecatedAt=NOW(),
        removalDate=NOW() + deprecationPolicy.gracePeriod,
        successor=ToolRef(toolId, newVersion),
        migrationGuide=deprecationPolicy.migrationGuide
    ))
    
    // Phase 2: Emit deprecation warnings on invocation
    // (Handled by tool invocation lifecycle — deprecated tools return
    //  a `deprecation` field in the response metadata)
    
    // Phase 3: Monitor migration progress
    SCHEDULE PeriodicTask(interval=deprecationPolicy.reportInterval):
        usageStats ← MetricsStore.GetToolUsage(toolId, oldVersion, window=7d)
        IF usageStats.invocationCount = 0 THEN
            EmitEvent(DEPRECATION_MIGRATION_COMPLETE, toolId, oldVersion)
        ELSE
            EmitReport(DEPRECATION_MIGRATION_PROGRESS, toolId, oldVersion, usageStats)
    
    // Phase 4: Remove after grace period
    SCHEDULE Task(at=NOW() + deprecationPolicy.gracePeriod):
        finalUsage ← MetricsStore.GetToolUsage(toolId, oldVersion, window=24h)
        IF finalUsage.invocationCount = 0 THEN
            toolRegistry.Deregister(toolId, oldVersion)
            EmitEvent(TOOL_VERSION_REMOVED, toolId, oldVersion)
        ELSE
            EmitAlert(DEPRECATION_REMOVAL_BLOCKED, toolId, oldVersion, finalUsage)
            // Extend grace period or force-remove based on policy

13.12.4 Multi-Version Coexistence#

The prefill compiler and agent planner are version-aware. When multiple versions of a tool are registered:

Prefer latest non-deprecated version by default.
Pin to specific version when the task requires behavioral stability (e.g., reproducing a prior result).
Reject deprecated versions when removalDate has passed.

\texttt{ResolveVersion}(\texttt{toolId}, \texttt{versionSpec}) = \begin{cases} \texttt{Exact}(v) & \text{if } \texttt{versionSpec} = v \\ \max\{v \mid v \text{ registered} \wedge v \text{ not deprecated}\} & \text{if } \texttt{versionSpec} = \texttt{LATEST} \\ \max\{v \mid v \text{ compatible with } \texttt{versionSpec}\} & \text{if } \texttt{versionSpec} = \texttt{MAJOR.x} \end{cases}

13.13 Tool Observability: Invocation Traces, Success/Failure Rates, Latency Distributions, Cost Attribution#

13.13.1 Observability Architecture#

Tool observability is structured along the three pillars of production telemetry: traces, metrics, and logs. Each operates at a distinct granularity and serves a distinct consumer.

13.13.2 Distributed Traces#

Every tool invocation is a span within the broader agent execution trace. The trace structure is:

AgentTask (root span)
  ├── Plan (span)
  ├── Retrieve (span)
  ├── ToolInvocation: search_codebase (span)
  │     ├── Validate (span)
  │     ├── Authorize (span)
  │     ├── Execute (span)
  │     │     └── ExternalAPI: github_search (span)
  │     ├── Verify (span)
  │     └── Return (span)
  ├── ToolInvocation: read_file (span)
  │     └── ...
  ├── Synthesize (span)
  └── Verify (span)

Trace context (W3C traceparent header or equivalent) is propagated across all protocol boundaries (MCP, JSON-RPC, gRPC).

13.13.3 Metrics#

The following metrics are emitted at every tool invocation boundary:

Metric	Type	Labels	Purpose
`tool.invocation.count`	Counter	`toolId`, `version`, `status`, `mutationClass`	Invocation volume and success rate
`tool.invocation.latency_ms`	Histogram	`toolId`, `version`, `timeoutClass`	Latency distribution (p50, p95, p99)
`tool.invocation.cost`	Counter	`toolId`, `version`, `costCategory`	Cost attribution (compute, API calls, tokens)
`tool.invocation.retries`	Counter	`toolId`, `version`	Retry frequency (indicates reliability issues)
`tool.invocation.timeout_rate`	Gauge	`toolId`, `version`, `timeoutClass`	Fraction of invocations exceeding deadline
`tool.idempotency.replay_rate`	Gauge	`toolId`	Fraction of invocations resolved via idempotency cache
`tool.approval.pending_count`	Gauge	`toolId`, `approverGroup`	Pending approval queue depth
`tool.approval.resolution_time_ms`	Histogram	`toolId`, `resolution`	Time to approval/denial
`tool.schema.validation_failure_rate`	Gauge	`toolId`, `phase` (input/output)	Schema compliance rate

13.13.4 Alerting Rules#

Derived alerting conditions from the metrics above:

\texttt{Alert}(\texttt{ToolDegraded}) \iff \frac{\texttt{tool.invocation.count}[\text{status=error}]}{\texttt{tool.invocation.count}[\text{total}]} > \epsilon_{\text{error}} \quad \text{over window } w

\texttt{Alert}(\texttt{LatencyRegression}) \iff p_{99}(\texttt{tool.invocation.latency\_ms}) > 2 \cdot \text{baseline}_{p_{99}} \quad \text{over window } w

\texttt{Alert}(\texttt{TimeoutClassMismatch}) \iff \texttt{tool.invocation.timeout\_rate} > \epsilon_{\text{timeout}} \quad \text{for class } \tau

\texttt{Alert}(\texttt{ApprovalQueueBacklog}) \iff \texttt{tool.approval.pending\_count} > \theta_{\text{queue}} \quad \text{for duration} > d_{\text{max}}

13.13.5 Cost Attribution Model#

Tool cost attribution enables the agentic platform to account for resource consumption at the task level. The total cost of a tool invocation is:

\texttt{Cost}(\text{invocation}) = c_{\text{compute}} + c_{\text{external\_api}} + c_{\text{storage\_io}} + c_{\text{network}} + c_{\text{token}} + c_{\text{human\_review}}

where each component is reported by the tool server or derived from infrastructure metering. Cost is aggregated upward:

\texttt{Cost}(\text{task}) = \sum_{\text{invocations} \in \text{task}} \texttt{Cost}(\text{invocation}) + c_{\text{llm\_inference}} + c_{\text{retrieval}}

This enables per-task cost budgets, anomaly detection on runaway agent loops, and cost-aware tool selection in the planner.

13.13.6 Observability-Driven Feedback Loops#

Pseudo-Algorithm 13.13: Observability-Driven Tool Health Monitor

PROCEDURE MonitorToolHealth(toolRegistry, metricsStore, alertManager):
    FOR EACH tool IN toolRegistry.GetAllActive():
        window ← LAST_15_MINUTES
        
        errorRate ← metricsStore.Query(
            "rate(tool.invocation.count{toolId=$tool.id, status='error'}[$window])"
        ) / metricsStore.Query(
            "rate(tool.invocation.count{toolId=$tool.id}[$window])"
        )
        
        p99Latency ← metricsStore.Query(
            "histogram_quantile(0.99, tool.invocation.latency_ms{toolId=$tool.id}[$window])"
        )
        
        timeoutRate ← metricsStore.Query(
            "tool.invocation.timeout_rate{toolId=$tool.id}[$window]"
        )
        
        // Update tool health status in registry
        healthScore ← ComputeHealthScore(errorRate, p99Latency, timeoutRate, tool.timeoutClass)
        toolRegistry.UpdateHealth(tool.id, healthScore)
        
        // Trigger alerts
        IF errorRate > tool.alertThresholds.errorRate THEN
            alertManager.Fire(ToolDegraded(tool.id, errorRate))
        IF p99Latency > 2 * tool.baselineLatency.p99 THEN
            alertManager.Fire(LatencyRegression(tool.id, p99Latency))
        IF timeoutRate > tool.alertThresholds.timeoutRate THEN
            alertManager.Fire(TimeoutClassMismatch(tool.id, timeoutRate))
            // Suggest reclassification
            suggestedClass ← InferTimeoutClass(p99Latency)
            IF suggestedClass ≠ tool.timeoutClass THEN
                EmitRecommendation(RECLASSIFY_TIMEOUT, tool.id, tool.timeoutClass, suggestedClass)
 
FUNCTION ComputeHealthScore(errorRate, p99Latency, timeoutRate, timeoutClass):
    // Normalized health score ∈ [0, 1], where 1 = perfect health
    errorPenalty ← CLAMP(errorRate / MAX_TOLERABLE_ERROR_RATE, 0, 1)
    latencyPenalty ← CLAMP(p99Latency / timeoutClass.maxDuration, 0, 1)
    timeoutPenalty ← CLAMP(timeoutRate / MAX_TOLERABLE_TIMEOUT_RATE, 0, 1)
    
    RETURN 1.0 - (0.5 * errorPenalty + 0.3 * latencyPenalty + 0.2 * timeoutPenalty)

13.14 Tool Testing: Unit Tests, Integration Tests, Chaos Tests, and Behavioral Contract Verification#

13.14.1 Testing Pyramid for Tool Infrastructure#

Tool testing follows a layered pyramid with increasing scope and decreasing execution frequency:

          ╱ Chaos Tests ╲               ← Rare (weekly/pre-release)
         ╱────────────────╲
        ╱  Contract Tests   ╲           ← Per-PR / Per-deploy
       ╱──────────────────────╲
      ╱   Integration Tests     ╲       ← Per-commit
     ╱────────────────────────────╲
    ╱       Unit Tests              ╲   ← Continuous (every build)
   ╱──────────────────────────────────╲

13.14.2 Unit Tests#

Unit tests validate individual tool handler logic in isolation, with all external dependencies mocked or stubbed.

Coverage Requirements:

Input validation: Every schema constraint is tested with valid, boundary, and invalid inputs.
Core computation: The tool's primary logic is tested against known input-output pairs.
Error handling: Every classified error code in $\mathcal{E}$ is produced by at least one test case.
Idempotency: For stateful tools, duplicate invocations with the same idempotency key return identical results.

Pseudo-Algorithm 13.14: Unit Test Generation from Tool Schema

PROCEDURE GenerateUnitTests(tool):
    testCases ← []
    
    // Valid input tests
    FOR EACH exampleInput IN tool.inputSchema.examples:
        testCases.Add(TestCase(
            name="valid_input_" + Hash(exampleInput),
            input=exampleInput,
            expectedStatus=SUCCESS,
            validateOutput=LAMBDA(out): ValidateAgainstSchema(out, tool.outputSchema)
        ))
    
    // Boundary tests
    FOR EACH property IN tool.inputSchema.properties:
        IF property.type = "string" AND property.maxLength IS DEFINED THEN
            testCases.Add(TestCase(
                name="boundary_maxlength_" + property.name,
                input=GenerateInputWithPropertyValue(property.name, RandomString(property.maxLength)),
                expectedStatus=SUCCESS
            ))
            testCases.Add(TestCase(
                name="exceed_maxlength_" + property.name,
                input=GenerateInputWithPropertyValue(property.name, RandomString(property.maxLength + 1)),
                expectedStatus=ERROR,
                expectedErrorCode=INVALID_INPUT
            ))
        
        IF property.type = "integer" AND property.minimum IS DEFINED THEN
            testCases.Add(TestCase(
                name="boundary_minimum_" + property.name,
                input=GenerateInputWithPropertyValue(property.name, property.minimum),
                expectedStatus=SUCCESS
            ))
            testCases.Add(TestCase(
                name="below_minimum_" + property.name,
                input=GenerateInputWithPropertyValue(property.name, property.minimum - 1),
                expectedStatus=ERROR,
                expectedErrorCode=INVALID_INPUT
            ))
    
    // Required field tests
    FOR EACH requiredField IN tool.inputSchema.required:
        testCases.Add(TestCase(
            name="missing_required_" + requiredField,
            input=GenerateInputWithoutField(requiredField),
            expectedStatus=ERROR,
            expectedErrorCode=INVALID_INPUT
        ))
    
    // Additional properties test
    testCases.Add(TestCase(
        name="reject_additional_properties",
        input=GenerateValidInput() ∪ {"unknownField": "hallucinated_value"},
        expectedStatus=ERROR,
        expectedErrorCode=INVALID_INPUT
    ))
    
    // Idempotency test (for stateful tools)
    IF tool.mutationClass ≠ READ_ONLY THEN
        validInput ← GenerateValidInput()
        idempotencyKey ← GenerateIdempotencyKey()
        testCases.Add(TestCase(
            name="idempotency_duplicate",
            steps=[
                Invoke(tool, validInput, idempotencyKey) → result1,
                Invoke(tool, validInput, idempotencyKey) → result2,
                Assert(result1.data = result2.data),
                Assert(StateMutationCount() = 1)  // Only one actual mutation
            ]
        ))
    
    RETURN testCases

13.14.3 Integration Tests#

Integration tests validate end-to-end tool invocation through the full lifecycle (§13.6) against real or realistic dependencies.

Scope:

Protocol integration: Verify that MCP discovery, JSON-RPC invocation, and gRPC execution paths produce correct results end to end.
Authorization integration: Verify that caller-scoped credentials are correctly propagated and evaluated.
Transaction integration: For stateful tools, verify that commit/rollback semantics work correctly under concurrent access.
Timeout integration: Verify that timeout class enforcement correctly terminates operations and returns appropriate error envelopes.
Observability integration: Verify that traces, metrics, and audit logs are correctly emitted for each invocation.

Pseudo-Algorithm 13.15: Integration Test for Full Invocation Lifecycle

PROCEDURE IntegrationTest_FullLifecycle(tool, testEnvironment):
    // Setup: Provision test credentials with minimal scopes
    testCredential ← testEnvironment.ProvisionCredential(scopes=tool.authContract.requiredScopes)
    
    // Phase 1: Discovery
    discoveredTools ← DiscoverTools(testEnvironment.mcpEndpoint, DEFAULT_CAPABILITIES)
    Assert(tool.id IN discoveredTools.Map(t → t.id))
    
    // Phase 2: Valid invocation
    request ← BuildToolRequest(tool.id, tool.version, GenerateValidInput(), 
                                GenerateIdempotencyKey(), testCredential)
    response ← InvokeTool(request)
    Assert(response.status = SUCCESS)
    Assert(ValidateAgainstSchema(response.data, tool.outputSchema))
    
    // Phase 3: Verify trace was emitted
    trace ← testEnvironment.traceCollector.GetTrace(request.traceId)
    Assert(trace.spans.Any(s → s.name = "tool.invoke" AND s.toolId = tool.id))
    
    // Phase 4: Verify metrics were emitted
    metrics ← testEnvironment.metricsCollector.GetRecent(toolId=tool.id)
    Assert(metrics.invocationCount ≥ 1)
    Assert(metrics.latencyMs > 0)
    
    // Phase 5: Verify audit log was written
    auditRecord ← testEnvironment.auditLog.GetLatest(toolId=tool.id, traceId=request.traceId)
    Assert(auditRecord ≠ NULL)
    Assert(auditRecord.callerId = testCredential.identity)
    Assert(auditRecord.status = SUCCESS)
    
    // Phase 6: Insufficient authorization
    limitedCredential ← testEnvironment.ProvisionCredential(scopes=EMPTY_SET)
    unauthorizedRequest ← BuildToolRequest(tool.id, tool.version, GenerateValidInput(),
                                            GenerateIdempotencyKey(), limitedCredential)
    unauthorizedResponse ← InvokeTool(unauthorizedRequest)
    Assert(unauthorizedResponse.status = ERROR)
    Assert(unauthorizedResponse.errorCode = AUTHORIZATION_DENIED)
    
    // Phase 7: Timeout enforcement
    IF tool.timeoutClass = INTERACTIVE THEN
        slowRequest ← BuildToolRequest(tool.id, tool.version, 
                                        GenerateSlowInput(targetLatency=2*tool.timeoutClass.maxDuration),
                                        GenerateIdempotencyKey(), testCredential)
        slowResponse ← InvokeTool(slowRequest)
        Assert(slowResponse.errorCode = DEADLINE_EXCEEDED)

13.14.4 Chaos Tests#

Chaos tests validate tool resilience under adverse conditions. These are executed in isolated environments and simulate:

Chaos Scenario	Injection Method	Expected Behavior
Network partition	Drop/delay packets between agent and tool server	Retry with backoff; return `UPSTREAM_FAILURE` after budget exhaustion
Dependency failure	Kill upstream service the tool depends on	Return `UPSTREAM_FAILURE` with retryable flag; no partial mutations
Resource exhaustion	Limit CPU/memory available to tool server	Return `RESOURCE_EXHAUSTED`; no crash, no data corruption
Clock skew	Shift system clock on tool server	Idempotency windows still function correctly; no duplicate mutations
Slow responses	Inject latency into tool execution	Timeout class enforcement triggers; agent does not hang indefinitely
Concurrent mutations	Submit conflicting mutations simultaneously	Optimistic concurrency control detects conflict; return `CONFLICT`
Partial failure in composite	Fail one step of a composite tool	Compensating actions execute for completed steps; aggregate failure returned

Pseudo-Algorithm 13.16: Chaos Test — Concurrent Mutation Conflict

PROCEDURE ChaosTest_ConcurrentMutationConflict(tool, testEnvironment):
    // Setup: Create a shared resource
    resource ← testEnvironment.CreateTestResource()
    
    // Attempt two concurrent mutations on the same resource
    key1 ← GenerateIdempotencyKey()
    key2 ← GenerateIdempotencyKey()
    
    request1 ← BuildToolRequest(tool.id, tool.version, 
                                 {resourceId: resource.id, value: "A"}, key1, testCredential)
    request2 ← BuildToolRequest(tool.id, tool.version, 
                                 {resourceId: resource.id, value: "B"}, key2, testCredential)
    
    // Execute concurrently
    (response1, response2) ← ExecuteConcurrently(InvokeTool(request1), InvokeTool(request2))
    
    // Exactly one should succeed, the other should receive CONFLICT or succeed after retry
    successCount ← COUNT(r IN [response1, response2] WHERE r.status = SUCCESS)
    conflictCount ← COUNT(r IN [response1, response2] WHERE r.errorCode = CONFLICT)
    
    Assert(successCount ≥ 1)
    Assert(successCount + conflictCount = 2)
    
    // Verify final state is consistent (reflects one of the two values, not corrupted)
    finalState ← testEnvironment.ReadResource(resource.id)
    Assert(finalState.value ∈ {"A", "B"})

13.14.5 Behavioral Contract Verification#

Beyond schema compliance, tools must satisfy behavioral contracts — invariants about the relationship between inputs, outputs, and state changes that hold across all valid invocations. Behavioral contracts are formalized as properties and verified through property-based testing (generative testing).

Contract Categories:

Determinism contract (for stateless tools):

\forall p \in \texttt{Valid}(\Sigma_{\text{in}}): \quad \mathcal{T}(p) = \mathcal{T}(p)

Idempotency contract (for stateful tools):

\forall p, k: \quad \mathcal{T}(p, k); \mathcal{T}(p, k) \equiv \mathcal{T}(p, k)

Monotonicity contract (for append-only tools):

|S_{\text{after}}| \geq |S_{\text{before}}|

Conservation contract (for transfer operations):

\sum_{r \in R} \texttt{balance}(r, S_{\text{after}}) = \sum_{r \in R} \texttt{balance}(r, S_{\text{before}})

Reversibility contract (for reversible write tools):

\mathcal{T}(p); \; \texttt{Compensate}(\mathcal{T}, p) \implies S_{\text{final}} = S_{\text{initial}}

Pseudo-Algorithm 13.17: Property-Based Behavioral Contract Verification

PROCEDURE VerifyBehavioralContracts(tool, contractSpec, generatorConfig):
    generator ← PropertyBasedGenerator(tool.inputSchema, generatorConfig)
    failures ← []
    
    FOR iteration IN 1..generatorConfig.maxIterations:
        randomInput ← generator.Generate()
        
        // Contract 1: Schema compliance
        response ← InvokeTool(BuildRequest(tool, randomInput))
        IF response.status = SUCCESS THEN
            schemaValid ← ValidateAgainstSchema(response.data, tool.outputSchema)
            IF NOT schemaValid THEN
                failures.Add(ContractViolation("output_schema", randomInput, response.data))
        
        // Contract 2: Determinism (stateless tools)
        IF tool.mutationClass = READ_ONLY AND contractSpec.determinism THEN
            response2 ← InvokeTool(BuildRequest(tool, randomInput))
            IF response.status = SUCCESS AND response2.status = SUCCESS THEN
                IF response.data ≠ response2.data THEN
                    failures.Add(ContractViolation("determinism", randomInput, 
                                                    {first: response.data, second: response2.data}))
        
        // Contract 3: Idempotency (stateful tools)
        IF tool.mutationClass ≠ READ_ONLY AND contractSpec.idempotency THEN
            key ← GenerateIdempotencyKey()
            r1 ← InvokeTool(BuildRequest(tool, randomInput, key))
            r2 ← InvokeTool(BuildRequest(tool, randomInput, key))
            IF r1.status = SUCCESS AND r2.status = SUCCESS THEN
                IF r1.data ≠ r2.data THEN
                    failures.Add(ContractViolation("idempotency", randomInput, {r1: r1.data, r2: r2.data}))
        
        // Contract 4: Reversibility (reversible write tools)
        IF tool.mutationClass = WRITE_REVERSIBLE AND contractSpec.reversibility THEN
            preState ← CaptureState()
            r ← InvokeTool(BuildRequest(tool, randomInput))
            IF r.status = SUCCESS THEN
                compensation ← tool.DeriveCompensation(randomInput, r.data)
                ExecuteCompensation(compensation)
                postState ← CaptureState()
                IF preState ≠ postState THEN
                    failures.Add(ContractViolation("reversibility", randomInput, 
                                                    {pre: preState, post: postState}))
    
    // Report
    IF failures ≠ EMPTY THEN
        EmitTestFailure(BEHAVIORAL_CONTRACT_VIOLATION, tool.id, failures)
        RETURN TEST_FAILED
    RETURN TEST_PASSED

13.14.6 Continuous Contract Enforcement in CI/CD#

Tool tests are integrated into the CI/CD pipeline as mandatory quality gates:

\texttt{DeployGate}(\mathcal{T}, v_{\text{new}}) = \begin{cases} \texttt{PASS} & \text{if } \texttt{UnitTests}(\mathcal{T}) = \top \;\wedge\; \texttt{IntegrationTests}(\mathcal{T}) = \top \;\wedge\; \texttt{ContractTests}(\mathcal{T}) = \top \\ \texttt{BLOCK} & \text{otherwise} \end{cases}

Chaos tests are executed on a periodic schedule (weekly or pre-release) and do not block individual deployments but may block release promotions.

Regression Detection:

When a tool's behavioral contract test fails after a code change, the CI system:

Identifies the minimal failing input (via shrinking in the property-based test framework).
Records the failure as a regression test case (permanently added to the unit test suite).
Blocks deployment until the regression is resolved.
Notifies the tool owner and the agentic platform team.

Chapter 13 Summary: Architectural Invariants#

The following invariants define the non-negotiable properties of a production-grade tool architecture for agentic AI systems:

#	Invariant	Enforcement Mechanism
1	Every tool has a typed, versioned, schema-validated contract	Admission predicate at registration
2	Every invocation traverses the full six-phase lifecycle	Protocol enforcement in tool server framework
3	Every mutating tool is idempotent with caller-generated keys	Idempotency store + deduplication window
4	Authorization is caller-scoped, never agent-ambient	Credential delegation chain + policy evaluation
5	Irreversible mutations require human approval or dry-run preview	Approval gates + timeout-based auto-deny
6	Tool schemas are lazily loaded under explicit token budgets	Prefill compiler with knapsack-optimized selection
7	Every invocation is traced, metered, and audit-logged	Observability infrastructure at every boundary
8	Schema evolution follows semantic versioning with backward-compatibility verification	Automated compatibility checks in CI
9	Side effects are declared, detected, and audited	Side-effect manifests + runtime mutation detection
10	Behavioral contracts are verified through property-based testing in CI/CD	Mandatory deploy gates

These invariants collectively ensure that tool infrastructure in agentic systems operates with the same reliability, governance, and observability expectations as any production-grade distributed system — because that is precisely what it is.

End of Chapter 13.