Preface to Chapter 13#
An agentic system that cannot actuate the external world through well-governed, observable, and fault-tolerant tool interfaces is merely a text generator. This chapter formally defines the architecture of tool infrastructure in production agentic systems: from the protocol-level design of Model Context Protocol (MCP) servers, through typed contract enforcement, invocation lifecycle management, authorization propagation, and operational observability, to the governance mechanisms that keep autonomous agents safe under mutation pressure. Every tool interaction is treated as a first-class distributed systems problem — with schemas, deadlines, idempotency guarantees, versioning contracts, and human-interruptible control surfaces.
13.1 Tools as First-Class Infrastructure: Beyond Simple Function Calling#
13.1.1 The Inadequacy of Naive Function Calling#
Conventional LLM function calling — where a model emits a JSON object matching a function signature and the runtime blindly dispatches it — is structurally inadequate for production agentic systems. The deficiencies are categorical:
- No schema negotiation at runtime: The model receives a static list of function signatures at prompt compilation time; there is no mechanism for capability discovery, version awareness, or conditional availability.
- No authorization boundary: The calling model implicitly inherits the credentials of the host process rather than operating under caller-scoped, least-privilege authorization.
- No invocation lifecycle: There is no pre-validation, no post-verification, no timeout classification, and no structured error recovery — only success or opaque failure.
- No observability contract: Invocations are fire-and-forget from the model's perspective; latency, cost, side-effect auditing, and trace propagation are afterthoughts rather than architectural invariants.
- No idempotency guarantee: Retries after transient failures may duplicate state-changing mutations, with no deduplication mechanism.
13.1.2 The Tool-as-Infrastructure Thesis#
Tools in a principled agentic architecture must be treated as typed, versioned, observable, authorization-scoped, lifecycle-managed infrastructure components — equivalent in rigor to microservice APIs in distributed systems engineering. Formally, a tool is defined as a tuple:
where:
| Symbol | Definition |
|---|---|
| Globally unique, stable tool identifier (namespace-scoped) | |
| Semantic version of the tool contract | |
| JSON Schema–validated input type | |
| Typed structured output schema (including pagination metadata) | |
| Error envelope with classified error codes, retryability flag, and human-readable diagnostics | |
| Authorization contract: required scopes, credential propagation rules | |
| Timeout class assignment: interactive, standard, long-running, or async | |
| Idempotency specification: key derivation, deduplication window, at-least-once semantics | |
| Read-only vs. state-changing classification with side-effect manifest | |
| Trace context propagation, metric emission points, audit log contract |
13.1.3 Protocol Layering for Tool Communication#
The tool protocol stack mirrors the broader platform architecture:
- MCP (Model Context Protocol) — Discovery, schema announcement, prompt surface exposure, resource listing, and change notification for tools accessible to the agent runtime.
- JSON-RPC 2.0 — The user/application boundary protocol for tool invocations initiated by external orchestrators or human-in-the-loop interfaces.
- gRPC/Protobuf — Internal service-to-service execution path for low-latency tool dispatch between orchestration agents and tool servers, with strongly typed message contracts, deadline propagation, and bidirectional streaming for long-running operations.
Every tool invocation carries a trace context (W3C Trace Context or equivalent), a deadline (derived from the timeout class), and a caller identity (propagated, never impersonated).
13.1.4 Formal Quality Gate for Tool Admission#
A tool is admitted into the agent's available toolset if and only if it satisfies the following admission predicate:
Tools that fail any conjunct are rejected at registration time — never silently degraded.
13.2 MCP Tool Server Design Patterns#
The Model Context Protocol (MCP) defines a standardized interface through which agents discover, negotiate, and invoke tools. MCP servers are categorized by state management requirements and composition complexity.
13.2.1 Stateless Tool Servers: Pure Computation and Data Retrieval#
Definition. A stateless MCP tool server processes each request as an isolated, side-effect-free computation. No server-side session state persists between invocations.
Characteristics:
- Referential transparency: For identical inputs and identical external data snapshots, the output is deterministic.
- Horizontal scalability: Any replica can serve any request; no affinity routing required.
- Trivial idempotency: Re-execution is safe by construction since no mutation occurs.
- Cache-friendly: Responses can be cached by input hash with TTL governed by data freshness requirements.
Canonical Examples:
- Mathematical computation servers (symbolic algebra, unit conversion, statistical calculation)
- Read-only database query executors (with parameterized, pre-authorized query templates)
- Document format converters (Markdown → PDF, CSV → JSON)
- Embedding computation servers (text → vector, with model version pinning)
Pseudo-Algorithm 13.1: Stateless MCP Tool Server Request Handler
PROCEDURE HandleStatelessRequest(request, toolRegistry, cache):
// Phase 1: Schema validation
tool ← toolRegistry.Resolve(request.toolId, request.version)
IF tool = NULL THEN
RETURN ErrorEnvelope(code=TOOL_NOT_FOUND, retryable=FALSE)
validationResult ← ValidateAgainstSchema(request.params, tool.inputSchema)
IF NOT validationResult.valid THEN
RETURN ErrorEnvelope(code=INVALID_INPUT, details=validationResult.errors, retryable=FALSE)
// Phase 2: Cache lookup
cacheKey ← DeriveCanonicalHash(request.toolId, request.version, request.params)
cachedResult ← cache.Get(cacheKey)
IF cachedResult ≠ NULL AND cachedResult.age < tool.cacheTTL THEN
RETURN SuccessEnvelope(data=cachedResult.value, source="cache", traceId=request.traceId)
// Phase 3: Execute
deadline ← MIN(request.deadline, tool.timeoutClass.maxDuration)
result ← ExecuteWithDeadline(tool.handler, request.params, deadline)
IF result.timedOut THEN
RETURN ErrorEnvelope(code=DEADLINE_EXCEEDED, retryable=TRUE)
// Phase 4: Output validation
outputValid ← ValidateAgainstSchema(result.value, tool.outputSchema)
IF NOT outputValid.valid THEN
EmitAlert(TOOL_OUTPUT_SCHEMA_VIOLATION, tool.id, result.value)
RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR, retryable=FALSE)
// Phase 5: Cache store and return
cache.Put(cacheKey, result.value, ttl=tool.cacheTTL)
EmitMetric(tool.id, latency=result.duration, status=SUCCESS)
RETURN SuccessEnvelope(data=result.value, source="compute", traceId=request.traceId)Design Constraints:
- Stateless servers MUST NOT write to any persistent store as a side effect of tool execution.
- All external data reads MUST be through read-only connections or pre-authorized query templates.
- Cache invalidation MUST be event-driven (via MCP change notifications) or TTL-bounded; stale data is a correctness hazard in agentic reasoning.
13.2.2 Stateful Tool Servers: Session-Aware, Transaction-Capable Services#
Definition. A stateful MCP tool server maintains mutable state across invocations within a bounded session or transaction scope. State transitions must be explicit, auditable, and reversible where feasible.
Characteristics:
- Session affinity: Requests within a session must route to the same server instance or to a shared session store.
- Transaction boundaries: State mutations occur within explicit transaction scopes with commit/rollback semantics.
- Idempotency enforcement: Every mutating operation MUST accept an idempotency key; duplicate submissions within the deduplication window MUST return the original result without re-execution.
- Compensating actions: For mutations that cannot be atomically rolled back, a compensating action (inverse operation) MUST be registered at commit time.
Canonical Examples:
- Code execution sandboxes (stateful REPL sessions with persistent variable scope)
- Database write proxies (INSERT, UPDATE, DELETE with transactional guarantees)
- File system manipulation servers (create, modify, delete with journaling)
- External API integrators that perform multi-step workflows (e.g., create a cloud resource, then configure it)
Formal State Transition Model:
Let be the state space of a stateful tool server, and let be an action (tool invocation). The transition function is:
where is the idempotency key, is the result, and is the journal entry recording the transition for auditability and rollback.
The idempotency invariant is:
That is, re-applying the same action with the same idempotency key from the post-state yields no further state change.
Pseudo-Algorithm 13.2: Stateful MCP Tool Server with Idempotency
PROCEDURE HandleStatefulRequest(request, toolRegistry, sessionStore, idempotencyStore):
// Phase 1: Resolve tool and session
tool ← toolRegistry.Resolve(request.toolId, request.version)
session ← sessionStore.GetOrCreate(request.sessionId)
// Phase 2: Idempotency check
IF request.idempotencyKey ≠ NULL THEN
prior ← idempotencyStore.Lookup(request.toolId, request.idempotencyKey)
IF prior ≠ NULL AND prior.age < tool.idempotencyWindow THEN
EmitMetric(tool.id, status=IDEMPOTENT_REPLAY)
RETURN prior.result // Return cached result without re-execution
// Phase 3: Authorization (caller-scoped)
authDecision ← EvaluatePolicy(
callerIdentity=request.callerToken,
requiredScopes=tool.authContract.requiredScopes,
resourceContext=session.resourceScope
)
IF authDecision = DENY THEN
RETURN ErrorEnvelope(code=AUTHORIZATION_DENIED, retryable=FALSE)
IF authDecision = REQUIRES_APPROVAL THEN
approvalId ← SubmitApprovalRequest(request, tool, session)
RETURN PendingEnvelope(approvalId=approvalId, pollEndpoint=...)
// Phase 4: Input validation
validationResult ← ValidateAgainstSchema(request.params, tool.inputSchema)
IF NOT validationResult.valid THEN
RETURN ErrorEnvelope(code=INVALID_INPUT, details=validationResult.errors)
// Phase 5: Begin transaction, execute, commit
txn ← session.BeginTransaction()
TRY:
result ← ExecuteWithDeadline(tool.handler, request.params, session.state, txn, deadline)
outputValid ← ValidateAgainstSchema(result.value, tool.outputSchema)
IF NOT outputValid.valid THEN
txn.Rollback()
RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR)
compensatingAction ← tool.DeriveCompensation(request.params, result.value)
journalEntry ← JournalEntry(tool.id, request.params, result.value, compensatingAction, timestamp=NOW())
txn.Commit()
session.AppendJournal(journalEntry)
// Phase 6: Store idempotency record
IF request.idempotencyKey ≠ NULL THEN
idempotencyStore.Store(request.toolId, request.idempotencyKey,
SuccessEnvelope(data=result.value), ttl=tool.idempotencyWindow)
EmitMetric(tool.id, latency=result.duration, status=SUCCESS, mutation=TRUE)
RETURN SuccessEnvelope(data=result.value, traceId=request.traceId)
CATCH exception:
txn.Rollback()
EmitMetric(tool.id, status=FAILURE, error=exception.class)
RETURN ErrorEnvelope(code=EXECUTION_FAILED, retryable=IsTransient(exception))13.2.3 Composite Tool Servers: Orchestrating Multi-Step Tool Chains#
Definition. A composite MCP tool server composes multiple atomic tool invocations into a coherent multi-step workflow, managing inter-step dependencies, partial failure recovery, and aggregate result synthesis.
Motivation. Agents frequently require tool chains — e.g., "search for relevant files, then read their contents, then run static analysis on the results." Delegating chain orchestration entirely to the LLM planner introduces fragility: the model may mis-sequence steps, fail to handle partial failures, or lose intermediate results due to context window pressure. Composite tool servers internalize the orchestration logic while exposing a single, high-level tool interface to the agent.
Formal Composition Model:
A composite tool is defined as a directed acyclic graph (DAG) of atomic tools:
where:
- is the set of constituent atomic tools.
- defines execution dependencies ( means depends on 's output).
- is the binding function that maps upstream outputs to downstream inputs.
- is the merge function that synthesizes the composite output.
Execution Semantics:
- Topologically sort according to .
- Execute each level of the topological sort in parallel where no data dependencies exist.
- Apply to route intermediate results.
- On any step failure, evaluate the step's retry budget; if exhausted, invoke compensating actions for all committed prior steps (reverse topological order) and report aggregate failure.
- Apply to produce the composite result.
Pseudo-Algorithm 13.3: Composite Tool Execution with Partial Failure Recovery
PROCEDURE ExecuteComposite(compositeSpec, inputParams, deadline):
dag ← BuildDAG(compositeSpec.tools, compositeSpec.edges)
levels ← TopologicalSort(dag)
intermediateResults ← EmptyMap()
committedSteps ← EmptyStack()
FOR EACH level IN levels:
parallelTasks ← []
FOR EACH tool IN level:
boundInput ← ApplyBindings(compositeSpec.bindings[tool.id], inputParams, intermediateResults)
parallelTasks.Add(ScheduleToolInvocation(tool, boundInput, deadline))
results ← AwaitAll(parallelTasks, deadline)
FOR EACH (tool, result) IN results:
IF result.status = FAILURE THEN
IF tool.retryBudget > 0 THEN
retryResult ← RetryWithBackoff(tool, result.input, tool.retryBudget, deadline)
IF retryResult.status = FAILURE THEN
GOTO CompensateAndFail
result ← retryResult
ELSE
GOTO CompensateAndFail
intermediateResults[tool.id] ← result.value
committedSteps.Push(tool)
compositeOutput ← ApplyMergeFunction(compositeSpec.mergeFunction, intermediateResults)
RETURN SuccessEnvelope(data=compositeOutput)
CompensateAndFail:
WHILE committedSteps IS NOT EMPTY:
completedTool ← committedSteps.Pop()
compensation ← completedTool.DeriveCompensation(intermediateResults[completedTool.id])
IF compensation ≠ NULL THEN
ExecuteCompensation(compensation, deadline)
RETURN ErrorEnvelope(code=COMPOSITE_PARTIAL_FAILURE,
failedStep=tool.id,
completedSteps=committedSteps.AsList())Design Constraints for Composite Servers:
- The composite DAG MUST be acyclic; cycle detection is performed at registration time.
- Maximum DAG depth is bounded (typically steps) to enforce latency predictability.
- Each constituent tool invocation inherits the caller's authorization scope — the composite server MUST NOT escalate privileges.
- Intermediate results are held in ephemeral working memory and MUST NOT be persisted beyond the composite execution scope unless explicitly required by the output contract.
13.3 Tool Schema Design: JSON Schema Input Validation, Structured Output Types, Error Envelopes#
13.3.1 Input Schema: JSON Schema with Semantic Annotations#
Every tool input MUST be described by a JSON Schema document (draft 2020-12 or later) augmented with semantic annotations that guide both the LLM's parameter generation and the runtime's validation logic.
Required Schema Properties:
| Property | Purpose |
|---|---|
type, properties, required | Standard structural validation |
description (per-property) | Semantic guidance for the LLM planner — must be precise, unambiguous, and include valid example values |
enum / const | Constrained value sets; reduces hallucination risk in parameter generation |
format | Semantic format hints (date-time, uri, email, uuid) with runtime validation |
minLength, maxLength, pattern | String constraint enforcement |
minimum, maximum, multipleOf | Numeric constraint enforcement |
x-sensitivity | Custom extension: marks fields containing PII, secrets, or credentials for audit redaction |
x-token-cost | Custom extension: estimated token cost of including this field in context, used by the prefill compiler |
default | Default values for optional parameters; reduces LLM burden for common cases |
Validation Rigor:
Input validation is strict by default — additional properties not declared in the schema are rejected (i.e., additionalProperties: false). This prevents the LLM from inventing parameters that the tool does not support, which is a common hallucination pattern.
13.3.2 Output Schema: Typed Structured Responses#
Tool outputs are not freeform text. Every tool defines a typed output schema that includes:
- Result payload: The primary structured data returned by the tool.
- Pagination metadata (where applicable):
cursor,hasMore,totalCount— enabling the agent to request subsequent pages without re-specifying the query. - Provenance metadata:
source,retrievedAt,dataVersion— enabling the agent and downstream verifiers to assess result authority and freshness. - Execution metadata:
executionDurationMs,resourceCost— enabling the orchestrator to update latency and cost models.
13.3.3 Error Envelope: Classified, Retryable, Diagnostic#
All tool failures MUST be returned within a standardized error envelope, never as raw exceptions or unstructured text.
Error Envelope Structure:
| Field | Type | Description |
|---|---|---|
errorCode | Enum | Machine-readable error class (see taxonomy below) |
retryable | Boolean | Whether the caller should retry with the same parameters |
retryAfterMs | Integer (optional) | Suggested backoff duration before retry |
humanMessage | String | Human-readable diagnostic (never leaked credentials or internal paths) |
details | Object (optional) | Structured diagnostic data (validation errors, partial results, trace references) |
traceId | String | Correlation identifier for distributed tracing |
Error Code Taxonomy:
The retryable flag is the authoritative signal — the agent loop MUST NOT attempt retries on non-retryable errors, regardless of heuristic judgment.
13.4 Tool Discovery and Registration: Dynamic Capability Announcement, Schema Negotiation#
13.4.1 Discovery Protocol#
Tool discovery follows the MCP capability announcement pattern. When an agent runtime connects to an MCP server (or when an MCP server is dynamically registered with the orchestrator), the following handshake occurs:
Pseudo-Algorithm 13.4: MCP Tool Discovery Handshake
PROCEDURE DiscoverTools(mcpEndpoint, agentCapabilities):
// Step 1: Capability negotiation
serverCapabilities ← mcpEndpoint.Initialize(
protocolVersion=CURRENT_MCP_VERSION,
clientCapabilities=agentCapabilities
)
IF NOT CompatibleVersion(serverCapabilities.protocolVersion, CURRENT_MCP_VERSION) THEN
RETURN DiscoveryFailure(reason="Protocol version mismatch")
// Step 2: List available tools with schemas
toolList ← mcpEndpoint.ListTools(cursor=NULL)
allTools ← []
WHILE toolList IS NOT NULL:
FOR EACH toolDescriptor IN toolList.tools:
// Step 3: Schema validation at discovery time
schemaValid ← ValidateToolDescriptor(toolDescriptor)
IF schemaValid THEN
allTools.Add(toolDescriptor)
ELSE
EmitWarning(INVALID_TOOL_SCHEMA, toolDescriptor.id)
IF toolList.hasMore THEN
toolList ← mcpEndpoint.ListTools(cursor=toolList.nextCursor)
ELSE
toolList ← NULL
// Step 4: Register discovered tools in agent's tool registry
FOR EACH tool IN allTools:
agentToolRegistry.Register(tool, source=mcpEndpoint.id, discoveredAt=NOW())
// Step 5: Subscribe to change notifications
mcpEndpoint.SubscribeToChanges(callback=OnToolSetChanged)
RETURN allTools13.4.2 Registration Invariants#
Every registered tool MUST satisfy the admission predicate defined in §13.1.4. Additionally:
- Namespace isolation: Tool IDs are scoped by MCP server namespace to prevent collisions across independently managed servers.
- Version coexistence: Multiple versions of the same tool may be registered simultaneously. The agent's prefill compiler selects the appropriate version based on task requirements and deprecation status.
- Liveness probing: The orchestrator periodically probes registered MCP servers for health. Tools from unresponsive servers are marked
UNAVAILABLEin the registry but not deregistered (to preserve discovery history and enable rapid re-availability). - Change propagation: When an MCP server emits a change notification (tool added, removed, or schema modified), the agent's tool registry is updated atomically, and the prefill compiler's tool affordance cache is invalidated.
13.4.3 Schema Negotiation#
Schema negotiation resolves capability mismatches between agent expectations and tool offerings:
Partial matches are acceptable when missing capabilities are optional. The agent loop adapts its plan accordingly.
13.5 Lazy Tool Loading: Minimizing Context Cost by Deferring Schema Injection#
13.5.1 The Context Cost Problem#
Each tool schema injected into the agent's context window consumes tokens. For a system with registered tools, each with an average schema size of tokens, the total tool-affordance context cost is:
For production systems with tools and tokens, tokens — a substantial fraction of even large context windows. This directly reduces the budget available for task-relevant instructions, retrieved evidence, memory summaries, and reasoning.
13.5.2 Lazy Loading Strategy#
Lazy tool loading defers schema injection until the agent's planner determines that a tool is likely needed for the current task. The strategy operates in three tiers:
Tier 0 — Tool Index (Always Loaded): A compressed index of all available tools, containing only (toolId, oneSentenceDescription, mutationClass). Cost: tokens per tool.
Tier 1 — Selected Tool Schemas (Loaded on Plan): When the agent's planner identifies relevant tools during the planning phase, full schemas for only those tools are injected.
Tier 2 — Extended Tool Details (Loaded on Demand): For tools with complex schemas (e.g., multi-page input structures), only the top-level schema is injected at Tier 1; nested details are loaded only when the agent begins constructing the specific invocation.
Pseudo-Algorithm 13.5: Lazy Tool Loading in Prefill Compilation
PROCEDURE CompileToolAffordances(taskObjective, toolRegistry, tokenBudget):
// Tier 0: Always include compressed index
toolIndex ← toolRegistry.GetCompressedIndex()
indexTokens ← CountTokens(toolIndex)
remainingBudget ← tokenBudget - indexTokens
affordanceBlock ← [toolIndex]
// Tier 1: Select relevant tools via planner query
relevantToolIds ← PlannerSelectTools(taskObjective, toolIndex)
// Rank by predicted utility for current task
rankedTools ← RankByTaskUtility(relevantToolIds, taskObjective)
FOR EACH toolId IN rankedTools:
schema ← toolRegistry.GetSchema(toolId, detail=STANDARD)
schemaTokens ← CountTokens(schema)
IF schemaTokens ≤ remainingBudget THEN
affordanceBlock.Add(schema)
remainingBudget ← remainingBudget - schemaTokens
ELSE
// Tier 2: Include compressed schema reference only
compressedRef ← toolRegistry.GetSchema(toolId, detail=MINIMAL)
refTokens ← CountTokens(compressedRef)
IF refTokens ≤ remainingBudget THEN
affordanceBlock.Add(compressedRef)
remainingBudget ← remainingBudget - refTokens
ELSE
BREAK // Budget exhausted
RETURN affordanceBlock, remainingBudget13.5.3 Token Budget Optimization#
The optimal tool selection under a token budget is formulated as a 0-1 knapsack problem:
where is the estimated utility of tool for the current task (derived from planner relevance scoring) and is the token cost of tool 's schema. In practice, a greedy approximation (sort by descending, select greedily) is sufficient given that tool counts are moderate ().
13.6 Tool Invocation Lifecycle: Request → Validate → Authorize → Execute → Verify → Return#
13.6.1 Lifecycle Overview#
Every tool invocation — regardless of tool type, server pattern, or timeout class — traverses a deterministic six-phase lifecycle. Skipping any phase constitutes a protocol violation.
13.6.2 Phase-by-Phase Specification#
Phase 1 — Request Construction. The agent's action generator produces a structured tool invocation request containing:
toolIdandversionparams(JSON object matching )idempotencyKey(for mutating operations)callerToken(propagated credential)traceId(from current execution trace)deadline(computed from timeout class and remaining agent loop budget)
Phase 2 — Validate. The tool server validates the request:
- Schema validation against
- Semantic validation (e.g., referenced resource exists, date range is valid)
- Idempotency key format validation
Validation failures produce INVALID_INPUT error envelopes with actionable diagnostic details.
Phase 3 — Authorize. The tool server evaluates the authorization policy against the caller's identity and requested operation:
Possible decisions: .
Phase 4 — Execute. The tool handler executes the core logic within the assigned deadline. For stateful tools, execution occurs within a transaction scope. The tool server emits structured telemetry (start timestamp, intermediate checkpoints for long-running operations, completion status).
Phase 5 — Verify. Post-execution verification:
- Output schema validation against
- Side-effect verification (for mutating tools): confirm the intended state change occurred correctly
- Integrity checks: hash verification for file operations, row count verification for database operations
Phase 6 — Return. The tool server constructs and returns either a SuccessEnvelope or an ErrorEnvelope, with full telemetry (latency, cost attribution, trace references).
Pseudo-Algorithm 13.6: Complete Tool Invocation Lifecycle
PROCEDURE InvokeTool(request):
span ← StartTraceSpan("tool.invoke", traceId=request.traceId, toolId=request.toolId)
// PHASE 1: Request already constructed by agent action generator
// PHASE 2: Validate
tool ← ResolveToolOrFail(request.toolId, request.version)
validationErrors ← ValidateInput(request.params, tool.inputSchema)
IF validationErrors ≠ EMPTY THEN
span.SetStatus(INVALID_INPUT)
RETURN ErrorEnvelope(code=INVALID_INPUT, details=validationErrors)
// PHASE 3: Authorize
authDecision ← EvaluateAuthPolicy(tool.authContract, request.callerToken, request.params)
IF authDecision = DENY THEN
span.SetStatus(AUTHORIZATION_DENIED)
EmitAuditLog(DENIED, request.callerToken, tool.id, request.params)
RETURN ErrorEnvelope(code=AUTHORIZATION_DENIED)
IF authDecision = REQUIRES_APPROVAL THEN
approvalTicket ← CreateApprovalTicket(request, tool)
span.AddEvent("approval_requested", approvalTicket.id)
RETURN PendingEnvelope(approvalTicket)
// PHASE 4: Execute
executionContext ← PrepareExecutionContext(tool, request, span)
result ← ExecuteWithDeadline(tool.handler, executionContext, request.deadline)
IF result.status = TIMEOUT THEN
span.SetStatus(DEADLINE_EXCEEDED)
RETURN ErrorEnvelope(code=DEADLINE_EXCEEDED, retryable=TRUE)
IF result.status = ERROR THEN
span.SetStatus(EXECUTION_FAILED)
RETURN ErrorEnvelope(code=MapErrorCode(result.error), retryable=IsTransient(result.error))
// PHASE 5: Verify
outputErrors ← ValidateOutput(result.value, tool.outputSchema)
IF outputErrors ≠ EMPTY THEN
span.SetStatus(OUTPUT_SCHEMA_VIOLATION)
EmitAlert(TOOL_OUTPUT_CORRUPTION, tool.id)
RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR)
IF tool.mutationClass = WRITE THEN
sideEffectVerification ← VerifySideEffects(tool, request.params, result.value)
IF NOT sideEffectVerification.consistent THEN
AttemptRollback(tool, request.params, result.value)
span.SetStatus(SIDE_EFFECT_INCONSISTENCY)
RETURN ErrorEnvelope(code=INTERNAL_TOOL_ERROR, details=sideEffectVerification)
// PHASE 6: Return
EmitAuditLog(SUCCESS, request.callerToken, tool.id, request.params, result.value)
span.SetStatus(SUCCESS)
span.SetAttribute("latencyMs", result.durationMs)
span.SetAttribute("cost", result.resourceCost)
span.End()
RETURN SuccessEnvelope(
data=result.value,
provenance={source: tool.id, version: tool.version, executedAt: NOW()},
executionMetadata={latencyMs: result.durationMs, cost: result.resourceCost}
)13.7 Tool Timeout Classes: Interactive (<500ms), Standard (<5s), Long-Running (<5min), Async (>5min)#
13.7.1 Classification Rationale#
Not all tool invocations have the same latency profile. A single timeout value is either too aggressive (causing spurious failures for legitimately slow operations) or too permissive (allowing stuck invocations to block the agent loop). Timeout classes formalize latency expectations and enable the orchestrator to make informed scheduling decisions.
13.7.2 Timeout Class Definitions#
| Class | Symbol | Max Duration | Use Cases | Agent Loop Behavior |
|---|---|---|---|---|
| Interactive | In-memory computation, cache lookups, simple data retrieval | Synchronous wait; no intermediate feedback | ||
| Standard | Database queries, API calls, file reads, search operations | Synchronous wait; may emit progress indicator | ||
| Long-Running | Code execution, large file processing, complex analysis, CI runs | Checkpoint-based progress; agent may interleave other work | ||
| Async | Deployments, training jobs, batch processing, external approvals | Submit-and-poll; agent continues other tasks; callback on completion |
13.7.3 Deadline Propagation#
The effective deadline for a tool invocation is the minimum of the tool's timeout class maximum and the agent loop's remaining time budget:
where is the total time budget for the agent's current task iteration and is time already consumed.
13.7.4 Long-Running and Async Tool Patterns#
Pseudo-Algorithm 13.7: Async Tool Invocation with Polling
PROCEDURE InvokeAsyncTool(request, pollingPolicy):
// Submit the async operation
submission ← ToolServer.SubmitAsync(request)
IF submission.status = REJECTED THEN
RETURN ErrorEnvelope(code=submission.errorCode)
operationId ← submission.operationId
estimatedDuration ← submission.estimatedDurationMs
// Register callback if supported
IF ToolServer.SupportsCallbacks() THEN
ToolServer.RegisterCallback(operationId, callback=OnAsyncToolComplete)
RETURN AsyncPendingEnvelope(operationId, estimatedCompletion=NOW()+estimatedDuration)
// Fallback: polling with exponential backoff
pollInterval ← pollingPolicy.initialIntervalMs
totalWait ← 0
WHILE totalWait < pollingPolicy.maxWaitMs:
Sleep(pollInterval)
totalWait ← totalWait + pollInterval
status ← ToolServer.PollStatus(operationId)
IF status.state = COMPLETED THEN
RETURN SuccessEnvelope(data=status.result)
IF status.state = FAILED THEN
RETURN ErrorEnvelope(code=status.errorCode, retryable=status.retryable)
IF status.state = IN_PROGRESS THEN
// Optionally report progress to agent
EmitProgress(operationId, status.progressPercent, status.checkpoint)
pollInterval ← MIN(pollInterval * pollingPolicy.backoffMultiplier, pollingPolicy.maxIntervalMs)
// Max wait exceeded
ToolServer.RequestCancellation(operationId)
RETURN ErrorEnvelope(code=DEADLINE_EXCEEDED, retryable=TRUE)13.7.5 Timeout Class Assignment Rules#
A tool's timeout class is assigned at registration time based on empirical latency profiling and operational characteristics:
Misclassification is detected via observability (§13.13) and triggers automatic reclassification alerts.
13.8 Tool Idempotency Requirements: Safe Retries, Deduplication Keys, and At-Least-Once Semantics#
13.8.1 The Idempotency Imperative#
In distributed agentic systems, transient failures (network partitions, deadline exceedances, process restarts) are routine. Without idempotency guarantees, retrying a failed tool invocation may duplicate state-changing mutations (double-writes, duplicate resource creation, repeated financial transactions). Idempotency is not optional for mutating tools — it is a correctness requirement.
13.8.2 Formal Definition#
A tool is idempotent with respect to idempotency key if:
In operational terms: applying the same action with the same idempotency key times produces the same final state and the same result as applying it once.
13.8.3 Idempotency Key Design#
The idempotency key must be:
- Caller-generated: The agent (or orchestrator) generates the key before the first attempt. The tool server MUST NOT generate idempotency keys — this defeats the purpose.
- Deterministically derivable: For a given (task, step, invocation intent) tuple, the same key is produced. This ensures that retries after agent process restart use the same key.
- Scoped: Keys are scoped to
(toolId, callerSessionId)to prevent cross-session collisions.
Key derivation function:
where is a deterministic JSON serialization of the input parameters (keys sorted, whitespace normalized).
13.8.4 Deduplication Window#
Idempotency records are stored with a bounded TTL (the deduplication window):
where is a configurable minimum (typically 1 hour). Records older than are evicted — subsequent invocations with the same key are treated as new requests.
13.8.5 At-Least-Once Semantics with Idempotent Receivers#
The tool invocation protocol guarantees at-least-once delivery: the orchestrator will retry until it receives a definitive response (success or non-retryable error). Combined with idempotent tool implementations, this yields effectively-once execution semantics:
Pseudo-Algorithm 13.8: Idempotent Retry Logic in Agent Orchestrator
PROCEDURE InvokeWithIdempotentRetry(tool, params, retryPolicy):
idempotencyKey ← DeriveIdempotencyKey(currentSession, tool.id, currentStep, params)
attempt ← 0
lastError ← NULL
WHILE attempt < retryPolicy.maxAttempts:
attempt ← attempt + 1
request ← BuildToolRequest(
toolId=tool.id,
version=tool.version,
params=params,
idempotencyKey=idempotencyKey,
callerToken=currentSession.callerToken,
traceId=currentTrace.id,
deadline=ComputeDeadline(tool.timeoutClass, retryPolicy, attempt)
)
response ← ToolServer.Invoke(request)
IF response.status = SUCCESS THEN
RETURN response
IF response.status = ERROR THEN
IF NOT response.retryable THEN
RETURN response // Terminal failure — do not retry
lastError ← response
backoff ← ComputeBackoff(retryPolicy.baseMs, attempt, retryPolicy.jitterFactor)
Sleep(backoff)
// Exhausted retry budget
EmitAlert(RETRY_BUDGET_EXHAUSTED, tool.id, attempt, lastError)
RETURN ErrorEnvelope(code=RETRY_EXHAUSTED, lastError=lastError)
FUNCTION ComputeBackoff(baseMs, attempt, jitterFactor):
exponentialMs ← baseMs * 2^(attempt - 1)
cappedMs ← MIN(exponentialMs, MAX_BACKOFF_MS)
jitter ← RANDOM_UNIFORM(-jitterFactor * cappedMs, +jitterFactor * cappedMs)
RETURN MAX(0, cappedMs + jitter)13.9 Read vs. Write Tool Classification: Mutation Detection, Side-Effect Auditing#
13.9.1 Classification Taxonomy#
Every tool is classified at registration time into one of three mutation classes:
| Class | Symbol | Characteristics | Governance |
|---|---|---|---|
| Read-Only | No state changes; referentially transparent within data freshness window | No approval gates; freely cacheable; unlimited retries | |
| Write (Reversible) | State-changing with compensating action available | Approval gates configurable; idempotency required; audit logged | |
| Write (Irreversible) | State-changing with no compensating action (e.g., email sent, funds transferred) | Mandatory approval gate; dry-run mode required; enhanced audit |
13.9.2 Mutation Detection#
Mutation class is declared in the tool schema but verified at runtime through side-effect auditing:
Pseudo-Algorithm 13.9: Runtime Mutation Detection
PROCEDURE AuditSideEffects(tool, request, preState, postState, declaredClass):
detectedMutations ← DiffState(preState, postState)
IF declaredClass = READ_ONLY AND detectedMutations ≠ EMPTY THEN
EmitAlert(MUTATION_CLASS_VIOLATION, tool.id, "Declared READ_ONLY but mutations detected")
QuarantineTool(tool.id) // Immediate quarantine until human review
RETURN VIOLATION
IF declaredClass ∈ {WRITE_REVERSIBLE, WRITE_IRREVERSIBLE} AND detectedMutations = EMPTY THEN
EmitWarning(WRITE_TOOL_NO_MUTATION, tool.id, "Declared WRITE but no mutations observed")
// Non-critical — may indicate a no-op invocation
// Log all detected mutations for audit trail
FOR EACH mutation IN detectedMutations:
AuditLog.Record(
toolId=tool.id,
callerToken=request.callerToken,
timestamp=NOW(),
mutationType=mutation.type,
affectedResource=mutation.resource,
previousValue=mutation.before,
newValue=mutation.after,
idempotencyKey=request.idempotencyKey
)
RETURN CONSISTENT13.9.3 Side-Effect Manifests#
Write tools MUST declare a side-effect manifest at registration time — a structured enumeration of all state changes the tool may produce:
The side-effect manifest enables:
- Static analysis: The orchestrator can predict which resources will be affected before invocation.
- Conflict detection: Parallel agent executions operating on overlapping resource scopes are detected and serialized.
- Compensating action derivation: The compensation engine uses the manifest to generate inverse operations.
13.10 Human-in-the-Loop Tool Governance#
13.10.1 Approval Gates for State-Changing Operations#
State-changing tool invocations in production agentic systems MUST pass through configurable approval gates. The approval gate policy is a function of tool mutation class, operation scope, caller trust level, and environmental risk:
where is the minimum trust level for auto-approval and is the scope threshold above which all mutations require approval.
Pseudo-Algorithm 13.10: Approval Gate Execution
PROCEDURE ExecuteApprovalGate(request, tool, approvalPolicy):
decision ← EvaluateApprovalPolicy(tool, request.operation, request.callerToken, CURRENT_ENV)
IF decision = NOT_REQUIRED THEN
RETURN ApprovalResult(approved=TRUE, mechanism="auto")
// Create approval ticket
ticket ← ApprovalTicket(
id=GenerateTicketId(),
toolId=tool.id,
operation=request.operation,
params=RedactSensitiveFields(request.params, tool.inputSchema),
requestedBy=request.callerToken.identity,
requestedAt=NOW(),
expiresAt=NOW() + approvalPolicy.timeoutDuration,
status=PENDING,
escalationChain=approvalPolicy.escalationChain
)
// Notify approvers
NotifyApprovers(ticket, approvalPolicy.approverGroups)
// Store ticket and wait
ApprovalStore.Save(ticket)
RETURN ApprovalResult(
approved=FALSE,
mechanism="human_review",
ticketId=ticket.id,
pollEndpoint=BuildPollEndpoint(ticket.id),
expiresAt=ticket.expiresAt
)13.10.2 Dry-Run / Preview Modes for Destructive Actions#
For tools classified as (irreversible write), a dry-run mode MUST be supported. In dry-run mode, the tool executes all validation, authorization, and planning logic but does not commit the mutation. Instead, it returns a preview envelope describing:
- The exact state changes that would occur
- The resources that would be affected
- Any preconditions that would fail
- The estimated cost or impact of the operation
The agent or human reviewer examines the preview envelope and explicitly authorizes the live execution, or the preview is used as input to the approval gate.
13.10.3 Approval Escalation Policies and Timeout-Based Auto-Deny#
Approval requests that are not acted upon within the configured timeout are auto-denied — never auto-approved. This is a fundamental safety invariant:
Escalation chains define a sequence of approver groups with increasing authority. If the primary approver group does not respond within a fraction of the timeout (e.g., 50%), the request escalates to the next group:
Pseudo-Algorithm 13.11: Escalation and Timeout Logic
PROCEDURE MonitorApprovalTicket(ticket, escalationPolicy):
escalationIndex ← 0
WHILE NOW() < ticket.expiresAt:
// Check for resolution
currentStatus ← ApprovalStore.GetStatus(ticket.id)
IF currentStatus ∈ {APPROVED, DENIED} THEN
RETURN currentStatus
// Check for escalation
escalationThreshold ← ticket.createdAt +
(escalationPolicy.escalationFraction * (ticket.expiresAt - ticket.createdAt)) * (escalationIndex + 1) /
LENGTH(escalationPolicy.chain)
IF NOW() > escalationThreshold AND escalationIndex < LENGTH(escalationPolicy.chain) - 1 THEN
escalationIndex ← escalationIndex + 1
NotifyApprovers(ticket, escalationPolicy.chain[escalationIndex])
EmitEvent(APPROVAL_ESCALATED, ticket.id, escalationIndex)
Sleep(escalationPolicy.pollIntervalMs)
// Timeout reached — auto-deny
ApprovalStore.UpdateStatus(ticket.id, AUTO_DENIED)
EmitEvent(APPROVAL_AUTO_DENIED, ticket.id)
RETURN AUTO_DENIED13.11 Caller-Scoped Authorization: Credential Propagation, Least Privilege, and Audit Trails#
13.11.1 Principle of Least Privilege for Agent Tool Access#
Agents MUST NOT operate with broad, ambient credentials. Instead, every tool invocation is authorized against the caller's identity and scopes — the human user or system principal that initiated the agentic task. This is the fundamental difference between "agent-owned credentials" (dangerous) and "caller-scoped authorization" (safe).
The effective authorization is the intersection of the caller's granted scopes and the tool's required scopes. If the intersection does not cover all required scopes, the invocation is denied.
13.11.2 Credential Propagation Architecture#
Three-Tier Credential Model:
- User Credential (): The original credential (OAuth token, API key, session token) of the human user who initiated the task. This credential is the root of trust.
- Agent Credential (): A derived, time-limited, scope-restricted credential issued to the agent for the duration of the task. Derived via delegation:
- Tool Invocation Credential (): A further-restricted credential for a specific tool invocation, derived from :
Invariants:
No credential escalation is permitted at any layer.
13.11.3 Audit Trail Requirements#
Every tool invocation generates an immutable audit record:
| Field | Content |
|---|---|
timestamp | ISO-8601 UTC timestamp of invocation |
traceId | Distributed trace correlation ID |
callerId | Identity of the originating caller (from ) |
agentId | Identity of the executing agent (from ) |
toolId | Tool identifier and version |
operation | Specific operation within the tool |
inputHash | SHA-256 hash of canonical input (not raw input, for PII protection) |
outputSummary | Structured summary of output (not full output, for cost/privacy) |
mutationClass | Read / Write-Reversible / Write-Irreversible |
sideEffects | List of detected state changes (from side-effect auditing) |
authDecision | ALLOW / DENY / REQUIRES_APPROVAL |
approvalTicketId | If approval was required, the ticket ID |
latencyMs | Execution duration |
status | SUCCESS / ERROR / TIMEOUT |
errorCode | If failed, the classified error code |
Audit records are written to an append-only, tamper-evident log (write-once storage or cryptographically chained log). Retention policies are governed by organizational compliance requirements (typically 1–7 years).
13.12 Tool Versioning and Backward Compatibility: Schema Evolution, Deprecation Notices#
13.12.1 Semantic Versioning for Tool Contracts#
Tools follow strict semantic versioning:
| Version Component | Change Semantics |
|---|---|
| MAJOR | Breaking changes to input schema, output schema, or behavioral semantics. Requires agent-side adaptation. |
| MINOR | Backward-compatible additions (new optional input fields, new output fields, new capabilities). Existing agents function without modification. |
| PATCH | Bug fixes, performance improvements, documentation updates. No schema changes. |
13.12.2 Schema Evolution Rules#
Backward-compatible (MINOR) changes:
- Adding optional input fields with defaults
- Adding new fields to the output schema
- Relaxing input constraints (e.g., increasing
maxLength) - Adding new enum values (if the consumer handles unknown values gracefully)
Breaking (MAJOR) changes:
- Adding required input fields
- Removing or renaming existing fields (input or output)
- Changing field types
- Tightening input constraints
- Changing behavioral semantics (same input produces different output)
Formal Compatibility Check:
13.12.3 Deprecation Protocol#
Pseudo-Algorithm 13.12: Tool Deprecation Lifecycle
PROCEDURE DeprecateTool(toolId, oldVersion, newVersion, deprecationPolicy):
// Phase 1: Announce deprecation
toolRegistry.SetDeprecationNotice(toolId, oldVersion, DeprecationNotice(
deprecatedAt=NOW(),
removalDate=NOW() + deprecationPolicy.gracePeriod,
successor=ToolRef(toolId, newVersion),
migrationGuide=deprecationPolicy.migrationGuide
))
// Phase 2: Emit deprecation warnings on invocation
// (Handled by tool invocation lifecycle — deprecated tools return
// a `deprecation` field in the response metadata)
// Phase 3: Monitor migration progress
SCHEDULE PeriodicTask(interval=deprecationPolicy.reportInterval):
usageStats ← MetricsStore.GetToolUsage(toolId, oldVersion, window=7d)
IF usageStats.invocationCount = 0 THEN
EmitEvent(DEPRECATION_MIGRATION_COMPLETE, toolId, oldVersion)
ELSE
EmitReport(DEPRECATION_MIGRATION_PROGRESS, toolId, oldVersion, usageStats)
// Phase 4: Remove after grace period
SCHEDULE Task(at=NOW() + deprecationPolicy.gracePeriod):
finalUsage ← MetricsStore.GetToolUsage(toolId, oldVersion, window=24h)
IF finalUsage.invocationCount = 0 THEN
toolRegistry.Deregister(toolId, oldVersion)
EmitEvent(TOOL_VERSION_REMOVED, toolId, oldVersion)
ELSE
EmitAlert(DEPRECATION_REMOVAL_BLOCKED, toolId, oldVersion, finalUsage)
// Extend grace period or force-remove based on policy13.12.4 Multi-Version Coexistence#
The prefill compiler and agent planner are version-aware. When multiple versions of a tool are registered:
- Prefer latest non-deprecated version by default.
- Pin to specific version when the task requires behavioral stability (e.g., reproducing a prior result).
- Reject deprecated versions when
removalDatehas passed.
13.13 Tool Observability: Invocation Traces, Success/Failure Rates, Latency Distributions, Cost Attribution#
13.13.1 Observability Architecture#
Tool observability is structured along the three pillars of production telemetry: traces, metrics, and logs. Each operates at a distinct granularity and serves a distinct consumer.
13.13.2 Distributed Traces#
Every tool invocation is a span within the broader agent execution trace. The trace structure is:
AgentTask (root span)
├── Plan (span)
├── Retrieve (span)
├── ToolInvocation: search_codebase (span)
│ ├── Validate (span)
│ ├── Authorize (span)
│ ├── Execute (span)
│ │ └── ExternalAPI: github_search (span)
│ ├── Verify (span)
│ └── Return (span)
├── ToolInvocation: read_file (span)
│ └── ...
├── Synthesize (span)
└── Verify (span)Trace context (W3C traceparent header or equivalent) is propagated across all protocol boundaries (MCP, JSON-RPC, gRPC).
13.13.3 Metrics#
The following metrics are emitted at every tool invocation boundary:
| Metric | Type | Labels | Purpose |
|---|---|---|---|
tool.invocation.count | Counter | toolId, version, status, mutationClass | Invocation volume and success rate |
tool.invocation.latency_ms | Histogram | toolId, version, timeoutClass | Latency distribution (p50, p95, p99) |
tool.invocation.cost | Counter | toolId, version, costCategory | Cost attribution (compute, API calls, tokens) |
tool.invocation.retries | Counter | toolId, version | Retry frequency (indicates reliability issues) |
tool.invocation.timeout_rate | Gauge | toolId, version, timeoutClass | Fraction of invocations exceeding deadline |
tool.idempotency.replay_rate | Gauge | toolId | Fraction of invocations resolved via idempotency cache |
tool.approval.pending_count | Gauge | toolId, approverGroup | Pending approval queue depth |
tool.approval.resolution_time_ms | Histogram | toolId, resolution | Time to approval/denial |
tool.schema.validation_failure_rate | Gauge | toolId, phase (input/output) | Schema compliance rate |
13.13.4 Alerting Rules#
Derived alerting conditions from the metrics above:
13.13.5 Cost Attribution Model#
Tool cost attribution enables the agentic platform to account for resource consumption at the task level. The total cost of a tool invocation is:
where each component is reported by the tool server or derived from infrastructure metering. Cost is aggregated upward:
This enables per-task cost budgets, anomaly detection on runaway agent loops, and cost-aware tool selection in the planner.
13.13.6 Observability-Driven Feedback Loops#
Pseudo-Algorithm 13.13: Observability-Driven Tool Health Monitor
PROCEDURE MonitorToolHealth(toolRegistry, metricsStore, alertManager):
FOR EACH tool IN toolRegistry.GetAllActive():
window ← LAST_15_MINUTES
errorRate ← metricsStore.Query(
"rate(tool.invocation.count{toolId=$tool.id, status='error'}[$window])"
) / metricsStore.Query(
"rate(tool.invocation.count{toolId=$tool.id}[$window])"
)
p99Latency ← metricsStore.Query(
"histogram_quantile(0.99, tool.invocation.latency_ms{toolId=$tool.id}[$window])"
)
timeoutRate ← metricsStore.Query(
"tool.invocation.timeout_rate{toolId=$tool.id}[$window]"
)
// Update tool health status in registry
healthScore ← ComputeHealthScore(errorRate, p99Latency, timeoutRate, tool.timeoutClass)
toolRegistry.UpdateHealth(tool.id, healthScore)
// Trigger alerts
IF errorRate > tool.alertThresholds.errorRate THEN
alertManager.Fire(ToolDegraded(tool.id, errorRate))
IF p99Latency > 2 * tool.baselineLatency.p99 THEN
alertManager.Fire(LatencyRegression(tool.id, p99Latency))
IF timeoutRate > tool.alertThresholds.timeoutRate THEN
alertManager.Fire(TimeoutClassMismatch(tool.id, timeoutRate))
// Suggest reclassification
suggestedClass ← InferTimeoutClass(p99Latency)
IF suggestedClass ≠ tool.timeoutClass THEN
EmitRecommendation(RECLASSIFY_TIMEOUT, tool.id, tool.timeoutClass, suggestedClass)
FUNCTION ComputeHealthScore(errorRate, p99Latency, timeoutRate, timeoutClass):
// Normalized health score ∈ [0, 1], where 1 = perfect health
errorPenalty ← CLAMP(errorRate / MAX_TOLERABLE_ERROR_RATE, 0, 1)
latencyPenalty ← CLAMP(p99Latency / timeoutClass.maxDuration, 0, 1)
timeoutPenalty ← CLAMP(timeoutRate / MAX_TOLERABLE_TIMEOUT_RATE, 0, 1)
RETURN 1.0 - (0.5 * errorPenalty + 0.3 * latencyPenalty + 0.2 * timeoutPenalty)13.14 Tool Testing: Unit Tests, Integration Tests, Chaos Tests, and Behavioral Contract Verification#
13.14.1 Testing Pyramid for Tool Infrastructure#
Tool testing follows a layered pyramid with increasing scope and decreasing execution frequency:
╱ Chaos Tests ╲ ← Rare (weekly/pre-release)
╱────────────────╲
╱ Contract Tests ╲ ← Per-PR / Per-deploy
╱──────────────────────╲
╱ Integration Tests ╲ ← Per-commit
╱────────────────────────────╲
╱ Unit Tests ╲ ← Continuous (every build)
╱──────────────────────────────────╲13.14.2 Unit Tests#
Unit tests validate individual tool handler logic in isolation, with all external dependencies mocked or stubbed.
Coverage Requirements:
- Input validation: Every schema constraint is tested with valid, boundary, and invalid inputs.
- Core computation: The tool's primary logic is tested against known input-output pairs.
- Error handling: Every classified error code in is produced by at least one test case.
- Idempotency: For stateful tools, duplicate invocations with the same idempotency key return identical results.
Pseudo-Algorithm 13.14: Unit Test Generation from Tool Schema
PROCEDURE GenerateUnitTests(tool):
testCases ← []
// Valid input tests
FOR EACH exampleInput IN tool.inputSchema.examples:
testCases.Add(TestCase(
name="valid_input_" + Hash(exampleInput),
input=exampleInput,
expectedStatus=SUCCESS,
validateOutput=LAMBDA(out): ValidateAgainstSchema(out, tool.outputSchema)
))
// Boundary tests
FOR EACH property IN tool.inputSchema.properties:
IF property.type = "string" AND property.maxLength IS DEFINED THEN
testCases.Add(TestCase(
name="boundary_maxlength_" + property.name,
input=GenerateInputWithPropertyValue(property.name, RandomString(property.maxLength)),
expectedStatus=SUCCESS
))
testCases.Add(TestCase(
name="exceed_maxlength_" + property.name,
input=GenerateInputWithPropertyValue(property.name, RandomString(property.maxLength + 1)),
expectedStatus=ERROR,
expectedErrorCode=INVALID_INPUT
))
IF property.type = "integer" AND property.minimum IS DEFINED THEN
testCases.Add(TestCase(
name="boundary_minimum_" + property.name,
input=GenerateInputWithPropertyValue(property.name, property.minimum),
expectedStatus=SUCCESS
))
testCases.Add(TestCase(
name="below_minimum_" + property.name,
input=GenerateInputWithPropertyValue(property.name, property.minimum - 1),
expectedStatus=ERROR,
expectedErrorCode=INVALID_INPUT
))
// Required field tests
FOR EACH requiredField IN tool.inputSchema.required:
testCases.Add(TestCase(
name="missing_required_" + requiredField,
input=GenerateInputWithoutField(requiredField),
expectedStatus=ERROR,
expectedErrorCode=INVALID_INPUT
))
// Additional properties test
testCases.Add(TestCase(
name="reject_additional_properties",
input=GenerateValidInput() ∪ {"unknownField": "hallucinated_value"},
expectedStatus=ERROR,
expectedErrorCode=INVALID_INPUT
))
// Idempotency test (for stateful tools)
IF tool.mutationClass ≠ READ_ONLY THEN
validInput ← GenerateValidInput()
idempotencyKey ← GenerateIdempotencyKey()
testCases.Add(TestCase(
name="idempotency_duplicate",
steps=[
Invoke(tool, validInput, idempotencyKey) → result1,
Invoke(tool, validInput, idempotencyKey) → result2,
Assert(result1.data = result2.data),
Assert(StateMutationCount() = 1) // Only one actual mutation
]
))
RETURN testCases13.14.3 Integration Tests#
Integration tests validate end-to-end tool invocation through the full lifecycle (§13.6) against real or realistic dependencies.
Scope:
- Protocol integration: Verify that MCP discovery, JSON-RPC invocation, and gRPC execution paths produce correct results end to end.
- Authorization integration: Verify that caller-scoped credentials are correctly propagated and evaluated.
- Transaction integration: For stateful tools, verify that commit/rollback semantics work correctly under concurrent access.
- Timeout integration: Verify that timeout class enforcement correctly terminates operations and returns appropriate error envelopes.
- Observability integration: Verify that traces, metrics, and audit logs are correctly emitted for each invocation.
Pseudo-Algorithm 13.15: Integration Test for Full Invocation Lifecycle
PROCEDURE IntegrationTest_FullLifecycle(tool, testEnvironment):
// Setup: Provision test credentials with minimal scopes
testCredential ← testEnvironment.ProvisionCredential(scopes=tool.authContract.requiredScopes)
// Phase 1: Discovery
discoveredTools ← DiscoverTools(testEnvironment.mcpEndpoint, DEFAULT_CAPABILITIES)
Assert(tool.id IN discoveredTools.Map(t → t.id))
// Phase 2: Valid invocation
request ← BuildToolRequest(tool.id, tool.version, GenerateValidInput(),
GenerateIdempotencyKey(), testCredential)
response ← InvokeTool(request)
Assert(response.status = SUCCESS)
Assert(ValidateAgainstSchema(response.data, tool.outputSchema))
// Phase 3: Verify trace was emitted
trace ← testEnvironment.traceCollector.GetTrace(request.traceId)
Assert(trace.spans.Any(s → s.name = "tool.invoke" AND s.toolId = tool.id))
// Phase 4: Verify metrics were emitted
metrics ← testEnvironment.metricsCollector.GetRecent(toolId=tool.id)
Assert(metrics.invocationCount ≥ 1)
Assert(metrics.latencyMs > 0)
// Phase 5: Verify audit log was written
auditRecord ← testEnvironment.auditLog.GetLatest(toolId=tool.id, traceId=request.traceId)
Assert(auditRecord ≠ NULL)
Assert(auditRecord.callerId = testCredential.identity)
Assert(auditRecord.status = SUCCESS)
// Phase 6: Insufficient authorization
limitedCredential ← testEnvironment.ProvisionCredential(scopes=EMPTY_SET)
unauthorizedRequest ← BuildToolRequest(tool.id, tool.version, GenerateValidInput(),
GenerateIdempotencyKey(), limitedCredential)
unauthorizedResponse ← InvokeTool(unauthorizedRequest)
Assert(unauthorizedResponse.status = ERROR)
Assert(unauthorizedResponse.errorCode = AUTHORIZATION_DENIED)
// Phase 7: Timeout enforcement
IF tool.timeoutClass = INTERACTIVE THEN
slowRequest ← BuildToolRequest(tool.id, tool.version,
GenerateSlowInput(targetLatency=2*tool.timeoutClass.maxDuration),
GenerateIdempotencyKey(), testCredential)
slowResponse ← InvokeTool(slowRequest)
Assert(slowResponse.errorCode = DEADLINE_EXCEEDED)13.14.4 Chaos Tests#
Chaos tests validate tool resilience under adverse conditions. These are executed in isolated environments and simulate:
| Chaos Scenario | Injection Method | Expected Behavior |
|---|---|---|
| Network partition | Drop/delay packets between agent and tool server | Retry with backoff; return UPSTREAM_FAILURE after budget exhaustion |
| Dependency failure | Kill upstream service the tool depends on | Return UPSTREAM_FAILURE with retryable flag; no partial mutations |
| Resource exhaustion | Limit CPU/memory available to tool server | Return RESOURCE_EXHAUSTED; no crash, no data corruption |
| Clock skew | Shift system clock on tool server | Idempotency windows still function correctly; no duplicate mutations |
| Slow responses | Inject latency into tool execution | Timeout class enforcement triggers; agent does not hang indefinitely |
| Concurrent mutations | Submit conflicting mutations simultaneously | Optimistic concurrency control detects conflict; return CONFLICT |
| Partial failure in composite | Fail one step of a composite tool | Compensating actions execute for completed steps; aggregate failure returned |
Pseudo-Algorithm 13.16: Chaos Test — Concurrent Mutation Conflict
PROCEDURE ChaosTest_ConcurrentMutationConflict(tool, testEnvironment):
// Setup: Create a shared resource
resource ← testEnvironment.CreateTestResource()
// Attempt two concurrent mutations on the same resource
key1 ← GenerateIdempotencyKey()
key2 ← GenerateIdempotencyKey()
request1 ← BuildToolRequest(tool.id, tool.version,
{resourceId: resource.id, value: "A"}, key1, testCredential)
request2 ← BuildToolRequest(tool.id, tool.version,
{resourceId: resource.id, value: "B"}, key2, testCredential)
// Execute concurrently
(response1, response2) ← ExecuteConcurrently(InvokeTool(request1), InvokeTool(request2))
// Exactly one should succeed, the other should receive CONFLICT or succeed after retry
successCount ← COUNT(r IN [response1, response2] WHERE r.status = SUCCESS)
conflictCount ← COUNT(r IN [response1, response2] WHERE r.errorCode = CONFLICT)
Assert(successCount ≥ 1)
Assert(successCount + conflictCount = 2)
// Verify final state is consistent (reflects one of the two values, not corrupted)
finalState ← testEnvironment.ReadResource(resource.id)
Assert(finalState.value ∈ {"A", "B"})13.14.5 Behavioral Contract Verification#
Beyond schema compliance, tools must satisfy behavioral contracts — invariants about the relationship between inputs, outputs, and state changes that hold across all valid invocations. Behavioral contracts are formalized as properties and verified through property-based testing (generative testing).
Contract Categories:
- Determinism contract (for stateless tools):
- Idempotency contract (for stateful tools):
- Monotonicity contract (for append-only tools):
- Conservation contract (for transfer operations):
- Reversibility contract (for reversible write tools):
Pseudo-Algorithm 13.17: Property-Based Behavioral Contract Verification
PROCEDURE VerifyBehavioralContracts(tool, contractSpec, generatorConfig):
generator ← PropertyBasedGenerator(tool.inputSchema, generatorConfig)
failures ← []
FOR iteration IN 1..generatorConfig.maxIterations:
randomInput ← generator.Generate()
// Contract 1: Schema compliance
response ← InvokeTool(BuildRequest(tool, randomInput))
IF response.status = SUCCESS THEN
schemaValid ← ValidateAgainstSchema(response.data, tool.outputSchema)
IF NOT schemaValid THEN
failures.Add(ContractViolation("output_schema", randomInput, response.data))
// Contract 2: Determinism (stateless tools)
IF tool.mutationClass = READ_ONLY AND contractSpec.determinism THEN
response2 ← InvokeTool(BuildRequest(tool, randomInput))
IF response.status = SUCCESS AND response2.status = SUCCESS THEN
IF response.data ≠ response2.data THEN
failures.Add(ContractViolation("determinism", randomInput,
{first: response.data, second: response2.data}))
// Contract 3: Idempotency (stateful tools)
IF tool.mutationClass ≠ READ_ONLY AND contractSpec.idempotency THEN
key ← GenerateIdempotencyKey()
r1 ← InvokeTool(BuildRequest(tool, randomInput, key))
r2 ← InvokeTool(BuildRequest(tool, randomInput, key))
IF r1.status = SUCCESS AND r2.status = SUCCESS THEN
IF r1.data ≠ r2.data THEN
failures.Add(ContractViolation("idempotency", randomInput, {r1: r1.data, r2: r2.data}))
// Contract 4: Reversibility (reversible write tools)
IF tool.mutationClass = WRITE_REVERSIBLE AND contractSpec.reversibility THEN
preState ← CaptureState()
r ← InvokeTool(BuildRequest(tool, randomInput))
IF r.status = SUCCESS THEN
compensation ← tool.DeriveCompensation(randomInput, r.data)
ExecuteCompensation(compensation)
postState ← CaptureState()
IF preState ≠ postState THEN
failures.Add(ContractViolation("reversibility", randomInput,
{pre: preState, post: postState}))
// Report
IF failures ≠ EMPTY THEN
EmitTestFailure(BEHAVIORAL_CONTRACT_VIOLATION, tool.id, failures)
RETURN TEST_FAILED
RETURN TEST_PASSED13.14.6 Continuous Contract Enforcement in CI/CD#
Tool tests are integrated into the CI/CD pipeline as mandatory quality gates:
Chaos tests are executed on a periodic schedule (weekly or pre-release) and do not block individual deployments but may block release promotions.
Regression Detection:
When a tool's behavioral contract test fails after a code change, the CI system:
- Identifies the minimal failing input (via shrinking in the property-based test framework).
- Records the failure as a regression test case (permanently added to the unit test suite).
- Blocks deployment until the regression is resolved.
- Notifies the tool owner and the agentic platform team.
Chapter 13 Summary: Architectural Invariants#
The following invariants define the non-negotiable properties of a production-grade tool architecture for agentic AI systems:
| # | Invariant | Enforcement Mechanism |
|---|---|---|
| 1 | Every tool has a typed, versioned, schema-validated contract | Admission predicate at registration |
| 2 | Every invocation traverses the full six-phase lifecycle | Protocol enforcement in tool server framework |
| 3 | Every mutating tool is idempotent with caller-generated keys | Idempotency store + deduplication window |
| 4 | Authorization is caller-scoped, never agent-ambient | Credential delegation chain + policy evaluation |
| 5 | Irreversible mutations require human approval or dry-run preview | Approval gates + timeout-based auto-deny |
| 6 | Tool schemas are lazily loaded under explicit token budgets | Prefill compiler with knapsack-optimized selection |
| 7 | Every invocation is traced, metered, and audit-logged | Observability infrastructure at every boundary |
| 8 | Schema evolution follows semantic versioning with backward-compatibility verification | Automated compatibility checks in CI |
| 9 | Side effects are declared, detected, and audited | Side-effect manifests + runtime mutation detection |
| 10 | Behavioral contracts are verified through property-based testing in CI/CD | Mandatory deploy gates |
These invariants collectively ensure that tool infrastructure in agentic systems operates with the same reliability, governance, and observability expectations as any production-grade distributed system — because that is precisely what it is.
End of Chapter 13.