Preamble#
Tool use elevates a language model from a stateless text generator into an actuating agent capable of observing, mutating, and reasoning over external state. Yet the difference between a demonstration-grade tool-calling agent and a production-grade agentic system lies entirely in the composition discipline: how tools are sequenced, how their outputs are routed into downstream reasoning, how failures are absorbed, how transactions maintain consistency, and how the tool surface itself evolves under agent-driven creation and community governance. This chapter provides the complete engineering treatment of these advanced patterns — formalized mathematically, specified as typed protocols, and rendered as bounded pseudo-algorithms suitable for implementation in enterprise-scale agentic platforms.
Throughout, we adopt the following foundational abstraction. A tool is a typed function:
where is the validated input schema, is the structured output schema, and denotes failure (with typed error class). A tool invocation is a tuple submitted through a protocol boundary (MCP, JSON-RPC, or gRPC) with explicit deadline, caller-scoped authorization, and distributed trace identity.
14.1 Tool Chaining: Sequential, Conditional, and Parallel Composition Patterns#
14.1.1 Foundational Definitions#
A tool chain is a directed acyclic graph (DAG) of tool invocations where edges encode data dependencies and control-flow predicates. We distinguish three canonical composition patterns:
| Pattern | Structure | Data Flow | Concurrency |
|---|---|---|---|
| Sequential | Linear pipeline | Output of feeds input of | None |
| Conditional | Branch node selects among subchains | Predicate determines successor | None (selection) |
| Parallel | Fan-out set with join barrier | Independent inputs; outputs merged at barrier | Full within fan-out |
Formally, a chain where:
- is the tool vertex set.
- is the dependency edge set.
- assigns a composition type.
- is the junction transform mapping source outputs to destination inputs.
14.1.2 Sequential Composition#
Sequential chaining is the simplest and most common pattern. Each tool produces output , which is transformed by into the input for :
The total latency of a sequential chain is:
where is tool execution latency and is junction transform latency (typically negligible for schema mappings, non-trivial for LLM-mediated transforms).
Pseudo-Algorithm 14.1: Sequential Chain Executor
PROCEDURE ExecuteSequentialChain(chain: List<ToolNode>, initial_input, trace_id, deadline):
current_input ← initial_input
results ← []
remaining_budget ← deadline - now()
FOR i ← 0 TO len(chain) - 1:
node ← chain[i]
tool ← node.tool
junction ← node.junction_transform
// Deadline subdivision
estimated_remaining_latency ← SUM(chain[j].estimated_latency FOR j IN [i..len(chain)-1])
local_deadline ← now() + (remaining_budget * chain[i].estimated_latency / estimated_remaining_latency)
// Schema validation on input
validated_input ← ValidateSchema(current_input, tool.input_schema)
IF validated_input IS SchemaError:
RETURN ChainFailure(step=i, error=validated_input, partial_results=results)
// Invoke with deadline, auth, trace
invocation ← ToolInvocation(
id=GenerateUUID(),
tool=tool,
input=validated_input,
deadline=local_deadline,
auth=ScopedAuth(tool.required_permissions),
trace_id=trace_id
)
result ← InvokeTool(invocation)
IF result IS Failure:
RETURN ChainFailure(step=i, error=result.error, partial_results=results)
// Validate output schema
validated_output ← ValidateSchema(result.output, tool.output_schema)
IF validated_output IS SchemaError:
RETURN ChainFailure(step=i, error="output_schema_violation", partial_results=results)
results.APPEND(validated_output)
// Apply junction transform for next step
IF i < len(chain) - 1:
current_input ← junction(validated_output)
remaining_budget ← deadline - now()
IF remaining_budget <= 0:
RETURN ChainFailure(step=i, error="deadline_exceeded", partial_results=results)
RETURN ChainSuccess(results=results, final_output=results[-1])14.1.3 Conditional Composition#
Conditional branching introduces a predicate function that maps the output of a predecessor tool to a branch selector. Each branch identifies a distinct subchain .
Predicates may be:
- Deterministic: schema field comparison, threshold evaluation, enum matching.
- LLM-mediated: the agent reasons over and selects the branch. This introduces non-determinism and must be bounded by a selection confidence threshold :
If no branch exceeds , the chain enters a disambiguation subroutine (human escalation, additional retrieval, or retry with enriched context).
14.1.4 Parallel Composition#
Parallel fan-out distributes independent tool invocations across concurrent execution slots. A join barrier collects outputs and merges them before the chain continues.
Let the fan-out set be . Each tool receives independently constructed inputs:
The join barrier produces:
The latency of the parallel segment is:
Join barrier strategies:
- All-success: Wait for all results. Fail if any fails.
- Quorum: Wait for successes where is the quorum fraction.
- First-success: Return the first successful result; cancel remaining.
- Best-of-N: Wait for all, then select the highest-quality result by a scoring function .
Pseudo-Algorithm 14.2: Parallel Fan-Out with Quorum Join
PROCEDURE ExecuteParallelFanOut(fan_out_nodes: List<ToolNode>, predecessor_output, quorum_fraction, deadline, trace_id):
invocations ← []
FOR node IN fan_out_nodes:
input_j ← node.fan_out_transform(predecessor_output)
validated ← ValidateSchema(input_j, node.tool.input_schema)
inv ← ToolInvocation(id=UUID(), tool=node.tool, input=validated, deadline=deadline, trace_id=trace_id)
invocations.APPEND(inv)
// Dispatch all concurrently
futures ← DispatchConcurrent(invocations)
required_successes ← CEIL(quorum_fraction * len(futures))
successes ← []
failures ← []
WHILE len(successes) < required_successes AND (len(successes) + len(pending(futures))) >= required_successes:
completed ← AwaitAny(futures, timeout=deadline - now())
IF completed IS None:
BREAK // Deadline
IF completed.result IS Success:
successes.APPEND(completed)
ELSE:
failures.APPEND(completed)
futures.REMOVE(completed)
IF len(successes) >= required_successes:
CancelRemaining(futures)
merged ← JoinBarrierMerge(successes)
RETURN ParallelSuccess(merged=merged, individual=successes, failures=failures)
ELSE:
CancelRemaining(futures)
RETURN ParallelFailure(successes=successes, failures=failures, quorum_not_met=TRUE)14.1.5 Composition Algebra#
Tool chains compose algebraically. Define operators:
- Sequence: — execute then , routing output through junction.
- Branch: — conditional dispatch.
- Parallel: — fan-out, join, continue.
Any composition of these operators produces a valid DAG. The chain compiler validates:
- Type compatibility:
- Acyclicity: No directed cycle exists (bounded recursion is modeled as iteration with explicit depth counters).
- Deadline feasibility: where is estimated latency.
14.2 Tool Output Routing: Feeding Tool Results as Context to Subsequent Reasoning Steps#
14.2.1 The Output Routing Problem#
When a tool returns output , the agentic system must decide where and how that output enters the reasoning pipeline. The output may serve as:
- Direct input to the next tool (junction transform, §14.1).
- Context injection into the LLM's next reasoning step.
- Memory write to working, session, or episodic memory.
- Observation record for verification or audit.
- Discard if the output has been fully consumed by an earlier transform.
This routing decision is itself a function:
where is the current agent state and denotes the power set (multiple destinations are common).
14.2.2 Context Injection Strategies#
When tool output is routed into the LLM context window, the key trade-off is information density vs. token cost. Raw tool outputs (e.g., a full database result set) may consume thousands of tokens while contributing marginal reasoning value.
Compression hierarchy for tool outputs entering context:
| Level | Method | Compression Ratio | Information Loss |
|---|---|---|---|
| L0 | Raw output verbatim | None | |
| L1 | Schema-filtered projection (select relevant fields) | Low (structural) | |
| L2 | Summarization by secondary LLM call | Moderate (semantic) | |
| L3 | Key-value extraction (facts only) | Moderate–High | |
| L4 | Boolean/scalar signal ("found"/"not found", count) | High |
The optimal level is selected by:
where are task-specific weights. For high-stakes reasoning (medical, financial), dominates; for high-throughput batch processing, dominates.
14.2.3 Structured Output Placement in the Context Window#
Tool outputs should be placed in the context window using structured delimiters and provenance tags:
<tool_result tool_id="τ_3" invocation_id="inv-8a2f" timestamp="2025-01-15T10:32:00Z"
source="database/orders" freshness="live" confidence="verified">
{ "order_count": 1247, "total_revenue": 89340.50, "currency": "USD" }
</tool_result>This enables the LLM to:
- Distinguish tool-provided facts from its own prior knowledge (hallucination control).
- Attribute claims to specific tool invocations (provenance).
- Assess freshness and confidence for downstream reasoning.
14.2.4 Output Routing Decision Algorithm#
Pseudo-Algorithm 14.3: Tool Output Router
PROCEDURE RouteToolOutput(output, tool_metadata, agent_state, chain_plan, token_budget):
destinations ← {}
// 1. Check if next tool in chain needs this output
IF chain_plan.HasSuccessor(tool_metadata.step_id):
successor ← chain_plan.GetSuccessor(tool_metadata.step_id)
junction ← chain_plan.GetJunction(tool_metadata.step_id, successor.step_id)
transformed ← junction.Transform(output)
destinations.ADD("tool_input", transformed)
// 2. Determine context injection need
IF agent_state.RequiresReasoningOverOutput(tool_metadata):
compression_level ← SelectCompressionLevel(
output_size=TokenCount(output),
remaining_budget=token_budget.remaining,
task_criticality=agent_state.task.criticality,
output_schema=tool_metadata.output_schema
)
compressed ← CompressOutput(output, compression_level, tool_metadata)
provenance_tagged ← AttachProvenance(compressed, tool_metadata)
destinations.ADD("context", provenance_tagged)
// 3. Memory write evaluation
IF MemoryPolicy.ShouldPersist(output, tool_metadata, agent_state):
memory_record ← ConstructMemoryRecord(
content=ExtractPersistableContent(output),
provenance=tool_metadata,
expiry=MemoryPolicy.ComputeExpiry(tool_metadata),
layer=MemoryPolicy.SelectLayer(output, agent_state)
)
IF NOT MemoryStore.IsDuplicate(memory_record):
destinations.ADD("memory", memory_record)
// 4. Observation log (always, for audit)
destinations.ADD("observation", ObservationRecord(output, tool_metadata, timestamp=now()))
RETURN destinations14.2.5 Token Budget Accounting#
Every tool output routed into context must be deducted from the active token budget . Let be the total context window capacity. Then:
where is the system prompt / role policy cost, is conversation history, is injected memory summaries, and is the minimum generation budget (typically 2048–4096 tokens). A tool output of size tokens is admitted into context only if:
If exceeds this bound, the output is either compressed to a lower level or offloaded to external memory with a pointer summary injected into context.
14.3 Tool Selection Strategies: LLM-Driven, Rule-Based, Policy-Gated, and Learned Tool Routing#
14.3.1 The Tool Selection Problem#
Given a set of available tools and a current agent state , the tool selection problem is to choose a subset (possibly a singleton) and construct the invocation parameters for each selected tool:
This is a policy mapping states to tool invocations. The quality of the policy is evaluated over a distribution of tasks:
where is the reward at step , is the discount factor, and is the episode horizon.
14.3.2 Strategy Taxonomy#
A. LLM-Driven Selection (Native Function Calling)#
The LLM receives tool descriptions in its context and generates a structured tool call as part of its output:
Advantages: Flexible, handles novel tool combinations, benefits from world knowledge. Risks: Hallucinated tool names, invalid parameters, suboptimal selection under large .
Token cost scaling: If each tool schema consumes tokens, the total schema injection cost is . For , this can consume a significant fraction of the context window.
Mitigation — Lazy Tool Loading: Only inject schemas for tools relevant to the current task phase, determined by a lightweight pre-classifier:
where may be computed by embedding similarity between the task description and tool descriptions, or by a trained classifier.
B. Rule-Based Selection#
Deterministic rules map state patterns to tool choices:
Advantages: Deterministic, auditable, zero LLM cost for selection. Risks: Brittle under novel tasks, requires manual maintenance as tool set evolves.
C. Policy-Gated Selection#
A policy layer intercepts LLM-proposed tool calls and applies authorization, safety, and cost constraints before execution:
The gate enforces:
- Authorization: Caller-scoped permissions, not agent-owned credentials.
- Safety: Input sanitization, dangerous operation detection (e.g.,
DROP TABLE). - Budget: Cost ceiling per invocation and cumulative per session.
- Rate limiting: Per-tool invocation rate within sliding windows.
D. Learned Tool Routing#
A trained router model maps task embeddings to tool selection distributions:
Training data is derived from successful traces:
The loss function is cross-entropy:
Advantages: Adapts to usage patterns, fast inference, low token cost. Risks: Cold-start problem, distribution shift as tools are added/removed.
14.3.3 Hybrid Selection Architecture#
Production systems combine all four strategies in a layered architecture:
User Query → Intent Classifier → Learned Router (candidate set)
↓
LLM Selection (from candidate set)
↓
Policy Gate (auth, safety, budget)
↓
Rule Override (domain-specific hardcoded rules)
↓
Approved Invocation → ExecutorPseudo-Algorithm 14.4: Hybrid Tool Selector
PROCEDURE SelectTool(agent_state, available_tools, token_budget, policy):
// Phase 1: Learned routing (fast pre-filter)
relevance_scores ← LearnedRouter.Score(agent_state.task_embedding, available_tools)
candidate_set ← TopK(available_tools, relevance_scores, k=min(10, len(available_tools)))
// Phase 2: Rule-based override
rule_forced ← RuleEngine.Evaluate(agent_state)
IF rule_forced IS NOT NULL:
candidate_set ← {rule_forced} ∪ candidate_set[:2] // Keep rule choice primary
// Phase 3: LLM-driven selection from candidate set
tool_schemas ← [tool.schema FOR tool IN candidate_set]
schema_tokens ← SUM(TokenCount(s) FOR s IN tool_schemas)
IF schema_tokens > token_budget.tool_schema_allocation:
candidate_set ← candidate_set[:FLOOR(token_budget.tool_schema_allocation / avg_schema_tokens)]
tool_schemas ← [tool.schema FOR tool IN candidate_set]
llm_selection ← LLM.SelectTool(
context=agent_state.context,
tool_schemas=tool_schemas,
instruction="Select the most appropriate tool and construct valid parameters."
)
// Phase 4: Policy gate
gate_result ← policy.Evaluate(llm_selection.tool, llm_selection.params, agent_state)
SWITCH gate_result:
CASE ALLOW:
RETURN ApprovedInvocation(llm_selection.tool, llm_selection.params)
CASE REQUIRE_APPROVAL:
approval ← RequestHumanApproval(llm_selection, timeout=policy.approval_timeout)
IF approval.granted:
RETURN ApprovedInvocation(llm_selection.tool, llm_selection.params)
ELSE:
RETURN Denied(reason=approval.reason)
CASE DENY:
RETURN Denied(reason=gate_result.reason)14.3.4 Tool Selection Under Large Catalogs#
When exceeds 50 tools, neither full schema injection nor flat relevance scoring scales. A hierarchical tool index partitions tools into categories where and . Selection proceeds in two phases:
- Category selection: — low cost, operates over category descriptions.
- Tool selection within category: Standard hybrid selection over .
This reduces schema injection cost from to per step.
14.4 Multi-Tool Transactions: Compensation, Rollback, and Saga Patterns for Tool Chains#
14.4.1 Transactional Semantics for Tool Chains#
Unlike database transactions backed by ACID guarantees, tool chains span heterogeneous systems (APIs, file systems, databases, external services) where global atomicity is unavailable. We therefore adopt the Saga pattern: a sequence of local transactions , each with a compensating action that semantically undoes the effect of .
Definition: A tool saga is a pair of sequences:
where is the forward tool invocation and is its compensating counterpart. If fails after have succeeded, the saga executes compensations in reverse order:
14.4.2 Compensation Design Constraints#
Not all tool operations are reversible. We classify tools by their compensation capability:
| Class | Compensation | Example |
|---|---|---|
| Fully reversible | exactly undoes | File write → file delete |
| Semantically reversible | achieves approximate undo | API order creation → order cancellation |
| Compensable with side effects | undoes primary effect but leaves traces | Email sent → follow-up retraction email |
| Irreversible | No exists | Published tweet, physical actuation |
For irreversible tools, the saga must place them last in the chain (after all fallible steps) or require pre-commitment approval:
14.4.3 Saga Orchestrator#
Pseudo-Algorithm 14.5: Saga Orchestrator with Compensation
PROCEDURE ExecuteSaga(saga: SagaDefinition, initial_input, trace_id):
completed_steps ← []
current_input ← initial_input
FOR i ← 0 TO len(saga.forward_steps) - 1:
step ← saga.forward_steps[i]
// Pre-flight check for irreversible steps
IF step.tool.compensation_class = IRREVERSIBLE:
IF i < len(saga.forward_steps) - 1:
LOG.WARN("Irreversible step not at end of saga; requesting approval")
approval ← RequestHumanApproval(step, context=current_input)
IF NOT approval.granted:
ExecuteCompensation(completed_steps)
RETURN SagaAborted(reason="irreversible_step_denied", compensated=TRUE)
// Execute forward step
result ← InvokeToolWithRetry(
tool=step.tool,
input=current_input,
retry_budget=step.retry_budget,
trace_id=trace_id
)
IF result IS Failure:
LOG.ERROR("Saga step failed", step=i, error=result.error)
compensation_result ← ExecuteCompensation(completed_steps)
RETURN SagaFailed(
failed_step=i,
error=result.error,
compensation_result=compensation_result,
partial_results=[s.result FOR s IN completed_steps]
)
completed_steps.APPEND(CompletedStep(
index=i,
tool=step.tool,
input=current_input,
result=result.output,
compensation=step.compensation
))
current_input ← step.junction_transform(result.output)
RETURN SagaSuccess(results=[s.result FOR s IN completed_steps])
PROCEDURE ExecuteCompensation(completed_steps: List<CompletedStep>):
compensation_results ← []
// Reverse order
FOR step IN REVERSED(completed_steps):
IF step.compensation IS NOT NULL:
comp_result ← InvokeToolWithRetry(
tool=step.compensation.tool,
input=step.compensation.construct_input(step.result, step.input),
retry_budget=step.compensation.retry_budget
)
compensation_results.APPEND(CompensationOutcome(step=step.index, result=comp_result))
IF comp_result IS Failure:
LOG.CRITICAL("Compensation failed — manual intervention required",
step=step.index, error=comp_result.error)
AlertOnCall("saga_compensation_failure", step=step.index)
ELSE:
LOG.WARN("No compensation defined for step", step=step.index)
RETURN compensation_results14.4.4 Idempotency Requirements#
Every forward step and compensation must be idempotent:
Implementation techniques:
- Idempotency keys: Each invocation carries a unique key; the tool server deduplicates.
- Conditional mutations: Use version vectors or ETags; the operation succeeds only if the precondition matches.
- Upsert semantics: Prefer
INSERT ... ON CONFLICT UPDATEover blindINSERT.
14.4.5 Saga State Machine#
A saga's lifecycle is a finite state machine:
Transitions:
The COMPENSATION_FAILED state triggers manual intervention via alerting infrastructure.
14.5 Tool Fallback Hierarchies: Primary → Secondary → Degraded → Manual Escalation#
14.5.1 Motivation and Architecture#
External tools are inherently unreliable: APIs have outages, rate limits are exhausted, and services degrade. A production agentic system must define fallback hierarchies for every critical tool capability so that agent execution degrades gracefully rather than failing catastrophically.
A fallback hierarchy for a capability is an ordered list of tool implementations:
where superscripts denote priority (1 = primary), and is the human escalation sentinel.
Each tool in the hierarchy is annotated with:
| Property | Type | Description |
|---|---|---|
priority | Lower is preferred | |
latency_p99 | Duration | Expected worst-case latency |
cost_per_call | Float | Monetary cost |
accuracy | Float | Expected output correctness |
availability | Float | Historical uptime |
circuit_state | Current circuit breaker state |
14.5.2 Fallback Selection Criteria#
The selection function evaluates candidates in priority order, skipping those with open circuits or exceeded rate limits:
If no automated tool is available:
14.5.3 Circuit Breaker Integration#
Each tool's circuit breaker tracks failure rates within sliding windows:
State transitions:
Pseudo-Algorithm 14.6: Fallback Hierarchy Executor
PROCEDURE ExecuteWithFallback(capability, input, hierarchy, trace_id, deadline):
attempted ← []
FOR level ← 0 TO len(hierarchy) - 1:
tool ← hierarchy[level]
// Skip if circuit open
IF tool.circuit_breaker.state = OPEN:
attempted.APPEND(Skipped(tool, reason="circuit_open"))
CONTINUE
// Skip if rate limit exhausted
IF tool.rate_limiter.remaining = 0:
attempted.APPEND(Skipped(tool, reason="rate_limited"))
CONTINUE
// Check deadline feasibility
IF now() + tool.latency_p99_estimate > deadline:
attempted.APPEND(Skipped(tool, reason="deadline_infeasible"))
CONTINUE
result ← InvokeTool(tool, input, deadline=deadline, trace_id=trace_id)
IF result IS Success:
EmitMetric("tool_fallback_level", level, capability=capability)
RETURN FallbackSuccess(result=result.output, level=level, attempted=attempted)
ELSE:
tool.circuit_breaker.RecordFailure()
attempted.APPEND(Failed(tool, error=result.error))
// All automated tools exhausted — manual escalation
IF hierarchy.manual_escalation_enabled:
ticket ← CreateEscalationTicket(
capability=capability,
input=input,
attempted=attempted,
trace_id=trace_id,
urgency=ComputeUrgency(deadline)
)
RETURN ManualEscalation(ticket=ticket, attempted=attempted)
ELSE:
RETURN FallbackExhausted(attempted=attempted)14.5.4 Degradation Modes#
When falling back from primary to lower-tier tools, the agent must adjust its expectations:
- Accuracy degradation: Lower-tier tools may provide approximate results. The agent should annotate downstream reasoning with confidence decrements.
- Latency degradation: Synchronous fallbacks increase total chain latency; the agent should re-evaluate deadline feasibility.
- Feature degradation: Secondary tools may lack capabilities (e.g., pagination, filtering). The agent must compensate with post-processing.
- Manual escalation: The agent loop pauses at a human gate, persists state, and resumes upon human response. This requires durable state serialization and session resumption protocols.
14.6 Tool Result Validation: Schema Conformance, Sanity Checks, Cross-Tool Consistency Verification#
14.6.1 Validation Layers#
Tool outputs cannot be trusted blindly. A production-grade validation pipeline applies three layers:
- Schema conformance — structural correctness.
- Semantic sanity checks — value-level plausibility.
- Cross-tool consistency — agreement across corroborating sources.
14.6.2 Schema Conformance Validation#
Every tool declares an output schema (JSON Schema, Protobuf message, or equivalent). Validation checks:
- Required fields present.
- Types correct (string, integer, array, nested object).
- Value constraints met (ranges, regex patterns, enum membership).
- Array bounds respected (, ).
Non-conforming outputs are rejected, and the invocation is treated as a failure (triggering retry or fallback).
14.6.3 Semantic Sanity Checks#
Beyond schema correctness, outputs must be semantically plausible. Sanity checks are domain-specific predicates:
Examples of sanity predicates:
- Temporal:
- Numeric range:
- Referential:
- Cardinality:
- Self-consistency:
14.6.4 Cross-Tool Consistency Verification#
When multiple tools produce overlapping information, consistency checks detect conflicting outputs. Given outputs from and from covering the same entity or fact:
where is a field-specific tolerance (exact match for identifiers, -tolerance for floating-point values, fuzzy match for strings).
Conflict resolution strategies:
| Strategy | Rule |
|---|---|
| Authority ranking | Prefer the output from the higher-authority source |
| Freshness | Prefer the more recently retrieved value |
| Majority vote | If out of tools agree, adopt the majority |
| LLM arbitration | Present conflicts to the LLM with provenance for reasoned resolution |
| Human escalation | Flag unresolvable conflicts for human review |
Pseudo-Algorithm 14.7: Multi-Layer Tool Output Validator
PROCEDURE ValidateToolOutput(output, tool, context, validation_policy):
issues ← []
// Layer 1: Schema conformance
schema_result ← ValidateJsonSchema(output, tool.output_schema)
IF schema_result.has_errors:
RETURN ValidationFailure(layer="schema", errors=schema_result.errors, severity=CRITICAL)
// Layer 2: Semantic sanity
sanity_predicates ← validation_policy.GetSanityPredicates(tool.id)
FOR predicate IN sanity_predicates:
IF NOT predicate.evaluate(output):
issues.APPEND(SanityViolation(predicate=predicate.name, value=predicate.extract(output),
expected=predicate.expected_range, severity=predicate.severity))
IF ANY(issue.severity = CRITICAL FOR issue IN issues):
RETURN ValidationFailure(layer="sanity", errors=issues, severity=CRITICAL)
// Layer 3: Cross-tool consistency
corroborating_outputs ← context.GetCorroboratingOutputs(tool.id, output)
FOR corr IN corroborating_outputs:
shared_fields ← IntersectFields(output, corr.output)
FOR field IN shared_fields:
IF NOT AgreeWithTolerance(output[field], corr.output[field], tolerance=validation_policy.GetTolerance(field)):
issues.APPEND(ConsistencyConflict(
field=field,
value_a=output[field], source_a=tool.id,
value_b=corr.output[field], source_b=corr.tool_id,
severity=validation_policy.GetConflictSeverity(field)
))
IF ANY(issue.severity = CRITICAL FOR issue IN issues):
resolution ← ResolveConflicts(issues, validation_policy.conflict_resolution_strategy)
RETURN ValidationConditional(output=resolution.resolved_output, issues=issues, resolution=resolution)
IF len(issues) > 0:
RETURN ValidationWarning(output=output, issues=issues)
ELSE:
RETURN ValidationSuccess(output=output)14.7 Self-Healing Tool Use: Automatic Retry with Parameter Adjustment, Error-Guided Correction#
14.7.1 Error Taxonomy for Tool Invocations#
Tool failures are categorized by their amenability to automated correction:
| Error Class | Retryable | Self-Healable | Example |
|---|---|---|---|
| Transient | Yes (same params) | N/A | Network timeout, 503 |
| Rate-limited | Yes (after delay) | N/A | 429 Too Many Requests |
| Input validation | No (same params) | Yes (adjust params) | Invalid date format, missing field |
| Authorization | No | No (escalate) | 403 Forbidden |
| Semantic error | No (same params) | Yes (rewrite query) | SQL syntax error, empty result set |
| Resource not found | No (same params) | Yes (search + retry) | 404, entity doesn't exist |
| Server error | Conditional | No | 500 with corruption |
14.7.2 Self-Healing Loop#
The self-healing mechanism uses the LLM as an error diagnostician that reads the error response and adjusts parameters:
This forms a repair loop bounded by a maximum iteration count :
Pseudo-Algorithm 14.8: Self-Healing Tool Invocation
PROCEDURE InvokeWithSelfHealing(tool, initial_params, context, max_attempts, trace_id):
current_params ← initial_params
attempt_history ← []
FOR attempt ← 1 TO max_attempts:
// Invoke tool
result ← InvokeTool(tool, current_params, trace_id=trace_id)
IF result IS Success:
validated ← ValidateToolOutput(result.output, tool, context)
IF validated IS ValidationSuccess OR validated IS ValidationWarning:
RETURN SelfHealSuccess(output=validated.output, attempts=attempt, history=attempt_history)
ELSE:
// Output validation failed — treat as semantic error
error_info ← ConstructErrorInfo(type="output_validation_failure", details=validated.errors)
ELSE:
error_info ← ClassifyError(result.error)
attempt_history.APPEND(AttemptRecord(
attempt=attempt, params=current_params, error=error_info, timestamp=now()
))
// Determine if error is self-healable
IF error_info.class IN {AUTHORIZATION, SERVER_ERROR_PERMANENT}:
RETURN SelfHealFailure(error=error_info, attempts=attempt, history=attempt_history)
IF error_info.class = TRANSIENT:
// Simple retry with exponential backoff + jitter
delay ← MIN(BASE_DELAY * 2^attempt + RandomJitter(), MAX_DELAY)
Sleep(delay)
CONTINUE // Same params
IF error_info.class = RATE_LIMITED:
delay ← ParseRetryAfter(result.headers) OR DEFAULT_RATE_LIMIT_DELAY
Sleep(delay)
CONTINUE // Same params
// Self-healing: LLM-driven parameter adjustment
repair_context ← ConstructRepairContext(
tool_schema=tool.input_schema,
original_params=initial_params,
current_params=current_params,
error=error_info,
attempt_history=attempt_history,
task_context=context
)
repaired_params ← LLM.RepairToolParams(repair_context)
IF repaired_params = current_params:
// LLM couldn't find a different parameterization
RETURN SelfHealFailure(error=error_info, attempts=attempt, history=attempt_history,
reason="no_alternative_params")
// Validate repaired params against schema before retrying
schema_valid ← ValidateSchema(repaired_params, tool.input_schema)
IF schema_valid IS SchemaError:
// LLM produced invalid repair — try once more with explicit schema guidance
repaired_params ← LLM.RepairToolParams(repair_context, include_schema_error=schema_valid)
IF ValidateSchema(repaired_params, tool.input_schema) IS SchemaError:
RETURN SelfHealFailure(error="repair_produced_invalid_schema", attempts=attempt,
history=attempt_history)
current_params ← repaired_params
RETURN SelfHealExhausted(attempts=max_attempts, history=attempt_history)14.7.3 Repair Quality Metrics#
The quality of the self-healing loop is measured by:
- Repair success rate:
- Mean attempts to heal:
- Repair latency overhead:
- False repair rate: cases where the healed output passed validation but was semantically incorrect.
These metrics feed into the evaluation infrastructure (§14.13) for continuous quality monitoring.
14.7.4 Error-Guided Correction Patterns#
Specific error classes invoke specialized correction strategies:
- SQL syntax errors: Extract the error message, re-invoke the LLM with the error and original query for targeted rewriting. Include the database schema as additional context.
- Empty result sets: Broaden search criteria (relax filters, expand date ranges, use synonyms). The repair prompt explicitly instructs broadening.
- Type mismatches: Extract expected vs. actual types from the error; apply type coercion or reformatting.
- Pagination exhaustion: Adjust offset/cursor parameters to access the correct page.
- Encoding errors: Detect encoding issues (UTF-8 vs. ASCII) and apply appropriate encoding before retry.
14.8 Tool Creation by Agents: Dynamic Code Generation, Sandboxed Execution, and Promotion to Permanent Tools#
14.8.1 The Tool Creation Lifecycle#
When an agent encounters a task for which no existing tool suffices, it may create a new tool dynamically. This capability transforms the agent from a tool consumer into a tool producer, but introduces severe safety, correctness, and governance risks that must be addressed mechanically.
The tool creation lifecycle consists of five phases:
14.8.2 Gap Identification#
The agent identifies tool gaps through:
- Tool selection failure: No existing tool matches the required capability.
- Composition inefficiency: The task requires tool invocations that could be replaced by a single custom tool.
- Repeated patterns: The same multi-tool sequence appears times in recent traces.
Formally, a gap is detected when:
where is the required capability and is the minimum relevance threshold.
14.8.3 Code Generation with Typed Contracts#
The agent generates tool code that conforms to a tool template contract:
ToolTemplate:
name: string // Unique identifier
description: string // Natural language description for discovery
input_schema: JSONSchema // Typed input specification
output_schema: JSONSchema // Typed output specification
implementation: CodeBlock // Generated code (Python, TypeScript, etc.)
dependencies: List<Dependency> // Required libraries (allowlisted)
resource_limits: ResourceLimits // CPU, memory, network, time bounds
test_cases: List<TestCase> // Agent-generated test cases
security_classification: SecurityLevel // Determines sandbox tierThe LLM generates the implementation subject to constraints:
- Allowed dependencies: Only from an approved allowlist (no arbitrary package installation).
- No network access during initial sandbox testing (unless explicitly approved).
- Deterministic behavior: Same input should produce same output (excluding time-dependent operations).
- Bounded execution time: Hard timeout enforced by the sandbox.
14.8.4 Sandboxed Execution Environment#
Generated tools execute in a multi-tier sandbox:
| Tier | Capabilities | Use Case |
|---|---|---|
| Tier 0 (Pure compute) | CPU, memory only; no I/O, no network | Mathematical operations, data transforms |
| Tier 1 (Read-only I/O) | Filesystem read, environment variables | File parsing, config reading |
| Tier 2 (Controlled network) | Allowlisted HTTP endpoints only | API consumption |
| Tier 3 (Full access) | Requires human approval per invocation | System administration tools |
Sandbox enforcement uses OS-level isolation (containers, seccomp-bpf, namespace isolation):
14.8.5 Validation and Testing#
Before a generated tool is used, it must pass:
- Static analysis: Lint, type check, security scan (no eval, no shell injection, no hardcoded credentials).
- Unit tests: Agent-generated test cases executed in sandbox.
- Property-based tests: Automatically generated inputs (fuzzing) to detect crashes, hangs, and out-of-bounds behavior.
- Output schema conformance: Every test execution's output must validate against the declared output schema.
The validation score:
where and . The tool is usable only if .
14.8.6 Promotion Pipeline#
Pseudo-Algorithm 14.9: Tool Promotion Pipeline
PROCEDURE PromoteTool(generated_tool, promotion_policy):
// Phase 1: Static analysis
static_result ← RunStaticAnalysis(generated_tool.implementation, generated_tool.dependencies)
IF static_result.has_critical_issues:
RETURN PromotionRejected(reason="static_analysis_failure", details=static_result)
// Phase 2: Sandboxed test execution
sandbox ← CreateSandbox(tier=DetermineSandboxTier(generated_tool))
test_results ← []
FOR test IN generated_tool.test_cases:
result ← sandbox.Execute(generated_tool.implementation, test.input, timeout=generated_tool.resource_limits.timeout)
schema_valid ← ValidateSchema(result.output, generated_tool.output_schema)
test_results.APPEND(TestResult(input=test.input, expected=test.expected, actual=result.output,
passed=test.expected_matches(result.output) AND schema_valid, duration=result.duration))
// Phase 3: Fuzz testing
fuzz_inputs ← GenerateFuzzInputs(generated_tool.input_schema, count=100)
FOR fuzz_input IN fuzz_inputs:
result ← sandbox.Execute(generated_tool.implementation, fuzz_input, timeout=generated_tool.resource_limits.timeout)
IF result IS Crash OR result IS Timeout:
test_results.APPEND(TestResult(input=fuzz_input, passed=FALSE, error=result.error))
// Phase 4: Compute validation score
scores ← ComputeValidationScores(static_result, test_results)
IF scores.overall < promotion_policy.min_validation_score:
RETURN PromotionRejected(reason="insufficient_validation_score", score=scores.overall)
// Phase 5: Promotion decision
IF promotion_policy.requires_human_review:
review ← RequestHumanReview(generated_tool, test_results, scores)
IF NOT review.approved:
RETURN PromotionRejected(reason="human_review_denied")
// Phase 6: Register in tool catalog
tool_record ← ToolCatalogRecord(
tool=generated_tool,
status=PROMOTED,
validation_score=scores.overall,
created_by="agent:" + agent_id,
created_at=now(),
version="1.0.0",
trust_score=ComputeInitialTrustScore(scores)
)
ToolCatalog.Register(tool_record)
RETURN PromotionSuccess(tool_id=tool_record.id, trust_score=tool_record.trust_score)14.8.7 Ephemeral vs. Permanent Tools#
| Attribute | Ephemeral | Permanent |
|---|---|---|
| Lifetime | Single session | Indefinite (versioned) |
| Storage | Working memory | Tool catalog / registry |
| Discovery | Not discoverable by other agents | Discoverable via MCP |
| Governance | Self-validated | Human-reviewed, CI-tested |
| Trust score | Minimal | Accumulates with usage |
Ephemeral tools serve as a rapid prototyping mechanism; only tools that prove their value over multiple sessions and pass the promotion pipeline become permanent.
14.9 Browser and GUI Tools: Playwright, Puppeteer, Desktop Automation, Vision-Language Tool Agents#
14.9.1 Architecture of Browser/GUI Tool Agents#
Browser and GUI automation tools enable agents to interact with web applications and desktop software that lack APIs. These tools bridge the gap between structured tool invocation and unstructured visual interfaces.
The architecture comprises three layers:
Agent Reasoning Layer (LLM)
↕ Structured commands / observations
Automation Abstraction Layer (Tool Server)
↕ Low-level browser/GUI commands
Execution Engine (Playwright / Puppeteer / OS Accessibility APIs)14.9.2 Browser Automation Protocol#
Browser tools expose a typed interface for web interaction:
Core operations:
| Operation | Input | Output | Side Effect |
|---|---|---|---|
navigate | URL | Page metadata, status code | Browser navigates |
extract_text | CSS selector / XPath | Extracted text content | None |
extract_structured | Selector + schema | Structured data | None |
click | Selector | Updated page state | DOM mutation |
fill_form | Selector + value | Confirmation | DOM mutation |
screenshot | Viewport/selector | Image (base64) | None |
execute_js | JavaScript code | Execution result | Arbitrary |
wait_for | Condition + timeout | Success/timeout | Blocks |
14.9.3 Vision-Language Integration#
For applications where DOM access is unreliable or unavailable (canvas-based apps, remote desktops, PDFs rendered as images), the agent uses vision-language models (VLMs) to interpret screenshots:
The VLM produces structured action commands:
Coordinate grounding is the critical challenge: the VLM must map semantic targets ("the Submit button") to pixel coordinates . Approaches:
- Set-of-Mark (SoM): Overlay numbered labels on interactive elements in the screenshot; the VLM references labels instead of raw coordinates.
- Bounding box prediction: The VLM outputs bounding box coordinates for the target element.
- DOM-augmented vision: Combine the screenshot with a simplified DOM tree to provide both visual and structural grounding.
14.9.4 Observation Compression for GUI Agents#
Raw screenshots consume significant token budget when processed by VLMs. Compression strategies:
- DOM-to-text: Convert the accessibility tree to a compact text representation (element type, label, state, relative position).
- Selective screenshot: Capture only the relevant viewport region, not the full page.
- Delta encoding: After the first screenshot, transmit only changed regions.
- Structured observation: Extract form field states, button labels, and error messages as structured data rather than relying on visual parsing.
Pseudo-Algorithm 14.10: Browser Agent Action Loop
PROCEDURE BrowserAgentLoop(task, browser_session, max_steps, deadline):
action_history ← []
FOR step ← 1 TO max_steps:
// Observe current state
observation ← ConstructObservation(browser_session)
// Observation = { dom_summary, screenshot_if_needed, url, page_title, error_messages }
// Compress observation for context
compressed_obs ← CompressObservation(observation, method=SelectCompressionMethod(task, step))
// Agent reasons about next action
agent_context ← AssembleContext(
task=task,
current_observation=compressed_obs,
action_history=TruncateHistory(action_history, max_tokens=2048),
available_actions=BROWSER_ACTION_SCHEMA
)
action_decision ← LLM.DecideAction(agent_context)
// Validate action
IF action_decision.action = "done":
result ← ExtractResult(observation, task.expected_output_schema)
RETURN BrowserTaskSuccess(result=result, steps=step, history=action_history)
IF NOT IsValidAction(action_decision, BROWSER_ACTION_SCHEMA):
action_history.APPEND(InvalidAction(action_decision, step=step))
CONTINUE // Let agent self-correct on next iteration
// Execute action with safety checks
IF action_decision.action IN {"navigate", "click", "fill_form", "execute_js"}:
safety_check ← BrowserSafetyPolicy.Check(action_decision, browser_session)
IF safety_check = DENY:
action_history.APPEND(BlockedAction(action_decision, reason=safety_check.reason))
CONTINUE
execution_result ← browser_session.Execute(action_decision)
action_history.APPEND(ExecutedAction(
step=step, action=action_decision, result=execution_result, timestamp=now()
))
// Wait for page stability
browser_session.WaitForStable(timeout=5000)
IF now() > deadline:
RETURN BrowserTaskTimeout(steps=step, history=action_history)
RETURN BrowserTaskExhausted(steps=max_steps, history=action_history)14.9.5 Desktop Automation#
Desktop automation extends browser patterns to native applications using:
- OS Accessibility APIs (Windows UI Automation, macOS Accessibility, Linux AT-SPI): Provide structured access to UI element trees.
- Computer Vision: When accessibility APIs are unavailable, use screenshot-based interaction with VLMs.
- Keyboard/Mouse simulation: Low-level input injection for actions not exposed through accessibility APIs.
The same action loop (Pseudo-Algorithm 14.10) applies, with the browser session replaced by a desktop session abstraction and the DOM summary replaced by the accessibility tree.
14.10 File System and Repository Tools: Git Operations, File Manipulation, Build System Integration#
14.10.1 File System Tool Taxonomy#
File system tools provide the agent with the ability to read, write, search, and manipulate files and directories. These tools are critical for software engineering agents, document processing agents, and data pipeline agents.
| Tool | Input | Output | Mutation |
|---|---|---|---|
read_file | Path, encoding, byte range | File content (text/binary) | None |
write_file | Path, content, mode (create/overwrite/append) | Success + metadata | Yes |
list_directory | Path, pattern, recursive | File listing with metadata | None |
search_files | Pattern (glob/regex), root path | Matching file paths | None |
search_content | Query (text/regex), file set | Matching lines with context | None |
move_file | Source, destination | Success | Yes |
delete_file | Path, require_confirmation | Success | Yes (irreversible) |
file_diff | Path A, Path B | Unified diff | None |
file_metadata | Path | Size, timestamps, permissions | None |
14.10.2 Git Operations#
Git tools enable agents to operate within version-controlled repositories with full branching, committing, and merging capabilities:
Core Git tool operations:
| Tool | Input | Output | Mutation |
|---|---|---|---|
git_status | Repo path | Modified/staged/untracked files | None |
git_diff | Repo, ref range, path filter | Diff output | None |
git_log | Repo, ref, count, path filter | Commit history | None |
git_branch | Repo, branch name, base ref | Branch created | Yes |
git_checkout | Repo, ref | Working tree updated | Yes |
git_commit | Repo, message, file list | Commit SHA | Yes |
git_merge | Repo, source branch, strategy | Merge result or conflicts | Yes |
git_push | Repo, remote, branch | Push result | Yes (remote state) |
git_blame | Repo, file, line range | Authorship per line | None |
Branching discipline for agentic execution: Agents must operate on isolated branches:
This ensures:
- No direct mutation of
mainor protected branches. - All changes are reviewable via pull request.
- Parallel agents do not create merge conflicts on the same branch.
- Rollback is trivial (delete branch).
14.10.3 Build System Integration#
Agents that modify code must verify their changes compile and pass tests:
| Tool | Input | Output |
|---|---|---|
build | Repo, target, config | Build result (success/failure + logs) |
test | Repo, test suite, filter | Test results (pass/fail/skip per test) |
lint | Repo, file set, config | Lint results (warnings, errors per file) |
typecheck | Repo, config | Type errors (file, line, message) |
Verification loop after file mutation:
PROCEDURE VerifyAfterMutation(repo, changed_files, verification_policy):
results ← {}
IF verification_policy.require_typecheck:
results["typecheck"] ← RunTypecheck(repo)
IF verification_policy.require_lint:
results["lint"] ← RunLint(repo, changed_files)
IF verification_policy.require_build:
results["build"] ← RunBuild(repo)
IF verification_policy.require_tests:
test_scope ← DetermineAffectedTests(repo, changed_files)
results["tests"] ← RunTests(repo, test_scope)
all_passed ← ALL(r.passed FOR r IN results.values())
RETURN VerificationResult(passed=all_passed, details=results)14.11 Database Tools: Query Generation, Schema Introspection, Migration Planning, and Data Validation#
14.11.1 Database Tool Architecture#
Database tools enable agents to interact with relational and non-relational databases through a safety-layered interface:
Agent (LLM)
↕ Natural language → SQL/query intent
Query Generator (LLM or template engine)
↕ Generated query
Safety Layer (parser, validator, policy gate)
↕ Approved query
Execution Layer (connection pool, timeout, result limit)
↕ Result set
Result Formatter (projection, truncation, type conversion)
↕ Structured output to agent14.11.2 Schema Introspection#
Before generating queries, the agent must understand the database schema. The schema introspection tool provides:
- Tables/collections: Names, descriptions (from comments), row counts.
- Columns/fields: Name, type, nullability, constraints, foreign keys, indices.
- Relationships: Foreign key graph, junction tables.
- Sample data: representative rows per table (anonymized if sensitive).
- Statistics: Value distributions, cardinality estimates, NULL rates.
Schema is injected into context in compressed form:
TABLE orders (
id: INT PK AUTO_INCREMENT,
user_id: INT FK→users.id NOT NULL INDEX,
total: DECIMAL(10,2) NOT NULL CHECK(>0),
status: ENUM('pending','shipped','delivered','cancelled') DEFAULT 'pending',
created_at: TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
) -- ~1.2M rows, ~340 orders/day14.11.3 Query Generation Safety#
Generated SQL must be validated before execution:
- Parse validation: The query must parse as valid SQL for the target dialect.
- Read-only enforcement: For read operations, reject any query containing
INSERT,UPDATE,DELETE,DROP,ALTER,TRUNCATE,CREATE,GRANT. - Query complexity bounds: Reject queries with:
- More than joins (default: 5).
- Subqueries nested deeper than (default: 3).
- Missing
WHEREclause on large tables (prevents full table scans). - Missing
LIMITclause (enforce maximum result set size).
- Parameterization: All user-derived values must be parameterized (prevent SQL injection).
- Execution cost estimation: Use
EXPLAINto estimate query cost before execution; reject queries above a cost threshold.
14.11.4 Query Generation Algorithm#
Pseudo-Algorithm 14.11: Safe Database Query Generator
PROCEDURE GenerateAndExecuteQuery(natural_language_query, database_connection, schema_cache, safety_policy):
// Step 1: Schema retrieval
relevant_tables ← IdentifyRelevantTables(natural_language_query, schema_cache)
schema_context ← FormatSchemaForContext(relevant_tables, include_sample_data=TRUE, max_tables=10)
// Step 2: Query generation
generated_sql ← LLM.GenerateSQL(
instruction="Generate a SQL query for the following request. Use only the provided schema. Include LIMIT clause.",
user_query=natural_language_query,
schema=schema_context,
dialect=database_connection.dialect
)
// Step 3: Safety validation
parse_result ← ParseSQL(generated_sql, dialect=database_connection.dialect)
IF parse_result IS ParseError:
// Self-heal: retry with error feedback
generated_sql ← LLM.RepairSQL(generated_sql, parse_result.error, schema_context)
parse_result ← ParseSQL(generated_sql, dialect=database_connection.dialect)
IF parse_result IS ParseError:
RETURN QueryFailure(reason="parse_error_after_repair", error=parse_result.error)
IF NOT safety_policy.IsReadOnly(parse_result.ast):
RETURN QueryFailure(reason="write_operation_not_permitted")
IF NOT safety_policy.ComplexityWithinBounds(parse_result.ast):
RETURN QueryFailure(reason="query_too_complex", details=safety_policy.GetComplexityReport(parse_result.ast))
// Step 4: Cost estimation
explain_result ← database_connection.Explain(generated_sql)
IF explain_result.estimated_cost > safety_policy.max_query_cost:
RETURN QueryFailure(reason="estimated_cost_too_high", cost=explain_result.estimated_cost)
// Step 5: Execute with timeout and row limit
result ← database_connection.Execute(
query=generated_sql,
timeout=safety_policy.query_timeout,
max_rows=safety_policy.max_result_rows
)
IF result IS Timeout:
RETURN QueryFailure(reason="execution_timeout")
// Step 6: Format and return
formatted ← FormatResultSet(result.rows, result.columns, max_display_rows=safety_policy.max_display_rows)
RETURN QuerySuccess(
sql=generated_sql,
result=formatted,
row_count=result.row_count,
execution_time=result.duration,
truncated=result.row_count > safety_policy.max_display_rows
)14.11.5 Migration Planning#
For schema migrations, the agent:
- Analyzes the current schema and the desired end state.
- Generates migration scripts (DDL statements) with up and down paths.
- Validates migrations against a copy of the schema (dry run).
- Estimates data migration duration and locking impact.
- Produces a migration plan for human review (never auto-executes DDL in production).
Migration tools are always classified as Tier 3 (human-approval-gated) in the safety hierarchy.
14.11.6 Data Validation#
Data validation tools enable agents to verify data quality:
where is the set of validation rules. Common rules include:
- Completeness: NULL rate per column threshold.
- Uniqueness: Duplicate rate on candidate keys .
- Referential integrity: All foreign keys resolve.
- Value distributions: Statistical tests for anomalous shifts (KL divergence from baseline).
- Freshness: Most recent record timestamp within expected recency window.
14.12 Communication Tools: Email, Chat, Notification, and Workflow Trigger Integrations#
14.12.1 Communication Tool Categories#
Communication tools enable agents to send, receive, and manage messages across external channels. These are inherently state-changing and externally visible, requiring strict governance.
| Category | Tools | Risk Level |
|---|---|---|
send_email, read_inbox, search_email, draft_email | High (external recipients) | |
| Chat | send_message, read_channel, create_thread, react | Medium (internal channels) |
| Notification | send_notification, schedule_reminder, create_alert | Medium |
| Workflow | trigger_pipeline, create_ticket, update_status, assign_task | Medium–High |
14.12.2 Governance Framework for Communication Tools#
All communication tools that produce externally visible artifacts must pass through a multi-gate approval pipeline:
where is the message payload and is the agent state.
Content policy checks:
- No personally identifiable information (PII) leakage outside authorized boundaries.
- No confidential or classified information in external-facing messages.
- Tone and professionalism scoring (LLM-evaluated).
- No impersonation (messages clearly attributed to the agentic system).
Recipient policy checks:
- Internal-only recipients: auto-approved (subject to rate limits).
- External recipients: require human approval or be on a pre-approved allowlist.
- Broadcast (all-channel): always require explicit human approval.
Rate policy checks:
- Per-recipient rate limits (e.g., max 3 emails per recipient per hour).
- Per-channel rate limits (e.g., max 10 messages per channel per hour).
- Global daily caps.
14.12.3 Draft-Review-Send Pattern#
For high-stakes communications, the agent produces a draft that is reviewed before sending:
Pseudo-Algorithm 14.12: Draft-Review-Send Communication
PROCEDURE SendCommunication(message_spec, comm_channel, agent_state, governance_policy):
// Step 1: Generate draft
draft ← LLM.ComposeDraft(
intent=message_spec.intent,
recipient=message_spec.recipient,
context=message_spec.context,
tone=governance_policy.required_tone,
constraints=governance_policy.content_constraints
)
// Step 2: Content policy validation
content_check ← ContentPolicy.Evaluate(draft, governance_policy)
IF content_check.has_violations:
draft ← LLM.ReviseDraft(draft, violations=content_check.violations)
content_check ← ContentPolicy.Evaluate(draft, governance_policy)
IF content_check.has_violations:
RETURN CommunicationBlocked(reason="content_policy_violation", violations=content_check.violations)
// Step 3: PII scan
pii_scan ← PIIDetector.Scan(draft.body)
IF pii_scan.found AND NOT governance_policy.AllowsPII(message_spec.recipient):
draft ← RedactPII(draft, pii_scan.findings)
// Step 4: Recipient policy
recipient_check ← RecipientPolicy.Evaluate(message_spec.recipient, agent_state)
IF recipient_check = EXTERNAL_REQUIRES_APPROVAL:
approval ← RequestHumanApproval(
type="external_communication",
draft=draft,
recipient=message_spec.recipient,
timeout=governance_policy.approval_timeout
)
IF NOT approval.granted:
RETURN CommunicationBlocked(reason="human_denied", draft=draft)
// Step 5: Rate limit check
IF NOT RateLimiter.TryAcquire(comm_channel, message_spec.recipient):
RETURN CommunicationDeferred(reason="rate_limited", retry_after=RateLimiter.RetryAfter())
// Step 6: Send
send_result ← comm_channel.Send(draft)
AuditLog.Record(action="communication_sent", draft=draft, result=send_result, agent=agent_state.agent_id)
RETURN CommunicationSuccess(message_id=send_result.id, draft=draft)14.12.4 Workflow Trigger Tools#
Workflow triggers connect the agent to external automation systems (CI/CD pipelines, ticketing systems, orchestration platforms):
- Ticket creation: The agent creates JIRA/Linear/GitHub Issues with structured fields.
- Pipeline triggers: The agent initiates build/deploy pipelines with specified parameters.
- Status updates: The agent updates task status in project management tools.
- Escalation chains: The agent triggers PagerDuty/OpsGenie alerts for critical issues.
Each workflow trigger must be idempotent (repeated invocations with the same idempotency key produce the same result) and auditable (every trigger is logged with full context and trace ID).
14.13 Tool Ecosystem Management: Marketplace, Rating, Trust Scoring, and Community Tool Servers#
14.13.1 Tool Ecosystem Architecture#
As the number of available tools grows beyond what a single organization manages, a tool ecosystem emerges — a marketplace of tool servers contributed by internal teams, vendors, and the open-source community. Managing this ecosystem requires:
- Discovery: Agents find tools by capability, not by name.
- Trust: Tools are scored by reliability, safety, and quality.
- Versioning: Tool contracts evolve without breaking consumers.
- Governance: Policies control which tools are available to which agents.
14.13.2 Tool Registry Schema#
Each tool in the ecosystem is registered with:
ToolRegistryEntry:
id: UUID
name: string
version: SemanticVersion // e.g., "2.3.1"
provider: ProviderIdentity // Organization, team, or individual
description: string // Natural language for LLM consumption
capabilities: List<CapabilityTag> // Standardized taxonomy (e.g., "data.query.sql")
input_schema: JSONSchema
output_schema: JSONSchema
protocol: {MCP | JSON-RPC | gRPC}
endpoint: URI
auth_requirements: AuthSpec
sla: SLASpec // Latency p50/p95/p99, availability target
trust_score: Float ∈ [0, 1]
usage_count: Int
rating: Float ∈ [0, 5]
last_verified: Timestamp
deprecation_status: {ACTIVE | DEPRECATED | SUNSET}
compatibility: List<CompatibilityConstraint>14.13.3 Trust Scoring Model#
Trust scores are computed from multiple signals:
where the scoring functions and their weights are:
| Signal | Weight | Computation |
|---|---|---|
| Reliability | 0.25 | |
| Schema compliance | 0.15 | Fraction of invocations with valid output schemas |
| Latency SLA adherence | 0.15 | Fraction of invocations within declared latency SLA |
| Security audit status | 0.15 | for {unaudited, self-assessed, third-party audited} |
| Community rating | 0.10 | Normalized average user rating |
| Provider reputation | 0.10 | Provider-level aggregate trust score |
| Freshness | 0.05 | Recency of last successful verification |
| Usage volume | 0.05 |
Trust scores decay over time if not refreshed:
where is the decay rate and is the last verification timestamp. This incentivizes tool providers to maintain active verification.
14.13.4 Tool Discovery Protocol#
Tool discovery follows MCP's resource discovery pattern, extended with capability-based search:
Pseudo-Algorithm 14.13: Capability-Based Tool Discovery
PROCEDURE DiscoverTools(capability_query, constraints, agent_context):
// Step 1: Parse capability query into structured search
parsed ← ParseCapabilityQuery(capability_query)
// parsed = { capability_tags: ["data.query.sql"], constraints: { max_latency: 500ms, min_trust: 0.7 } }
// Step 2: Search registry
candidates ← ToolRegistry.Search(
capability_tags=parsed.capability_tags,
min_trust_score=constraints.min_trust OR 0.5,
status=ACTIVE,
compatible_with=agent_context.platform_version
)
// Step 3: Filter by constraints
filtered ← []
FOR tool IN candidates:
IF tool.sla.latency_p95 <= constraints.max_latency
AND tool.trust_score >= constraints.min_trust
AND agent_context.auth.HasPermission(tool.auth_requirements)
AND tool.version.SatisfiesConstraint(constraints.version_constraint):
filtered.APPEND(tool)
// Step 4: Rank by composite score
FOR tool IN filtered:
tool.discovery_score ← (
0.4 * tool.trust_score
+ 0.3 * CapabilityRelevance(tool.capabilities, parsed.capability_tags)
+ 0.2 * (1.0 - Normalize(tool.sla.latency_p95, max=constraints.max_latency))
+ 0.1 * (1.0 - Normalize(tool.cost_per_call, max=constraints.max_cost))
)
ranked ← SortDescending(filtered, key=discovery_score)
// Step 5: Return top-K with schemas
RETURN DiscoveryResult(
tools=ranked[:constraints.max_results OR 10],
total_candidates=len(candidates),
total_filtered=len(filtered)
)14.13.5 Version Governance#
Tool versions follow semantic versioning:
- Patch (): Bug fixes, no schema changes.
- Minor (): Additive schema changes (new optional fields), backward compatible.
- Major (): Breaking schema changes, requires consumer migration.
The registry enforces:
Agents pin to version ranges (e.g., ^2.0.0 for any 2.x.x) and receive notifications when a major version sunset approaches.
14.13.6 Community Contribution Pipeline#
Community-contributed tools follow a graduated promotion path:
| Stage | Requirements |
|---|---|
| Submitted | Schema provided, basic metadata |
| Sandbox | Passes automated schema validation, deploys in isolated sandbox |
| Beta | Passes integration tests, achieves over 100+ invocations |
| GA | Passes security audit (self-assessed), achieves over 1000+ invocations |
| Verified | Third-party security audit, SLA commitment, provider identity verified |
14.13.7 Marketplace Governance and Abuse Prevention#
Governance mechanisms protect the ecosystem:
- Rate limiting per tool provider: Prevents monopolization of compute/network resources.
- Abuse detection: Monitors for tools that exfiltrate data, inject malicious payloads, or exhibit inconsistent behavior.
- Automated regression testing: Continuous invocation of registered tools with known inputs to detect behavioral drift.
- Revocation: Tools can be immediately revoked (circuit-broken globally) if critical issues are detected.
When a tool is revoked, all agents using it receive a notification and automatically fall back to the next tool in their fallback hierarchy (§14.5).
Chapter Summary and Cross-Cutting Concerns#
Architectural Invariants Across All Tool Patterns#
The following invariants must hold across every tool pattern described in this chapter:
-
Typed contracts at every boundary: Every tool exposes JSON Schema–validated input/output. No untyped data flows between tools or into agent context.
-
Provenance on every output: Every tool result carries its invocation ID, timestamp, source, freshness, and confidence — enabling downstream reasoning to attribute and assess.
-
Human-interruptible mutation: Any state-changing tool path includes an interception point where human approval can be required based on policy.
-
Idempotent operations: All mutations are keyed by idempotency tokens. Retries and saga compensations are safe to re-execute.
-
Bounded execution: Every tool invocation carries an explicit deadline. No tool can run indefinitely.
-
Observability: Every invocation produces structured traces consumable by the agent runtime, enabling self-diagnosis and continuous evaluation.
-
Least privilege: Tools execute with caller-scoped authorization, not agent-global credentials. Permissions are the minimum required for the specific operation.
-
Graceful degradation: Every critical capability has a fallback hierarchy. Total system failure requires simultaneous failure of all tiers plus human escalation timeout.
Token Budget Accounting Across Tool Operations#
The total token cost of tool use in a single agent step is:
This must satisfy:
When approaches the budget, the context compiler must:
- Prune older tool results from history.
- Compress current results to higher compression levels.
- Reduce the number of tool schemas in context (lazy loading).
- Summarize invocation history rather than including verbatim records.
Reliability Equation#
The end-to-end reliability of a tool chain of sequential steps with per-step reliability and fallback depth (each fallback having independent reliability ) is:
For a chain of 5 steps, each with a primary tool at and a secondary at :
Compared to single-tool reliability: . Fallback hierarchies provide a 26% absolute reliability improvement in this configuration.
This chapter establishes the complete engineering framework for advanced tool use in agentic systems. The patterns described — composition, routing, selection, transactions, fallbacks, validation, self-healing, creation, and ecosystem management — form the operational backbone of any production-grade agentic platform. Each pattern is specified with typed contracts, bounded algorithms, and measurable quality criteria, ensuring that tool use is not an ad hoc capability but a disciplined, observable, and governable infrastructure layer.