Chapter 14: Advanced Tool Patterns — Composition, Chaining, and Agentic Tool Use

Preamble#

Tool use elevates a language model from a stateless text generator into an actuating agent capable of observing, mutating, and reasoning over external state. Yet the difference between a demonstration-grade tool-calling agent and a production-grade agentic system lies entirely in the composition discipline: how tools are sequenced, how their outputs are routed into downstream reasoning, how failures are absorbed, how transactions maintain consistency, and how the tool surface itself evolves under agent-driven creation and community governance. This chapter provides the complete engineering treatment of these advanced patterns — formalized mathematically, specified as typed protocols, and rendered as bounded pseudo-algorithms suitable for implementation in enterprise-scale agentic platforms.

Throughout, we adopt the following foundational abstraction. A tool $\tau$ is a typed function:

\tau : \mathcal{I}_\tau \rightarrow \mathcal{O}_\tau \cup \{\bot\}

where $\mathcal{I}_\tau$ is the validated input schema, $\mathcal{O}_\tau$ is the structured output schema, and $\bot$ denotes failure (with typed error class). A tool invocation is a tuple $(id, \tau, \mathbf{x}, t_{\text{submit}}, t_{\text{deadline}}, \text{auth}, \text{trace\_id})$ submitted through a protocol boundary (MCP, JSON-RPC, or gRPC) with explicit deadline, caller-scoped authorization, and distributed trace identity.

14.1 Tool Chaining: Sequential, Conditional, and Parallel Composition Patterns#

14.1.1 Foundational Definitions#

A tool chain $\mathcal{C}$ is a directed acyclic graph (DAG) of tool invocations where edges encode data dependencies and control-flow predicates. We distinguish three canonical composition patterns:

Pattern	Structure	Data Flow	Concurrency
Sequential	Linear pipeline $\tau_1 \to \tau_2 \to \cdots \to \tau_n$	Output of $\tau_i$ feeds input of $\tau_{i+1}$	None
Conditional	Branch node $\phi$ selects among subchains	Predicate $\phi(o_i)$ determines successor	None (selection)
Parallel	Fan-out set $\{\tau_{a}, \tau_{b}, \ldots\}$ with join barrier	Independent inputs; outputs merged at barrier	Full within fan-out

Formally, a chain $\mathcal{C} = (V, E, \Phi, \mathcal{J})$ where:

$V = \{\tau_1, \ldots, \tau_n\}$ is the tool vertex set.
$E \subseteq V \times V$ is the dependency edge set.
$\Phi : V \to \{\text{seq}, \text{cond}, \text{par}, \text{join}\}$ assigns a composition type.
$\mathcal{J} : E \to (\mathcal{O}_{\text{src}} \to \mathcal{I}_{\text{dst}})$ is the junction transform mapping source outputs to destination inputs.

14.1.2 Sequential Composition#

Sequential chaining is the simplest and most common pattern. Each tool $\tau_i$ produces output $o_i$ , which is transformed by $\mathcal{J}_{i \to i+1}$ into the input for $\tau_{i+1}$ :

\mathbf{x}_{i+1} = \mathcal{J}_{i \to i+1}(o_i), \quad o_i = \tau_i(\mathbf{x}_i)

The total latency of a sequential chain is:

L_{\text{seq}} = \sum_{i=1}^{n} \left( l_{\tau_i} + l_{\mathcal{J}_{i \to i+1}} \right)

where $l_{\tau_i}$ is tool execution latency and $l_{\mathcal{J}}$ is junction transform latency (typically negligible for schema mappings, non-trivial for LLM-mediated transforms).

Pseudo-Algorithm 14.1: Sequential Chain Executor

PROCEDURE ExecuteSequentialChain(chain: List<ToolNode>, initial_input, trace_id, deadline):
    current_input ← initial_input
    results ← []
    remaining_budget ← deadline - now()
 
    FOR i ← 0 TO len(chain) - 1:
        node ← chain[i]
        tool ← node.tool
        junction ← node.junction_transform
 
        // Deadline subdivision
        estimated_remaining_latency ← SUM(chain[j].estimated_latency FOR j IN [i..len(chain)-1])
        local_deadline ← now() + (remaining_budget * chain[i].estimated_latency / estimated_remaining_latency)
 
        // Schema validation on input
        validated_input ← ValidateSchema(current_input, tool.input_schema)
        IF validated_input IS SchemaError:
            RETURN ChainFailure(step=i, error=validated_input, partial_results=results)
 
        // Invoke with deadline, auth, trace
        invocation ← ToolInvocation(
            id=GenerateUUID(),
            tool=tool,
            input=validated_input,
            deadline=local_deadline,
            auth=ScopedAuth(tool.required_permissions),
            trace_id=trace_id
        )
        result ← InvokeTool(invocation)
 
        IF result IS Failure:
            RETURN ChainFailure(step=i, error=result.error, partial_results=results)
 
        // Validate output schema
        validated_output ← ValidateSchema(result.output, tool.output_schema)
        IF validated_output IS SchemaError:
            RETURN ChainFailure(step=i, error="output_schema_violation", partial_results=results)
 
        results.APPEND(validated_output)
 
        // Apply junction transform for next step
        IF i < len(chain) - 1:
            current_input ← junction(validated_output)
            remaining_budget ← deadline - now()
            IF remaining_budget <= 0:
                RETURN ChainFailure(step=i, error="deadline_exceeded", partial_results=results)
 
    RETURN ChainSuccess(results=results, final_output=results[-1])

14.1.3 Conditional Composition#

Conditional branching introduces a predicate function $\phi : \mathcal{O}_{\tau_i} \to \{b_1, b_2, \ldots, b_k\}$ that maps the output of a predecessor tool to a branch selector. Each branch $b_j$ identifies a distinct subchain $\mathcal{C}_j$ .

\text{next}(\tau_i, o_i) = \mathcal{C}_{\phi(o_i)}

Predicates may be:

Deterministic: schema field comparison, threshold evaluation, enum matching.
LLM-mediated: the agent reasons over $o_i$ and selects the branch. This introduces non-determinism and must be bounded by a selection confidence threshold $\theta_{\text{branch}}$ :

\phi_{\text{LLM}}(o_i) = \arg\max_{b_j} P(b_j \mid o_i, \text{context}), \quad \text{subject to } \max_j P(b_j) \geq \theta_{\text{branch}}

If no branch exceeds $\theta_{\text{branch}}$ , the chain enters a disambiguation subroutine (human escalation, additional retrieval, or retry with enriched context).

14.1.4 Parallel Composition#

Parallel fan-out distributes independent tool invocations across concurrent execution slots. A join barrier $\mathcal{B}$ collects outputs and merges them before the chain continues.

Let the fan-out set be $F = \{\tau_{a_1}, \tau_{a_2}, \ldots, \tau_{a_m}\}$ . Each tool receives independently constructed inputs:

\mathbf{x}_{a_j} = \mathcal{J}_{\text{fan-out}}^{(j)}(o_{\text{predecessor}})

The join barrier produces:

o_{\text{merged}} = \mathcal{B}(o_{a_1}, o_{a_2}, \ldots, o_{a_m})

The latency of the parallel segment is:

L_{\text{par}} = \max_{j \in [1, m]} l_{\tau_{a_j}} + l_{\mathcal{B}}

Join barrier strategies:

All-success: Wait for all $m$ results. Fail if any $\tau_{a_j}$ fails.
Quorum: Wait for $\lceil q \cdot m \rceil$ successes where $q \in (0, 1]$ is the quorum fraction.
First-success: Return the first successful result; cancel remaining.
Best-of-N: Wait for all, then select the highest-quality result by a scoring function $s : \mathcal{O} \to \mathbb{R}$ .

Pseudo-Algorithm 14.2: Parallel Fan-Out with Quorum Join

PROCEDURE ExecuteParallelFanOut(fan_out_nodes: List<ToolNode>, predecessor_output, quorum_fraction, deadline, trace_id):
    invocations ← []
    FOR node IN fan_out_nodes:
        input_j ← node.fan_out_transform(predecessor_output)
        validated ← ValidateSchema(input_j, node.tool.input_schema)
        inv ← ToolInvocation(id=UUID(), tool=node.tool, input=validated, deadline=deadline, trace_id=trace_id)
        invocations.APPEND(inv)
 
    // Dispatch all concurrently
    futures ← DispatchConcurrent(invocations)
    required_successes ← CEIL(quorum_fraction * len(futures))
    successes ← []
    failures ← []
 
    WHILE len(successes) < required_successes AND (len(successes) + len(pending(futures))) >= required_successes:
        completed ← AwaitAny(futures, timeout=deadline - now())
        IF completed IS None:
            BREAK  // Deadline
        IF completed.result IS Success:
            successes.APPEND(completed)
        ELSE:
            failures.APPEND(completed)
        futures.REMOVE(completed)
 
    IF len(successes) >= required_successes:
        CancelRemaining(futures)
        merged ← JoinBarrierMerge(successes)
        RETURN ParallelSuccess(merged=merged, individual=successes, failures=failures)
    ELSE:
        CancelRemaining(futures)
        RETURN ParallelFailure(successes=successes, failures=failures, quorum_not_met=TRUE)

14.1.5 Composition Algebra#

Tool chains compose algebraically. Define operators:

Sequence: $\mathcal{C}_1 \triangleright \mathcal{C}_2$ — execute $\mathcal{C}_1$ then $\mathcal{C}_2$ , routing output through junction.
Branch: $\mathcal{C}_1 \triangleright_\phi \{b_1 : \mathcal{C}_a, b_2 : \mathcal{C}_b, \ldots\}$ — conditional dispatch.
Parallel: $\mathcal{C}_1 \triangleright (\mathcal{C}_a \| \mathcal{C}_b \| \ldots) \triangleright_\mathcal{B} \mathcal{C}_2$ — fan-out, join, continue.

Any composition of these operators produces a valid DAG. The chain compiler validates:

Type compatibility: $\forall (u, v) \in E: \text{range}(\mathcal{J}_{u \to v}) \subseteq \mathcal{I}_v$
Acyclicity: No directed cycle exists (bounded recursion is modeled as iteration with explicit depth counters).
Deadline feasibility: $\sum_{\text{critical path}} \hat{l}_{\tau_i} \leq t_{\text{deadline}}$ where $\hat{l}$ is estimated latency.

14.2 Tool Output Routing: Feeding Tool Results as Context to Subsequent Reasoning Steps#

14.2.1 The Output Routing Problem#

When a tool $\tau_i$ returns output $o_i$ , the agentic system must decide where and how that output enters the reasoning pipeline. The output may serve as:

Direct input to the next tool (junction transform, §14.1).
Context injection into the LLM's next reasoning step.
Memory write to working, session, or episodic memory.
Observation record for verification or audit.
Discard if the output has been fully consumed by an earlier transform.

This routing decision is itself a function:

\mathcal{R} : \mathcal{O}_\tau \times \mathcal{S}_{\text{agent}} \to \mathcal{P}(\{\text{tool\_input}, \text{context}, \text{memory}, \text{observation}, \text{discard}\})

where $\mathcal{S}_{\text{agent}}$ is the current agent state and $\mathcal{P}$ denotes the power set (multiple destinations are common).

14.2.2 Context Injection Strategies#

When tool output is routed into the LLM context window, the key trade-off is information density vs. token cost. Raw tool outputs (e.g., a full database result set) may consume thousands of tokens while contributing marginal reasoning value.

Compression hierarchy for tool outputs entering context:

Level	Method	Compression Ratio	Information Loss
L0	Raw output verbatim	$1:1$	None
L1	Schema-filtered projection (select relevant fields)	$2\text{–}10\times$	Low (structural)
L2	Summarization by secondary LLM call	$10\text{–}100\times$	Moderate (semantic)
L3	Key-value extraction (facts only)	$20\text{–}200\times$	Moderate–High
L4	Boolean/scalar signal ("found"/"not found", count)	$100\text{–}1000\times$	High

The optimal level $\ell^*$ is selected by:

\ell^* = \arg\min_{\ell} \left[ \alpha \cdot \text{TokenCost}(\ell) + \beta \cdot \text{InfoLoss}(\ell) + \gamma \cdot \text{Latency}(\ell) \right]

where $\alpha, \beta, \gamma$ are task-specific weights. For high-stakes reasoning (medical, financial), $\beta$ dominates; for high-throughput batch processing, $\alpha$ dominates.

14.2.3 Structured Output Placement in the Context Window#

Tool outputs should be placed in the context window using structured delimiters and provenance tags:

<tool_result tool_id="τ_3" invocation_id="inv-8a2f" timestamp="2025-01-15T10:32:00Z"
             source="database/orders" freshness="live" confidence="verified">
  { "order_count": 1247, "total_revenue": 89340.50, "currency": "USD" }
</tool_result>

This enables the LLM to:

Distinguish tool-provided facts from its own prior knowledge (hallucination control).
Attribute claims to specific tool invocations (provenance).
Assess freshness and confidence for downstream reasoning.

14.2.4 Output Routing Decision Algorithm#

Pseudo-Algorithm 14.3: Tool Output Router

PROCEDURE RouteToolOutput(output, tool_metadata, agent_state, chain_plan, token_budget):
    destinations ← {}
 
    // 1. Check if next tool in chain needs this output
    IF chain_plan.HasSuccessor(tool_metadata.step_id):
        successor ← chain_plan.GetSuccessor(tool_metadata.step_id)
        junction ← chain_plan.GetJunction(tool_metadata.step_id, successor.step_id)
        transformed ← junction.Transform(output)
        destinations.ADD("tool_input", transformed)
 
    // 2. Determine context injection need
    IF agent_state.RequiresReasoningOverOutput(tool_metadata):
        compression_level ← SelectCompressionLevel(
            output_size=TokenCount(output),
            remaining_budget=token_budget.remaining,
            task_criticality=agent_state.task.criticality,
            output_schema=tool_metadata.output_schema
        )
        compressed ← CompressOutput(output, compression_level, tool_metadata)
        provenance_tagged ← AttachProvenance(compressed, tool_metadata)
        destinations.ADD("context", provenance_tagged)
 
    // 3. Memory write evaluation
    IF MemoryPolicy.ShouldPersist(output, tool_metadata, agent_state):
        memory_record ← ConstructMemoryRecord(
            content=ExtractPersistableContent(output),
            provenance=tool_metadata,
            expiry=MemoryPolicy.ComputeExpiry(tool_metadata),
            layer=MemoryPolicy.SelectLayer(output, agent_state)
        )
        IF NOT MemoryStore.IsDuplicate(memory_record):
            destinations.ADD("memory", memory_record)
 
    // 4. Observation log (always, for audit)
    destinations.ADD("observation", ObservationRecord(output, tool_metadata, timestamp=now()))
 
    RETURN destinations

14.2.5 Token Budget Accounting#

Every tool output routed into context must be deducted from the active token budget $B_{\text{active}}$ . Let $C_{\text{window}}$ be the total context window capacity. Then:

B_{\text{active}} = C_{\text{window}} - T_{\text{system}} - T_{\text{history}} - T_{\text{memory}} - T_{\text{reserved\_generation}}

where $T_{\text{system}}$ is the system prompt / role policy cost, $T_{\text{history}}$ is conversation history, $T_{\text{memory}}$ is injected memory summaries, and $T_{\text{reserved\_generation}}$ is the minimum generation budget (typically 2048–4096 tokens). A tool output of size $t_o$ tokens is admitted into context only if:

t_o \leq B_{\text{active}} - T_{\text{safety\_margin}}

If $t_o$ exceeds this bound, the output is either compressed to a lower level $\ell$ or offloaded to external memory with a pointer summary injected into context.

14.3 Tool Selection Strategies: LLM-Driven, Rule-Based, Policy-Gated, and Learned Tool Routing#

14.3.1 The Tool Selection Problem#

Given a set of available tools $\mathcal{T} = \{\tau_1, \tau_2, \ldots, \tau_N\}$ and a current agent state $s \in \mathcal{S}$ , the tool selection problem is to choose a subset $A \subseteq \mathcal{T}$ (possibly a singleton) and construct the invocation parameters $\mathbf{x}_\tau$ for each selected tool:

\pi(s) = \{(\tau, \mathbf{x}_\tau) \mid \tau \in A \subseteq \mathcal{T}\}

This is a policy $\pi$ mapping states to tool invocations. The quality of the policy is evaluated over a distribution of tasks:

Q(\pi) = \mathbb{E}_{s \sim \mathcal{D}} \left[ \sum_{t=0}^{T} \gamma^t \cdot r(s_t, \pi(s_t)) \right]

where $r(s_t, \pi(s_t))$ is the reward at step $t$ , $\gamma \in (0, 1]$ is the discount factor, and $T$ is the episode horizon.

14.3.2 Strategy Taxonomy#

A. LLM-Driven Selection (Native Function Calling)#

The LLM receives tool descriptions in its context and generates a structured tool call as part of its output:

\tau^*, \mathbf{x}^* = \text{LLM}(\text{context} \| \text{ToolSchemas}(\mathcal{T}))

Advantages: Flexible, handles novel tool combinations, benefits from world knowledge. Risks: Hallucinated tool names, invalid parameters, suboptimal selection under large $|\mathcal{T}|$ .

Token cost scaling: If each tool schema consumes $\bar{t}$ tokens, the total schema injection cost is $|\mathcal{T}| \cdot \bar{t}$ . For $|\mathcal{T}| > 50$ , this can consume a significant fraction of the context window.

Mitigation — Lazy Tool Loading: Only inject schemas for tools relevant to the current task phase, determined by a lightweight pre-classifier:

\mathcal{T}_{\text{active}} = \text{TopK}\left(\{\tau \in \mathcal{T} \mid \text{relevance}(\tau, s) \geq \theta_{\text{tool}}\}, k\right)

where $\text{relevance}(\tau, s)$ may be computed by embedding similarity between the task description and tool descriptions, or by a trained classifier.

B. Rule-Based Selection#

Deterministic rules map state patterns to tool choices:

\pi_{\text{rule}}(s) = \begin{cases} \tau_{\text{search}} & \text{if } s.\text{intent} = \text{LOOKUP} \\ \tau_{\text{calculate}} & \text{if } s.\text{intent} = \text{COMPUTE} \\ \tau_{\text{write\_file}} & \text{if } s.\text{intent} = \text{PERSIST} \\ \varnothing & \text{otherwise} \end{cases}

Advantages: Deterministic, auditable, zero LLM cost for selection. Risks: Brittle under novel tasks, requires manual maintenance as tool set evolves.

C. Policy-Gated Selection#

A policy layer intercepts LLM-proposed tool calls and applies authorization, safety, and cost constraints before execution:

\text{Gate}(\tau, \mathbf{x}, s) = \begin{cases} \text{ALLOW} & \text{if } \text{Auth}(\tau, s.\text{caller}) \wedge \text{Safe}(\tau, \mathbf{x}) \wedge \text{Budget}(\tau, s) \\ \text{REQUIRE\_APPROVAL} & \text{if } \tau \in \mathcal{T}_{\text{sensitive}} \\ \text{DENY} & \text{otherwise} \end{cases}

The gate enforces:

Authorization: Caller-scoped permissions, not agent-owned credentials.
Safety: Input sanitization, dangerous operation detection (e.g., DROP TABLE).
Budget: Cost ceiling per invocation and cumulative per session.
Rate limiting: Per-tool invocation rate within sliding windows.

D. Learned Tool Routing#

A trained router model $f_\theta$ maps task embeddings to tool selection distributions:

P(\tau \mid s) = \text{softmax}(f_\theta(\text{Embed}(s)))

Training data is derived from successful traces:

\mathcal{D}_{\text{train}} = \{(s_t, \tau_t^*) \mid \text{trace } t \text{ was successful}\}

The loss function is cross-entropy:

\mathcal{L}(\theta) = -\sum_{(s, \tau^*) \in \mathcal{D}_{\text{train}}} \log P_\theta(\tau^* \mid s)

Advantages: Adapts to usage patterns, fast inference, low token cost. Risks: Cold-start problem, distribution shift as tools are added/removed.

14.3.3 Hybrid Selection Architecture#

Production systems combine all four strategies in a layered architecture:

User Query → Intent Classifier → Learned Router (candidate set)
                                        ↓
                                  LLM Selection (from candidate set)
                                        ↓
                                  Policy Gate (auth, safety, budget)
                                        ↓
                                  Rule Override (domain-specific hardcoded rules)
                                        ↓
                                  Approved Invocation → Executor

Pseudo-Algorithm 14.4: Hybrid Tool Selector

PROCEDURE SelectTool(agent_state, available_tools, token_budget, policy):
    // Phase 1: Learned routing (fast pre-filter)
    relevance_scores ← LearnedRouter.Score(agent_state.task_embedding, available_tools)
    candidate_set ← TopK(available_tools, relevance_scores, k=min(10, len(available_tools)))
 
    // Phase 2: Rule-based override
    rule_forced ← RuleEngine.Evaluate(agent_state)
    IF rule_forced IS NOT NULL:
        candidate_set ← {rule_forced} ∪ candidate_set[:2]  // Keep rule choice primary
 
    // Phase 3: LLM-driven selection from candidate set
    tool_schemas ← [tool.schema FOR tool IN candidate_set]
    schema_tokens ← SUM(TokenCount(s) FOR s IN tool_schemas)
    IF schema_tokens > token_budget.tool_schema_allocation:
        candidate_set ← candidate_set[:FLOOR(token_budget.tool_schema_allocation / avg_schema_tokens)]
        tool_schemas ← [tool.schema FOR tool IN candidate_set]
 
    llm_selection ← LLM.SelectTool(
        context=agent_state.context,
        tool_schemas=tool_schemas,
        instruction="Select the most appropriate tool and construct valid parameters."
    )
 
    // Phase 4: Policy gate
    gate_result ← policy.Evaluate(llm_selection.tool, llm_selection.params, agent_state)
    SWITCH gate_result:
        CASE ALLOW:
            RETURN ApprovedInvocation(llm_selection.tool, llm_selection.params)
        CASE REQUIRE_APPROVAL:
            approval ← RequestHumanApproval(llm_selection, timeout=policy.approval_timeout)
            IF approval.granted:
                RETURN ApprovedInvocation(llm_selection.tool, llm_selection.params)
            ELSE:
                RETURN Denied(reason=approval.reason)
        CASE DENY:
            RETURN Denied(reason=gate_result.reason)

14.3.4 Tool Selection Under Large Catalogs#

When $|\mathcal{T}|$ exceeds $\sim$ 50 tools, neither full schema injection nor flat relevance scoring scales. A hierarchical tool index partitions tools into categories $\mathcal{G}_1, \ldots, \mathcal{G}_m$ where $\bigcup \mathcal{G}_j = \mathcal{T}$ and $\mathcal{G}_i \cap \mathcal{G}_j = \varnothing$ . Selection proceeds in two phases:

Category selection: $g^* = \arg\max_j \text{relevance}(\mathcal{G}_j, s)$ — low cost, operates over category descriptions.
Tool selection within category: Standard hybrid selection over $\mathcal{G}_{g^*}$ .

This reduces schema injection cost from $O(|\mathcal{T}|)$ to $O(|\mathcal{G}_{g^*}|)$ per step.

14.4 Multi-Tool Transactions: Compensation, Rollback, and Saga Patterns for Tool Chains#

14.4.1 Transactional Semantics for Tool Chains#

Unlike database transactions backed by ACID guarantees, tool chains span heterogeneous systems (APIs, file systems, databases, external services) where global atomicity is unavailable. We therefore adopt the Saga pattern: a sequence of local transactions $T_1, T_2, \ldots, T_n$ , each with a compensating action $C_i$ that semantically undoes the effect of $T_i$ .

Definition: A tool saga $\mathcal{S}$ is a pair of sequences:

\mathcal{S} = \langle (T_1, T_2, \ldots, T_n), (C_1, C_2, \ldots, C_n) \rangle

where $T_i$ is the forward tool invocation and $C_i$ is its compensating counterpart. If $T_k$ fails after $T_1, \ldots, T_{k-1}$ have succeeded, the saga executes compensations in reverse order:

C_{k-1}, C_{k-2}, \ldots, C_1

14.4.2 Compensation Design Constraints#

Not all tool operations are reversible. We classify tools by their compensation capability:

Class	Compensation	Example
Fully reversible	$C_i$ exactly undoes $T_i$	File write → file delete
Semantically reversible	$C_i$ achieves approximate undo	API order creation → order cancellation
Compensable with side effects	$C_i$ undoes primary effect but leaves traces	Email sent → follow-up retraction email
Irreversible	No $C_i$ exists	Published tweet, physical actuation

For irreversible tools, the saga must place them last in the chain (after all fallible steps) or require pre-commitment approval:

\text{Position constraint}: \forall \tau_i \in \mathcal{T}_{\text{irreversible}}, \nexists \tau_j \text{ with } j > i \text{ and } \tau_j \text{ is fallible}

14.4.3 Saga Orchestrator#

Pseudo-Algorithm 14.5: Saga Orchestrator with Compensation

PROCEDURE ExecuteSaga(saga: SagaDefinition, initial_input, trace_id):
    completed_steps ← []
    current_input ← initial_input
 
    FOR i ← 0 TO len(saga.forward_steps) - 1:
        step ← saga.forward_steps[i]
 
        // Pre-flight check for irreversible steps
        IF step.tool.compensation_class = IRREVERSIBLE:
            IF i < len(saga.forward_steps) - 1:
                LOG.WARN("Irreversible step not at end of saga; requesting approval")
                approval ← RequestHumanApproval(step, context=current_input)
                IF NOT approval.granted:
                    ExecuteCompensation(completed_steps)
                    RETURN SagaAborted(reason="irreversible_step_denied", compensated=TRUE)
 
        // Execute forward step
        result ← InvokeToolWithRetry(
            tool=step.tool,
            input=current_input,
            retry_budget=step.retry_budget,
            trace_id=trace_id
        )
 
        IF result IS Failure:
            LOG.ERROR("Saga step failed", step=i, error=result.error)
            compensation_result ← ExecuteCompensation(completed_steps)
            RETURN SagaFailed(
                failed_step=i,
                error=result.error,
                compensation_result=compensation_result,
                partial_results=[s.result FOR s IN completed_steps]
            )
 
        completed_steps.APPEND(CompletedStep(
            index=i,
            tool=step.tool,
            input=current_input,
            result=result.output,
            compensation=step.compensation
        ))
        current_input ← step.junction_transform(result.output)
 
    RETURN SagaSuccess(results=[s.result FOR s IN completed_steps])
 
 
PROCEDURE ExecuteCompensation(completed_steps: List<CompletedStep>):
    compensation_results ← []
    // Reverse order
    FOR step IN REVERSED(completed_steps):
        IF step.compensation IS NOT NULL:
            comp_result ← InvokeToolWithRetry(
                tool=step.compensation.tool,
                input=step.compensation.construct_input(step.result, step.input),
                retry_budget=step.compensation.retry_budget
            )
            compensation_results.APPEND(CompensationOutcome(step=step.index, result=comp_result))
            IF comp_result IS Failure:
                LOG.CRITICAL("Compensation failed — manual intervention required",
                             step=step.index, error=comp_result.error)
                AlertOnCall("saga_compensation_failure", step=step.index)
        ELSE:
            LOG.WARN("No compensation defined for step", step=step.index)
 
    RETURN compensation_results

14.4.4 Idempotency Requirements#

Every forward step $T_i$ and compensation $C_i$ must be idempotent:

T_i(T_i(\mathbf{x})) \equiv T_i(\mathbf{x}), \quad C_i(C_i(\mathbf{x})) \equiv C_i(\mathbf{x})

Implementation techniques:

Idempotency keys: Each invocation carries a unique key; the tool server deduplicates.
Conditional mutations: Use version vectors or ETags; the operation succeeds only if the precondition matches.
Upsert semantics: Prefer INSERT ... ON CONFLICT UPDATE over blind INSERT.

14.4.5 Saga State Machine#

A saga's lifecycle is a finite state machine:

\text{States} = \{\text{PENDING}, \text{RUNNING}, \text{COMPENSATING}, \text{COMPLETED}, \text{FAILED}, \text{COMPENSATION\_FAILED}\}

Transitions:

\text{PENDING} \xrightarrow{start} \text{RUNNING} \xrightarrow{all\_succeed} \text{COMPLETED}

\text{RUNNING} \xrightarrow{step\_fails} \text{COMPENSATING} \xrightarrow{all\_compensated} \text{FAILED}

\text{COMPENSATING} \xrightarrow{comp\_fails} \text{COMPENSATION\_FAILED}

The COMPENSATION_FAILED state triggers manual intervention via alerting infrastructure.

14.5 Tool Fallback Hierarchies: Primary → Secondary → Degraded → Manual Escalation#

14.5.1 Motivation and Architecture#

External tools are inherently unreliable: APIs have outages, rate limits are exhausted, and services degrade. A production agentic system must define fallback hierarchies for every critical tool capability so that agent execution degrades gracefully rather than failing catastrophically.

A fallback hierarchy for a capability $\kappa$ is an ordered list of tool implementations:

\mathcal{H}_\kappa = [\tau_{\kappa}^{(1)}, \tau_{\kappa}^{(2)}, \ldots, \tau_{\kappa}^{(d)}, \tau_{\kappa}^{(\text{manual})}]

where superscripts denote priority (1 = primary), and $\tau_{\kappa}^{(\text{manual})}$ is the human escalation sentinel.

Each tool in the hierarchy is annotated with:

Property	Type	Description
`priority`	$\mathbb{Z}^+$	Lower is preferred
`latency_p99`	Duration	Expected worst-case latency
`cost_per_call`	Float	Monetary cost
`accuracy`	Float $\in [0, 1]$	Expected output correctness
`availability`	Float $\in [0, 1]$	Historical uptime
`circuit_state`	$\{\text{CLOSED}, \text{OPEN}, \text{HALF\_OPEN}\}$	Current circuit breaker state

14.5.2 Fallback Selection Criteria#

The selection function $\sigma$ evaluates candidates in priority order, skipping those with open circuits or exceeded rate limits:

\sigma(\mathcal{H}_\kappa) = \min_{i} \left\{ i \mid \tau_{\kappa}^{(i)}.\text{circuit} \neq \text{OPEN} \wedge \tau_{\kappa}^{(i)}.\text{rate\_remaining} > 0 \right\}

If no automated tool is available:

\sigma(\mathcal{H}_\kappa) = (\text{manual}) \quad \text{if } \forall i \in [1, d] : \tau_{\kappa}^{(i)} \text{ is unavailable}

14.5.3 Circuit Breaker Integration#

Each tool's circuit breaker tracks failure rates within sliding windows:

\text{failure\_rate}(\tau, w) = \frac{|\{t \in [now - w, now] : \text{invocation}(t) \text{ failed}\}|}{|\{t \in [now - w, now] : \text{invocation}(t)\}|}

State transitions:

\text{CLOSED} \xrightarrow{\text{failure\_rate} > \theta_{\text{open}}} \text{OPEN} \xrightarrow{\text{cooldown expired}} \text{HALF\_OPEN} \xrightarrow{\text{probe succeeds}} \text{CLOSED}

\text{HALF\_OPEN} \xrightarrow{\text{probe fails}} \text{OPEN}

Pseudo-Algorithm 14.6: Fallback Hierarchy Executor

PROCEDURE ExecuteWithFallback(capability, input, hierarchy, trace_id, deadline):
    attempted ← []
 
    FOR level ← 0 TO len(hierarchy) - 1:
        tool ← hierarchy[level]
 
        // Skip if circuit open
        IF tool.circuit_breaker.state = OPEN:
            attempted.APPEND(Skipped(tool, reason="circuit_open"))
            CONTINUE
 
        // Skip if rate limit exhausted
        IF tool.rate_limiter.remaining = 0:
            attempted.APPEND(Skipped(tool, reason="rate_limited"))
            CONTINUE
 
        // Check deadline feasibility
        IF now() + tool.latency_p99_estimate > deadline:
            attempted.APPEND(Skipped(tool, reason="deadline_infeasible"))
            CONTINUE
 
        result ← InvokeTool(tool, input, deadline=deadline, trace_id=trace_id)
 
        IF result IS Success:
            EmitMetric("tool_fallback_level", level, capability=capability)
            RETURN FallbackSuccess(result=result.output, level=level, attempted=attempted)
        ELSE:
            tool.circuit_breaker.RecordFailure()
            attempted.APPEND(Failed(tool, error=result.error))
 
    // All automated tools exhausted — manual escalation
    IF hierarchy.manual_escalation_enabled:
        ticket ← CreateEscalationTicket(
            capability=capability,
            input=input,
            attempted=attempted,
            trace_id=trace_id,
            urgency=ComputeUrgency(deadline)
        )
        RETURN ManualEscalation(ticket=ticket, attempted=attempted)
    ELSE:
        RETURN FallbackExhausted(attempted=attempted)

14.5.4 Degradation Modes#

When falling back from primary to lower-tier tools, the agent must adjust its expectations:

Accuracy degradation: Lower-tier tools may provide approximate results. The agent should annotate downstream reasoning with confidence decrements.
Latency degradation: Synchronous fallbacks increase total chain latency; the agent should re-evaluate deadline feasibility.
Feature degradation: Secondary tools may lack capabilities (e.g., pagination, filtering). The agent must compensate with post-processing.
Manual escalation: The agent loop pauses at a human gate, persists state, and resumes upon human response. This requires durable state serialization and session resumption protocols.

14.6 Tool Result Validation: Schema Conformance, Sanity Checks, Cross-Tool Consistency Verification#

14.6.1 Validation Layers#

Tool outputs cannot be trusted blindly. A production-grade validation pipeline applies three layers:

Schema conformance — structural correctness.
Semantic sanity checks — value-level plausibility.
Cross-tool consistency — agreement across corroborating sources.

14.6.2 Schema Conformance Validation#

Every tool declares an output schema $\mathcal{O}_\tau$ (JSON Schema, Protobuf message, or equivalent). Validation checks:

Required fields present.
Types correct (string, integer, array, nested object).
Value constraints met (ranges, regex patterns, enum membership).
Array bounds respected ( $\text{minItems}$ , $\text{maxItems}$ ).

\text{SchemaValid}(o, \mathcal{O}_\tau) = \begin{cases} \text{TRUE} & \text{if } o \text{ conforms to } \mathcal{O}_\tau \\ \text{FALSE} & \text{otherwise} \end{cases}

Non-conforming outputs are rejected, and the invocation is treated as a failure (triggering retry or fallback).

14.6.3 Semantic Sanity Checks#

Beyond schema correctness, outputs must be semantically plausible. Sanity checks are domain-specific predicates:

\text{SanityValid}(o) = \bigwedge_{p \in \mathcal{P}_{\text{sanity}}} p(o)

Examples of sanity predicates:

Temporal: $o.\text{created\_at} \leq \text{now}()$
Numeric range: $o.\text{price} > 0 \wedge o.\text{price} < 10^7$
Referential: $o.\text{user\_id} \in \text{KnownUsers}$
Cardinality: $|o.\text{results}| \leq \text{expected\_max}$
Self-consistency: $o.\text{total} = \sum_{i} o.\text{items}[i].\text{amount}$

14.6.4 Cross-Tool Consistency Verification#

When multiple tools produce overlapping information, consistency checks detect conflicting outputs. Given outputs $o_a$ from $\tau_a$ and $o_b$ from $\tau_b$ covering the same entity or fact:

\text{Consistent}(o_a, o_b) = \forall f \in \text{SharedFields}(o_a, o_b) : \text{Agree}(o_a.f, o_b.f, \epsilon_f)

where $\epsilon_f$ is a field-specific tolerance (exact match for identifiers, $\epsilon$ -tolerance for floating-point values, fuzzy match for strings).

Conflict resolution strategies:

Strategy	Rule
Authority ranking	Prefer the output from the higher-authority source
Freshness	Prefer the more recently retrieved value
Majority vote	If $\geq k$ out of $n$ tools agree, adopt the majority
LLM arbitration	Present conflicts to the LLM with provenance for reasoned resolution
Human escalation	Flag unresolvable conflicts for human review

Pseudo-Algorithm 14.7: Multi-Layer Tool Output Validator

PROCEDURE ValidateToolOutput(output, tool, context, validation_policy):
    issues ← []
 
    // Layer 1: Schema conformance
    schema_result ← ValidateJsonSchema(output, tool.output_schema)
    IF schema_result.has_errors:
        RETURN ValidationFailure(layer="schema", errors=schema_result.errors, severity=CRITICAL)
 
    // Layer 2: Semantic sanity
    sanity_predicates ← validation_policy.GetSanityPredicates(tool.id)
    FOR predicate IN sanity_predicates:
        IF NOT predicate.evaluate(output):
            issues.APPEND(SanityViolation(predicate=predicate.name, value=predicate.extract(output),
                                          expected=predicate.expected_range, severity=predicate.severity))
 
    IF ANY(issue.severity = CRITICAL FOR issue IN issues):
        RETURN ValidationFailure(layer="sanity", errors=issues, severity=CRITICAL)
 
    // Layer 3: Cross-tool consistency
    corroborating_outputs ← context.GetCorroboratingOutputs(tool.id, output)
    FOR corr IN corroborating_outputs:
        shared_fields ← IntersectFields(output, corr.output)
        FOR field IN shared_fields:
            IF NOT AgreeWithTolerance(output[field], corr.output[field], tolerance=validation_policy.GetTolerance(field)):
                issues.APPEND(ConsistencyConflict(
                    field=field,
                    value_a=output[field], source_a=tool.id,
                    value_b=corr.output[field], source_b=corr.tool_id,
                    severity=validation_policy.GetConflictSeverity(field)
                ))
 
    IF ANY(issue.severity = CRITICAL FOR issue IN issues):
        resolution ← ResolveConflicts(issues, validation_policy.conflict_resolution_strategy)
        RETURN ValidationConditional(output=resolution.resolved_output, issues=issues, resolution=resolution)
 
    IF len(issues) > 0:
        RETURN ValidationWarning(output=output, issues=issues)
    ELSE:
        RETURN ValidationSuccess(output=output)

14.7 Self-Healing Tool Use: Automatic Retry with Parameter Adjustment, Error-Guided Correction#

14.7.1 Error Taxonomy for Tool Invocations#

Tool failures are categorized by their amenability to automated correction:

Error Class	Retryable	Self-Healable	Example
Transient	Yes (same params)	N/A	Network timeout, 503
Rate-limited	Yes (after delay)	N/A	429 Too Many Requests
Input validation	No (same params)	Yes (adjust params)	Invalid date format, missing field
Authorization	No	No (escalate)	403 Forbidden
Semantic error	No (same params)	Yes (rewrite query)	SQL syntax error, empty result set
Resource not found	No (same params)	Yes (search + retry)	404, entity doesn't exist
Server error	Conditional	No	500 with corruption

14.7.2 Self-Healing Loop#

The self-healing mechanism uses the LLM as an error diagnostician that reads the error response and adjusts parameters:

\mathbf{x}' = \text{LLM}_{\text{repair}}(\text{error}(o), \mathbf{x}, \tau.\text{schema}, \text{context})

This forms a repair loop bounded by a maximum iteration count $K_{\text{max}}$ :

Pseudo-Algorithm 14.8: Self-Healing Tool Invocation

PROCEDURE InvokeWithSelfHealing(tool, initial_params, context, max_attempts, trace_id):
    current_params ← initial_params
    attempt_history ← []
 
    FOR attempt ← 1 TO max_attempts:
        // Invoke tool
        result ← InvokeTool(tool, current_params, trace_id=trace_id)
 
        IF result IS Success:
            validated ← ValidateToolOutput(result.output, tool, context)
            IF validated IS ValidationSuccess OR validated IS ValidationWarning:
                RETURN SelfHealSuccess(output=validated.output, attempts=attempt, history=attempt_history)
            ELSE:
                // Output validation failed — treat as semantic error
                error_info ← ConstructErrorInfo(type="output_validation_failure", details=validated.errors)
        ELSE:
            error_info ← ClassifyError(result.error)
 
        attempt_history.APPEND(AttemptRecord(
            attempt=attempt, params=current_params, error=error_info, timestamp=now()
        ))
 
        // Determine if error is self-healable
        IF error_info.class IN {AUTHORIZATION, SERVER_ERROR_PERMANENT}:
            RETURN SelfHealFailure(error=error_info, attempts=attempt, history=attempt_history)
 
        IF error_info.class = TRANSIENT:
            // Simple retry with exponential backoff + jitter
            delay ← MIN(BASE_DELAY * 2^attempt + RandomJitter(), MAX_DELAY)
            Sleep(delay)
            CONTINUE  // Same params
 
        IF error_info.class = RATE_LIMITED:
            delay ← ParseRetryAfter(result.headers) OR DEFAULT_RATE_LIMIT_DELAY
            Sleep(delay)
            CONTINUE  // Same params
 
        // Self-healing: LLM-driven parameter adjustment
        repair_context ← ConstructRepairContext(
            tool_schema=tool.input_schema,
            original_params=initial_params,
            current_params=current_params,
            error=error_info,
            attempt_history=attempt_history,
            task_context=context
        )
 
        repaired_params ← LLM.RepairToolParams(repair_context)
 
        IF repaired_params = current_params:
            // LLM couldn't find a different parameterization
            RETURN SelfHealFailure(error=error_info, attempts=attempt, history=attempt_history,
                                   reason="no_alternative_params")
 
        // Validate repaired params against schema before retrying
        schema_valid ← ValidateSchema(repaired_params, tool.input_schema)
        IF schema_valid IS SchemaError:
            // LLM produced invalid repair — try once more with explicit schema guidance
            repaired_params ← LLM.RepairToolParams(repair_context, include_schema_error=schema_valid)
            IF ValidateSchema(repaired_params, tool.input_schema) IS SchemaError:
                RETURN SelfHealFailure(error="repair_produced_invalid_schema", attempts=attempt,
                                       history=attempt_history)
 
        current_params ← repaired_params
 
    RETURN SelfHealExhausted(attempts=max_attempts, history=attempt_history)

14.7.3 Repair Quality Metrics#

The quality of the self-healing loop is measured by:

Repair success rate: $R_{\text{success}} = \frac{|\text{healed invocations}|}{|\text{attempted healings}|}$
Mean attempts to heal: $\bar{A} = \mathbb{E}[\text{attempts} \mid \text{healed}]$
Repair latency overhead: $\Delta L = L_{\text{healed}} - L_{\text{first\_attempt}}$
False repair rate: cases where the healed output passed validation but was semantically incorrect.

These metrics feed into the evaluation infrastructure (§14.13) for continuous quality monitoring.

14.7.4 Error-Guided Correction Patterns#

Specific error classes invoke specialized correction strategies:

SQL syntax errors: Extract the error message, re-invoke the LLM with the error and original query for targeted rewriting. Include the database schema as additional context.
Empty result sets: Broaden search criteria (relax filters, expand date ranges, use synonyms). The repair prompt explicitly instructs broadening.
Type mismatches: Extract expected vs. actual types from the error; apply type coercion or reformatting.
Pagination exhaustion: Adjust offset/cursor parameters to access the correct page.
Encoding errors: Detect encoding issues (UTF-8 vs. ASCII) and apply appropriate encoding before retry.

14.8 Tool Creation by Agents: Dynamic Code Generation, Sandboxed Execution, and Promotion to Permanent Tools#

14.8.1 The Tool Creation Lifecycle#

When an agent encounters a task for which no existing tool suffices, it may create a new tool dynamically. This capability transforms the agent from a tool consumer into a tool producer, but introduces severe safety, correctness, and governance risks that must be addressed mechanically.

The tool creation lifecycle consists of five phases:

\text{Identify Gap} \to \text{Generate Code} \to \text{Sandbox Test} \to \text{Validate} \to \text{Promote or Discard}

14.8.2 Gap Identification#

The agent identifies tool gaps through:

Tool selection failure: No existing tool matches the required capability.
Composition inefficiency: The task requires $>k$ tool invocations that could be replaced by a single custom tool.
Repeated patterns: The same multi-tool sequence appears $\geq n_{\text{threshold}}$ times in recent traces.

Formally, a gap is detected when:

\forall \tau \in \mathcal{T} : \text{relevance}(\tau, \kappa_{\text{required}}) < \theta_{\text{gap}}

where $\kappa_{\text{required}}$ is the required capability and $\theta_{\text{gap}}$ is the minimum relevance threshold.

14.8.3 Code Generation with Typed Contracts#

The agent generates tool code that conforms to a tool template contract:

ToolTemplate:
    name: string                     // Unique identifier
    description: string              // Natural language description for discovery
    input_schema: JSONSchema          // Typed input specification
    output_schema: JSONSchema         // Typed output specification
    implementation: CodeBlock         // Generated code (Python, TypeScript, etc.)
    dependencies: List<Dependency>    // Required libraries (allowlisted)
    resource_limits: ResourceLimits   // CPU, memory, network, time bounds
    test_cases: List<TestCase>        // Agent-generated test cases
    security_classification: SecurityLevel  // Determines sandbox tier

The LLM generates the implementation subject to constraints:

Allowed dependencies: Only from an approved allowlist (no arbitrary package installation).
No network access during initial sandbox testing (unless explicitly approved).
Deterministic behavior: Same input should produce same output (excluding time-dependent operations).
Bounded execution time: Hard timeout enforced by the sandbox.

14.8.4 Sandboxed Execution Environment#

Generated tools execute in a multi-tier sandbox:

Tier	Capabilities	Use Case
Tier 0 (Pure compute)	CPU, memory only; no I/O, no network	Mathematical operations, data transforms
Tier 1 (Read-only I/O)	Filesystem read, environment variables	File parsing, config reading
Tier 2 (Controlled network)	Allowlisted HTTP endpoints only	API consumption
Tier 3 (Full access)	Requires human approval per invocation	System administration tools

Sandbox enforcement uses OS-level isolation (containers, seccomp-bpf, namespace isolation):

\text{SandboxPolicy}(\tau_{\text{new}}) = \text{Tier}[\min(i : \text{capabilities}(\tau_{\text{new}}) \subseteq \text{allowed}(\text{Tier}_i))]

14.8.5 Validation and Testing#

Before a generated tool is used, it must pass:

Static analysis: Lint, type check, security scan (no eval, no shell injection, no hardcoded credentials).
Unit tests: Agent-generated test cases executed in sandbox.
Property-based tests: Automatically generated inputs (fuzzing) to detect crashes, hangs, and out-of-bounds behavior.
Output schema conformance: Every test execution's output must validate against the declared output schema.

The validation score:

V(\tau_{\text{new}}) = w_{\text{static}} \cdot s_{\text{static}} + w_{\text{unit}} \cdot s_{\text{unit}} + w_{\text{fuzz}} \cdot s_{\text{fuzz}} + w_{\text{schema}} \cdot s_{\text{schema}}

where $s_i \in [0, 1]$ and $\sum w_i = 1$ . The tool is usable only if $V(\tau_{\text{new}}) \geq \theta_{\text{validation}}$ .

14.8.6 Promotion Pipeline#

Pseudo-Algorithm 14.9: Tool Promotion Pipeline

PROCEDURE PromoteTool(generated_tool, promotion_policy):
    // Phase 1: Static analysis
    static_result ← RunStaticAnalysis(generated_tool.implementation, generated_tool.dependencies)
    IF static_result.has_critical_issues:
        RETURN PromotionRejected(reason="static_analysis_failure", details=static_result)
 
    // Phase 2: Sandboxed test execution
    sandbox ← CreateSandbox(tier=DetermineSandboxTier(generated_tool))
    test_results ← []
    FOR test IN generated_tool.test_cases:
        result ← sandbox.Execute(generated_tool.implementation, test.input, timeout=generated_tool.resource_limits.timeout)
        schema_valid ← ValidateSchema(result.output, generated_tool.output_schema)
        test_results.APPEND(TestResult(input=test.input, expected=test.expected, actual=result.output,
                                        passed=test.expected_matches(result.output) AND schema_valid, duration=result.duration))
 
    // Phase 3: Fuzz testing
    fuzz_inputs ← GenerateFuzzInputs(generated_tool.input_schema, count=100)
    FOR fuzz_input IN fuzz_inputs:
        result ← sandbox.Execute(generated_tool.implementation, fuzz_input, timeout=generated_tool.resource_limits.timeout)
        IF result IS Crash OR result IS Timeout:
            test_results.APPEND(TestResult(input=fuzz_input, passed=FALSE, error=result.error))
 
    // Phase 4: Compute validation score
    scores ← ComputeValidationScores(static_result, test_results)
    IF scores.overall < promotion_policy.min_validation_score:
        RETURN PromotionRejected(reason="insufficient_validation_score", score=scores.overall)
 
    // Phase 5: Promotion decision
    IF promotion_policy.requires_human_review:
        review ← RequestHumanReview(generated_tool, test_results, scores)
        IF NOT review.approved:
            RETURN PromotionRejected(reason="human_review_denied")
 
    // Phase 6: Register in tool catalog
    tool_record ← ToolCatalogRecord(
        tool=generated_tool,
        status=PROMOTED,
        validation_score=scores.overall,
        created_by="agent:" + agent_id,
        created_at=now(),
        version="1.0.0",
        trust_score=ComputeInitialTrustScore(scores)
    )
    ToolCatalog.Register(tool_record)
    RETURN PromotionSuccess(tool_id=tool_record.id, trust_score=tool_record.trust_score)

14.8.7 Ephemeral vs. Permanent Tools#

Attribute	Ephemeral	Permanent
Lifetime	Single session	Indefinite (versioned)
Storage	Working memory	Tool catalog / registry
Discovery	Not discoverable by other agents	Discoverable via MCP
Governance	Self-validated	Human-reviewed, CI-tested
Trust score	Minimal	Accumulates with usage

Ephemeral tools serve as a rapid prototyping mechanism; only tools that prove their value over multiple sessions and pass the promotion pipeline become permanent.

14.9 Browser and GUI Tools: Playwright, Puppeteer, Desktop Automation, Vision-Language Tool Agents#

14.9.1 Architecture of Browser/GUI Tool Agents#

Browser and GUI automation tools enable agents to interact with web applications and desktop software that lack APIs. These tools bridge the gap between structured tool invocation and unstructured visual interfaces.

The architecture comprises three layers:

Agent Reasoning Layer (LLM)
        ↕ Structured commands / observations
Automation Abstraction Layer (Tool Server)
        ↕ Low-level browser/GUI commands
Execution Engine (Playwright / Puppeteer / OS Accessibility APIs)

14.9.2 Browser Automation Protocol#

Browser tools expose a typed interface for web interaction:

Core operations:

Operation	Input	Output	Side Effect
`navigate`	URL	Page metadata, status code	Browser navigates
`extract_text`	CSS selector / XPath	Extracted text content	None
`extract_structured`	Selector + schema	Structured data	None
`click`	Selector	Updated page state	DOM mutation
`fill_form`	Selector + value	Confirmation	DOM mutation
`screenshot`	Viewport/selector	Image (base64)	None
`execute_js`	JavaScript code	Execution result	Arbitrary
`wait_for`	Condition + timeout	Success/timeout	Blocks

14.9.3 Vision-Language Integration#

For applications where DOM access is unreliable or unavailable (canvas-based apps, remote desktops, PDFs rendered as images), the agent uses vision-language models (VLMs) to interpret screenshots:

\text{action} = \text{VLM}(\text{screenshot}, \text{task\_instruction}, \text{action\_history})

The VLM produces structured action commands:

\text{action} \in \{\text{click}(x, y), \text{type}(x, y, \text{text}), \text{scroll}(dx, dy), \text{wait}(t), \text{done}\}

Coordinate grounding is the critical challenge: the VLM must map semantic targets ("the Submit button") to pixel coordinates $(x, y)$ . Approaches:

Set-of-Mark (SoM): Overlay numbered labels on interactive elements in the screenshot; the VLM references labels instead of raw coordinates.
Bounding box prediction: The VLM outputs bounding box coordinates for the target element.
DOM-augmented vision: Combine the screenshot with a simplified DOM tree to provide both visual and structural grounding.

14.9.4 Observation Compression for GUI Agents#

Raw screenshots consume significant token budget when processed by VLMs. Compression strategies:

DOM-to-text: Convert the accessibility tree to a compact text representation (element type, label, state, relative position).
Selective screenshot: Capture only the relevant viewport region, not the full page.
Delta encoding: After the first screenshot, transmit only changed regions.
Structured observation: Extract form field states, button labels, and error messages as structured data rather than relying on visual parsing.

Pseudo-Algorithm 14.10: Browser Agent Action Loop

PROCEDURE BrowserAgentLoop(task, browser_session, max_steps, deadline):
    action_history ← []
    
    FOR step ← 1 TO max_steps:
        // Observe current state
        observation ← ConstructObservation(browser_session)
        // Observation = { dom_summary, screenshot_if_needed, url, page_title, error_messages }
 
        // Compress observation for context
        compressed_obs ← CompressObservation(observation, method=SelectCompressionMethod(task, step))
 
        // Agent reasons about next action
        agent_context ← AssembleContext(
            task=task,
            current_observation=compressed_obs,
            action_history=TruncateHistory(action_history, max_tokens=2048),
            available_actions=BROWSER_ACTION_SCHEMA
        )
 
        action_decision ← LLM.DecideAction(agent_context)
 
        // Validate action
        IF action_decision.action = "done":
            result ← ExtractResult(observation, task.expected_output_schema)
            RETURN BrowserTaskSuccess(result=result, steps=step, history=action_history)
 
        IF NOT IsValidAction(action_decision, BROWSER_ACTION_SCHEMA):
            action_history.APPEND(InvalidAction(action_decision, step=step))
            CONTINUE  // Let agent self-correct on next iteration
 
        // Execute action with safety checks
        IF action_decision.action IN {"navigate", "click", "fill_form", "execute_js"}:
            safety_check ← BrowserSafetyPolicy.Check(action_decision, browser_session)
            IF safety_check = DENY:
                action_history.APPEND(BlockedAction(action_decision, reason=safety_check.reason))
                CONTINUE
 
        execution_result ← browser_session.Execute(action_decision)
        action_history.APPEND(ExecutedAction(
            step=step, action=action_decision, result=execution_result, timestamp=now()
        ))
 
        // Wait for page stability
        browser_session.WaitForStable(timeout=5000)
 
        IF now() > deadline:
            RETURN BrowserTaskTimeout(steps=step, history=action_history)
 
    RETURN BrowserTaskExhausted(steps=max_steps, history=action_history)

14.9.5 Desktop Automation#

Desktop automation extends browser patterns to native applications using:

OS Accessibility APIs (Windows UI Automation, macOS Accessibility, Linux AT-SPI): Provide structured access to UI element trees.
Computer Vision: When accessibility APIs are unavailable, use screenshot-based interaction with VLMs.
Keyboard/Mouse simulation: Low-level input injection for actions not exposed through accessibility APIs.

The same action loop (Pseudo-Algorithm 14.10) applies, with the browser session replaced by a desktop session abstraction and the DOM summary replaced by the accessibility tree.

14.10 File System and Repository Tools: Git Operations, File Manipulation, Build System Integration#

14.10.1 File System Tool Taxonomy#

File system tools provide the agent with the ability to read, write, search, and manipulate files and directories. These tools are critical for software engineering agents, document processing agents, and data pipeline agents.

Tool	Input	Output	Mutation
`read_file`	Path, encoding, byte range	File content (text/binary)	None
`write_file`	Path, content, mode (create/overwrite/append)	Success + metadata	Yes
`list_directory`	Path, pattern, recursive	File listing with metadata	None
`search_files`	Pattern (glob/regex), root path	Matching file paths	None
`search_content`	Query (text/regex), file set	Matching lines with context	None
`move_file`	Source, destination	Success	Yes
`delete_file`	Path, require_confirmation	Success	Yes (irreversible)
`file_diff`	Path A, Path B	Unified diff	None
`file_metadata`	Path	Size, timestamps, permissions	None

14.10.2 Git Operations#

Git tools enable agents to operate within version-controlled repositories with full branching, committing, and merging capabilities:

Core Git tool operations:

Tool	Input	Output	Mutation
`git_status`	Repo path	Modified/staged/untracked files	None
`git_diff`	Repo, ref range, path filter	Diff output	None
`git_log`	Repo, ref, count, path filter	Commit history	None
`git_branch`	Repo, branch name, base ref	Branch created	Yes
`git_checkout`	Repo, ref	Working tree updated	Yes
`git_commit`	Repo, message, file list	Commit SHA	Yes
`git_merge`	Repo, source branch, strategy	Merge result or conflicts	Yes
`git_push`	Repo, remote, branch	Push result	Yes (remote state)
`git_blame`	Repo, file, line range	Authorship per line	None

Branching discipline for agentic execution: Agents must operate on isolated branches:

\text{branch}_{\text{agent}} = \text{agent/}\langle\text{task\_id}\rangle\text{/}\langle\text{timestamp}\rangle

This ensures:

No direct mutation of main or protected branches.
All changes are reviewable via pull request.
Parallel agents do not create merge conflicts on the same branch.
Rollback is trivial (delete branch).

14.10.3 Build System Integration#

Agents that modify code must verify their changes compile and pass tests:

Tool	Input	Output
`build`	Repo, target, config	Build result (success/failure + logs)
`test`	Repo, test suite, filter	Test results (pass/fail/skip per test)
`lint`	Repo, file set, config	Lint results (warnings, errors per file)
`typecheck`	Repo, config	Type errors (file, line, message)

Verification loop after file mutation:

PROCEDURE VerifyAfterMutation(repo, changed_files, verification_policy):
    results ← {}
 
    IF verification_policy.require_typecheck:
        results["typecheck"] ← RunTypecheck(repo)
    IF verification_policy.require_lint:
        results["lint"] ← RunLint(repo, changed_files)
    IF verification_policy.require_build:
        results["build"] ← RunBuild(repo)
    IF verification_policy.require_tests:
        test_scope ← DetermineAffectedTests(repo, changed_files)
        results["tests"] ← RunTests(repo, test_scope)
 
    all_passed ← ALL(r.passed FOR r IN results.values())
    RETURN VerificationResult(passed=all_passed, details=results)

14.11 Database Tools: Query Generation, Schema Introspection, Migration Planning, and Data Validation#

14.11.1 Database Tool Architecture#

Database tools enable agents to interact with relational and non-relational databases through a safety-layered interface:

Agent (LLM)
    ↕ Natural language → SQL/query intent
Query Generator (LLM or template engine)
    ↕ Generated query
Safety Layer (parser, validator, policy gate)
    ↕ Approved query
Execution Layer (connection pool, timeout, result limit)
    ↕ Result set
Result Formatter (projection, truncation, type conversion)
    ↕ Structured output to agent

14.11.2 Schema Introspection#

Before generating queries, the agent must understand the database schema. The schema introspection tool provides:

Tables/collections: Names, descriptions (from comments), row counts.
Columns/fields: Name, type, nullability, constraints, foreign keys, indices.
Relationships: Foreign key graph, junction tables.
Sample data: $k$ representative rows per table (anonymized if sensitive).
Statistics: Value distributions, cardinality estimates, NULL rates.

Schema is injected into context in compressed form:

TABLE orders (
  id: INT PK AUTO_INCREMENT,
  user_id: INT FK→users.id NOT NULL INDEX,
  total: DECIMAL(10,2) NOT NULL CHECK(>0),
  status: ENUM('pending','shipped','delivered','cancelled') DEFAULT 'pending',
  created_at: TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
) -- ~1.2M rows, ~340 orders/day

14.11.3 Query Generation Safety#

Generated SQL must be validated before execution:

Parse validation: The query must parse as valid SQL for the target dialect.
Read-only enforcement: For read operations, reject any query containing INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE, CREATE, GRANT.
Query complexity bounds: Reject queries with:
- More than $n_{\text{join}}$ joins (default: 5).
- Subqueries nested deeper than $d_{\text{max}}$ (default: 3).
- Missing WHERE clause on large tables (prevents full table scans).
- Missing LIMIT clause (enforce maximum result set size).
Parameterization: All user-derived values must be parameterized (prevent SQL injection).
Execution cost estimation: Use EXPLAIN to estimate query cost before execution; reject queries above a cost threshold.

\text{QueryApproved}(q) = \text{ParseValid}(q) \wedge \text{ReadOnly}(q) \wedge \text{ComplexityBound}(q) \wedge \text{CostBound}(q)

14.11.4 Query Generation Algorithm#

Pseudo-Algorithm 14.11: Safe Database Query Generator

PROCEDURE GenerateAndExecuteQuery(natural_language_query, database_connection, schema_cache, safety_policy):
    // Step 1: Schema retrieval
    relevant_tables ← IdentifyRelevantTables(natural_language_query, schema_cache)
    schema_context ← FormatSchemaForContext(relevant_tables, include_sample_data=TRUE, max_tables=10)
 
    // Step 2: Query generation
    generated_sql ← LLM.GenerateSQL(
        instruction="Generate a SQL query for the following request. Use only the provided schema. Include LIMIT clause.",
        user_query=natural_language_query,
        schema=schema_context,
        dialect=database_connection.dialect
    )
 
    // Step 3: Safety validation
    parse_result ← ParseSQL(generated_sql, dialect=database_connection.dialect)
    IF parse_result IS ParseError:
        // Self-heal: retry with error feedback
        generated_sql ← LLM.RepairSQL(generated_sql, parse_result.error, schema_context)
        parse_result ← ParseSQL(generated_sql, dialect=database_connection.dialect)
        IF parse_result IS ParseError:
            RETURN QueryFailure(reason="parse_error_after_repair", error=parse_result.error)
 
    IF NOT safety_policy.IsReadOnly(parse_result.ast):
        RETURN QueryFailure(reason="write_operation_not_permitted")
 
    IF NOT safety_policy.ComplexityWithinBounds(parse_result.ast):
        RETURN QueryFailure(reason="query_too_complex", details=safety_policy.GetComplexityReport(parse_result.ast))
 
    // Step 4: Cost estimation
    explain_result ← database_connection.Explain(generated_sql)
    IF explain_result.estimated_cost > safety_policy.max_query_cost:
        RETURN QueryFailure(reason="estimated_cost_too_high", cost=explain_result.estimated_cost)
 
    // Step 5: Execute with timeout and row limit
    result ← database_connection.Execute(
        query=generated_sql,
        timeout=safety_policy.query_timeout,
        max_rows=safety_policy.max_result_rows
    )
 
    IF result IS Timeout:
        RETURN QueryFailure(reason="execution_timeout")
 
    // Step 6: Format and return
    formatted ← FormatResultSet(result.rows, result.columns, max_display_rows=safety_policy.max_display_rows)
    RETURN QuerySuccess(
        sql=generated_sql,
        result=formatted,
        row_count=result.row_count,
        execution_time=result.duration,
        truncated=result.row_count > safety_policy.max_display_rows
    )

14.11.5 Migration Planning#

For schema migrations, the agent:

Analyzes the current schema and the desired end state.
Generates migration scripts (DDL statements) with up and down paths.
Validates migrations against a copy of the schema (dry run).
Estimates data migration duration and locking impact.
Produces a migration plan for human review (never auto-executes DDL in production).

Migration tools are always classified as Tier 3 (human-approval-gated) in the safety hierarchy.

14.11.6 Data Validation#

Data validation tools enable agents to verify data quality:

\text{DataQuality}(D) = \frac{1}{|\mathcal{R}|} \sum_{r \in \mathcal{R}} \mathbb{1}[r(D) = \text{PASS}]

where $\mathcal{R}$ is the set of validation rules. Common rules include:

Completeness: NULL rate per column $\leq$ threshold.
Uniqueness: Duplicate rate on candidate keys $= 0$ .
Referential integrity: All foreign keys resolve.
Value distributions: Statistical tests for anomalous shifts (KL divergence from baseline).
Freshness: Most recent record timestamp within expected recency window.

14.12 Communication Tools: Email, Chat, Notification, and Workflow Trigger Integrations#

14.12.1 Communication Tool Categories#

Communication tools enable agents to send, receive, and manage messages across external channels. These are inherently state-changing and externally visible, requiring strict governance.

Category	Tools	Risk Level
Email	`send_email`, `read_inbox`, `search_email`, `draft_email`	High (external recipients)
Chat	`send_message`, `read_channel`, `create_thread`, `react`	Medium (internal channels)
Notification	`send_notification`, `schedule_reminder`, `create_alert`	Medium
Workflow	`trigger_pipeline`, `create_ticket`, `update_status`, `assign_task`	Medium–High

14.12.2 Governance Framework for Communication Tools#

All communication tools that produce externally visible artifacts must pass through a multi-gate approval pipeline:

\text{CommsGate}(m, \tau_{\text{comm}}, s) = \text{ContentPolicy}(m) \wedge \text{RecipientPolicy}(m.\text{to}, s) \wedge \text{RatePolicy}(\tau_{\text{comm}}, s) \wedge \text{ApprovalPolicy}(m, s)

where $m$ is the message payload and $s$ is the agent state.

Content policy checks:

No personally identifiable information (PII) leakage outside authorized boundaries.
No confidential or classified information in external-facing messages.
Tone and professionalism scoring (LLM-evaluated).
No impersonation (messages clearly attributed to the agentic system).

Recipient policy checks:

Internal-only recipients: auto-approved (subject to rate limits).
External recipients: require human approval or be on a pre-approved allowlist.
Broadcast (all-channel): always require explicit human approval.

Rate policy checks:

Per-recipient rate limits (e.g., max 3 emails per recipient per hour).
Per-channel rate limits (e.g., max 10 messages per channel per hour).
Global daily caps.

14.12.3 Draft-Review-Send Pattern#

For high-stakes communications, the agent produces a draft that is reviewed before sending:

Pseudo-Algorithm 14.12: Draft-Review-Send Communication

PROCEDURE SendCommunication(message_spec, comm_channel, agent_state, governance_policy):
    // Step 1: Generate draft
    draft ← LLM.ComposeDraft(
        intent=message_spec.intent,
        recipient=message_spec.recipient,
        context=message_spec.context,
        tone=governance_policy.required_tone,
        constraints=governance_policy.content_constraints
    )
 
    // Step 2: Content policy validation
    content_check ← ContentPolicy.Evaluate(draft, governance_policy)
    IF content_check.has_violations:
        draft ← LLM.ReviseDraft(draft, violations=content_check.violations)
        content_check ← ContentPolicy.Evaluate(draft, governance_policy)
        IF content_check.has_violations:
            RETURN CommunicationBlocked(reason="content_policy_violation", violations=content_check.violations)
 
    // Step 3: PII scan
    pii_scan ← PIIDetector.Scan(draft.body)
    IF pii_scan.found AND NOT governance_policy.AllowsPII(message_spec.recipient):
        draft ← RedactPII(draft, pii_scan.findings)
 
    // Step 4: Recipient policy
    recipient_check ← RecipientPolicy.Evaluate(message_spec.recipient, agent_state)
    IF recipient_check = EXTERNAL_REQUIRES_APPROVAL:
        approval ← RequestHumanApproval(
            type="external_communication",
            draft=draft,
            recipient=message_spec.recipient,
            timeout=governance_policy.approval_timeout
        )
        IF NOT approval.granted:
            RETURN CommunicationBlocked(reason="human_denied", draft=draft)
 
    // Step 5: Rate limit check
    IF NOT RateLimiter.TryAcquire(comm_channel, message_spec.recipient):
        RETURN CommunicationDeferred(reason="rate_limited", retry_after=RateLimiter.RetryAfter())
 
    // Step 6: Send
    send_result ← comm_channel.Send(draft)
    AuditLog.Record(action="communication_sent", draft=draft, result=send_result, agent=agent_state.agent_id)
    RETURN CommunicationSuccess(message_id=send_result.id, draft=draft)

14.12.4 Workflow Trigger Tools#

Workflow triggers connect the agent to external automation systems (CI/CD pipelines, ticketing systems, orchestration platforms):

Ticket creation: The agent creates JIRA/Linear/GitHub Issues with structured fields.
Pipeline triggers: The agent initiates build/deploy pipelines with specified parameters.
Status updates: The agent updates task status in project management tools.
Escalation chains: The agent triggers PagerDuty/OpsGenie alerts for critical issues.

Each workflow trigger must be idempotent (repeated invocations with the same idempotency key produce the same result) and auditable (every trigger is logged with full context and trace ID).

14.13 Tool Ecosystem Management: Marketplace, Rating, Trust Scoring, and Community Tool Servers#

14.13.1 Tool Ecosystem Architecture#

As the number of available tools grows beyond what a single organization manages, a tool ecosystem emerges — a marketplace of tool servers contributed by internal teams, vendors, and the open-source community. Managing this ecosystem requires:

Discovery: Agents find tools by capability, not by name.
Trust: Tools are scored by reliability, safety, and quality.
Versioning: Tool contracts evolve without breaking consumers.
Governance: Policies control which tools are available to which agents.

14.13.2 Tool Registry Schema#

Each tool in the ecosystem is registered with:

ToolRegistryEntry:
    id: UUID
    name: string
    version: SemanticVersion         // e.g., "2.3.1"
    provider: ProviderIdentity       // Organization, team, or individual
    description: string              // Natural language for LLM consumption
    capabilities: List<CapabilityTag>  // Standardized taxonomy (e.g., "data.query.sql")
    input_schema: JSONSchema
    output_schema: JSONSchema
    protocol: {MCP | JSON-RPC | gRPC}
    endpoint: URI
    auth_requirements: AuthSpec
    sla: SLASpec                     // Latency p50/p95/p99, availability target
    trust_score: Float ∈ [0, 1]
    usage_count: Int
    rating: Float ∈ [0, 5]
    last_verified: Timestamp
    deprecation_status: {ACTIVE | DEPRECATED | SUNSET}
    compatibility: List<CompatibilityConstraint>

14.13.3 Trust Scoring Model#

Trust scores are computed from multiple signals:

\text{Trust}(\tau) = \sum_{i} w_i \cdot f_i(\tau)

where the scoring functions $f_i$ and their weights $w_i$ are:

Signal $f_i$	Weight $w_i$	Computation
Reliability	0.25	$1 - \text{failure\_rate}(\tau, \text{window=30d})$
Schema compliance	0.15	Fraction of invocations with valid output schemas
Latency SLA adherence	0.15	Fraction of invocations within declared latency SLA
Security audit status	0.15	$\{0.0, 0.5, 1.0\}$ for {unaudited, self-assessed, third-party audited}
Community rating	0.10	Normalized average user rating $\in [0, 1]$
Provider reputation	0.10	Provider-level aggregate trust score
Freshness	0.05	Recency of last successful verification
Usage volume	0.05	$\min(1, \log(\text{usage\_count}) / \log(\text{threshold}))$

Trust scores decay over time if not refreshed:

\text{Trust}(\tau, t) = \text{Trust}(\tau, t_0) \cdot e^{-\lambda (t - t_0)}

where $\lambda$ is the decay rate and $t_0$ is the last verification timestamp. This incentivizes tool providers to maintain active verification.

14.13.4 Tool Discovery Protocol#

Tool discovery follows MCP's resource discovery pattern, extended with capability-based search:

Pseudo-Algorithm 14.13: Capability-Based Tool Discovery

PROCEDURE DiscoverTools(capability_query, constraints, agent_context):
    // Step 1: Parse capability query into structured search
    parsed ← ParseCapabilityQuery(capability_query)
    // parsed = { capability_tags: ["data.query.sql"], constraints: { max_latency: 500ms, min_trust: 0.7 } }
 
    // Step 2: Search registry
    candidates ← ToolRegistry.Search(
        capability_tags=parsed.capability_tags,
        min_trust_score=constraints.min_trust OR 0.5,
        status=ACTIVE,
        compatible_with=agent_context.platform_version
    )
 
    // Step 3: Filter by constraints
    filtered ← []
    FOR tool IN candidates:
        IF tool.sla.latency_p95 <= constraints.max_latency
           AND tool.trust_score >= constraints.min_trust
           AND agent_context.auth.HasPermission(tool.auth_requirements)
           AND tool.version.SatisfiesConstraint(constraints.version_constraint):
            filtered.APPEND(tool)
 
    // Step 4: Rank by composite score
    FOR tool IN filtered:
        tool.discovery_score ← (
            0.4 * tool.trust_score
            + 0.3 * CapabilityRelevance(tool.capabilities, parsed.capability_tags)
            + 0.2 * (1.0 - Normalize(tool.sla.latency_p95, max=constraints.max_latency))
            + 0.1 * (1.0 - Normalize(tool.cost_per_call, max=constraints.max_cost))
        )
 
    ranked ← SortDescending(filtered, key=discovery_score)
 
    // Step 5: Return top-K with schemas
    RETURN DiscoveryResult(
        tools=ranked[:constraints.max_results OR 10],
        total_candidates=len(candidates),
        total_filtered=len(filtered)
    )

14.13.5 Version Governance#

Tool versions follow semantic versioning:

Patch ( $x.y.Z$ ): Bug fixes, no schema changes.
Minor ( $x.Y.z$ ): Additive schema changes (new optional fields), backward compatible.
Major ( $X.y.z$ ): Breaking schema changes, requires consumer migration.

The registry enforces:

\text{BreakingChangePolicy}: \text{Major version bumps require } \geq 30\text{d deprecation notice}

Agents pin to version ranges (e.g., ^2.0.0 for any 2.x.x) and receive notifications when a major version sunset approaches.

14.13.6 Community Contribution Pipeline#

Community-contributed tools follow a graduated promotion path:

\text{SUBMITTED} \to \text{SANDBOX} \to \text{BETA} \to \text{GENERAL\_AVAILABILITY} \to \text{VERIFIED}

Stage	Requirements
Submitted	Schema provided, basic metadata
Sandbox	Passes automated schema validation, deploys in isolated sandbox
Beta	Passes integration tests, achieves $\text{Trust} \geq 0.5$ over 100+ invocations
GA	Passes security audit (self-assessed), achieves $\text{Trust} \geq 0.7$ over 1000+ invocations
Verified	Third-party security audit, SLA commitment, provider identity verified

14.13.7 Marketplace Governance and Abuse Prevention#

Governance mechanisms protect the ecosystem:

Rate limiting per tool provider: Prevents monopolization of compute/network resources.
Abuse detection: Monitors for tools that exfiltrate data, inject malicious payloads, or exhibit inconsistent behavior.
Automated regression testing: Continuous invocation of registered tools with known inputs to detect behavioral drift.
Revocation: Tools can be immediately revoked (circuit-broken globally) if critical issues are detected.

\text{RevocationCondition}(\tau) = \text{Trust}(\tau) < \theta_{\text{revoke}} \vee \text{SecurityAlert}(\tau) \vee \text{AbuseDetected}(\tau)

When a tool is revoked, all agents using it receive a notification and automatically fall back to the next tool in their fallback hierarchy (§14.5).

Chapter Summary and Cross-Cutting Concerns#

Architectural Invariants Across All Tool Patterns#

The following invariants must hold across every tool pattern described in this chapter:

Typed contracts at every boundary: Every tool exposes JSON Schema–validated input/output. No untyped data flows between tools or into agent context.
Provenance on every output: Every tool result carries its invocation ID, timestamp, source, freshness, and confidence — enabling downstream reasoning to attribute and assess.
Human-interruptible mutation: Any state-changing tool path includes an interception point where human approval can be required based on policy.
Idempotent operations: All mutations are keyed by idempotency tokens. Retries and saga compensations are safe to re-execute.
Bounded execution: Every tool invocation carries an explicit deadline. No tool can run indefinitely.
Observability: Every invocation produces structured traces consumable by the agent runtime, enabling self-diagnosis and continuous evaluation.
Least privilege: Tools execute with caller-scoped authorization, not agent-global credentials. Permissions are the minimum required for the specific operation.
Graceful degradation: Every critical capability has a fallback hierarchy. Total system failure requires simultaneous failure of all tiers plus human escalation timeout.

Token Budget Accounting Across Tool Operations#

The total token cost of tool use in a single agent step is:

T_{\text{tool}} = T_{\text{schemas}} + T_{\text{invocation\_history}} + T_{\text{results}} + T_{\text{validation\_output}}

This must satisfy:

T_{\text{tool}} \leq C_{\text{window}} - T_{\text{system}} - T_{\text{task}} - T_{\text{memory}} - T_{\text{reserved}}

When $T_{\text{tool}}$ approaches the budget, the context compiler must:

Prune older tool results from history.
Compress current results to higher compression levels.
Reduce the number of tool schemas in context (lazy loading).
Summarize invocation history rather than including verbatim records.

Reliability Equation#

The end-to-end reliability of a tool chain of $n$ sequential steps with per-step reliability $p_i$ and fallback depth $d_i$ (each fallback having independent reliability $p_{i,j}$ ) is:

R_{\text{chain}} = \prod_{i=1}^{n} \left(1 - \prod_{j=1}^{d_i}(1 - p_{i,j})\right)

For a chain of 5 steps, each with a primary tool at $p = 0.95$ and a secondary at $p = 0.90$ :

R_{\text{chain}} = \left(1 - (1-0.95)(1-0.90)\right)^5 = (1 - 0.005)^5 = 0.995^5 \approx 0.975

Compared to single-tool reliability: $0.95^5 \approx 0.774$ . Fallback hierarchies provide a 26% absolute reliability improvement in this configuration.

This chapter establishes the complete engineering framework for advanced tool use in agentic systems. The patterns described — composition, routing, selection, transactions, fallbacks, validation, self-healing, creation, and ecosystem management — form the operational backbone of any production-grade agentic platform. Each pattern is specified with typed contracts, bounded algorithms, and measurable quality criteria, ensuring that tool use is not an ad hoc capability but a disciplined, observable, and governable infrastructure layer.