Agentic Notes Library

Chapter 14: Advanced Tool Patterns — Composition, Chaining, and Agentic Tool Use

Tool use elevates a language model from a stateless text generator into an actuating agent capable of observing, mutating, and reasoning over external state. Yet the difference between a demonstration-grade tool-calling agent and a produ...

March 20, 2026 23 min read 5,046 words
Chapter 14MathRaw HTML

Preamble#

Tool use elevates a language model from a stateless text generator into an actuating agent capable of observing, mutating, and reasoning over external state. Yet the difference between a demonstration-grade tool-calling agent and a production-grade agentic system lies entirely in the composition discipline: how tools are sequenced, how their outputs are routed into downstream reasoning, how failures are absorbed, how transactions maintain consistency, and how the tool surface itself evolves under agent-driven creation and community governance. This chapter provides the complete engineering treatment of these advanced patterns — formalized mathematically, specified as typed protocols, and rendered as bounded pseudo-algorithms suitable for implementation in enterprise-scale agentic platforms.

Throughout, we adopt the following foundational abstraction. A tool τ\tau is a typed function:

τ:IτOτ{}\tau : \mathcal{I}_\tau \rightarrow \mathcal{O}_\tau \cup \{\bot\}

where Iτ\mathcal{I}_\tau is the validated input schema, Oτ\mathcal{O}_\tau is the structured output schema, and \bot denotes failure (with typed error class). A tool invocation is a tuple (id,τ,x,tsubmit,tdeadline,auth,trace_id)(id, \tau, \mathbf{x}, t_{\text{submit}}, t_{\text{deadline}}, \text{auth}, \text{trace\_id}) submitted through a protocol boundary (MCP, JSON-RPC, or gRPC) with explicit deadline, caller-scoped authorization, and distributed trace identity.


14.1 Tool Chaining: Sequential, Conditional, and Parallel Composition Patterns#

14.1.1 Foundational Definitions#

A tool chain C\mathcal{C} is a directed acyclic graph (DAG) of tool invocations where edges encode data dependencies and control-flow predicates. We distinguish three canonical composition patterns:

PatternStructureData FlowConcurrency
SequentialLinear pipeline τ1τ2τn\tau_1 \to \tau_2 \to \cdots \to \tau_nOutput of τi\tau_i feeds input of τi+1\tau_{i+1}None
ConditionalBranch node ϕ\phi selects among subchainsPredicate ϕ(oi)\phi(o_i) determines successorNone (selection)
ParallelFan-out set {τa,τb,}\{\tau_{a}, \tau_{b}, \ldots\} with join barrierIndependent inputs; outputs merged at barrierFull within fan-out

Formally, a chain C=(V,E,Φ,J)\mathcal{C} = (V, E, \Phi, \mathcal{J}) where:

  • V={τ1,,τn}V = \{\tau_1, \ldots, \tau_n\} is the tool vertex set.
  • EV×VE \subseteq V \times V is the dependency edge set.
  • Φ:V{seq,cond,par,join}\Phi : V \to \{\text{seq}, \text{cond}, \text{par}, \text{join}\} assigns a composition type.
  • J:E(OsrcIdst)\mathcal{J} : E \to (\mathcal{O}_{\text{src}} \to \mathcal{I}_{\text{dst}}) is the junction transform mapping source outputs to destination inputs.

14.1.2 Sequential Composition#

Sequential chaining is the simplest and most common pattern. Each tool τi\tau_i produces output oio_i, which is transformed by Jii+1\mathcal{J}_{i \to i+1} into the input for τi+1\tau_{i+1}:

xi+1=Jii+1(oi),oi=τi(xi)\mathbf{x}_{i+1} = \mathcal{J}_{i \to i+1}(o_i), \quad o_i = \tau_i(\mathbf{x}_i)

The total latency of a sequential chain is:

Lseq=i=1n(lτi+lJii+1)L_{\text{seq}} = \sum_{i=1}^{n} \left( l_{\tau_i} + l_{\mathcal{J}_{i \to i+1}} \right)

where lτil_{\tau_i} is tool execution latency and lJl_{\mathcal{J}} is junction transform latency (typically negligible for schema mappings, non-trivial for LLM-mediated transforms).

Pseudo-Algorithm 14.1: Sequential Chain Executor

PROCEDURE ExecuteSequentialChain(chain: List<ToolNode>, initial_input, trace_id, deadline):
    current_input ← initial_input
    results ← []
    remaining_budget ← deadline - now()
 
    FOR i ← 0 TO len(chain) - 1:
        node ← chain[i]
        tool ← node.tool
        junction ← node.junction_transform
 
        // Deadline subdivision
        estimated_remaining_latency ← SUM(chain[j].estimated_latency FOR j IN [i..len(chain)-1])
        local_deadline ← now() + (remaining_budget * chain[i].estimated_latency / estimated_remaining_latency)
 
        // Schema validation on input
        validated_input ← ValidateSchema(current_input, tool.input_schema)
        IF validated_input IS SchemaError:
            RETURN ChainFailure(step=i, error=validated_input, partial_results=results)
 
        // Invoke with deadline, auth, trace
        invocation ← ToolInvocation(
            id=GenerateUUID(),
            tool=tool,
            input=validated_input,
            deadline=local_deadline,
            auth=ScopedAuth(tool.required_permissions),
            trace_id=trace_id
        )
        result ← InvokeTool(invocation)
 
        IF result IS Failure:
            RETURN ChainFailure(step=i, error=result.error, partial_results=results)
 
        // Validate output schema
        validated_output ← ValidateSchema(result.output, tool.output_schema)
        IF validated_output IS SchemaError:
            RETURN ChainFailure(step=i, error="output_schema_violation", partial_results=results)
 
        results.APPEND(validated_output)
 
        // Apply junction transform for next step
        IF i < len(chain) - 1:
            current_input ← junction(validated_output)
            remaining_budget ← deadline - now()
            IF remaining_budget <= 0:
                RETURN ChainFailure(step=i, error="deadline_exceeded", partial_results=results)
 
    RETURN ChainSuccess(results=results, final_output=results[-1])

14.1.3 Conditional Composition#

Conditional branching introduces a predicate function ϕ:Oτi{b1,b2,,bk}\phi : \mathcal{O}_{\tau_i} \to \{b_1, b_2, \ldots, b_k\} that maps the output of a predecessor tool to a branch selector. Each branch bjb_j identifies a distinct subchain Cj\mathcal{C}_j.

next(τi,oi)=Cϕ(oi)\text{next}(\tau_i, o_i) = \mathcal{C}_{\phi(o_i)}

Predicates may be:

  • Deterministic: schema field comparison, threshold evaluation, enum matching.
  • LLM-mediated: the agent reasons over oio_i and selects the branch. This introduces non-determinism and must be bounded by a selection confidence threshold θbranch\theta_{\text{branch}}:
ϕLLM(oi)=argmaxbjP(bjoi,context),subject to maxjP(bj)θbranch\phi_{\text{LLM}}(o_i) = \arg\max_{b_j} P(b_j \mid o_i, \text{context}), \quad \text{subject to } \max_j P(b_j) \geq \theta_{\text{branch}}

If no branch exceeds θbranch\theta_{\text{branch}}, the chain enters a disambiguation subroutine (human escalation, additional retrieval, or retry with enriched context).

14.1.4 Parallel Composition#

Parallel fan-out distributes independent tool invocations across concurrent execution slots. A join barrier B\mathcal{B} collects outputs and merges them before the chain continues.

Let the fan-out set be F={τa1,τa2,,τam}F = \{\tau_{a_1}, \tau_{a_2}, \ldots, \tau_{a_m}\}. Each tool receives independently constructed inputs:

xaj=Jfan-out(j)(opredecessor)\mathbf{x}_{a_j} = \mathcal{J}_{\text{fan-out}}^{(j)}(o_{\text{predecessor}})

The join barrier produces:

omerged=B(oa1,oa2,,oam)o_{\text{merged}} = \mathcal{B}(o_{a_1}, o_{a_2}, \ldots, o_{a_m})

The latency of the parallel segment is:

Lpar=maxj[1,m]lτaj+lBL_{\text{par}} = \max_{j \in [1, m]} l_{\tau_{a_j}} + l_{\mathcal{B}}

Join barrier strategies:

  1. All-success: Wait for all mm results. Fail if any τaj\tau_{a_j} fails.
  2. Quorum: Wait for qm\lceil q \cdot m \rceil successes where q(0,1]q \in (0, 1] is the quorum fraction.
  3. First-success: Return the first successful result; cancel remaining.
  4. Best-of-N: Wait for all, then select the highest-quality result by a scoring function s:ORs : \mathcal{O} \to \mathbb{R}.

Pseudo-Algorithm 14.2: Parallel Fan-Out with Quorum Join

PROCEDURE ExecuteParallelFanOut(fan_out_nodes: List<ToolNode>, predecessor_output, quorum_fraction, deadline, trace_id):
    invocations ← []
    FOR node IN fan_out_nodes:
        input_j ← node.fan_out_transform(predecessor_output)
        validated ← ValidateSchema(input_j, node.tool.input_schema)
        inv ← ToolInvocation(id=UUID(), tool=node.tool, input=validated, deadline=deadline, trace_id=trace_id)
        invocations.APPEND(inv)
 
    // Dispatch all concurrently
    futures ← DispatchConcurrent(invocations)
    required_successes ← CEIL(quorum_fraction * len(futures))
    successes ← []
    failures ← []
 
    WHILE len(successes) < required_successes AND (len(successes) + len(pending(futures))) >= required_successes:
        completed ← AwaitAny(futures, timeout=deadline - now())
        IF completed IS None:
            BREAK  // Deadline
        IF completed.result IS Success:
            successes.APPEND(completed)
        ELSE:
            failures.APPEND(completed)
        futures.REMOVE(completed)
 
    IF len(successes) >= required_successes:
        CancelRemaining(futures)
        merged ← JoinBarrierMerge(successes)
        RETURN ParallelSuccess(merged=merged, individual=successes, failures=failures)
    ELSE:
        CancelRemaining(futures)
        RETURN ParallelFailure(successes=successes, failures=failures, quorum_not_met=TRUE)

14.1.5 Composition Algebra#

Tool chains compose algebraically. Define operators:

  • Sequence: C1C2\mathcal{C}_1 \triangleright \mathcal{C}_2 — execute C1\mathcal{C}_1 then C2\mathcal{C}_2, routing output through junction.
  • Branch: C1ϕ{b1:Ca,b2:Cb,}\mathcal{C}_1 \triangleright_\phi \{b_1 : \mathcal{C}_a, b_2 : \mathcal{C}_b, \ldots\} — conditional dispatch.
  • Parallel: C1(CaCb)BC2\mathcal{C}_1 \triangleright (\mathcal{C}_a \| \mathcal{C}_b \| \ldots) \triangleright_\mathcal{B} \mathcal{C}_2 — fan-out, join, continue.

Any composition of these operators produces a valid DAG. The chain compiler validates:

  1. Type compatibility: (u,v)E:range(Juv)Iv\forall (u, v) \in E: \text{range}(\mathcal{J}_{u \to v}) \subseteq \mathcal{I}_v
  2. Acyclicity: No directed cycle exists (bounded recursion is modeled as iteration with explicit depth counters).
  3. Deadline feasibility: critical pathl^τitdeadline\sum_{\text{critical path}} \hat{l}_{\tau_i} \leq t_{\text{deadline}} where l^\hat{l} is estimated latency.

14.2 Tool Output Routing: Feeding Tool Results as Context to Subsequent Reasoning Steps#

14.2.1 The Output Routing Problem#

When a tool τi\tau_i returns output oio_i, the agentic system must decide where and how that output enters the reasoning pipeline. The output may serve as:

  1. Direct input to the next tool (junction transform, §14.1).
  2. Context injection into the LLM's next reasoning step.
  3. Memory write to working, session, or episodic memory.
  4. Observation record for verification or audit.
  5. Discard if the output has been fully consumed by an earlier transform.

This routing decision is itself a function:

R:Oτ×SagentP({tool_input,context,memory,observation,discard})\mathcal{R} : \mathcal{O}_\tau \times \mathcal{S}_{\text{agent}} \to \mathcal{P}(\{\text{tool\_input}, \text{context}, \text{memory}, \text{observation}, \text{discard}\})

where Sagent\mathcal{S}_{\text{agent}} is the current agent state and P\mathcal{P} denotes the power set (multiple destinations are common).

14.2.2 Context Injection Strategies#

When tool output is routed into the LLM context window, the key trade-off is information density vs. token cost. Raw tool outputs (e.g., a full database result set) may consume thousands of tokens while contributing marginal reasoning value.

Compression hierarchy for tool outputs entering context:

LevelMethodCompression RatioInformation Loss
L0Raw output verbatim1:11:1None
L1Schema-filtered projection (select relevant fields)210×2\text{–}10\timesLow (structural)
L2Summarization by secondary LLM call10100×10\text{–}100\timesModerate (semantic)
L3Key-value extraction (facts only)20200×20\text{–}200\timesModerate–High
L4Boolean/scalar signal ("found"/"not found", count)1001000×100\text{–}1000\timesHigh

The optimal level \ell^* is selected by:

=argmin[αTokenCost()+βInfoLoss()+γLatency()]\ell^* = \arg\min_{\ell} \left[ \alpha \cdot \text{TokenCost}(\ell) + \beta \cdot \text{InfoLoss}(\ell) + \gamma \cdot \text{Latency}(\ell) \right]

where α,β,γ\alpha, \beta, \gamma are task-specific weights. For high-stakes reasoning (medical, financial), β\beta dominates; for high-throughput batch processing, α\alpha dominates.

14.2.3 Structured Output Placement in the Context Window#

Tool outputs should be placed in the context window using structured delimiters and provenance tags:

<tool_result tool_id="τ_3" invocation_id="inv-8a2f" timestamp="2025-01-15T10:32:00Z"
             source="database/orders" freshness="live" confidence="verified">
  { "order_count": 1247, "total_revenue": 89340.50, "currency": "USD" }
</tool_result>

This enables the LLM to:

  • Distinguish tool-provided facts from its own prior knowledge (hallucination control).
  • Attribute claims to specific tool invocations (provenance).
  • Assess freshness and confidence for downstream reasoning.

14.2.4 Output Routing Decision Algorithm#

Pseudo-Algorithm 14.3: Tool Output Router

PROCEDURE RouteToolOutput(output, tool_metadata, agent_state, chain_plan, token_budget):
    destinations ← {}
 
    // 1. Check if next tool in chain needs this output
    IF chain_plan.HasSuccessor(tool_metadata.step_id):
        successor ← chain_plan.GetSuccessor(tool_metadata.step_id)
        junction ← chain_plan.GetJunction(tool_metadata.step_id, successor.step_id)
        transformed ← junction.Transform(output)
        destinations.ADD("tool_input", transformed)
 
    // 2. Determine context injection need
    IF agent_state.RequiresReasoningOverOutput(tool_metadata):
        compression_level ← SelectCompressionLevel(
            output_size=TokenCount(output),
            remaining_budget=token_budget.remaining,
            task_criticality=agent_state.task.criticality,
            output_schema=tool_metadata.output_schema
        )
        compressed ← CompressOutput(output, compression_level, tool_metadata)
        provenance_tagged ← AttachProvenance(compressed, tool_metadata)
        destinations.ADD("context", provenance_tagged)
 
    // 3. Memory write evaluation
    IF MemoryPolicy.ShouldPersist(output, tool_metadata, agent_state):
        memory_record ← ConstructMemoryRecord(
            content=ExtractPersistableContent(output),
            provenance=tool_metadata,
            expiry=MemoryPolicy.ComputeExpiry(tool_metadata),
            layer=MemoryPolicy.SelectLayer(output, agent_state)
        )
        IF NOT MemoryStore.IsDuplicate(memory_record):
            destinations.ADD("memory", memory_record)
 
    // 4. Observation log (always, for audit)
    destinations.ADD("observation", ObservationRecord(output, tool_metadata, timestamp=now()))
 
    RETURN destinations

14.2.5 Token Budget Accounting#

Every tool output routed into context must be deducted from the active token budget BactiveB_{\text{active}}. Let CwindowC_{\text{window}} be the total context window capacity. Then:

Bactive=CwindowTsystemThistoryTmemoryTreserved_generationB_{\text{active}} = C_{\text{window}} - T_{\text{system}} - T_{\text{history}} - T_{\text{memory}} - T_{\text{reserved\_generation}}

where TsystemT_{\text{system}} is the system prompt / role policy cost, ThistoryT_{\text{history}} is conversation history, TmemoryT_{\text{memory}} is injected memory summaries, and Treserved_generationT_{\text{reserved\_generation}} is the minimum generation budget (typically 2048–4096 tokens). A tool output of size tot_o tokens is admitted into context only if:

toBactiveTsafety_margint_o \leq B_{\text{active}} - T_{\text{safety\_margin}}

If tot_o exceeds this bound, the output is either compressed to a lower level \ell or offloaded to external memory with a pointer summary injected into context.


14.3 Tool Selection Strategies: LLM-Driven, Rule-Based, Policy-Gated, and Learned Tool Routing#

14.3.1 The Tool Selection Problem#

Given a set of available tools T={τ1,τ2,,τN}\mathcal{T} = \{\tau_1, \tau_2, \ldots, \tau_N\} and a current agent state sSs \in \mathcal{S}, the tool selection problem is to choose a subset ATA \subseteq \mathcal{T} (possibly a singleton) and construct the invocation parameters xτ\mathbf{x}_\tau for each selected tool:

π(s)={(τ,xτ)τAT}\pi(s) = \{(\tau, \mathbf{x}_\tau) \mid \tau \in A \subseteq \mathcal{T}\}

This is a policy π\pi mapping states to tool invocations. The quality of the policy is evaluated over a distribution of tasks:

Q(π)=EsD[t=0Tγtr(st,π(st))]Q(\pi) = \mathbb{E}_{s \sim \mathcal{D}} \left[ \sum_{t=0}^{T} \gamma^t \cdot r(s_t, \pi(s_t)) \right]

where r(st,π(st))r(s_t, \pi(s_t)) is the reward at step tt, γ(0,1]\gamma \in (0, 1] is the discount factor, and TT is the episode horizon.

14.3.2 Strategy Taxonomy#

A. LLM-Driven Selection (Native Function Calling)#

The LLM receives tool descriptions in its context and generates a structured tool call as part of its output:

τ,x=LLM(contextToolSchemas(T))\tau^*, \mathbf{x}^* = \text{LLM}(\text{context} \| \text{ToolSchemas}(\mathcal{T}))

Advantages: Flexible, handles novel tool combinations, benefits from world knowledge. Risks: Hallucinated tool names, invalid parameters, suboptimal selection under large T|\mathcal{T}|.

Token cost scaling: If each tool schema consumes tˉ\bar{t} tokens, the total schema injection cost is Ttˉ|\mathcal{T}| \cdot \bar{t}. For T>50|\mathcal{T}| > 50, this can consume a significant fraction of the context window.

Mitigation — Lazy Tool Loading: Only inject schemas for tools relevant to the current task phase, determined by a lightweight pre-classifier:

Tactive=TopK({τTrelevance(τ,s)θtool},k)\mathcal{T}_{\text{active}} = \text{TopK}\left(\{\tau \in \mathcal{T} \mid \text{relevance}(\tau, s) \geq \theta_{\text{tool}}\}, k\right)

where relevance(τ,s)\text{relevance}(\tau, s) may be computed by embedding similarity between the task description and tool descriptions, or by a trained classifier.

B. Rule-Based Selection#

Deterministic rules map state patterns to tool choices:

πrule(s)={τsearchif s.intent=LOOKUPτcalculateif s.intent=COMPUTEτwrite_fileif s.intent=PERSISTotherwise\pi_{\text{rule}}(s) = \begin{cases} \tau_{\text{search}} & \text{if } s.\text{intent} = \text{LOOKUP} \\ \tau_{\text{calculate}} & \text{if } s.\text{intent} = \text{COMPUTE} \\ \tau_{\text{write\_file}} & \text{if } s.\text{intent} = \text{PERSIST} \\ \varnothing & \text{otherwise} \end{cases}

Advantages: Deterministic, auditable, zero LLM cost for selection. Risks: Brittle under novel tasks, requires manual maintenance as tool set evolves.

C. Policy-Gated Selection#

A policy layer intercepts LLM-proposed tool calls and applies authorization, safety, and cost constraints before execution:

Gate(τ,x,s)={ALLOWif Auth(τ,s.caller)Safe(τ,x)Budget(τ,s)REQUIRE_APPROVALif τTsensitiveDENYotherwise\text{Gate}(\tau, \mathbf{x}, s) = \begin{cases} \text{ALLOW} & \text{if } \text{Auth}(\tau, s.\text{caller}) \wedge \text{Safe}(\tau, \mathbf{x}) \wedge \text{Budget}(\tau, s) \\ \text{REQUIRE\_APPROVAL} & \text{if } \tau \in \mathcal{T}_{\text{sensitive}} \\ \text{DENY} & \text{otherwise} \end{cases}

The gate enforces:

  • Authorization: Caller-scoped permissions, not agent-owned credentials.
  • Safety: Input sanitization, dangerous operation detection (e.g., DROP TABLE).
  • Budget: Cost ceiling per invocation and cumulative per session.
  • Rate limiting: Per-tool invocation rate within sliding windows.

D. Learned Tool Routing#

A trained router model fθf_\theta maps task embeddings to tool selection distributions:

P(τs)=softmax(fθ(Embed(s)))P(\tau \mid s) = \text{softmax}(f_\theta(\text{Embed}(s)))

Training data is derived from successful traces:

Dtrain={(st,τt)trace t was successful}\mathcal{D}_{\text{train}} = \{(s_t, \tau_t^*) \mid \text{trace } t \text{ was successful}\}

The loss function is cross-entropy:

L(θ)=(s,τ)DtrainlogPθ(τs)\mathcal{L}(\theta) = -\sum_{(s, \tau^*) \in \mathcal{D}_{\text{train}}} \log P_\theta(\tau^* \mid s)

Advantages: Adapts to usage patterns, fast inference, low token cost. Risks: Cold-start problem, distribution shift as tools are added/removed.

14.3.3 Hybrid Selection Architecture#

Production systems combine all four strategies in a layered architecture:

User Query → Intent Classifier → Learned Router (candidate set)

                                  LLM Selection (from candidate set)

                                  Policy Gate (auth, safety, budget)

                                  Rule Override (domain-specific hardcoded rules)

                                  Approved Invocation → Executor

Pseudo-Algorithm 14.4: Hybrid Tool Selector

PROCEDURE SelectTool(agent_state, available_tools, token_budget, policy):
    // Phase 1: Learned routing (fast pre-filter)
    relevance_scores ← LearnedRouter.Score(agent_state.task_embedding, available_tools)
    candidate_set ← TopK(available_tools, relevance_scores, k=min(10, len(available_tools)))
 
    // Phase 2: Rule-based override
    rule_forced ← RuleEngine.Evaluate(agent_state)
    IF rule_forced IS NOT NULL:
        candidate_set ← {rule_forced} ∪ candidate_set[:2]  // Keep rule choice primary
 
    // Phase 3: LLM-driven selection from candidate set
    tool_schemas ← [tool.schema FOR tool IN candidate_set]
    schema_tokens ← SUM(TokenCount(s) FOR s IN tool_schemas)
    IF schema_tokens > token_budget.tool_schema_allocation:
        candidate_set ← candidate_set[:FLOOR(token_budget.tool_schema_allocation / avg_schema_tokens)]
        tool_schemas ← [tool.schema FOR tool IN candidate_set]
 
    llm_selection ← LLM.SelectTool(
        context=agent_state.context,
        tool_schemas=tool_schemas,
        instruction="Select the most appropriate tool and construct valid parameters."
    )
 
    // Phase 4: Policy gate
    gate_result ← policy.Evaluate(llm_selection.tool, llm_selection.params, agent_state)
    SWITCH gate_result:
        CASE ALLOW:
            RETURN ApprovedInvocation(llm_selection.tool, llm_selection.params)
        CASE REQUIRE_APPROVAL:
            approval ← RequestHumanApproval(llm_selection, timeout=policy.approval_timeout)
            IF approval.granted:
                RETURN ApprovedInvocation(llm_selection.tool, llm_selection.params)
            ELSE:
                RETURN Denied(reason=approval.reason)
        CASE DENY:
            RETURN Denied(reason=gate_result.reason)

14.3.4 Tool Selection Under Large Catalogs#

When T|\mathcal{T}| exceeds \sim50 tools, neither full schema injection nor flat relevance scoring scales. A hierarchical tool index partitions tools into categories G1,,Gm\mathcal{G}_1, \ldots, \mathcal{G}_m where Gj=T\bigcup \mathcal{G}_j = \mathcal{T} and GiGj=\mathcal{G}_i \cap \mathcal{G}_j = \varnothing. Selection proceeds in two phases:

  1. Category selection: g=argmaxjrelevance(Gj,s)g^* = \arg\max_j \text{relevance}(\mathcal{G}_j, s) — low cost, operates over category descriptions.
  2. Tool selection within category: Standard hybrid selection over Gg\mathcal{G}_{g^*}.

This reduces schema injection cost from O(T)O(|\mathcal{T}|) to O(Gg)O(|\mathcal{G}_{g^*}|) per step.


14.4 Multi-Tool Transactions: Compensation, Rollback, and Saga Patterns for Tool Chains#

14.4.1 Transactional Semantics for Tool Chains#

Unlike database transactions backed by ACID guarantees, tool chains span heterogeneous systems (APIs, file systems, databases, external services) where global atomicity is unavailable. We therefore adopt the Saga pattern: a sequence of local transactions T1,T2,,TnT_1, T_2, \ldots, T_n, each with a compensating action CiC_i that semantically undoes the effect of TiT_i.

Definition: A tool saga S\mathcal{S} is a pair of sequences:

S=(T1,T2,,Tn),(C1,C2,,Cn)\mathcal{S} = \langle (T_1, T_2, \ldots, T_n), (C_1, C_2, \ldots, C_n) \rangle

where TiT_i is the forward tool invocation and CiC_i is its compensating counterpart. If TkT_k fails after T1,,Tk1T_1, \ldots, T_{k-1} have succeeded, the saga executes compensations in reverse order:

Ck1,Ck2,,C1C_{k-1}, C_{k-2}, \ldots, C_1

14.4.2 Compensation Design Constraints#

Not all tool operations are reversible. We classify tools by their compensation capability:

ClassCompensationExample
Fully reversibleCiC_i exactly undoes TiT_iFile write → file delete
Semantically reversibleCiC_i achieves approximate undoAPI order creation → order cancellation
Compensable with side effectsCiC_i undoes primary effect but leaves tracesEmail sent → follow-up retraction email
IrreversibleNo CiC_i existsPublished tweet, physical actuation

For irreversible tools, the saga must place them last in the chain (after all fallible steps) or require pre-commitment approval:

Position constraint:τiTirreversible,τj with j>i and τj is fallible\text{Position constraint}: \forall \tau_i \in \mathcal{T}_{\text{irreversible}}, \nexists \tau_j \text{ with } j > i \text{ and } \tau_j \text{ is fallible}

14.4.3 Saga Orchestrator#

Pseudo-Algorithm 14.5: Saga Orchestrator with Compensation

PROCEDURE ExecuteSaga(saga: SagaDefinition, initial_input, trace_id):
    completed_steps ← []
    current_input ← initial_input
 
    FOR i ← 0 TO len(saga.forward_steps) - 1:
        step ← saga.forward_steps[i]
 
        // Pre-flight check for irreversible steps
        IF step.tool.compensation_class = IRREVERSIBLE:
            IF i < len(saga.forward_steps) - 1:
                LOG.WARN("Irreversible step not at end of saga; requesting approval")
                approval ← RequestHumanApproval(step, context=current_input)
                IF NOT approval.granted:
                    ExecuteCompensation(completed_steps)
                    RETURN SagaAborted(reason="irreversible_step_denied", compensated=TRUE)
 
        // Execute forward step
        result ← InvokeToolWithRetry(
            tool=step.tool,
            input=current_input,
            retry_budget=step.retry_budget,
            trace_id=trace_id
        )
 
        IF result IS Failure:
            LOG.ERROR("Saga step failed", step=i, error=result.error)
            compensation_result ← ExecuteCompensation(completed_steps)
            RETURN SagaFailed(
                failed_step=i,
                error=result.error,
                compensation_result=compensation_result,
                partial_results=[s.result FOR s IN completed_steps]
            )
 
        completed_steps.APPEND(CompletedStep(
            index=i,
            tool=step.tool,
            input=current_input,
            result=result.output,
            compensation=step.compensation
        ))
        current_input ← step.junction_transform(result.output)
 
    RETURN SagaSuccess(results=[s.result FOR s IN completed_steps])
 
 
PROCEDURE ExecuteCompensation(completed_steps: List<CompletedStep>):
    compensation_results ← []
    // Reverse order
    FOR step IN REVERSED(completed_steps):
        IF step.compensation IS NOT NULL:
            comp_result ← InvokeToolWithRetry(
                tool=step.compensation.tool,
                input=step.compensation.construct_input(step.result, step.input),
                retry_budget=step.compensation.retry_budget
            )
            compensation_results.APPEND(CompensationOutcome(step=step.index, result=comp_result))
            IF comp_result IS Failure:
                LOG.CRITICAL("Compensation failed — manual intervention required",
                             step=step.index, error=comp_result.error)
                AlertOnCall("saga_compensation_failure", step=step.index)
        ELSE:
            LOG.WARN("No compensation defined for step", step=step.index)
 
    RETURN compensation_results

14.4.4 Idempotency Requirements#

Every forward step TiT_i and compensation CiC_i must be idempotent:

Ti(Ti(x))Ti(x),Ci(Ci(x))Ci(x)T_i(T_i(\mathbf{x})) \equiv T_i(\mathbf{x}), \quad C_i(C_i(\mathbf{x})) \equiv C_i(\mathbf{x})

Implementation techniques:

  • Idempotency keys: Each invocation carries a unique key; the tool server deduplicates.
  • Conditional mutations: Use version vectors or ETags; the operation succeeds only if the precondition matches.
  • Upsert semantics: Prefer INSERT ... ON CONFLICT UPDATE over blind INSERT.

14.4.5 Saga State Machine#

A saga's lifecycle is a finite state machine:

States={PENDING,RUNNING,COMPENSATING,COMPLETED,FAILED,COMPENSATION_FAILED}\text{States} = \{\text{PENDING}, \text{RUNNING}, \text{COMPENSATING}, \text{COMPLETED}, \text{FAILED}, \text{COMPENSATION\_FAILED}\}

Transitions:

PENDINGstartRUNNINGall_succeedCOMPLETED\text{PENDING} \xrightarrow{start} \text{RUNNING} \xrightarrow{all\_succeed} \text{COMPLETED} RUNNINGstep_failsCOMPENSATINGall_compensatedFAILED\text{RUNNING} \xrightarrow{step\_fails} \text{COMPENSATING} \xrightarrow{all\_compensated} \text{FAILED} COMPENSATINGcomp_failsCOMPENSATION_FAILED\text{COMPENSATING} \xrightarrow{comp\_fails} \text{COMPENSATION\_FAILED}

The COMPENSATION_FAILED state triggers manual intervention via alerting infrastructure.


14.5 Tool Fallback Hierarchies: Primary → Secondary → Degraded → Manual Escalation#

14.5.1 Motivation and Architecture#

External tools are inherently unreliable: APIs have outages, rate limits are exhausted, and services degrade. A production agentic system must define fallback hierarchies for every critical tool capability so that agent execution degrades gracefully rather than failing catastrophically.

A fallback hierarchy for a capability κ\kappa is an ordered list of tool implementations:

Hκ=[τκ(1),τκ(2),,τκ(d),τκ(manual)]\mathcal{H}_\kappa = [\tau_{\kappa}^{(1)}, \tau_{\kappa}^{(2)}, \ldots, \tau_{\kappa}^{(d)}, \tau_{\kappa}^{(\text{manual})}]

where superscripts denote priority (1 = primary), and τκ(manual)\tau_{\kappa}^{(\text{manual})} is the human escalation sentinel.

Each tool in the hierarchy is annotated with:

PropertyTypeDescription
priorityZ+\mathbb{Z}^+Lower is preferred
latency_p99DurationExpected worst-case latency
cost_per_callFloatMonetary cost
accuracyFloat [0,1]\in [0, 1]Expected output correctness
availabilityFloat [0,1]\in [0, 1]Historical uptime
circuit_state{CLOSED,OPEN,HALF_OPEN}\{\text{CLOSED}, \text{OPEN}, \text{HALF\_OPEN}\}Current circuit breaker state

14.5.2 Fallback Selection Criteria#

The selection function σ\sigma evaluates candidates in priority order, skipping those with open circuits or exceeded rate limits:

σ(Hκ)=mini{iτκ(i).circuitOPENτκ(i).rate_remaining>0}\sigma(\mathcal{H}_\kappa) = \min_{i} \left\{ i \mid \tau_{\kappa}^{(i)}.\text{circuit} \neq \text{OPEN} \wedge \tau_{\kappa}^{(i)}.\text{rate\_remaining} > 0 \right\}

If no automated tool is available:

σ(Hκ)=(manual)if i[1,d]:τκ(i) is unavailable\sigma(\mathcal{H}_\kappa) = (\text{manual}) \quad \text{if } \forall i \in [1, d] : \tau_{\kappa}^{(i)} \text{ is unavailable}

14.5.3 Circuit Breaker Integration#

Each tool's circuit breaker tracks failure rates within sliding windows:

failure_rate(τ,w)={t[noww,now]:invocation(t) failed}{t[noww,now]:invocation(t)}\text{failure\_rate}(\tau, w) = \frac{|\{t \in [now - w, now] : \text{invocation}(t) \text{ failed}\}|}{|\{t \in [now - w, now] : \text{invocation}(t)\}|}

State transitions:

CLOSEDfailure_rate>θopenOPENcooldown expiredHALF_OPENprobe succeedsCLOSED\text{CLOSED} \xrightarrow{\text{failure\_rate} > \theta_{\text{open}}} \text{OPEN} \xrightarrow{\text{cooldown expired}} \text{HALF\_OPEN} \xrightarrow{\text{probe succeeds}} \text{CLOSED} HALF_OPENprobe failsOPEN\text{HALF\_OPEN} \xrightarrow{\text{probe fails}} \text{OPEN}

Pseudo-Algorithm 14.6: Fallback Hierarchy Executor

PROCEDURE ExecuteWithFallback(capability, input, hierarchy, trace_id, deadline):
    attempted ← []
 
    FOR level ← 0 TO len(hierarchy) - 1:
        tool ← hierarchy[level]
 
        // Skip if circuit open
        IF tool.circuit_breaker.state = OPEN:
            attempted.APPEND(Skipped(tool, reason="circuit_open"))
            CONTINUE
 
        // Skip if rate limit exhausted
        IF tool.rate_limiter.remaining = 0:
            attempted.APPEND(Skipped(tool, reason="rate_limited"))
            CONTINUE
 
        // Check deadline feasibility
        IF now() + tool.latency_p99_estimate > deadline:
            attempted.APPEND(Skipped(tool, reason="deadline_infeasible"))
            CONTINUE
 
        result ← InvokeTool(tool, input, deadline=deadline, trace_id=trace_id)
 
        IF result IS Success:
            EmitMetric("tool_fallback_level", level, capability=capability)
            RETURN FallbackSuccess(result=result.output, level=level, attempted=attempted)
        ELSE:
            tool.circuit_breaker.RecordFailure()
            attempted.APPEND(Failed(tool, error=result.error))
 
    // All automated tools exhausted — manual escalation
    IF hierarchy.manual_escalation_enabled:
        ticket ← CreateEscalationTicket(
            capability=capability,
            input=input,
            attempted=attempted,
            trace_id=trace_id,
            urgency=ComputeUrgency(deadline)
        )
        RETURN ManualEscalation(ticket=ticket, attempted=attempted)
    ELSE:
        RETURN FallbackExhausted(attempted=attempted)

14.5.4 Degradation Modes#

When falling back from primary to lower-tier tools, the agent must adjust its expectations:

  • Accuracy degradation: Lower-tier tools may provide approximate results. The agent should annotate downstream reasoning with confidence decrements.
  • Latency degradation: Synchronous fallbacks increase total chain latency; the agent should re-evaluate deadline feasibility.
  • Feature degradation: Secondary tools may lack capabilities (e.g., pagination, filtering). The agent must compensate with post-processing.
  • Manual escalation: The agent loop pauses at a human gate, persists state, and resumes upon human response. This requires durable state serialization and session resumption protocols.

14.6 Tool Result Validation: Schema Conformance, Sanity Checks, Cross-Tool Consistency Verification#

14.6.1 Validation Layers#

Tool outputs cannot be trusted blindly. A production-grade validation pipeline applies three layers:

  1. Schema conformance — structural correctness.
  2. Semantic sanity checks — value-level plausibility.
  3. Cross-tool consistency — agreement across corroborating sources.

14.6.2 Schema Conformance Validation#

Every tool declares an output schema Oτ\mathcal{O}_\tau (JSON Schema, Protobuf message, or equivalent). Validation checks:

  • Required fields present.
  • Types correct (string, integer, array, nested object).
  • Value constraints met (ranges, regex patterns, enum membership).
  • Array bounds respected (minItems\text{minItems}, maxItems\text{maxItems}).
SchemaValid(o,Oτ)={TRUEif o conforms to OτFALSEotherwise\text{SchemaValid}(o, \mathcal{O}_\tau) = \begin{cases} \text{TRUE} & \text{if } o \text{ conforms to } \mathcal{O}_\tau \\ \text{FALSE} & \text{otherwise} \end{cases}

Non-conforming outputs are rejected, and the invocation is treated as a failure (triggering retry or fallback).

14.6.3 Semantic Sanity Checks#

Beyond schema correctness, outputs must be semantically plausible. Sanity checks are domain-specific predicates:

SanityValid(o)=pPsanityp(o)\text{SanityValid}(o) = \bigwedge_{p \in \mathcal{P}_{\text{sanity}}} p(o)

Examples of sanity predicates:

  • Temporal: o.created_atnow()o.\text{created\_at} \leq \text{now}()
  • Numeric range: o.price>0o.price<107o.\text{price} > 0 \wedge o.\text{price} < 10^7
  • Referential: o.user_idKnownUserso.\text{user\_id} \in \text{KnownUsers}
  • Cardinality: o.resultsexpected_max|o.\text{results}| \leq \text{expected\_max}
  • Self-consistency: o.total=io.items[i].amounto.\text{total} = \sum_{i} o.\text{items}[i].\text{amount}

14.6.4 Cross-Tool Consistency Verification#

When multiple tools produce overlapping information, consistency checks detect conflicting outputs. Given outputs oao_a from τa\tau_a and obo_b from τb\tau_b covering the same entity or fact:

Consistent(oa,ob)=fSharedFields(oa,ob):Agree(oa.f,ob.f,ϵf)\text{Consistent}(o_a, o_b) = \forall f \in \text{SharedFields}(o_a, o_b) : \text{Agree}(o_a.f, o_b.f, \epsilon_f)

where ϵf\epsilon_f is a field-specific tolerance (exact match for identifiers, ϵ\epsilon-tolerance for floating-point values, fuzzy match for strings).

Conflict resolution strategies:

StrategyRule
Authority rankingPrefer the output from the higher-authority source
FreshnessPrefer the more recently retrieved value
Majority voteIf k\geq k out of nn tools agree, adopt the majority
LLM arbitrationPresent conflicts to the LLM with provenance for reasoned resolution
Human escalationFlag unresolvable conflicts for human review

Pseudo-Algorithm 14.7: Multi-Layer Tool Output Validator

PROCEDURE ValidateToolOutput(output, tool, context, validation_policy):
    issues ← []
 
    // Layer 1: Schema conformance
    schema_result ← ValidateJsonSchema(output, tool.output_schema)
    IF schema_result.has_errors:
        RETURN ValidationFailure(layer="schema", errors=schema_result.errors, severity=CRITICAL)
 
    // Layer 2: Semantic sanity
    sanity_predicates ← validation_policy.GetSanityPredicates(tool.id)
    FOR predicate IN sanity_predicates:
        IF NOT predicate.evaluate(output):
            issues.APPEND(SanityViolation(predicate=predicate.name, value=predicate.extract(output),
                                          expected=predicate.expected_range, severity=predicate.severity))
 
    IF ANY(issue.severity = CRITICAL FOR issue IN issues):
        RETURN ValidationFailure(layer="sanity", errors=issues, severity=CRITICAL)
 
    // Layer 3: Cross-tool consistency
    corroborating_outputs ← context.GetCorroboratingOutputs(tool.id, output)
    FOR corr IN corroborating_outputs:
        shared_fields ← IntersectFields(output, corr.output)
        FOR field IN shared_fields:
            IF NOT AgreeWithTolerance(output[field], corr.output[field], tolerance=validation_policy.GetTolerance(field)):
                issues.APPEND(ConsistencyConflict(
                    field=field,
                    value_a=output[field], source_a=tool.id,
                    value_b=corr.output[field], source_b=corr.tool_id,
                    severity=validation_policy.GetConflictSeverity(field)
                ))
 
    IF ANY(issue.severity = CRITICAL FOR issue IN issues):
        resolution ← ResolveConflicts(issues, validation_policy.conflict_resolution_strategy)
        RETURN ValidationConditional(output=resolution.resolved_output, issues=issues, resolution=resolution)
 
    IF len(issues) > 0:
        RETURN ValidationWarning(output=output, issues=issues)
    ELSE:
        RETURN ValidationSuccess(output=output)

14.7 Self-Healing Tool Use: Automatic Retry with Parameter Adjustment, Error-Guided Correction#

14.7.1 Error Taxonomy for Tool Invocations#

Tool failures are categorized by their amenability to automated correction:

Error ClassRetryableSelf-HealableExample
TransientYes (same params)N/ANetwork timeout, 503
Rate-limitedYes (after delay)N/A429 Too Many Requests
Input validationNo (same params)Yes (adjust params)Invalid date format, missing field
AuthorizationNoNo (escalate)403 Forbidden
Semantic errorNo (same params)Yes (rewrite query)SQL syntax error, empty result set
Resource not foundNo (same params)Yes (search + retry)404, entity doesn't exist
Server errorConditionalNo500 with corruption

14.7.2 Self-Healing Loop#

The self-healing mechanism uses the LLM as an error diagnostician that reads the error response and adjusts parameters:

x=LLMrepair(error(o),x,τ.schema,context)\mathbf{x}' = \text{LLM}_{\text{repair}}(\text{error}(o), \mathbf{x}, \tau.\text{schema}, \text{context})

This forms a repair loop bounded by a maximum iteration count KmaxK_{\text{max}}:

Pseudo-Algorithm 14.8: Self-Healing Tool Invocation

PROCEDURE InvokeWithSelfHealing(tool, initial_params, context, max_attempts, trace_id):
    current_params ← initial_params
    attempt_history ← []
 
    FOR attempt ← 1 TO max_attempts:
        // Invoke tool
        result ← InvokeTool(tool, current_params, trace_id=trace_id)
 
        IF result IS Success:
            validated ← ValidateToolOutput(result.output, tool, context)
            IF validated IS ValidationSuccess OR validated IS ValidationWarning:
                RETURN SelfHealSuccess(output=validated.output, attempts=attempt, history=attempt_history)
            ELSE:
                // Output validation failed — treat as semantic error
                error_info ← ConstructErrorInfo(type="output_validation_failure", details=validated.errors)
        ELSE:
            error_info ← ClassifyError(result.error)
 
        attempt_history.APPEND(AttemptRecord(
            attempt=attempt, params=current_params, error=error_info, timestamp=now()
        ))
 
        // Determine if error is self-healable
        IF error_info.class IN {AUTHORIZATION, SERVER_ERROR_PERMANENT}:
            RETURN SelfHealFailure(error=error_info, attempts=attempt, history=attempt_history)
 
        IF error_info.class = TRANSIENT:
            // Simple retry with exponential backoff + jitter
            delay ← MIN(BASE_DELAY * 2^attempt + RandomJitter(), MAX_DELAY)
            Sleep(delay)
            CONTINUE  // Same params
 
        IF error_info.class = RATE_LIMITED:
            delay ← ParseRetryAfter(result.headers) OR DEFAULT_RATE_LIMIT_DELAY
            Sleep(delay)
            CONTINUE  // Same params
 
        // Self-healing: LLM-driven parameter adjustment
        repair_context ← ConstructRepairContext(
            tool_schema=tool.input_schema,
            original_params=initial_params,
            current_params=current_params,
            error=error_info,
            attempt_history=attempt_history,
            task_context=context
        )
 
        repaired_params ← LLM.RepairToolParams(repair_context)
 
        IF repaired_params = current_params:
            // LLM couldn't find a different parameterization
            RETURN SelfHealFailure(error=error_info, attempts=attempt, history=attempt_history,
                                   reason="no_alternative_params")
 
        // Validate repaired params against schema before retrying
        schema_valid ← ValidateSchema(repaired_params, tool.input_schema)
        IF schema_valid IS SchemaError:
            // LLM produced invalid repair — try once more with explicit schema guidance
            repaired_params ← LLM.RepairToolParams(repair_context, include_schema_error=schema_valid)
            IF ValidateSchema(repaired_params, tool.input_schema) IS SchemaError:
                RETURN SelfHealFailure(error="repair_produced_invalid_schema", attempts=attempt,
                                       history=attempt_history)
 
        current_params ← repaired_params
 
    RETURN SelfHealExhausted(attempts=max_attempts, history=attempt_history)

14.7.3 Repair Quality Metrics#

The quality of the self-healing loop is measured by:

  • Repair success rate: Rsuccess=healed invocationsattempted healingsR_{\text{success}} = \frac{|\text{healed invocations}|}{|\text{attempted healings}|}
  • Mean attempts to heal: Aˉ=E[attemptshealed]\bar{A} = \mathbb{E}[\text{attempts} \mid \text{healed}]
  • Repair latency overhead: ΔL=LhealedLfirst_attempt\Delta L = L_{\text{healed}} - L_{\text{first\_attempt}}
  • False repair rate: cases where the healed output passed validation but was semantically incorrect.

These metrics feed into the evaluation infrastructure (§14.13) for continuous quality monitoring.

14.7.4 Error-Guided Correction Patterns#

Specific error classes invoke specialized correction strategies:

  1. SQL syntax errors: Extract the error message, re-invoke the LLM with the error and original query for targeted rewriting. Include the database schema as additional context.
  2. Empty result sets: Broaden search criteria (relax filters, expand date ranges, use synonyms). The repair prompt explicitly instructs broadening.
  3. Type mismatches: Extract expected vs. actual types from the error; apply type coercion or reformatting.
  4. Pagination exhaustion: Adjust offset/cursor parameters to access the correct page.
  5. Encoding errors: Detect encoding issues (UTF-8 vs. ASCII) and apply appropriate encoding before retry.

14.8 Tool Creation by Agents: Dynamic Code Generation, Sandboxed Execution, and Promotion to Permanent Tools#

14.8.1 The Tool Creation Lifecycle#

When an agent encounters a task for which no existing tool suffices, it may create a new tool dynamically. This capability transforms the agent from a tool consumer into a tool producer, but introduces severe safety, correctness, and governance risks that must be addressed mechanically.

The tool creation lifecycle consists of five phases:

Identify GapGenerate CodeSandbox TestValidatePromote or Discard\text{Identify Gap} \to \text{Generate Code} \to \text{Sandbox Test} \to \text{Validate} \to \text{Promote or Discard}

14.8.2 Gap Identification#

The agent identifies tool gaps through:

  1. Tool selection failure: No existing tool matches the required capability.
  2. Composition inefficiency: The task requires >k>k tool invocations that could be replaced by a single custom tool.
  3. Repeated patterns: The same multi-tool sequence appears nthreshold\geq n_{\text{threshold}} times in recent traces.

Formally, a gap is detected when:

τT:relevance(τ,κrequired)<θgap\forall \tau \in \mathcal{T} : \text{relevance}(\tau, \kappa_{\text{required}}) < \theta_{\text{gap}}

where κrequired\kappa_{\text{required}} is the required capability and θgap\theta_{\text{gap}} is the minimum relevance threshold.

14.8.3 Code Generation with Typed Contracts#

The agent generates tool code that conforms to a tool template contract:

ToolTemplate:
    name: string                     // Unique identifier
    description: string              // Natural language description for discovery
    input_schema: JSONSchema          // Typed input specification
    output_schema: JSONSchema         // Typed output specification
    implementation: CodeBlock         // Generated code (Python, TypeScript, etc.)
    dependencies: List<Dependency>    // Required libraries (allowlisted)
    resource_limits: ResourceLimits   // CPU, memory, network, time bounds
    test_cases: List<TestCase>        // Agent-generated test cases
    security_classification: SecurityLevel  // Determines sandbox tier

The LLM generates the implementation subject to constraints:

  • Allowed dependencies: Only from an approved allowlist (no arbitrary package installation).
  • No network access during initial sandbox testing (unless explicitly approved).
  • Deterministic behavior: Same input should produce same output (excluding time-dependent operations).
  • Bounded execution time: Hard timeout enforced by the sandbox.

14.8.4 Sandboxed Execution Environment#

Generated tools execute in a multi-tier sandbox:

TierCapabilitiesUse Case
Tier 0 (Pure compute)CPU, memory only; no I/O, no networkMathematical operations, data transforms
Tier 1 (Read-only I/O)Filesystem read, environment variablesFile parsing, config reading
Tier 2 (Controlled network)Allowlisted HTTP endpoints onlyAPI consumption
Tier 3 (Full access)Requires human approval per invocationSystem administration tools

Sandbox enforcement uses OS-level isolation (containers, seccomp-bpf, namespace isolation):

SandboxPolicy(τnew)=Tier[min(i:capabilities(τnew)allowed(Tieri))]\text{SandboxPolicy}(\tau_{\text{new}}) = \text{Tier}[\min(i : \text{capabilities}(\tau_{\text{new}}) \subseteq \text{allowed}(\text{Tier}_i))]

14.8.5 Validation and Testing#

Before a generated tool is used, it must pass:

  1. Static analysis: Lint, type check, security scan (no eval, no shell injection, no hardcoded credentials).
  2. Unit tests: Agent-generated test cases executed in sandbox.
  3. Property-based tests: Automatically generated inputs (fuzzing) to detect crashes, hangs, and out-of-bounds behavior.
  4. Output schema conformance: Every test execution's output must validate against the declared output schema.

The validation score:

V(τnew)=wstaticsstatic+wunitsunit+wfuzzsfuzz+wschemasschemaV(\tau_{\text{new}}) = w_{\text{static}} \cdot s_{\text{static}} + w_{\text{unit}} \cdot s_{\text{unit}} + w_{\text{fuzz}} \cdot s_{\text{fuzz}} + w_{\text{schema}} \cdot s_{\text{schema}}

where si[0,1]s_i \in [0, 1] and wi=1\sum w_i = 1. The tool is usable only if V(τnew)θvalidationV(\tau_{\text{new}}) \geq \theta_{\text{validation}}.

14.8.6 Promotion Pipeline#

Pseudo-Algorithm 14.9: Tool Promotion Pipeline

PROCEDURE PromoteTool(generated_tool, promotion_policy):
    // Phase 1: Static analysis
    static_result ← RunStaticAnalysis(generated_tool.implementation, generated_tool.dependencies)
    IF static_result.has_critical_issues:
        RETURN PromotionRejected(reason="static_analysis_failure", details=static_result)
 
    // Phase 2: Sandboxed test execution
    sandbox ← CreateSandbox(tier=DetermineSandboxTier(generated_tool))
    test_results ← []
    FOR test IN generated_tool.test_cases:
        result ← sandbox.Execute(generated_tool.implementation, test.input, timeout=generated_tool.resource_limits.timeout)
        schema_valid ← ValidateSchema(result.output, generated_tool.output_schema)
        test_results.APPEND(TestResult(input=test.input, expected=test.expected, actual=result.output,
                                        passed=test.expected_matches(result.output) AND schema_valid, duration=result.duration))
 
    // Phase 3: Fuzz testing
    fuzz_inputs ← GenerateFuzzInputs(generated_tool.input_schema, count=100)
    FOR fuzz_input IN fuzz_inputs:
        result ← sandbox.Execute(generated_tool.implementation, fuzz_input, timeout=generated_tool.resource_limits.timeout)
        IF result IS Crash OR result IS Timeout:
            test_results.APPEND(TestResult(input=fuzz_input, passed=FALSE, error=result.error))
 
    // Phase 4: Compute validation score
    scores ← ComputeValidationScores(static_result, test_results)
    IF scores.overall < promotion_policy.min_validation_score:
        RETURN PromotionRejected(reason="insufficient_validation_score", score=scores.overall)
 
    // Phase 5: Promotion decision
    IF promotion_policy.requires_human_review:
        review ← RequestHumanReview(generated_tool, test_results, scores)
        IF NOT review.approved:
            RETURN PromotionRejected(reason="human_review_denied")
 
    // Phase 6: Register in tool catalog
    tool_record ← ToolCatalogRecord(
        tool=generated_tool,
        status=PROMOTED,
        validation_score=scores.overall,
        created_by="agent:" + agent_id,
        created_at=now(),
        version="1.0.0",
        trust_score=ComputeInitialTrustScore(scores)
    )
    ToolCatalog.Register(tool_record)
    RETURN PromotionSuccess(tool_id=tool_record.id, trust_score=tool_record.trust_score)

14.8.7 Ephemeral vs. Permanent Tools#

AttributeEphemeralPermanent
LifetimeSingle sessionIndefinite (versioned)
StorageWorking memoryTool catalog / registry
DiscoveryNot discoverable by other agentsDiscoverable via MCP
GovernanceSelf-validatedHuman-reviewed, CI-tested
Trust scoreMinimalAccumulates with usage

Ephemeral tools serve as a rapid prototyping mechanism; only tools that prove their value over multiple sessions and pass the promotion pipeline become permanent.


14.9 Browser and GUI Tools: Playwright, Puppeteer, Desktop Automation, Vision-Language Tool Agents#

14.9.1 Architecture of Browser/GUI Tool Agents#

Browser and GUI automation tools enable agents to interact with web applications and desktop software that lack APIs. These tools bridge the gap between structured tool invocation and unstructured visual interfaces.

The architecture comprises three layers:

Agent Reasoning Layer (LLM)
        ↕ Structured commands / observations
Automation Abstraction Layer (Tool Server)
        ↕ Low-level browser/GUI commands
Execution Engine (Playwright / Puppeteer / OS Accessibility APIs)

14.9.2 Browser Automation Protocol#

Browser tools expose a typed interface for web interaction:

Core operations:

OperationInputOutputSide Effect
navigateURLPage metadata, status codeBrowser navigates
extract_textCSS selector / XPathExtracted text contentNone
extract_structuredSelector + schemaStructured dataNone
clickSelectorUpdated page stateDOM mutation
fill_formSelector + valueConfirmationDOM mutation
screenshotViewport/selectorImage (base64)None
execute_jsJavaScript codeExecution resultArbitrary
wait_forCondition + timeoutSuccess/timeoutBlocks

14.9.3 Vision-Language Integration#

For applications where DOM access is unreliable or unavailable (canvas-based apps, remote desktops, PDFs rendered as images), the agent uses vision-language models (VLMs) to interpret screenshots:

action=VLM(screenshot,task_instruction,action_history)\text{action} = \text{VLM}(\text{screenshot}, \text{task\_instruction}, \text{action\_history})

The VLM produces structured action commands:

action{click(x,y),type(x,y,text),scroll(dx,dy),wait(t),done}\text{action} \in \{\text{click}(x, y), \text{type}(x, y, \text{text}), \text{scroll}(dx, dy), \text{wait}(t), \text{done}\}

Coordinate grounding is the critical challenge: the VLM must map semantic targets ("the Submit button") to pixel coordinates (x,y)(x, y). Approaches:

  1. Set-of-Mark (SoM): Overlay numbered labels on interactive elements in the screenshot; the VLM references labels instead of raw coordinates.
  2. Bounding box prediction: The VLM outputs bounding box coordinates for the target element.
  3. DOM-augmented vision: Combine the screenshot with a simplified DOM tree to provide both visual and structural grounding.

14.9.4 Observation Compression for GUI Agents#

Raw screenshots consume significant token budget when processed by VLMs. Compression strategies:

  • DOM-to-text: Convert the accessibility tree to a compact text representation (element type, label, state, relative position).
  • Selective screenshot: Capture only the relevant viewport region, not the full page.
  • Delta encoding: After the first screenshot, transmit only changed regions.
  • Structured observation: Extract form field states, button labels, and error messages as structured data rather than relying on visual parsing.

Pseudo-Algorithm 14.10: Browser Agent Action Loop

PROCEDURE BrowserAgentLoop(task, browser_session, max_steps, deadline):
    action_history ← []
    
    FOR step ← 1 TO max_steps:
        // Observe current state
        observation ← ConstructObservation(browser_session)
        // Observation = { dom_summary, screenshot_if_needed, url, page_title, error_messages }
 
        // Compress observation for context
        compressed_obs ← CompressObservation(observation, method=SelectCompressionMethod(task, step))
 
        // Agent reasons about next action
        agent_context ← AssembleContext(
            task=task,
            current_observation=compressed_obs,
            action_history=TruncateHistory(action_history, max_tokens=2048),
            available_actions=BROWSER_ACTION_SCHEMA
        )
 
        action_decision ← LLM.DecideAction(agent_context)
 
        // Validate action
        IF action_decision.action = "done":
            result ← ExtractResult(observation, task.expected_output_schema)
            RETURN BrowserTaskSuccess(result=result, steps=step, history=action_history)
 
        IF NOT IsValidAction(action_decision, BROWSER_ACTION_SCHEMA):
            action_history.APPEND(InvalidAction(action_decision, step=step))
            CONTINUE  // Let agent self-correct on next iteration
 
        // Execute action with safety checks
        IF action_decision.action IN {"navigate", "click", "fill_form", "execute_js"}:
            safety_check ← BrowserSafetyPolicy.Check(action_decision, browser_session)
            IF safety_check = DENY:
                action_history.APPEND(BlockedAction(action_decision, reason=safety_check.reason))
                CONTINUE
 
        execution_result ← browser_session.Execute(action_decision)
        action_history.APPEND(ExecutedAction(
            step=step, action=action_decision, result=execution_result, timestamp=now()
        ))
 
        // Wait for page stability
        browser_session.WaitForStable(timeout=5000)
 
        IF now() > deadline:
            RETURN BrowserTaskTimeout(steps=step, history=action_history)
 
    RETURN BrowserTaskExhausted(steps=max_steps, history=action_history)

14.9.5 Desktop Automation#

Desktop automation extends browser patterns to native applications using:

  • OS Accessibility APIs (Windows UI Automation, macOS Accessibility, Linux AT-SPI): Provide structured access to UI element trees.
  • Computer Vision: When accessibility APIs are unavailable, use screenshot-based interaction with VLMs.
  • Keyboard/Mouse simulation: Low-level input injection for actions not exposed through accessibility APIs.

The same action loop (Pseudo-Algorithm 14.10) applies, with the browser session replaced by a desktop session abstraction and the DOM summary replaced by the accessibility tree.


14.10 File System and Repository Tools: Git Operations, File Manipulation, Build System Integration#

14.10.1 File System Tool Taxonomy#

File system tools provide the agent with the ability to read, write, search, and manipulate files and directories. These tools are critical for software engineering agents, document processing agents, and data pipeline agents.

ToolInputOutputMutation
read_filePath, encoding, byte rangeFile content (text/binary)None
write_filePath, content, mode (create/overwrite/append)Success + metadataYes
list_directoryPath, pattern, recursiveFile listing with metadataNone
search_filesPattern (glob/regex), root pathMatching file pathsNone
search_contentQuery (text/regex), file setMatching lines with contextNone
move_fileSource, destinationSuccessYes
delete_filePath, require_confirmationSuccessYes (irreversible)
file_diffPath A, Path BUnified diffNone
file_metadataPathSize, timestamps, permissionsNone

14.10.2 Git Operations#

Git tools enable agents to operate within version-controlled repositories with full branching, committing, and merging capabilities:

Core Git tool operations:

ToolInputOutputMutation
git_statusRepo pathModified/staged/untracked filesNone
git_diffRepo, ref range, path filterDiff outputNone
git_logRepo, ref, count, path filterCommit historyNone
git_branchRepo, branch name, base refBranch createdYes
git_checkoutRepo, refWorking tree updatedYes
git_commitRepo, message, file listCommit SHAYes
git_mergeRepo, source branch, strategyMerge result or conflictsYes
git_pushRepo, remote, branchPush resultYes (remote state)
git_blameRepo, file, line rangeAuthorship per lineNone

Branching discipline for agentic execution: Agents must operate on isolated branches:

branchagent=agent/task_id/timestamp\text{branch}_{\text{agent}} = \text{agent/}\langle\text{task\_id}\rangle\text{/}\langle\text{timestamp}\rangle

This ensures:

  • No direct mutation of main or protected branches.
  • All changes are reviewable via pull request.
  • Parallel agents do not create merge conflicts on the same branch.
  • Rollback is trivial (delete branch).

14.10.3 Build System Integration#

Agents that modify code must verify their changes compile and pass tests:

ToolInputOutput
buildRepo, target, configBuild result (success/failure + logs)
testRepo, test suite, filterTest results (pass/fail/skip per test)
lintRepo, file set, configLint results (warnings, errors per file)
typecheckRepo, configType errors (file, line, message)

Verification loop after file mutation:

PROCEDURE VerifyAfterMutation(repo, changed_files, verification_policy):
    results ← {}
 
    IF verification_policy.require_typecheck:
        results["typecheck"] ← RunTypecheck(repo)
    IF verification_policy.require_lint:
        results["lint"] ← RunLint(repo, changed_files)
    IF verification_policy.require_build:
        results["build"] ← RunBuild(repo)
    IF verification_policy.require_tests:
        test_scope ← DetermineAffectedTests(repo, changed_files)
        results["tests"] ← RunTests(repo, test_scope)
 
    all_passed ← ALL(r.passed FOR r IN results.values())
    RETURN VerificationResult(passed=all_passed, details=results)

14.11 Database Tools: Query Generation, Schema Introspection, Migration Planning, and Data Validation#

14.11.1 Database Tool Architecture#

Database tools enable agents to interact with relational and non-relational databases through a safety-layered interface:

Agent (LLM)
    ↕ Natural language → SQL/query intent
Query Generator (LLM or template engine)
    ↕ Generated query
Safety Layer (parser, validator, policy gate)
    ↕ Approved query
Execution Layer (connection pool, timeout, result limit)
    ↕ Result set
Result Formatter (projection, truncation, type conversion)
    ↕ Structured output to agent

14.11.2 Schema Introspection#

Before generating queries, the agent must understand the database schema. The schema introspection tool provides:

  • Tables/collections: Names, descriptions (from comments), row counts.
  • Columns/fields: Name, type, nullability, constraints, foreign keys, indices.
  • Relationships: Foreign key graph, junction tables.
  • Sample data: kk representative rows per table (anonymized if sensitive).
  • Statistics: Value distributions, cardinality estimates, NULL rates.

Schema is injected into context in compressed form:

TABLE orders (
  id: INT PK AUTO_INCREMENT,
  user_id: INT FK→users.id NOT NULL INDEX,
  total: DECIMAL(10,2) NOT NULL CHECK(>0),
  status: ENUM('pending','shipped','delivered','cancelled') DEFAULT 'pending',
  created_at: TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
) -- ~1.2M rows, ~340 orders/day

14.11.3 Query Generation Safety#

Generated SQL must be validated before execution:

  1. Parse validation: The query must parse as valid SQL for the target dialect.
  2. Read-only enforcement: For read operations, reject any query containing INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE, CREATE, GRANT.
  3. Query complexity bounds: Reject queries with:
    • More than njoinn_{\text{join}} joins (default: 5).
    • Subqueries nested deeper than dmaxd_{\text{max}} (default: 3).
    • Missing WHERE clause on large tables (prevents full table scans).
    • Missing LIMIT clause (enforce maximum result set size).
  4. Parameterization: All user-derived values must be parameterized (prevent SQL injection).
  5. Execution cost estimation: Use EXPLAIN to estimate query cost before execution; reject queries above a cost threshold.
QueryApproved(q)=ParseValid(q)ReadOnly(q)ComplexityBound(q)CostBound(q)\text{QueryApproved}(q) = \text{ParseValid}(q) \wedge \text{ReadOnly}(q) \wedge \text{ComplexityBound}(q) \wedge \text{CostBound}(q)

14.11.4 Query Generation Algorithm#

Pseudo-Algorithm 14.11: Safe Database Query Generator

PROCEDURE GenerateAndExecuteQuery(natural_language_query, database_connection, schema_cache, safety_policy):
    // Step 1: Schema retrieval
    relevant_tables ← IdentifyRelevantTables(natural_language_query, schema_cache)
    schema_context ← FormatSchemaForContext(relevant_tables, include_sample_data=TRUE, max_tables=10)
 
    // Step 2: Query generation
    generated_sql ← LLM.GenerateSQL(
        instruction="Generate a SQL query for the following request. Use only the provided schema. Include LIMIT clause.",
        user_query=natural_language_query,
        schema=schema_context,
        dialect=database_connection.dialect
    )
 
    // Step 3: Safety validation
    parse_result ← ParseSQL(generated_sql, dialect=database_connection.dialect)
    IF parse_result IS ParseError:
        // Self-heal: retry with error feedback
        generated_sql ← LLM.RepairSQL(generated_sql, parse_result.error, schema_context)
        parse_result ← ParseSQL(generated_sql, dialect=database_connection.dialect)
        IF parse_result IS ParseError:
            RETURN QueryFailure(reason="parse_error_after_repair", error=parse_result.error)
 
    IF NOT safety_policy.IsReadOnly(parse_result.ast):
        RETURN QueryFailure(reason="write_operation_not_permitted")
 
    IF NOT safety_policy.ComplexityWithinBounds(parse_result.ast):
        RETURN QueryFailure(reason="query_too_complex", details=safety_policy.GetComplexityReport(parse_result.ast))
 
    // Step 4: Cost estimation
    explain_result ← database_connection.Explain(generated_sql)
    IF explain_result.estimated_cost > safety_policy.max_query_cost:
        RETURN QueryFailure(reason="estimated_cost_too_high", cost=explain_result.estimated_cost)
 
    // Step 5: Execute with timeout and row limit
    result ← database_connection.Execute(
        query=generated_sql,
        timeout=safety_policy.query_timeout,
        max_rows=safety_policy.max_result_rows
    )
 
    IF result IS Timeout:
        RETURN QueryFailure(reason="execution_timeout")
 
    // Step 6: Format and return
    formatted ← FormatResultSet(result.rows, result.columns, max_display_rows=safety_policy.max_display_rows)
    RETURN QuerySuccess(
        sql=generated_sql,
        result=formatted,
        row_count=result.row_count,
        execution_time=result.duration,
        truncated=result.row_count > safety_policy.max_display_rows
    )

14.11.5 Migration Planning#

For schema migrations, the agent:

  1. Analyzes the current schema and the desired end state.
  2. Generates migration scripts (DDL statements) with up and down paths.
  3. Validates migrations against a copy of the schema (dry run).
  4. Estimates data migration duration and locking impact.
  5. Produces a migration plan for human review (never auto-executes DDL in production).

Migration tools are always classified as Tier 3 (human-approval-gated) in the safety hierarchy.

14.11.6 Data Validation#

Data validation tools enable agents to verify data quality:

DataQuality(D)=1RrR1[r(D)=PASS]\text{DataQuality}(D) = \frac{1}{|\mathcal{R}|} \sum_{r \in \mathcal{R}} \mathbb{1}[r(D) = \text{PASS}]

where R\mathcal{R} is the set of validation rules. Common rules include:

  • Completeness: NULL rate per column \leq threshold.
  • Uniqueness: Duplicate rate on candidate keys =0= 0.
  • Referential integrity: All foreign keys resolve.
  • Value distributions: Statistical tests for anomalous shifts (KL divergence from baseline).
  • Freshness: Most recent record timestamp within expected recency window.

14.12 Communication Tools: Email, Chat, Notification, and Workflow Trigger Integrations#

14.12.1 Communication Tool Categories#

Communication tools enable agents to send, receive, and manage messages across external channels. These are inherently state-changing and externally visible, requiring strict governance.

CategoryToolsRisk Level
Emailsend_email, read_inbox, search_email, draft_emailHigh (external recipients)
Chatsend_message, read_channel, create_thread, reactMedium (internal channels)
Notificationsend_notification, schedule_reminder, create_alertMedium
Workflowtrigger_pipeline, create_ticket, update_status, assign_taskMedium–High

14.12.2 Governance Framework for Communication Tools#

All communication tools that produce externally visible artifacts must pass through a multi-gate approval pipeline:

CommsGate(m,τcomm,s)=ContentPolicy(m)RecipientPolicy(m.to,s)RatePolicy(τcomm,s)ApprovalPolicy(m,s)\text{CommsGate}(m, \tau_{\text{comm}}, s) = \text{ContentPolicy}(m) \wedge \text{RecipientPolicy}(m.\text{to}, s) \wedge \text{RatePolicy}(\tau_{\text{comm}}, s) \wedge \text{ApprovalPolicy}(m, s)

where mm is the message payload and ss is the agent state.

Content policy checks:

  • No personally identifiable information (PII) leakage outside authorized boundaries.
  • No confidential or classified information in external-facing messages.
  • Tone and professionalism scoring (LLM-evaluated).
  • No impersonation (messages clearly attributed to the agentic system).

Recipient policy checks:

  • Internal-only recipients: auto-approved (subject to rate limits).
  • External recipients: require human approval or be on a pre-approved allowlist.
  • Broadcast (all-channel): always require explicit human approval.

Rate policy checks:

  • Per-recipient rate limits (e.g., max 3 emails per recipient per hour).
  • Per-channel rate limits (e.g., max 10 messages per channel per hour).
  • Global daily caps.

14.12.3 Draft-Review-Send Pattern#

For high-stakes communications, the agent produces a draft that is reviewed before sending:

Pseudo-Algorithm 14.12: Draft-Review-Send Communication

PROCEDURE SendCommunication(message_spec, comm_channel, agent_state, governance_policy):
    // Step 1: Generate draft
    draft ← LLM.ComposeDraft(
        intent=message_spec.intent,
        recipient=message_spec.recipient,
        context=message_spec.context,
        tone=governance_policy.required_tone,
        constraints=governance_policy.content_constraints
    )
 
    // Step 2: Content policy validation
    content_check ← ContentPolicy.Evaluate(draft, governance_policy)
    IF content_check.has_violations:
        draft ← LLM.ReviseDraft(draft, violations=content_check.violations)
        content_check ← ContentPolicy.Evaluate(draft, governance_policy)
        IF content_check.has_violations:
            RETURN CommunicationBlocked(reason="content_policy_violation", violations=content_check.violations)
 
    // Step 3: PII scan
    pii_scan ← PIIDetector.Scan(draft.body)
    IF pii_scan.found AND NOT governance_policy.AllowsPII(message_spec.recipient):
        draft ← RedactPII(draft, pii_scan.findings)
 
    // Step 4: Recipient policy
    recipient_check ← RecipientPolicy.Evaluate(message_spec.recipient, agent_state)
    IF recipient_check = EXTERNAL_REQUIRES_APPROVAL:
        approval ← RequestHumanApproval(
            type="external_communication",
            draft=draft,
            recipient=message_spec.recipient,
            timeout=governance_policy.approval_timeout
        )
        IF NOT approval.granted:
            RETURN CommunicationBlocked(reason="human_denied", draft=draft)
 
    // Step 5: Rate limit check
    IF NOT RateLimiter.TryAcquire(comm_channel, message_spec.recipient):
        RETURN CommunicationDeferred(reason="rate_limited", retry_after=RateLimiter.RetryAfter())
 
    // Step 6: Send
    send_result ← comm_channel.Send(draft)
    AuditLog.Record(action="communication_sent", draft=draft, result=send_result, agent=agent_state.agent_id)
    RETURN CommunicationSuccess(message_id=send_result.id, draft=draft)

14.12.4 Workflow Trigger Tools#

Workflow triggers connect the agent to external automation systems (CI/CD pipelines, ticketing systems, orchestration platforms):

  • Ticket creation: The agent creates JIRA/Linear/GitHub Issues with structured fields.
  • Pipeline triggers: The agent initiates build/deploy pipelines with specified parameters.
  • Status updates: The agent updates task status in project management tools.
  • Escalation chains: The agent triggers PagerDuty/OpsGenie alerts for critical issues.

Each workflow trigger must be idempotent (repeated invocations with the same idempotency key produce the same result) and auditable (every trigger is logged with full context and trace ID).


14.13 Tool Ecosystem Management: Marketplace, Rating, Trust Scoring, and Community Tool Servers#

14.13.1 Tool Ecosystem Architecture#

As the number of available tools grows beyond what a single organization manages, a tool ecosystem emerges — a marketplace of tool servers contributed by internal teams, vendors, and the open-source community. Managing this ecosystem requires:

  1. Discovery: Agents find tools by capability, not by name.
  2. Trust: Tools are scored by reliability, safety, and quality.
  3. Versioning: Tool contracts evolve without breaking consumers.
  4. Governance: Policies control which tools are available to which agents.

14.13.2 Tool Registry Schema#

Each tool in the ecosystem is registered with:

ToolRegistryEntry:
    id: UUID
    name: string
    version: SemanticVersion         // e.g., "2.3.1"
    provider: ProviderIdentity       // Organization, team, or individual
    description: string              // Natural language for LLM consumption
    capabilities: List<CapabilityTag>  // Standardized taxonomy (e.g., "data.query.sql")
    input_schema: JSONSchema
    output_schema: JSONSchema
    protocol: {MCP | JSON-RPC | gRPC}
    endpoint: URI
    auth_requirements: AuthSpec
    sla: SLASpec                     // Latency p50/p95/p99, availability target
    trust_score: Float ∈ [0, 1]
    usage_count: Int
    rating: Float ∈ [0, 5]
    last_verified: Timestamp
    deprecation_status: {ACTIVE | DEPRECATED | SUNSET}
    compatibility: List<CompatibilityConstraint>

14.13.3 Trust Scoring Model#

Trust scores are computed from multiple signals:

Trust(τ)=iwifi(τ)\text{Trust}(\tau) = \sum_{i} w_i \cdot f_i(\tau)

where the scoring functions fif_i and their weights wiw_i are:

Signal fif_iWeight wiw_iComputation
Reliability0.251failure_rate(τ,window=30d)1 - \text{failure\_rate}(\tau, \text{window=30d})
Schema compliance0.15Fraction of invocations with valid output schemas
Latency SLA adherence0.15Fraction of invocations within declared latency SLA
Security audit status0.15{0.0,0.5,1.0}\{0.0, 0.5, 1.0\} for {unaudited, self-assessed, third-party audited}
Community rating0.10Normalized average user rating [0,1]\in [0, 1]
Provider reputation0.10Provider-level aggregate trust score
Freshness0.05Recency of last successful verification
Usage volume0.05min(1,log(usage_count)/log(threshold))\min(1, \log(\text{usage\_count}) / \log(\text{threshold}))

Trust scores decay over time if not refreshed:

Trust(τ,t)=Trust(τ,t0)eλ(tt0)\text{Trust}(\tau, t) = \text{Trust}(\tau, t_0) \cdot e^{-\lambda (t - t_0)}

where λ\lambda is the decay rate and t0t_0 is the last verification timestamp. This incentivizes tool providers to maintain active verification.

14.13.4 Tool Discovery Protocol#

Tool discovery follows MCP's resource discovery pattern, extended with capability-based search:

Pseudo-Algorithm 14.13: Capability-Based Tool Discovery

PROCEDURE DiscoverTools(capability_query, constraints, agent_context):
    // Step 1: Parse capability query into structured search
    parsed ← ParseCapabilityQuery(capability_query)
    // parsed = { capability_tags: ["data.query.sql"], constraints: { max_latency: 500ms, min_trust: 0.7 } }
 
    // Step 2: Search registry
    candidates ← ToolRegistry.Search(
        capability_tags=parsed.capability_tags,
        min_trust_score=constraints.min_trust OR 0.5,
        status=ACTIVE,
        compatible_with=agent_context.platform_version
    )
 
    // Step 3: Filter by constraints
    filtered ← []
    FOR tool IN candidates:
        IF tool.sla.latency_p95 <= constraints.max_latency
           AND tool.trust_score >= constraints.min_trust
           AND agent_context.auth.HasPermission(tool.auth_requirements)
           AND tool.version.SatisfiesConstraint(constraints.version_constraint):
            filtered.APPEND(tool)
 
    // Step 4: Rank by composite score
    FOR tool IN filtered:
        tool.discovery_score ← (
            0.4 * tool.trust_score
            + 0.3 * CapabilityRelevance(tool.capabilities, parsed.capability_tags)
            + 0.2 * (1.0 - Normalize(tool.sla.latency_p95, max=constraints.max_latency))
            + 0.1 * (1.0 - Normalize(tool.cost_per_call, max=constraints.max_cost))
        )
 
    ranked ← SortDescending(filtered, key=discovery_score)
 
    // Step 5: Return top-K with schemas
    RETURN DiscoveryResult(
        tools=ranked[:constraints.max_results OR 10],
        total_candidates=len(candidates),
        total_filtered=len(filtered)
    )

14.13.5 Version Governance#

Tool versions follow semantic versioning:

  • Patch (x.y.Zx.y.Z): Bug fixes, no schema changes.
  • Minor (x.Y.zx.Y.z): Additive schema changes (new optional fields), backward compatible.
  • Major (X.y.zX.y.z): Breaking schema changes, requires consumer migration.

The registry enforces:

BreakingChangePolicy:Major version bumps require 30d deprecation notice\text{BreakingChangePolicy}: \text{Major version bumps require } \geq 30\text{d deprecation notice}

Agents pin to version ranges (e.g., ^2.0.0 for any 2.x.x) and receive notifications when a major version sunset approaches.

14.13.6 Community Contribution Pipeline#

Community-contributed tools follow a graduated promotion path:

SUBMITTEDSANDBOXBETAGENERAL_AVAILABILITYVERIFIED\text{SUBMITTED} \to \text{SANDBOX} \to \text{BETA} \to \text{GENERAL\_AVAILABILITY} \to \text{VERIFIED}
StageRequirements
SubmittedSchema provided, basic metadata
SandboxPasses automated schema validation, deploys in isolated sandbox
BetaPasses integration tests, achieves Trust0.5\text{Trust} \geq 0.5 over 100+ invocations
GAPasses security audit (self-assessed), achieves Trust0.7\text{Trust} \geq 0.7 over 1000+ invocations
VerifiedThird-party security audit, SLA commitment, provider identity verified

14.13.7 Marketplace Governance and Abuse Prevention#

Governance mechanisms protect the ecosystem:

  • Rate limiting per tool provider: Prevents monopolization of compute/network resources.
  • Abuse detection: Monitors for tools that exfiltrate data, inject malicious payloads, or exhibit inconsistent behavior.
  • Automated regression testing: Continuous invocation of registered tools with known inputs to detect behavioral drift.
  • Revocation: Tools can be immediately revoked (circuit-broken globally) if critical issues are detected.
RevocationCondition(τ)=Trust(τ)<θrevokeSecurityAlert(τ)AbuseDetected(τ)\text{RevocationCondition}(\tau) = \text{Trust}(\tau) < \theta_{\text{revoke}} \vee \text{SecurityAlert}(\tau) \vee \text{AbuseDetected}(\tau)

When a tool is revoked, all agents using it receive a notification and automatically fall back to the next tool in their fallback hierarchy (§14.5).


Chapter Summary and Cross-Cutting Concerns#

Architectural Invariants Across All Tool Patterns#

The following invariants must hold across every tool pattern described in this chapter:

  1. Typed contracts at every boundary: Every tool exposes JSON Schema–validated input/output. No untyped data flows between tools or into agent context.

  2. Provenance on every output: Every tool result carries its invocation ID, timestamp, source, freshness, and confidence — enabling downstream reasoning to attribute and assess.

  3. Human-interruptible mutation: Any state-changing tool path includes an interception point where human approval can be required based on policy.

  4. Idempotent operations: All mutations are keyed by idempotency tokens. Retries and saga compensations are safe to re-execute.

  5. Bounded execution: Every tool invocation carries an explicit deadline. No tool can run indefinitely.

  6. Observability: Every invocation produces structured traces consumable by the agent runtime, enabling self-diagnosis and continuous evaluation.

  7. Least privilege: Tools execute with caller-scoped authorization, not agent-global credentials. Permissions are the minimum required for the specific operation.

  8. Graceful degradation: Every critical capability has a fallback hierarchy. Total system failure requires simultaneous failure of all tiers plus human escalation timeout.

Token Budget Accounting Across Tool Operations#

The total token cost of tool use in a single agent step is:

Ttool=Tschemas+Tinvocation_history+Tresults+Tvalidation_outputT_{\text{tool}} = T_{\text{schemas}} + T_{\text{invocation\_history}} + T_{\text{results}} + T_{\text{validation\_output}}

This must satisfy:

TtoolCwindowTsystemTtaskTmemoryTreservedT_{\text{tool}} \leq C_{\text{window}} - T_{\text{system}} - T_{\text{task}} - T_{\text{memory}} - T_{\text{reserved}}

When TtoolT_{\text{tool}} approaches the budget, the context compiler must:

  • Prune older tool results from history.
  • Compress current results to higher compression levels.
  • Reduce the number of tool schemas in context (lazy loading).
  • Summarize invocation history rather than including verbatim records.

Reliability Equation#

The end-to-end reliability of a tool chain of nn sequential steps with per-step reliability pip_i and fallback depth did_i (each fallback having independent reliability pi,jp_{i,j}) is:

Rchain=i=1n(1j=1di(1pi,j))R_{\text{chain}} = \prod_{i=1}^{n} \left(1 - \prod_{j=1}^{d_i}(1 - p_{i,j})\right)

For a chain of 5 steps, each with a primary tool at p=0.95p = 0.95 and a secondary at p=0.90p = 0.90:

Rchain=(1(10.95)(10.90))5=(10.005)5=0.99550.975R_{\text{chain}} = \left(1 - (1-0.95)(1-0.90)\right)^5 = (1 - 0.005)^5 = 0.995^5 \approx 0.975

Compared to single-tool reliability: 0.9550.7740.95^5 \approx 0.774. Fallback hierarchies provide a 26% absolute reliability improvement in this configuration.


This chapter establishes the complete engineering framework for advanced tool use in agentic systems. The patterns described — composition, routing, selection, transactions, fallbacks, validation, self-healing, creation, and ecosystem management — form the operational backbone of any production-grade agentic platform. Each pattern is specified with typed contracts, bounded algorithms, and measurable quality criteria, ensuring that tool use is not an ad hoc capability but a disciplined, observable, and governable infrastructure layer.