An LLM becomes agentic only when embedded inside a closed-loop execution architecture with explicit goals, bounded planning, tool-mediated actuation, state management, verification, and governed commit semantics. The boundary between a predictive model and an autonomous cognitive architecture is therefore not linguistic fluency; it is control topology.
Scope and Assumptions#
This chapter assumes a production-grade agentic platform operating under the following conditions:
- Partially observable environments: the true task state is not fully visible to the model at any step.
- Mixed read/write tasks: some actions are advisory, some mutate external systems.
- Enterprise constraints: latency, cost, safety, auditability, regulatory compliance, and multi-tenant isolation are first-class concerns.
- Typed infrastructure: JSON-RPC at user/application boundaries, gRPC/Protobuf for internal execution, MCP for discoverable tool/resource/prompt surfaces.
- Deterministic orchestration around stochastic models: the model proposes; the system governs.
- No one-shot correctness assumption: nontrivial tasks must execute through bounded verify/repair loops.
1.1 Definitional Taxonomy: Agents, Assistants, Copilots, Autonomous Systems — Formal Boundaries#
1.1.1 Canonical System Model#
A production agentic system can be formalized as
where:
- : environment
- : observation space
- : internal state or belief state
- : goal set and acceptance criteria
- : policy/planning function
- : tool/action interface set
- : memory hierarchy
- : verification and critique mechanisms
- : commit protocol governing irreversible effects
A system is not agentic merely because it emits text that resembles planning. It is agentic only if it operates as a closed-loop controller over environment-facing actions and state transitions.
1.1.2 Predictive Model vs. Agentic System#
A predictive model is a conditional estimator:
It maps input context to output distribution . It does not inherently define:
- persistent state transitions,
- tool execution semantics,
- recovery behavior,
- verification contracts,
- or commit authority.
Agentic behavior emerges only when the model is embedded in a control architecture that interprets output as a candidate action rather than as authoritative truth.
1.1.3 Formal Boundary Conditions#
A system qualifies as an agent only if all of the following hold:
-
Objective-conditioned execution
There exists an explicit task objective or acceptance criterion beyond generic dialogue continuation. -
Stateful multi-step control
The system maintains task state across steps and can condition future actions on prior outcomes. -
Action selection over a tool/action set
The system selects among external actions, not merely among phrasings. -
Verification-mediated adaptation
The system can inspect consequences, detect failure, and repair or escalate. -
Commit semantics
The system can stage or perform side effects under governance.
If conditions 1–4 hold but 5 is absent, the system is a bounded planning assistant, not a fully operational agent.
1.1.4 Taxonomy Table#
| Class | Goal Ownership | Planning Horizon | Tool Use | Commit Authority | Memory Scope | Recovery Ownership | Primary Failure Mode |
|---|---|---|---|---|---|---|---|
| Predictive model | None explicit | 0 | None intrinsic | None | None intrinsic | External wrapper | Unsupported fluent output |
| Assistant | Human | 1 turn to short session | Optional, user-steered | Human only | Session-local | Human | Advice without execution grounding |
| Copilot | Human, domain-bounded | Short multi-step | Domain-coupled, proactive suggestion | Usually human-approved | Session + narrow task memory | Shared | Over-suggestion, partial context mismatch |
| Agent | Delegated task objective | Multi-step bounded | Autonomous routing among typed tools | Conditional, policy-bound | Working + session + episodic | System loop, human on exception | Tool misuse, hallucinated action plans, failed recovery |
| Autonomous system | Delegated domain objective, may derive subgoals | Long-horizon | Broad, scheduled, event-driven | Policy-governed direct execution | Durable, validated hierarchical memory | System by default | Goal drift, specification gaming, cascading failures |
1.1.5 Assistant vs. Copilot vs. Agent#
The boundaries are operational, not marketing labels:
- Assistant: language-centric, user-driven, mostly advisory.
- Copilot: proactive within a sharply bounded domain and UI surface.
- Agent: task-centric, closed-loop, tool-using, stateful, verifier-mediated.
- Autonomous system: agent with delegated authority over longer horizons, asynchronous execution, and policy-bounded self-scheduling.
1.1.6 Necessary Negative Definitions#
A system is not meaningfully agentic if:
- it cannot observe execution outcomes,
- it cannot distinguish draft output from committed side effects,
- it cannot recover from tool or retrieval failure,
- it cannot justify outputs with provenance,
- or it cannot abstain and escalate under uncertainty.
1.2 The Agent as a Control System: Sense–Plan–Act–Verify–Repair–Commit Loop Formalization#
1.2.1 Control-Theoretic Framing#
The correct abstraction for an agent is a closed-loop controller operating over a partially observable environment. Let:
- : latent environment state
- : observation at time
- : belief state
- : action
- : active goal
- : execution trace up to time
Belief update is ideally:
In real agentic platforms, is approximated by structured execution state, retrieved evidence, tool outputs, and memory summaries rather than exact Bayesian inference.
1.2.2 Objective Function with Operational Constraints#
The agent should optimize expected utility subject to safety, latency, and cost constraints:
subject to
A practical decision rule is:
where:
- is monetary or token cost,
- is latency impact,
- is estimated risk.
The action set must include abstain, escalate, and request clarification. A production agent without these actions will systematically overcommit.
1.2.3 Loop Semantics#
The minimal safe loop is:
-
Sense
Collect user inputs, system state, tool results, runtime telemetry, repository state, logs, tests, UI state, and retrieved evidence. -
Plan
Produce a bounded plan with subgoals, tool routing, success criteria, and rollback conditions. -
Act
Execute a typed tool call or emit a structured intermediate artifact. -
Verify
Check schema validity, evidence sufficiency, policy compliance, consistency, and task-specific tests. -
Repair
If verification fails, critique root cause, update plan/context/tool choice, or escalate. -
Commit
Only then write to external systems or emit final user-facing outputs with provenance.
1.2.4 Verification as a First-Class Transition Gate#
A proposed action is valid only if
where each is a machine-checkable predicate. If , the system must not commit. It must either repair, downgrade, or escalate.
1.2.5 Pseudo-Algorithm 1 — Bounded Agent Execution Loop#
Inputs: task specification , deadline , budget , idempotency key
- Initialize execution state , trace , retry budget, and recursion depth counter.
- Sense environment:
- normalize user request,
- ingest current runtime state,
- load session state and prior verified artifacts.
- Compile a bounded context artifact under token budget.
- Produce a task plan with explicit acceptance tests and rollback conditions.
- Decompose into claimable work units.
- For each work unit:
- retrieve provenance-tagged evidence,
- select tool or model action,
- execute under deadline and capability constraints,
- verify output against contracts.
- If verification fails:
- classify error as transient, deterministic, policy, or epistemic,
- persist failure state,
- repair via replan, reroute, clarification, or escalation,
- enforce bounded retries with jitter and maximum depth.
- If verification succeeds:
- stage mutation if effectful,
- require approval if policy requires,
- commit with idempotency key and audit trace.
- Synthesize final result only from verified state and evidence.
- Emit response, trace handle, provenance, and postconditions.
1.2.6 Critical Engineering Implication#
The language model is not the control loop. It is a stochastic component inside the loop. The orchestrator owns:
- deadlines,
- retry policy,
- tool authorization,
- state transitions,
- memory promotion,
- verification gates,
- and final commit.
That distinction is foundational.
1.3 Levels of Agentic Autonomy: L0 → L5#
1.3.1 Why a Leveling Model Is Necessary#
“Autonomous” is overloaded. A useful maturity model must distinguish systems by delegated authority, planning horizon, mutation rights, recovery competence, and human governance mode.
1.3.2 Autonomy Levels#
| Level | Name | Planning / Control | Tooling | Memory | Human Role | Mutation Rights |
|---|---|---|---|---|---|---|
| L0 | Tool-Augmented LLM | Single-turn or shallow multi-turn generation | Manual or wrapper-triggered tools | Minimal session | User in the loop for every step | None |
| L1 | Reactive Assistant | Responds with local adaptations | Read-heavy tools, limited execution | Session memory | User directs sequence | Draft only |
| L2 | Task Agent | Multi-step bounded planning with retries | Typed tool routing | Working + session + limited episodic | Human approves high-risk actions | Narrow, approval-gated |
| L3 | Delegated Operator | Asynchronous task handling, monitors and replans | Broad domain tools | Durable episodic memory | Human on exceptions | Policy-bound writes |
| L4 | Domain Autonomous System | Long-horizon domain objectives, event-driven operation | Multi-system orchestration | Episodic + semantic + procedural | Human on the loop | Broad within domain policy |
| L5 | Fully Autonomous Cognitive Agent | Cross-domain self-directed operation with self-maintenance | Dynamic multi-domain tooling | Hierarchical durable memory and self-model | Human only for governance | Broad delegated authority |
1.3.3 Promotion Criteria Between Levels#
A system should not be promoted to the next level based on anecdotal demos. Promotion requires evidence across:
- verifier pass rate,
- factual grounding rate,
- rollback success rate,
- human override rate,
- p95 latency,
- cost per accepted task,
- failure containment,
- and policy compliance under adversarial evaluation.
1.3.4 Practical Interpretation#
- Most current production systems are between L1 and L2.
- High-value enterprise systems are beginning to operate at constrained L3 in narrow domains.
- L4 is feasible only where tooling, policy contracts, and environment observability are unusually strong.
- L5 remains aspirational and is not presently compatible with stringent production safety expectations across open domains.
1.3.5 Architectural Consequence#
Autonomy level is not a model property. It is a system property resulting from the interaction of:
- planning depth,
- memory design,
- effect controls,
- environment observability,
- and verification strength.
1.4 Theoretical Foundations: Rational Agency, Bounded Rationality, Satisficing under Uncertainty#
1.4.1 Rational Agency#
Classical rational agency selects actions that maximize expected utility under belief uncertainty. In a POMDP-like framing:
This framing remains useful, but direct optimization is computationally intractable for real agentic systems.
1.4.2 Bounded Rationality#
Real systems are bounded by:
- context window size,
- inference latency,
- tool round-trip time,
- retrieval noise,
- budget ceilings,
- and verification capacity.
A more realistic objective is resource-rational:
This formulation captures a central systems truth: more reasoning is not always better. The value of additional deliberation must exceed its cost.
1.4.3 Metareasoning: Think vs. Act#
The system should choose to deliberate further only when the expected value of deliberation exceeds its cost:
This yields architecture-level policies such as:
- skip critique passes for low-risk, deterministic tasks,
- expand retrieval fan-out only under unresolved uncertainty,
- escalate rather than continue if evidence quality remains low after bounded attempts.
1.4.4 Satisficing Under Uncertainty#
In many enterprise contexts, perfect optimality is unnecessary or unobtainable. A satisficing rule is:
where:
- is the minimum acceptable utility threshold,
- is confidence.
This is operationally superior to naive maximization in settings with incomplete information, long-tail exceptions, or tight latency budgets.
1.4.5 Uncertainty Classes#
A production agent must distinguish:
-
Epistemic uncertainty: missing or incomplete knowledge
Mitigation: retrieval, tool inspection, clarification, human escalation. -
Aleatoric uncertainty: intrinsic environmental variability
Mitigation: probabilistic planning, risk bounds, staged commits. -
Model uncertainty: unreliable inference or calibration
Mitigation: verifier ensembles, model routing, abstention. -
Specification uncertainty: ambiguous or underspecified user intent
Mitigation: clarification, preference inference bounded by policy, human confirmation.
1.4.6 Why This Matters Architecturally#
Retrieval, memory, verification, and governance are not auxiliary features. They are the concrete mechanisms by which bounded rationality becomes operationally viable.
1.5 Cybernetic Feedback Loops and Homeostatic Agent Stability#
1.5.1 Cybernetic Framing#
An agentic system is a cybernetic system with:
- reference signals: goals, SLAs, policy thresholds,
- sensors: observations, tool outputs, telemetry, tests,
- controller: planner/orchestrator,
- actuators: tools and commits,
- plant: external environment plus internal execution substrate,
- feedback: verification, monitoring, user correction, evaluator output.
1.5.2 Homeostatic Variables#
Stable agents maintain internal variables within controlled bands. Typical homeostatic state vector:
where, for example:
- : uncertainty estimate
- : deadline slack or latency pressure
- : cumulative task cost
- : error rate
- : queue depth
- : service utilization
- : context or memory pressure
The system seeks to regulate toward target setpoints .
1.5.3 Stability Criterion#
A practical stability objective can be expressed through a Lyapunov-like function:
and the controller should enforce
for some , with representing bounded disturbance. In operational terms, the system should damp deviations rather than amplify them.
1.5.4 Negative Feedback Mechanisms#
Key stabilizers include:
- bounded recursion depth,
- retry budgets with exponential backoff and jitter,
- tool circuit breakers,
- context pruning and summarization,
- rate limiting and queue isolation,
- confidence-triggered escalation,
- contradiction detection,
- cache hierarchies,
- and budget-aware model routing.
1.5.5 Instability Modes#
Common unstable regimes include:
- tool thrashing: repeated low-value tool calls,
- retrieval storms: unbounded fan-out under uncertainty,
- reflection loops: critique/repair cycles without convergence,
- memory contamination: erroneous facts promoted into durable memory,
- goal drift: subgoals begin optimizing for local completion rather than task truth,
- latency collapse under load: interactive and background workloads contend on shared resources.
1.5.6 Queueing-Theoretic Constraint#
For sustained stable operation, critical queues must satisfy
where is arrival rate and is service rate. Since agentic workloads are bursty and heavy-tailed, practical design targets require headroom well below saturation and priority-based isolation between:
- interactive inference,
- batch agent jobs,
- verification workloads,
- and offline evaluation.
1.5.7 Homeostasis as a Safety Mechanism#
Homeostatic control is not only about uptime. It reduces hallucination and unsafe behavior by preventing the system from operating under:
- context overload,
- stale evidence,
- degraded tool reliability,
- and extreme time pressure without fallback.
1.6 The Competence–Alignment–Control Trilemma in Agentic Systems#
1.6.1 Definitions#
Let:
- : competence — probability of producing a useful, correct task outcome
- : alignment — fidelity to operator intent, policy, and normative constraints
- : control — ability to predict, bound, interrupt, audit, and reverse behavior
A deployable system requires all three. In practice, weakness in any one dimension collapses operational value.
1.6.2 Why It Is a Trilemma#
Increasing competence often requires:
- broader tool access,
- larger action spaces,
- longer planning horizons,
- richer memory,
- more adaptive behavior.
These raise the space of possible failure modes and therefore reduce control unless counterbalanced mechanically. Similarly, overly strong control through rigid gating can suppress competence and increase operator burden.
1.6.3 Operational Form#
A rough deployment utility can be represented as
This multiplicative framing is useful because near-zero performance on any axis makes the system nonviable regardless of strength on the others.
1.6.4 Failure Examples#
-
High competence, weak alignment
The system solves the wrong problem efficiently; specification gaming emerges. -
High competence, weak control
The system acts effectively but opaquely; operators cannot reliably stop, audit, or attribute behavior. -
High alignment, weak competence
The system is safe but operationally irrelevant. -
High control, weak competence
The system devolves into a human-operated workflow with AI ornamentation.
1.6.5 Architectural Response#
The trilemma is managed by systems design, not by prompt text alone:
-
Externalize control from the model
The orchestrator enforces budgets, tool scope, and commit rights. -
Use typed capability boundaries
Different effects require different guards. -
Separate planning from committing
Candidate plans are low-trust artifacts until verified. -
Bind competence to evidence
High competence must mean high performance under grounding and testing, not high rhetorical fluency. -
Make risky actions reversible where possible
Use staged writes, drafts, branch-based changes, and compensation workflows. -
Escalate under unresolved ambiguity
Alignment cannot be inferred robustly from underspecified tasks without a bounded clarification mechanism.
1.6.6 Strategic Implication#
There is no single-model solution to the trilemma. The stable frontier is achieved by combining:
- model capability,
- typed orchestration,
- verification infrastructure,
- human governance,
- and operational feedback loops.
1.7 Formal Verification of Agent Behavioral Contracts#
1.7.1 Why Formal Verification Is Necessary#
An agent that can mutate external systems without enforceable behavioral contracts is not deployable at enterprise scale. Since the model is stochastic and the environment is open-ended, verification must target observable behavior, not inaccessible internal reasoning.
1.7.2 Contract Structure#
A behavioral contract can be defined as
where:
- : preconditions
- : invariants
- : postconditions
- : permissible side-effect classes
- : liveness and temporal requirements
An execution trace satisfies the contract iff
1.7.3 Contract Layers#
-
Schema contracts
Inputs and outputs must conform to versioned types. -
Authority contracts
Only allowed tools and scopes may be invoked. -
Data contracts
Provenance, privacy class, retention policy, and field-level sensitivity must be honored. -
Process contracts
Required steps must occur in order, e.g. retrieve before claim, verify before commit. -
Postcondition contracts
The artifact or mutation must satisfy task-specific requirements. -
Temporal contracts
Safety and liveness properties over traces must hold.
1.7.4 Temporal Logic Examples#
Examples of enforceable policies:
- Approval before mutation:
- Verification before commit:
- Eventual safe termination:
- Tool calls must be schema-valid:
1.7.5 Verification Modalities#
A robust stack uses multiple verification modes:
-
Static
Typed schemas, effect typing, policy compilation, workflow model checking. -
Dynamic
Runtime guards, authorization checks, rate limits, timeout classes, output validators. -
Semantic
Evidence sufficiency checks, contradiction analysis, regression evaluators, unit/integration tests. -
Probabilistic
Calibration curves, confidence thresholds, abstention policies.
1.7.6 Limits of Formal Verification#
Full semantic correctness over open-world natural language tasks is not generally decidable. Therefore the platform should verify:
- the structure of behavior,
- the legality of actions,
- the provenance of claims,
- and the satisfaction of explicit acceptance tests.
This is sufficient for strong practical assurance when combined with controlled effect surfaces.
1.7.7 Design Rule#
Verify the contracted envelope of behavior, not the model’s hidden cognition. The system boundary is where assurance must live.
1.8 Agentic vs. Workflow Automation: Architectural Decision Boundaries#
1.8.1 Core Distinction#
Workflow automation is appropriate when the action graph is mostly enumerable. Agentic execution is appropriate when the system must reason over open-world ambiguity, long-tail exceptions, or dynamically selected tools and evidence.
1.8.2 Decision Matrix#
| Criterion | Workflow Automation | Agentic Execution | Hybrid Pattern |
|---|---|---|---|
| Input variability | Low | High | Medium |
| Branching structure | Known | Emergent | Known skeleton, agentic nodes |
| Exception rate | Low | High | Moderate |
| Tool selection | Fixed | Dynamic | Fixed core, dynamic augmentation |
| Correctness basis | Deterministic rules | Evidence + verification | Rules for control, agent for interpretation |
| Latency predictability | High | Lower, variable | High for core path |
| Cost predictability | High | Lower | Controlled |
| Best use case | ETL, BPM, CRUD, compliance steps | investigations, diagnosis, synthesis, adaptive operations | enterprise default |
1.8.3 Economic Decision Rule#
Use agentic execution when the expected value of adaptivity exceeds its additional control and failure-handling cost:
where:
- is the cost of guardrails, observability, and evaluation,
- is remaining risk after mitigation.
1.8.4 Common Anti-Patterns#
Do not use a free-form agent when:
- the process is deterministic and stable,
- the domain is heavily regulated and rules are fully encodable,
- the cost of a false positive mutation is extreme,
- or the latency SLO is incompatible with iterative reasoning.
Do not use pure workflows when:
- the long-tail exception burden dominates engineering cost,
- the task requires synthesis from heterogeneous evidence,
- the operator cannot predefine the branching logic economically,
- or the environment changes faster than workflow maintenance can keep up.
1.8.5 Recommended Enterprise Pattern#
The dominant production pattern is workflow skeleton + agentic interior:
- workflow handles identity, deadlines, retries, routing, approvals, and commits;
- agents handle interpretation, retrieval, diagnosis, synthesis, and adaptive subplanning.
This preserves control while exploiting model flexibility where it has comparative advantage.
1.9 The 10-Year Trajectory: From LLM-Centered Agents to Substrate-Independent Cognitive Architectures#
1.9.1 Direction of Travel#
The field is moving away from “prompted model as application” and toward protocol-oriented cognitive systems in which models are replaceable components within a durable execution substrate.
1.9.2 What Will Change#
Over the next decade, the system center of gravity will shift from the frontier model to:
- typed control planes,
- persistent structured memory,
- retrieval and world-state substrates,
- verifiers and evaluators,
- tool protocols,
- and event-sourced execution traces.
The system’s identity will increasingly reside in its contracts, memory, telemetry, and evaluation corpus, not in a specific model vendor.
1.9.3 Staged Evolution#
| Horizon | Dominant Pattern | Bottleneck | Architectural Response |
|---|---|---|---|
| 0–3 years | LLM-centered agents with retrieval and tools | hallucination, latency, brittle tool use | stronger orchestration, typed tools, verifiers |
| 3–7 years | multi-model agent stacks with persistent memory and specialization | coordination complexity, evaluation debt | protocol standardization, replay CI, event sourcing |
| 7–10 years | substrate-independent cognitive architectures | specification, governance, long-horizon stability | formal contracts, simulation, self-maintenance under policy |
1.9.4 Substrate Independence#
A substrate-independent architecture separates cognition into modular layers:
The model portfolio becomes hot-swappable by task, latency tier, cost class, or jurisdiction. This has significant implications:
- vendor portability,
- cost optimization,
- regulatory flexibility,
- resilience to model regressions,
- and performance specialization.
1.9.5 Likely Technical Shifts#
-
From monolithic reasoning to model ensembles
Separate models for planning, retrieval reformulation, code reasoning, verification, and summarization. -
From raw prompt history to compiled context artifacts
Context assembly becomes a deterministic systems function. -
From vector-only retrieval to evidence graphs
Lineage, authority, freshness, and usage patterns become ranking features. -
From session chat memory to policy-governed memory hierarchies
Durable memory requires validation, provenance, expiry, and deduplication. -
From human review after failure to continuous evaluation before deployment
Failed traces become replay suites and CI gates. -
From model-centric trust to system-centric trust
Assurance migrates from “the model seems good” to measurable behavioral guarantees.
1.9.6 Long-Term Limitation#
The hardest unsolved problem is not next-token quality. It is specification robustness under open-world action: ensuring that increasingly competent systems remain aligned and controllable when objectives are incomplete, shifting, or strategically exploitable.
1.10 Reference Architecture Overview: The Complete Agentic Execution Stack#
1.10.1 Architectural Principle#
The platform must be designed as a typed protocol stack, not prompt glue. All boundaries expose explicit schemas, capability discovery, deadlines, pagination where applicable, typed error classes, and versioned contracts.
1.10.2 Layered Stack#
| Layer | Primary Function | Protocol / Artifact | Key Controls |
|---|---|---|---|
| L1. User/Application Boundary | ingress, request lifecycle, async job control | JSON-RPC | authN/Z, deadlines, idempotency keys, error classes, versioning |
| L2. Task Gateway | request normalization, SLA assignment, tenancy isolation | typed task spec | admission control, rate limits, priority queues |
| L3. Orchestrator | plan/decompose/route/track execution | internal state machine | bounded recursion, leases, rollback, compensation |
| L4. Context Compiler | deterministic prefill assembly | compiled context artifact | token budgets, context hygiene, reproducibility digests |
| L5. Retrieval Engine | evidence gathering and ranking | hybrid indexed evidence packets | provenance, freshness, authority, latency budgets |
| L6. Memory Services | working/session/episodic/semantic/procedural memory | typed memory records | validation, deduplication, TTL, provenance |
| L7. Tool Fabric | execution against external/internal capabilities | MCP, gRPC, JSON-RPC adapters | least privilege, lazy loading, traceability |
| L8. Verification Plane | schema/policy/evidence/test checks | validators, rule engine, test harness | block-on-fail, contradiction detection, calibration |
| L9. Commit Plane | staged writes and irreversible mutations | transactional or saga protocols | approval gates, idempotency, compensation |
| L10. Observability / Eval Plane | logs, metrics, traces, replay CI | event store, trace schema, benchmark corpus | regression gating, drift detection, cost analytics |
The architecture is intentionally layered so that each concern can be validated, replaced, and scaled independently.
1.10.3 Boundary Protocols and Contracts#
JSON-RPC at the User/Application Boundary#
Use JSON-RPC for:
- synchronous request/response,
- asynchronous job creation,
- result polling,
- cancellation,
- trace retrieval,
- and capability discovery.
Required request fields:
request_ididempotency_keydeadlinetenant_idtask_typeschema_versionpriority_classauthorization_context
Required response fields:
- terminal status or job handle,
- structured result or typed error,
- provenance handle,
- trace reference,
- budget consumption summary.
gRPC/Protobuf for Internal Execution#
Use gRPC internally for:
- low-latency service-to-service calls,
- strict type contracts,
- streaming tool output,
- backpressure-aware internal services,
- and uniform deadline propagation.
This is the correct substrate for orchestrator-to-retrieval, orchestrator-to-memory, verifier-to-tool, and evaluator-to-trace-store calls.
MCP for Discoverable Tools and Resources#
Use MCP for:
- tool discovery,
- resource surfaces,
- prompt/resource metadata,
- local and remote tool connectors,
- change notifications,
- schema-described input/output affordances.
Tool definitions must be lazily loaded into context to avoid token waste. Tools not relevant to the active plan do not belong in the active window.
1.10.4 Deterministic Context Construction#
Prompting should be treated as a compiled runtime artifact, not handwritten prose.
Context Compilation Inputs#
The compiler assembles:
- role policy,
- task objective and acceptance criteria,
- protocol bindings,
- current execution state,
- tool affordances,
- retrieved evidence packets,
- memory summaries,
- failure-state summaries,
- and response contract.
Token Budget Formalization#
Let the model context window be . Allocate:
where is preserved for actual inference and structured output generation. The compiler must never consume the full window on prefilling.
Compiler Requirements#
- deterministic section ordering,
- duplicate elimination,
- stale-history compression,
- provenance-preserving evidence compression,
- explicit omission of irrelevant history,
- versioned compiler policies,
- stable context digests for traceability and cache reuse.
Pseudo-Algorithm 2 — Context Compilation#
Inputs: task state , model window , latency budget
- Load role and policy contract versions.
- Normalize objective into a machine-checkable task schema.
- Reserve generation capacity .
- Include only the current execution state summary, not raw full history.
- Discover relevant tool affordances from MCP metadata; omit inactive tools.
- Decompose the task into retrieval intents.
- Fetch evidence packets under latency budget .
- Fetch memory summaries by utility, recency, and policy eligibility.
- Deduplicate semantically equivalent constraints and evidence.
- Compress low-signal text; preserve citations, timestamps, and authorities.
- Emit a deterministic preamble with a content digest and schema version.
Hallucination Control Implication#
A clean context window reduces hallucination more effectively than verbose instruction piles. Unsupported claims frequently arise from context clutter, stale history, or missing evidence, not only from model weakness.
1.10.5 Retrieval Engine: Deterministic Evidence, Not Ad Hoc RAG#
Retrieval Is a Control Function#
Retrieval should produce evidence packets with provenance, not anonymous text blobs. Each evidence packet should minimally contain:
- source identifier,
- authority class,
- timestamp/freshness,
- lineage or dependency links,
- chunk boundaries,
- extraction method,
- confidence metadata,
- and access policy label.
Retrieval Inputs#
A production retrieval layer should combine:
- exact lexical match,
- semantic similarity,
- metadata filters,
- lineage/graph context,
- historical usage patterns,
- human annotations,
- code-derived enrichment,
- institutional knowledge bases,
- validated memory,
- and live runtime inspection.
Retrieval Scoring#
A practical ranking function is:
where:
- : exact match score
- : semantic score
- : authority
- : freshness
- : lineage/graph relevance
- : execution utility
- : latency penalty
- : context cost penalty
This is materially better than nearest-neighbor similarity alone.
Query Rewriting and Routing#
The platform should not issue one naive query. It should:
- rewrite and expand the user request,
- decompose it into subqueries,
- route each subquery by source, schema, and latency tier,
- merge results with provenance,
- and rank for execution utility, not only semantic relevance.
Chunking Strategy by Document Class#
Chunking must be document-aware:
- Policies / contracts / standards: structural chunking by section and clause
- Code and repositories: symbol-aware or AST-aware chunking with call graph and ownership enrichment
- Manuals and architecture docs: hierarchical chunking
- Tickets / incidents / chats: temporal-semantic chunking
- SOPs / workflows: agentic chunking around action units and precondition/postcondition boundaries
Hallucination Control Rules#
- no final factual claim without provenance,
- no synthesis from anonymous context,
- abstain if evidence is insufficient or conflicting,
- run contradiction checks before commit,
- prefer tool inspection over latent recall when the source of truth is available.
1.10.6 Hard Memory Wall: Working Context vs. Durable Memory#
Memory Layers#
A production memory system must separate:
-
Working memory
Current execution scratch state; ephemeral and high-churn. -
Session memory
Short-lived interaction continuity. -
Episodic memory
Validated facts about prior task outcomes and exceptions. -
Semantic memory
Canonical domain knowledge and stable constraints. -
Procedural memory
Validated operating patterns, repair strategies, routing heuristics.
These layers must not collapse into one undifferentiated vector store.
Promotion Policy#
A candidate memory item should be promoted only if:
Additional controls:
- TTL or expiry evaluation,
- privacy and retention policy checks,
- novelty threshold,
- source trust score,
- conflict detection against canonical memory.
Pseudo-Algorithm 3 — Memory Promotion#
Input: candidate observation or correction
- Verify source authenticity and authorization.
- Classify memory type: episodic, semantic, procedural, or reject.
- Check novelty against existing memory via semantic and exact deduplication.
- Require provenance, timestamp, and evidence links.
- Reject volatile, speculative, or chain-of-thought-like material.
- Apply privacy policy and expiry rules.
- Write only after validator approval.
- Record lineage from original event to retained memory object.
Design Rule#
Never store unverified model assertions as durable truth. Durable memory is a governed knowledge substrate, not a cache of previous guesses.
1.10.7 Orchestration and Multi-Agent Execution#
Bounded Control Loop#
Every execution follows:
No step may be skipped for high-impact tasks.
Specialized Agents#
Use specialization when it improves correctness or throughput:
- implementation agent,
- retrieval agent,
- verification agent,
- documentation agent,
- performance analysis agent,
- security or policy review agent.
Specialization is justified only if the coordination overhead is lower than the accuracy or latency gain.
Isolation and Lock Discipline#
Parallel agents require:
- independently claimable work units,
- task leases or locks,
- isolated workspaces,
- merge-safe branches,
- conflict detection,
- bounded recursion,
- deterministic merge rules.
Concurrency is permitted only when overlap risk and merge entropy are mechanically controlled.
Idempotency#
Distributed agent systems should assume at-least-once execution semantics and create the illusion of exactly-once behavior through idempotent mutations. Every effectful operation needs:
- operation-scoped idempotency key,
- deduplication store,
- replay-safe handlers,
- compensating action strategy when atomicity is impossible.
1.10.8 Tool Fabric: Typed Infrastructure with Least Privilege#
Tool Exposure Requirements#
Tool servers should expose:
- capability discovery,
- versioned schemas,
- optional structured outputs,
- pagination,
- timeout classes,
- error taxonomy,
- change notifications,
- and auditable invocation traces.
Least-Privilege Access#
Mutation-capable tool paths must be:
- caller-scoped,
- approval-gated when needed,
- human-interruptible,
- policy-bound to explicit effect classes,
- and never issued broad ambient credentials owned by the agent.
Environment Legibility#
Agents must be able to inspect:
- logs,
- metrics,
- traces,
- repository metadata,
- test harnesses,
- browser or desktop state where applicable,
- and runtime diagnostic artifacts.
An agent that cannot observe the environment cannot reliably improve or repair it.
1.10.9 Verification, Response Synthesis, and Commit#
Multi-Layer Verification#
Before response synthesis or mutation, enforce:
- schema validation,
- authorization validation,
- provenance sufficiency,
- contradiction detection,
- domain-specific tests or simulations,
- policy checks,
- approval checks for effectful actions.
Response Synthesis Rule#
Final responses should be synthesized only from verified state and evidence packets. If evidence is incomplete, the allowed outputs are:
- abstention,
- bounded hypothesis clearly marked as uncertain,
- clarification request,
- or escalation.
Commit Semantics#
Commits must be:
- explicit,
- audited,
- idempotent,
- and reversible where possible.
For non-atomic multi-step writes, use saga-style compensation rather than implicit trust in success propagation.
1.10.10 Fault Tolerance, Graceful Degradation, and Load Control#
Reliability Controls#
Production-grade agentic platforms require:
- rate limits,
- backpressure,
- circuit breakers,
- retry budgets with jitter,
- queue isolation by workload class,
- cache hierarchies for retrieval and compiled context artifacts,
- and workload prioritization.
Graceful Degradation Policy#
Under load or partial outage, degrade in a controlled order:
- shed low-priority background reflection and cleanup jobs,
- reduce retrieval fan-out,
- shrink evidence payload size,
- downgrade to cheaper or faster model tiers,
- disable nonessential critique passes,
- switch effectful actions to draft-only mode,
- escalate or abstain rather than hallucinate.
Pseudo-Algorithm 4 — Load-Aware Degradation#
Input: utilization , error rate , deadline pressure
- If or exceeds warning threshold:
- reduce asynchronous evaluator concurrency,
- enable cache-preferred retrieval,
- tighten context budgets.
- If thresholds continue rising:
- route to lower-latency models for low-risk tasks,
- disable optional reflection passes,
- restrict to read-only operations where feasible.
- If critical thresholds are crossed:
- reject or queue low-priority jobs,
- preserve interactive and safety-critical classes,
- force abstain/escalate behavior for under-verified tasks.
Operational Principle#
The system should fail safe and legible, not optimistically and opaquely.
1.10.11 Observability and Continuous Evaluation#
Observability at Every Boundary#
Capture:
- structured logs,
- distributed traces,
- span-level tool invocations,
- compiled context digests,
- token and latency usage,
- retrieval evidence provenance,
- verifier outcomes,
- approval events,
- commit results,
- and compensation flows.
Minimum Operational Metrics#
- task success rate by contract version,
- grounding rate,
- verifier pass/fail distribution,
- human escalation rate,
- p50/p95 latency,
- cost per accepted task,
- tool failure rate,
- memory promotion acceptance rate,
- replay regression pass rate,
- and load-shed frequency.
Feedback-to-Evaluation Pipeline#
Human corrections, failed traces, reviewer comments, and production regressions should be normalized into:
- replay sets,
- benchmark tasks,
- policy tests,
- adversarial evaluation suites,
- routing tests,
- and CI/CD gates.
Capability growth without regression enforcement is not engineering maturity; it is stochastic drift.
Recurring Cleanup Agents#
The platform should include maintenance agents, under policy, to:
- identify duplicated prompt or policy patterns,
- remove context slop,
- retire dead tools,
- detect stale memory,
- and propose eval additions from recurring incidents.
These agents improve the substrate, not only individual task outcomes.
Cross-Cutting Nonfunctional Invariants#
| Concern | Required Mechanisms |
|---|---|
| Hallucination control | provenance-tagged evidence, tool-grounding, contradiction checks, abstention, evidence-only synthesis |
| Fault tolerance | retries with jitter, circuit breakers, persisted failure state, sagas, queue isolation |
| Idempotency | operation keys, dedup stores, replay-safe handlers, staged commits |
| Observability | traces, logs, metrics, context digests, tool audit trails |
| Latency | deadlines, adaptive retrieval fan-out, priority queues, model tiering |
| Token efficiency | deterministic context compilation, lazy tool loading, memory summaries, history compression |
| Cost optimization | cache hierarchies, routing by task class, selective verification depth, retrieval utility scoring |
| Graceful degradation | load shedding, draft-only fallback, reduced reflection, abstain/escalate under uncertainty |
Concluding Synthesis#
The decisive shift from predictive model to autonomous cognitive architecture is the shift from generation to governed closed-loop execution. Agentic systems are not defined by conversational fluency, but by the presence of:
- explicit goals,
- typed interfaces,
- bounded control loops,
- deterministic context construction,
- provenance-first retrieval,
- validated memory hierarchies,
- verifier-mediated tool use,
- governed commit semantics,
- and continuous observational feedback.
The enduring architecture pattern is therefore clear: keep stochastic intelligence inside a deterministic, observable, contract-enforced envelope. That is the foundational design law for production-grade agentic AI.