Agentic Notes Library

Chapter 1: The Agentic Paradigm — From Predictive Models to Autonomous Cognitive Architectures

An LLM becomes agentic only when embedded inside a closed-loop execution architecture with explicit goals, bounded planning, tool-mediated actuation, state management, verification, and governed commit semantics. The boundary between a p...

March 19, 2026 23 min read 5,011 words
Chapter 01Math

An LLM becomes agentic only when embedded inside a closed-loop execution architecture with explicit goals, bounded planning, tool-mediated actuation, state management, verification, and governed commit semantics. The boundary between a predictive model and an autonomous cognitive architecture is therefore not linguistic fluency; it is control topology.

Scope and Assumptions#

This chapter assumes a production-grade agentic platform operating under the following conditions:

  • Partially observable environments: the true task state is not fully visible to the model at any step.
  • Mixed read/write tasks: some actions are advisory, some mutate external systems.
  • Enterprise constraints: latency, cost, safety, auditability, regulatory compliance, and multi-tenant isolation are first-class concerns.
  • Typed infrastructure: JSON-RPC at user/application boundaries, gRPC/Protobuf for internal execution, MCP for discoverable tool/resource/prompt surfaces.
  • Deterministic orchestration around stochastic models: the model proposes; the system governs.
  • No one-shot correctness assumption: nontrivial tasks must execute through bounded verify/repair loops.

1.1 Definitional Taxonomy: Agents, Assistants, Copilots, Autonomous Systems — Formal Boundaries#

1.1.1 Canonical System Model#

A production agentic system can be formalized as

A=E,O,S,G,Π,T,M,V,K\mathcal{A} = \langle \mathcal{E}, \mathcal{O}, \mathcal{S}, \mathcal{G}, \Pi, \mathcal{T}, \mathcal{M}, \mathcal{V}, \mathcal{K} \rangle

where:

  • E\mathcal{E}: environment
  • O\mathcal{O}: observation space
  • S\mathcal{S}: internal state or belief state
  • G\mathcal{G}: goal set and acceptance criteria
  • Π\Pi: policy/planning function
  • T\mathcal{T}: tool/action interface set
  • M\mathcal{M}: memory hierarchy
  • V\mathcal{V}: verification and critique mechanisms
  • K\mathcal{K}: commit protocol governing irreversible effects

A system is not agentic merely because it emits text that resembles planning. It is agentic only if it operates as a closed-loop controller over environment-facing actions and state transitions.

1.1.2 Predictive Model vs. Agentic System#

A predictive model is a conditional estimator:

pθ(yx)p_\theta(y \mid x)

It maps input context xx to output distribution yy. It does not inherently define:

  • persistent state transitions,
  • tool execution semantics,
  • recovery behavior,
  • verification contracts,
  • or commit authority.

Agentic behavior emerges only when the model is embedded in a control architecture that interprets output as a candidate action rather than as authoritative truth.

1.1.3 Formal Boundary Conditions#

A system qualifies as an agent only if all of the following hold:

  1. Objective-conditioned execution
    There exists an explicit task objective or acceptance criterion beyond generic dialogue continuation.

  2. Stateful multi-step control
    The system maintains task state across steps and can condition future actions on prior outcomes.

  3. Action selection over a tool/action set
    The system selects among external actions, not merely among phrasings.

  4. Verification-mediated adaptation
    The system can inspect consequences, detect failure, and repair or escalate.

  5. Commit semantics
    The system can stage or perform side effects under governance.

If conditions 1–4 hold but 5 is absent, the system is a bounded planning assistant, not a fully operational agent.

1.1.4 Taxonomy Table#

ClassGoal OwnershipPlanning HorizonTool UseCommit AuthorityMemory ScopeRecovery OwnershipPrimary Failure Mode
Predictive modelNone explicit0None intrinsicNoneNone intrinsicExternal wrapperUnsupported fluent output
AssistantHuman1 turn to short sessionOptional, user-steeredHuman onlySession-localHumanAdvice without execution grounding
CopilotHuman, domain-boundedShort multi-stepDomain-coupled, proactive suggestionUsually human-approvedSession + narrow task memorySharedOver-suggestion, partial context mismatch
AgentDelegated task objectiveMulti-step boundedAutonomous routing among typed toolsConditional, policy-boundWorking + session + episodicSystem loop, human on exceptionTool misuse, hallucinated action plans, failed recovery
Autonomous systemDelegated domain objective, may derive subgoalsLong-horizonBroad, scheduled, event-drivenPolicy-governed direct executionDurable, validated hierarchical memorySystem by defaultGoal drift, specification gaming, cascading failures

1.1.5 Assistant vs. Copilot vs. Agent#

The boundaries are operational, not marketing labels:

  • Assistant: language-centric, user-driven, mostly advisory.
  • Copilot: proactive within a sharply bounded domain and UI surface.
  • Agent: task-centric, closed-loop, tool-using, stateful, verifier-mediated.
  • Autonomous system: agent with delegated authority over longer horizons, asynchronous execution, and policy-bounded self-scheduling.

1.1.6 Necessary Negative Definitions#

A system is not meaningfully agentic if:

  • it cannot observe execution outcomes,
  • it cannot distinguish draft output from committed side effects,
  • it cannot recover from tool or retrieval failure,
  • it cannot justify outputs with provenance,
  • or it cannot abstain and escalate under uncertainty.

1.2 The Agent as a Control System: Sense–Plan–Act–Verify–Repair–Commit Loop Formalization#

1.2.1 Control-Theoretic Framing#

The correct abstraction for an agent is a closed-loop controller operating over a partially observable environment. Let:

  • xtx_t: latent environment state
  • oto_t: observation at time tt
  • btb_t: belief state
  • ata_t: action
  • gtg_t: active goal
  • τt\tau_t: execution trace up to time tt

Belief update is ideally:

bt=ηP(otxt)xt1P(xtxt1,at1)bt1(xt1)b_t = \eta \, P(o_t \mid x_t) \sum_{x_{t-1}} P(x_t \mid x_{t-1}, a_{t-1}) b_{t-1}(x_{t-1})

In real agentic platforms, btb_t is approximated by structured execution state, retrieved evidence, tool outputs, and memory summaries rather than exact Bayesian inference.

1.2.2 Objective Function with Operational Constraints#

The agent should optimize expected utility subject to safety, latency, and cost constraints:

maxπ  E[U(τ)]\max_{\pi} \; \mathbb{E}[U(\tau)]

subject to

Pr(unsafe(τ))ϵ,E[latency(τ)]Lmax,E[cost(τ)]Cmax\Pr(\text{unsafe}(\tau)) \le \epsilon, \qquad \mathbb{E}[\text{latency}(\tau)] \le L_{\max}, \qquad \mathbb{E}[\text{cost}(\tau)] \le C_{\max}

A practical decision rule is:

at=argmaxaA(bt)(E[U(abt)]λ$C(a)λL(a)λrR(a))a_t^{*} = \arg\max_{a \in \mathcal{A}(b_t)} \Big( \mathbb{E}[U(a \mid b_t)] - \lambda_{\$} C(a) - \lambda_{\ell} L(a) - \lambda_r R(a) \Big)

where:

  • C(a)C(a) is monetary or token cost,
  • L(a)L(a) is latency impact,
  • R(a)R(a) is estimated risk.

The action set must include abstain, escalate, and request clarification. A production agent without these actions will systematically overcommit.

1.2.3 Loop Semantics#

The minimal safe loop is:

  1. Sense
    Collect user inputs, system state, tool results, runtime telemetry, repository state, logs, tests, UI state, and retrieved evidence.

  2. Plan
    Produce a bounded plan with subgoals, tool routing, success criteria, and rollback conditions.

  3. Act
    Execute a typed tool call or emit a structured intermediate artifact.

  4. Verify
    Check schema validity, evidence sufficiency, policy compliance, consistency, and task-specific tests.

  5. Repair
    If verification fails, critique root cause, update plan/context/tool choice, or escalate.

  6. Commit
    Only then write to external systems or emit final user-facing outputs with provenance.

1.2.4 Verification as a First-Class Transition Gate#

A proposed action is valid only if

vt=ϕschemaϕpolicyϕevidenceϕstateϕtests=1v_t = \phi_{\text{schema}} \wedge \phi_{\text{policy}} \wedge \phi_{\text{evidence}} \wedge \phi_{\text{state}} \wedge \phi_{\text{tests}} = 1

where each ϕ\phi is a machine-checkable predicate. If vt=0v_t = 0, the system must not commit. It must either repair, downgrade, or escalate.

1.2.5 Pseudo-Algorithm 1 — Bounded Agent Execution Loop#

Inputs: task specification ω\omega, deadline dd, budget β\beta, idempotency key kk

  1. Initialize execution state s0s_0, trace τ0\tau_0, retry budget, and recursion depth counter.
  2. Sense environment:
    • normalize user request,
    • ingest current runtime state,
    • load session state and prior verified artifacts.
  3. Compile a bounded context artifact under token budget.
  4. Produce a task plan with explicit acceptance tests and rollback conditions.
  5. Decompose into claimable work units.
  6. For each work unit:
    • retrieve provenance-tagged evidence,
    • select tool or model action,
    • execute under deadline and capability constraints,
    • verify output against contracts.
  7. If verification fails:
    • classify error as transient, deterministic, policy, or epistemic,
    • persist failure state,
    • repair via replan, reroute, clarification, or escalation,
    • enforce bounded retries with jitter and maximum depth.
  8. If verification succeeds:
    • stage mutation if effectful,
    • require approval if policy requires,
    • commit with idempotency key kk and audit trace.
  9. Synthesize final result only from verified state and evidence.
  10. Emit response, trace handle, provenance, and postconditions.

1.2.6 Critical Engineering Implication#

The language model is not the control loop. It is a stochastic component inside the loop. The orchestrator owns:

  • deadlines,
  • retry policy,
  • tool authorization,
  • state transitions,
  • memory promotion,
  • verification gates,
  • and final commit.

That distinction is foundational.


1.3 Levels of Agentic Autonomy: L0 → L5#

1.3.1 Why a Leveling Model Is Necessary#

“Autonomous” is overloaded. A useful maturity model must distinguish systems by delegated authority, planning horizon, mutation rights, recovery competence, and human governance mode.

1.3.2 Autonomy Levels#

LevelNamePlanning / ControlToolingMemoryHuman RoleMutation Rights
L0Tool-Augmented LLMSingle-turn or shallow multi-turn generationManual or wrapper-triggered toolsMinimal sessionUser in the loop for every stepNone
L1Reactive AssistantResponds with local adaptationsRead-heavy tools, limited executionSession memoryUser directs sequenceDraft only
L2Task AgentMulti-step bounded planning with retriesTyped tool routingWorking + session + limited episodicHuman approves high-risk actionsNarrow, approval-gated
L3Delegated OperatorAsynchronous task handling, monitors and replansBroad domain toolsDurable episodic memoryHuman on exceptionsPolicy-bound writes
L4Domain Autonomous SystemLong-horizon domain objectives, event-driven operationMulti-system orchestrationEpisodic + semantic + proceduralHuman on the loopBroad within domain policy
L5Fully Autonomous Cognitive AgentCross-domain self-directed operation with self-maintenanceDynamic multi-domain toolingHierarchical durable memory and self-modelHuman only for governanceBroad delegated authority

1.3.3 Promotion Criteria Between Levels#

A system should not be promoted to the next level based on anecdotal demos. Promotion requires evidence across:

  • verifier pass rate,
  • factual grounding rate,
  • rollback success rate,
  • human override rate,
  • p95 latency,
  • cost per accepted task,
  • failure containment,
  • and policy compliance under adversarial evaluation.

1.3.4 Practical Interpretation#

  • Most current production systems are between L1 and L2.
  • High-value enterprise systems are beginning to operate at constrained L3 in narrow domains.
  • L4 is feasible only where tooling, policy contracts, and environment observability are unusually strong.
  • L5 remains aspirational and is not presently compatible with stringent production safety expectations across open domains.

1.3.5 Architectural Consequence#

Autonomy level is not a model property. It is a system property resulting from the interaction of:

  • planning depth,
  • memory design,
  • effect controls,
  • environment observability,
  • and verification strength.

1.4 Theoretical Foundations: Rational Agency, Bounded Rationality, Satisficing under Uncertainty#

1.4.1 Rational Agency#

Classical rational agency selects actions that maximize expected utility under belief uncertainty. In a POMDP-like framing:

π=argmaxπΠE[t=0Tγtr(xt,at)]\pi^{} = \arg\max_{\pi \in \Pi} \mathbb{E}\left[\sum_{t=0}^{T} \gamma^t r(x_t, a_t)\right]

This framing remains useful, but direct optimization is computationally intractable for real agentic systems.

1.4.2 Bounded Rationality#

Real systems are bounded by:

  • context window size,
  • inference latency,
  • tool round-trip time,
  • retrieval noise,
  • budget ceilings,
  • and verification capacity.

A more realistic objective is resource-rational:

π=argmaxπΠ(E[U(τ)]λcCompute(π)λmContext(π)λ$Spend(π))\pi^{*} = \arg\max_{\pi \in \Pi} \Big( \mathbb{E}[U(\tau)] - \lambda_c \, \text{Compute}(\pi) - \lambda_m \, \text{Context}(\pi) - \lambda_{\$} \, \text{Spend}(\pi) \Big)

This formulation captures a central systems truth: more reasoning is not always better. The value of additional deliberation must exceed its cost.

1.4.3 Metareasoning: Think vs. Act#

The system should choose to deliberate further only when the expected value of deliberation exceeds its cost:

ΔVdeliberate>Cdeliberate\Delta V_{\text{deliberate}} > C_{\text{deliberate}}

This yields architecture-level policies such as:

  • skip critique passes for low-risk, deterministic tasks,
  • expand retrieval fan-out only under unresolved uncertainty,
  • escalate rather than continue if evidence quality remains low after bounded attempts.

1.4.4 Satisficing Under Uncertainty#

In many enterprise contexts, perfect optimality is unnecessary or unobtainable. A satisficing rule is:

choose a such that Pr(U(a)θbt)α\text{choose } a \text{ such that } \Pr(U(a) \ge \theta \mid b_t) \ge \alpha

where:

  • θ\theta is the minimum acceptable utility threshold,
  • α\alpha is confidence.

This is operationally superior to naive maximization in settings with incomplete information, long-tail exceptions, or tight latency budgets.

1.4.5 Uncertainty Classes#

A production agent must distinguish:

  • Epistemic uncertainty: missing or incomplete knowledge
    Mitigation: retrieval, tool inspection, clarification, human escalation.

  • Aleatoric uncertainty: intrinsic environmental variability
    Mitigation: probabilistic planning, risk bounds, staged commits.

  • Model uncertainty: unreliable inference or calibration
    Mitigation: verifier ensembles, model routing, abstention.

  • Specification uncertainty: ambiguous or underspecified user intent
    Mitigation: clarification, preference inference bounded by policy, human confirmation.

1.4.6 Why This Matters Architecturally#

Retrieval, memory, verification, and governance are not auxiliary features. They are the concrete mechanisms by which bounded rationality becomes operationally viable.


1.5 Cybernetic Feedback Loops and Homeostatic Agent Stability#

1.5.1 Cybernetic Framing#

An agentic system is a cybernetic system with:

  • reference signals: goals, SLAs, policy thresholds,
  • sensors: observations, tool outputs, telemetry, tests,
  • controller: planner/orchestrator,
  • actuators: tools and commits,
  • plant: external environment plus internal execution substrate,
  • feedback: verification, monitoring, user correction, evaluator output.

1.5.2 Homeostatic Variables#

Stable agents maintain internal variables within controlled bands. Typical homeostatic state vector:

zt=[ut,t,ct,et,qt,ρt,mt]z_t = [u_t, \ell_t, c_t, e_t, q_t, \rho_t, m_t]

where, for example:

  • utu_t: uncertainty estimate
  • t\ell_t: deadline slack or latency pressure
  • ctc_t: cumulative task cost
  • ete_t: error rate
  • qtq_t: queue depth
  • ρt\rho_t: service utilization
  • mtm_t: context or memory pressure

The system seeks to regulate ztz_t toward target setpoints zz^{*}.

1.5.3 Stability Criterion#

A practical stability objective can be expressed through a Lyapunov-like function:

V(zt)=ztz22V(z_t) = \|z_t - z^{*}|_2^2

and the controller should enforce

E[V(zt+1)V(zt)]ϵztz22+η\mathbb{E}[V(z_{t+1}) - V(z_t)] \le -\epsilon \|z_t - z^{*}|_2^2 + \eta

for some ϵ>0\epsilon > 0, with η\eta representing bounded disturbance. In operational terms, the system should damp deviations rather than amplify them.

1.5.4 Negative Feedback Mechanisms#

Key stabilizers include:

  • bounded recursion depth,
  • retry budgets with exponential backoff and jitter,
  • tool circuit breakers,
  • context pruning and summarization,
  • rate limiting and queue isolation,
  • confidence-triggered escalation,
  • contradiction detection,
  • cache hierarchies,
  • and budget-aware model routing.

1.5.5 Instability Modes#

Common unstable regimes include:

  • tool thrashing: repeated low-value tool calls,
  • retrieval storms: unbounded fan-out under uncertainty,
  • reflection loops: critique/repair cycles without convergence,
  • memory contamination: erroneous facts promoted into durable memory,
  • goal drift: subgoals begin optimizing for local completion rather than task truth,
  • latency collapse under load: interactive and background workloads contend on shared resources.

1.5.6 Queueing-Theoretic Constraint#

For sustained stable operation, critical queues must satisfy

ρ=λμ<1\rho = \frac{\lambda}{\mu} < 1

where λ\lambda is arrival rate and μ\mu is service rate. Since agentic workloads are bursty and heavy-tailed, practical design targets require headroom well below saturation and priority-based isolation between:

  • interactive inference,
  • batch agent jobs,
  • verification workloads,
  • and offline evaluation.

1.5.7 Homeostasis as a Safety Mechanism#

Homeostatic control is not only about uptime. It reduces hallucination and unsafe behavior by preventing the system from operating under:

  • context overload,
  • stale evidence,
  • degraded tool reliability,
  • and extreme time pressure without fallback.

1.6 The Competence–Alignment–Control Trilemma in Agentic Systems#

1.6.1 Definitions#

Let:

  • CC: competence — probability of producing a useful, correct task outcome
  • AA: alignment — fidelity to operator intent, policy, and normative constraints
  • KK: control — ability to predict, bound, interrupt, audit, and reverse behavior

A deployable system requires all three. In practice, weakness in any one dimension collapses operational value.

1.6.2 Why It Is a Trilemma#

Increasing competence often requires:

  • broader tool access,
  • larger action spaces,
  • longer planning horizons,
  • richer memory,
  • more adaptive behavior.

These raise the space of possible failure modes and therefore reduce control unless counterbalanced mechanically. Similarly, overly strong control through rigid gating can suppress competence and increase operator burden.

1.6.3 Operational Form#

A rough deployment utility can be represented as

DCAKD \propto C \cdot A \cdot K

This multiplicative framing is useful because near-zero performance on any axis makes the system nonviable regardless of strength on the others.

1.6.4 Failure Examples#

  • High competence, weak alignment
    The system solves the wrong problem efficiently; specification gaming emerges.

  • High competence, weak control
    The system acts effectively but opaquely; operators cannot reliably stop, audit, or attribute behavior.

  • High alignment, weak competence
    The system is safe but operationally irrelevant.

  • High control, weak competence
    The system devolves into a human-operated workflow with AI ornamentation.

1.6.5 Architectural Response#

The trilemma is managed by systems design, not by prompt text alone:

  1. Externalize control from the model
    The orchestrator enforces budgets, tool scope, and commit rights.

  2. Use typed capability boundaries
    Different effects require different guards.

  3. Separate planning from committing
    Candidate plans are low-trust artifacts until verified.

  4. Bind competence to evidence
    High competence must mean high performance under grounding and testing, not high rhetorical fluency.

  5. Make risky actions reversible where possible
    Use staged writes, drafts, branch-based changes, and compensation workflows.

  6. Escalate under unresolved ambiguity
    Alignment cannot be inferred robustly from underspecified tasks without a bounded clarification mechanism.

1.6.6 Strategic Implication#

There is no single-model solution to the trilemma. The stable frontier is achieved by combining:

  • model capability,
  • typed orchestration,
  • verification infrastructure,
  • human governance,
  • and operational feedback loops.

1.7 Formal Verification of Agent Behavioral Contracts#

1.7.1 Why Formal Verification Is Necessary#

An agent that can mutate external systems without enforceable behavioral contracts is not deployable at enterprise scale. Since the model is stochastic and the environment is open-ended, verification must target observable behavior, not inaccessible internal reasoning.

1.7.2 Contract Structure#

A behavioral contract can be defined as

Γ=P,I,Q,Σ,Λ\Gamma = \langle P, I, Q, \Sigma, \Lambda \rangle

where:

  • PP: preconditions
  • II: invariants
  • QQ: postconditions
  • Σ\Sigma: permissible side-effect classes
  • Λ\Lambda: liveness and temporal requirements

An execution trace τ\tau satisfies the contract iff

Γ(τ)=P(τ0)(t=0τI(τt))Λ(τ)(terminal(τ)Q(τ))\Gamma(\tau) = P(\tau_0) \wedge \left(\bigwedge_{t=0}^{|\tau|} I(\tau_{\le t})\right) \wedge \Lambda(\tau) \wedge \big(\text{terminal}(\tau) \Rightarrow Q(\tau)\big)

1.7.3 Contract Layers#

  1. Schema contracts
    Inputs and outputs must conform to versioned types.

  2. Authority contracts
    Only allowed tools and scopes may be invoked.

  3. Data contracts
    Provenance, privacy class, retention policy, and field-level sensitivity must be honored.

  4. Process contracts
    Required steps must occur in order, e.g. retrieve before claim, verify before commit.

  5. Postcondition contracts
    The artifact or mutation must satisfy task-specific requirements.

  6. Temporal contracts
    Safety and liveness properties over traces must hold.

1.7.4 Temporal Logic Examples#

Examples of enforceable policies:

  • Approval before mutation:
G(mutating_actionapproved)G(\text{mutating\_action} \rightarrow \text{approved})
  • Verification before commit:
G(commitverified)G(\text{commit} \rightarrow \text{verified})
  • Eventual safe termination:
G(startedF(completedfailed_safeescalated))G(\text{started} \rightarrow F(\text{completed} \vee \text{failed\_safe} \vee \text{escalated}))
  • Tool calls must be schema-valid:
G(tool_callschema_valid)G(\text{tool\_call} \rightarrow \text{schema\_valid})

1.7.5 Verification Modalities#

A robust stack uses multiple verification modes:

  • Static
    Typed schemas, effect typing, policy compilation, workflow model checking.

  • Dynamic
    Runtime guards, authorization checks, rate limits, timeout classes, output validators.

  • Semantic
    Evidence sufficiency checks, contradiction analysis, regression evaluators, unit/integration tests.

  • Probabilistic
    Calibration curves, confidence thresholds, abstention policies.

1.7.6 Limits of Formal Verification#

Full semantic correctness over open-world natural language tasks is not generally decidable. Therefore the platform should verify:

  • the structure of behavior,
  • the legality of actions,
  • the provenance of claims,
  • and the satisfaction of explicit acceptance tests.

This is sufficient for strong practical assurance when combined with controlled effect surfaces.

1.7.7 Design Rule#

Verify the contracted envelope of behavior, not the model’s hidden cognition. The system boundary is where assurance must live.


1.8 Agentic vs. Workflow Automation: Architectural Decision Boundaries#

1.8.1 Core Distinction#

Workflow automation is appropriate when the action graph is mostly enumerable. Agentic execution is appropriate when the system must reason over open-world ambiguity, long-tail exceptions, or dynamically selected tools and evidence.

1.8.2 Decision Matrix#

CriterionWorkflow AutomationAgentic ExecutionHybrid Pattern
Input variabilityLowHighMedium
Branching structureKnownEmergentKnown skeleton, agentic nodes
Exception rateLowHighModerate
Tool selectionFixedDynamicFixed core, dynamic augmentation
Correctness basisDeterministic rulesEvidence + verificationRules for control, agent for interpretation
Latency predictabilityHighLower, variableHigh for core path
Cost predictabilityHighLowerControlled
Best use caseETL, BPM, CRUD, compliance stepsinvestigations, diagnosis, synthesis, adaptive operationsenterprise default

1.8.3 Economic Decision Rule#

Use agentic execution when the expected value of adaptivity exceeds its additional control and failure-handling cost:

E[VagentVworkflow]>ΔCcontrols+ΔRresidual\mathbb{E}[V_{\text{agent}} - V_{\text{workflow}}] > \Delta C_{\text{controls}} + \Delta R_{\text{residual}}

where:

  • ΔCcontrols\Delta C_{\text{controls}} is the cost of guardrails, observability, and evaluation,
  • ΔRresidual\Delta R_{\text{residual}} is remaining risk after mitigation.

1.8.4 Common Anti-Patterns#

Do not use a free-form agent when:

  • the process is deterministic and stable,
  • the domain is heavily regulated and rules are fully encodable,
  • the cost of a false positive mutation is extreme,
  • or the latency SLO is incompatible with iterative reasoning.

Do not use pure workflows when:

  • the long-tail exception burden dominates engineering cost,
  • the task requires synthesis from heterogeneous evidence,
  • the operator cannot predefine the branching logic economically,
  • or the environment changes faster than workflow maintenance can keep up.

The dominant production pattern is workflow skeleton + agentic interior:

  • workflow handles identity, deadlines, retries, routing, approvals, and commits;
  • agents handle interpretation, retrieval, diagnosis, synthesis, and adaptive subplanning.

This preserves control while exploiting model flexibility where it has comparative advantage.


1.9 The 10-Year Trajectory: From LLM-Centered Agents to Substrate-Independent Cognitive Architectures#

1.9.1 Direction of Travel#

The field is moving away from “prompted model as application” and toward protocol-oriented cognitive systems in which models are replaceable components within a durable execution substrate.

1.9.2 What Will Change#

Over the next decade, the system center of gravity will shift from the frontier model to:

  • typed control planes,
  • persistent structured memory,
  • retrieval and world-state substrates,
  • verifiers and evaluators,
  • tool protocols,
  • and event-sourced execution traces.

The system’s identity will increasingly reside in its contracts, memory, telemetry, and evaluation corpus, not in a specific model vendor.

1.9.3 Staged Evolution#

HorizonDominant PatternBottleneckArchitectural Response
0–3 yearsLLM-centered agents with retrieval and toolshallucination, latency, brittle tool usestronger orchestration, typed tools, verifiers
3–7 yearsmulti-model agent stacks with persistent memory and specializationcoordination complexity, evaluation debtprotocol standardization, replay CI, event sourcing
7–10 yearssubstrate-independent cognitive architecturesspecification, governance, long-horizon stabilityformal contracts, simulation, self-maintenance under policy

1.9.4 Substrate Independence#

A substrate-independent architecture separates cognition into modular layers:

Cognitive System=Control Plane+Memory Plane+Retrieval Plane+Tool Plane+Model Portfolio\text{Cognitive System} = \text{Control Plane} + \text{Memory Plane} + \text{Retrieval Plane} + \text{Tool Plane} + \text{Model Portfolio}

The model portfolio becomes hot-swappable by task, latency tier, cost class, or jurisdiction. This has significant implications:

  • vendor portability,
  • cost optimization,
  • regulatory flexibility,
  • resilience to model regressions,
  • and performance specialization.

1.9.5 Likely Technical Shifts#

  1. From monolithic reasoning to model ensembles
    Separate models for planning, retrieval reformulation, code reasoning, verification, and summarization.

  2. From raw prompt history to compiled context artifacts
    Context assembly becomes a deterministic systems function.

  3. From vector-only retrieval to evidence graphs
    Lineage, authority, freshness, and usage patterns become ranking features.

  4. From session chat memory to policy-governed memory hierarchies
    Durable memory requires validation, provenance, expiry, and deduplication.

  5. From human review after failure to continuous evaluation before deployment
    Failed traces become replay suites and CI gates.

  6. From model-centric trust to system-centric trust
    Assurance migrates from “the model seems good” to measurable behavioral guarantees.

1.9.6 Long-Term Limitation#

The hardest unsolved problem is not next-token quality. It is specification robustness under open-world action: ensuring that increasingly competent systems remain aligned and controllable when objectives are incomplete, shifting, or strategically exploitable.


1.10 Reference Architecture Overview: The Complete Agentic Execution Stack#

1.10.1 Architectural Principle#

The platform must be designed as a typed protocol stack, not prompt glue. All boundaries expose explicit schemas, capability discovery, deadlines, pagination where applicable, typed error classes, and versioned contracts.

1.10.2 Layered Stack#

LayerPrimary FunctionProtocol / ArtifactKey Controls
L1. User/Application Boundaryingress, request lifecycle, async job controlJSON-RPCauthN/Z, deadlines, idempotency keys, error classes, versioning
L2. Task Gatewayrequest normalization, SLA assignment, tenancy isolationtyped task specadmission control, rate limits, priority queues
L3. Orchestratorplan/decompose/route/track executioninternal state machinebounded recursion, leases, rollback, compensation
L4. Context Compilerdeterministic prefill assemblycompiled context artifacttoken budgets, context hygiene, reproducibility digests
L5. Retrieval Engineevidence gathering and rankinghybrid indexed evidence packetsprovenance, freshness, authority, latency budgets
L6. Memory Servicesworking/session/episodic/semantic/procedural memorytyped memory recordsvalidation, deduplication, TTL, provenance
L7. Tool Fabricexecution against external/internal capabilitiesMCP, gRPC, JSON-RPC adaptersleast privilege, lazy loading, traceability
L8. Verification Planeschema/policy/evidence/test checksvalidators, rule engine, test harnessblock-on-fail, contradiction detection, calibration
L9. Commit Planestaged writes and irreversible mutationstransactional or saga protocolsapproval gates, idempotency, compensation
L10. Observability / Eval Planelogs, metrics, traces, replay CIevent store, trace schema, benchmark corpusregression gating, drift detection, cost analytics

The architecture is intentionally layered so that each concern can be validated, replaced, and scaled independently.


1.10.3 Boundary Protocols and Contracts#

JSON-RPC at the User/Application Boundary#

Use JSON-RPC for:

  • synchronous request/response,
  • asynchronous job creation,
  • result polling,
  • cancellation,
  • trace retrieval,
  • and capability discovery.

Required request fields:

  • request_id
  • idempotency_key
  • deadline
  • tenant_id
  • task_type
  • schema_version
  • priority_class
  • authorization_context

Required response fields:

  • terminal status or job handle,
  • structured result or typed error,
  • provenance handle,
  • trace reference,
  • budget consumption summary.

gRPC/Protobuf for Internal Execution#

Use gRPC internally for:

  • low-latency service-to-service calls,
  • strict type contracts,
  • streaming tool output,
  • backpressure-aware internal services,
  • and uniform deadline propagation.

This is the correct substrate for orchestrator-to-retrieval, orchestrator-to-memory, verifier-to-tool, and evaluator-to-trace-store calls.

MCP for Discoverable Tools and Resources#

Use MCP for:

  • tool discovery,
  • resource surfaces,
  • prompt/resource metadata,
  • local and remote tool connectors,
  • change notifications,
  • schema-described input/output affordances.

Tool definitions must be lazily loaded into context to avoid token waste. Tools not relevant to the active plan do not belong in the active window.


1.10.4 Deterministic Context Construction#

Prompting should be treated as a compiled runtime artifact, not handwritten prose.

Context Compilation Inputs#

The compiler assembles:

  • role policy,
  • task objective and acceptance criteria,
  • protocol bindings,
  • current execution state,
  • tool affordances,
  • retrieved evidence packets,
  • memory summaries,
  • failure-state summaries,
  • and response contract.

Token Budget Formalization#

Let the model context window be WW. Allocate:

Bpolicy+Bobjective+Bstate+Btools+Bmemory+Bevidence+BreserveWB_{\text{policy}} + B_{\text{objective}} + B_{\text{state}} + B_{\text{tools}} + B_{\text{memory}} + B_{\text{evidence}} + B_{\text{reserve}} \le W

where BreserveB_{\text{reserve}} is preserved for actual inference and structured output generation. The compiler must never consume the full window on prefilling.

Compiler Requirements#

  • deterministic section ordering,
  • duplicate elimination,
  • stale-history compression,
  • provenance-preserving evidence compression,
  • explicit omission of irrelevant history,
  • versioned compiler policies,
  • stable context digests for traceability and cache reuse.

Pseudo-Algorithm 2 — Context Compilation#

Inputs: task state σ\sigma, model window WW, latency budget LL

  1. Load role and policy contract versions.
  2. Normalize objective into a machine-checkable task schema.
  3. Reserve generation capacity BreserveB_{\text{reserve}}.
  4. Include only the current execution state summary, not raw full history.
  5. Discover relevant tool affordances from MCP metadata; omit inactive tools.
  6. Decompose the task into retrieval intents.
  7. Fetch evidence packets under latency budget LL.
  8. Fetch memory summaries by utility, recency, and policy eligibility.
  9. Deduplicate semantically equivalent constraints and evidence.
  10. Compress low-signal text; preserve citations, timestamps, and authorities.
  11. Emit a deterministic preamble with a content digest and schema version.

Hallucination Control Implication#

A clean context window reduces hallucination more effectively than verbose instruction piles. Unsupported claims frequently arise from context clutter, stale history, or missing evidence, not only from model weakness.


1.10.5 Retrieval Engine: Deterministic Evidence, Not Ad Hoc RAG#

Retrieval Is a Control Function#

Retrieval should produce evidence packets with provenance, not anonymous text blobs. Each evidence packet should minimally contain:

  • source identifier,
  • authority class,
  • timestamp/freshness,
  • lineage or dependency links,
  • chunk boundaries,
  • extraction method,
  • confidence metadata,
  • and access policy label.

Retrieval Inputs#

A production retrieval layer should combine:

  • exact lexical match,
  • semantic similarity,
  • metadata filters,
  • lineage/graph context,
  • historical usage patterns,
  • human annotations,
  • code-derived enrichment,
  • institutional knowledge bases,
  • validated memory,
  • and live runtime inspection.

Retrieval Scoring#

A practical ranking function is:

score(d,q)=wxX(d,q)+wsS(d,q)+waA(d)+wfF(d)+wlL(d,q)+wuU(d,q)wτT(d)wcC(d)\text{score}(d,q) = w_x X(d,q) + w_s S(d,q) + w_a A(d) + w_f F(d) + w_l L(d,q) + w_u U(d,q) - w_\tau T(d) - w_c C(d)

where:

  • XX: exact match score
  • SS: semantic score
  • AA: authority
  • FF: freshness
  • LL: lineage/graph relevance
  • UU: execution utility
  • TT: latency penalty
  • CC: context cost penalty

This is materially better than nearest-neighbor similarity alone.

Query Rewriting and Routing#

The platform should not issue one naive query. It should:

  1. rewrite and expand the user request,
  2. decompose it into subqueries,
  3. route each subquery by source, schema, and latency tier,
  4. merge results with provenance,
  5. and rank for execution utility, not only semantic relevance.

Chunking Strategy by Document Class#

Chunking must be document-aware:

  • Policies / contracts / standards: structural chunking by section and clause
  • Code and repositories: symbol-aware or AST-aware chunking with call graph and ownership enrichment
  • Manuals and architecture docs: hierarchical chunking
  • Tickets / incidents / chats: temporal-semantic chunking
  • SOPs / workflows: agentic chunking around action units and precondition/postcondition boundaries

Hallucination Control Rules#

  • no final factual claim without provenance,
  • no synthesis from anonymous context,
  • abstain if evidence is insufficient or conflicting,
  • run contradiction checks before commit,
  • prefer tool inspection over latent recall when the source of truth is available.

1.10.6 Hard Memory Wall: Working Context vs. Durable Memory#

Memory Layers#

A production memory system must separate:

  1. Working memory
    Current execution scratch state; ephemeral and high-churn.

  2. Session memory
    Short-lived interaction continuity.

  3. Episodic memory
    Validated facts about prior task outcomes and exceptions.

  4. Semantic memory
    Canonical domain knowledge and stable constraints.

  5. Procedural memory
    Validated operating patterns, repair strategies, routing heuristics.

These layers must not collapse into one undifferentiated vector store.

Promotion Policy#

A candidate memory item mm should be promoted only if:

promote(m)=1    validated(m)¬duplicate(m)provenanced(m)policy_allowed(m)useful(m)\text{promote}(m)=1 \iff \text{validated}(m)\wedge \neg \text{duplicate}(m)\wedge \text{provenanced}(m)\wedge \text{policy\_allowed}(m)\wedge \text{useful}(m)

Additional controls:

  • TTL or expiry evaluation,
  • privacy and retention policy checks,
  • novelty threshold,
  • source trust score,
  • conflict detection against canonical memory.

Pseudo-Algorithm 3 — Memory Promotion#

Input: candidate observation or correction mm

  1. Verify source authenticity and authorization.
  2. Classify memory type: episodic, semantic, procedural, or reject.
  3. Check novelty against existing memory via semantic and exact deduplication.
  4. Require provenance, timestamp, and evidence links.
  5. Reject volatile, speculative, or chain-of-thought-like material.
  6. Apply privacy policy and expiry rules.
  7. Write only after validator approval.
  8. Record lineage from original event to retained memory object.

Design Rule#

Never store unverified model assertions as durable truth. Durable memory is a governed knowledge substrate, not a cache of previous guesses.


1.10.7 Orchestration and Multi-Agent Execution#

Bounded Control Loop#

Every execution follows:

plandecomposeretrieveactverifycritiquerepaircommit\text{plan} \rightarrow \text{decompose} \rightarrow \text{retrieve} \rightarrow \text{act} \rightarrow \text{verify} \rightarrow \text{critique} \rightarrow \text{repair} \rightarrow \text{commit}

No step may be skipped for high-impact tasks.

Specialized Agents#

Use specialization when it improves correctness or throughput:

  • implementation agent,
  • retrieval agent,
  • verification agent,
  • documentation agent,
  • performance analysis agent,
  • security or policy review agent.

Specialization is justified only if the coordination overhead is lower than the accuracy or latency gain.

Isolation and Lock Discipline#

Parallel agents require:

  • independently claimable work units,
  • task leases or locks,
  • isolated workspaces,
  • merge-safe branches,
  • conflict detection,
  • bounded recursion,
  • deterministic merge rules.

Concurrency is permitted only when overlap risk and merge entropy are mechanically controlled.

Idempotency#

Distributed agent systems should assume at-least-once execution semantics and create the illusion of exactly-once behavior through idempotent mutations. Every effectful operation needs:

  • operation-scoped idempotency key,
  • deduplication store,
  • replay-safe handlers,
  • compensating action strategy when atomicity is impossible.

1.10.8 Tool Fabric: Typed Infrastructure with Least Privilege#

Tool Exposure Requirements#

Tool servers should expose:

  • capability discovery,
  • versioned schemas,
  • optional structured outputs,
  • pagination,
  • timeout classes,
  • error taxonomy,
  • change notifications,
  • and auditable invocation traces.

Least-Privilege Access#

Mutation-capable tool paths must be:

  • caller-scoped,
  • approval-gated when needed,
  • human-interruptible,
  • policy-bound to explicit effect classes,
  • and never issued broad ambient credentials owned by the agent.

Environment Legibility#

Agents must be able to inspect:

  • logs,
  • metrics,
  • traces,
  • repository metadata,
  • test harnesses,
  • browser or desktop state where applicable,
  • and runtime diagnostic artifacts.

An agent that cannot observe the environment cannot reliably improve or repair it.


1.10.9 Verification, Response Synthesis, and Commit#

Multi-Layer Verification#

Before response synthesis or mutation, enforce:

  1. schema validation,
  2. authorization validation,
  3. provenance sufficiency,
  4. contradiction detection,
  5. domain-specific tests or simulations,
  6. policy checks,
  7. approval checks for effectful actions.

Response Synthesis Rule#

Final responses should be synthesized only from verified state and evidence packets. If evidence is incomplete, the allowed outputs are:

  • abstention,
  • bounded hypothesis clearly marked as uncertain,
  • clarification request,
  • or escalation.

Commit Semantics#

Commits must be:

  • explicit,
  • audited,
  • idempotent,
  • and reversible where possible.

For non-atomic multi-step writes, use saga-style compensation rather than implicit trust in success propagation.


1.10.10 Fault Tolerance, Graceful Degradation, and Load Control#

Reliability Controls#

Production-grade agentic platforms require:

  • rate limits,
  • backpressure,
  • circuit breakers,
  • retry budgets with jitter,
  • queue isolation by workload class,
  • cache hierarchies for retrieval and compiled context artifacts,
  • and workload prioritization.

Graceful Degradation Policy#

Under load or partial outage, degrade in a controlled order:

  1. shed low-priority background reflection and cleanup jobs,
  2. reduce retrieval fan-out,
  3. shrink evidence payload size,
  4. downgrade to cheaper or faster model tiers,
  5. disable nonessential critique passes,
  6. switch effectful actions to draft-only mode,
  7. escalate or abstain rather than hallucinate.

Pseudo-Algorithm 4 — Load-Aware Degradation#

Input: utilization ρ\rho, error rate ee, deadline pressure δ\delta

  1. If ρ\rho or ee exceeds warning threshold:
    • reduce asynchronous evaluator concurrency,
    • enable cache-preferred retrieval,
    • tighten context budgets.
  2. If thresholds continue rising:
    • route to lower-latency models for low-risk tasks,
    • disable optional reflection passes,
    • restrict to read-only operations where feasible.
  3. If critical thresholds are crossed:
    • reject or queue low-priority jobs,
    • preserve interactive and safety-critical classes,
    • force abstain/escalate behavior for under-verified tasks.

Operational Principle#

The system should fail safe and legible, not optimistically and opaquely.


1.10.11 Observability and Continuous Evaluation#

Observability at Every Boundary#

Capture:

  • structured logs,
  • distributed traces,
  • span-level tool invocations,
  • compiled context digests,
  • token and latency usage,
  • retrieval evidence provenance,
  • verifier outcomes,
  • approval events,
  • commit results,
  • and compensation flows.

Minimum Operational Metrics#

  • task success rate by contract version,
  • grounding rate,
  • verifier pass/fail distribution,
  • human escalation rate,
  • p50/p95 latency,
  • cost per accepted task,
  • tool failure rate,
  • memory promotion acceptance rate,
  • replay regression pass rate,
  • and load-shed frequency.

Feedback-to-Evaluation Pipeline#

Human corrections, failed traces, reviewer comments, and production regressions should be normalized into:

  • replay sets,
  • benchmark tasks,
  • policy tests,
  • adversarial evaluation suites,
  • routing tests,
  • and CI/CD gates.

Capability growth without regression enforcement is not engineering maturity; it is stochastic drift.

Recurring Cleanup Agents#

The platform should include maintenance agents, under policy, to:

  • identify duplicated prompt or policy patterns,
  • remove context slop,
  • retire dead tools,
  • detect stale memory,
  • and propose eval additions from recurring incidents.

These agents improve the substrate, not only individual task outcomes.


Cross-Cutting Nonfunctional Invariants#

ConcernRequired Mechanisms
Hallucination controlprovenance-tagged evidence, tool-grounding, contradiction checks, abstention, evidence-only synthesis
Fault toleranceretries with jitter, circuit breakers, persisted failure state, sagas, queue isolation
Idempotencyoperation keys, dedup stores, replay-safe handlers, staged commits
Observabilitytraces, logs, metrics, context digests, tool audit trails
Latencydeadlines, adaptive retrieval fan-out, priority queues, model tiering
Token efficiencydeterministic context compilation, lazy tool loading, memory summaries, history compression
Cost optimizationcache hierarchies, routing by task class, selective verification depth, retrieval utility scoring
Graceful degradationload shedding, draft-only fallback, reduced reflection, abstain/escalate under uncertainty

Concluding Synthesis#

The decisive shift from predictive model to autonomous cognitive architecture is the shift from generation to governed closed-loop execution. Agentic systems are not defined by conversational fluency, but by the presence of:

  • explicit goals,
  • typed interfaces,
  • bounded control loops,
  • deterministic context construction,
  • provenance-first retrieval,
  • validated memory hierarchies,
  • verifier-mediated tool use,
  • governed commit semantics,
  • and continuous observational feedback.

The enduring architecture pattern is therefore clear: keep stochastic intelligence inside a deterministic, observable, contract-enforced envelope. That is the foundational design law for production-grade agentic AI.