Agentic Notes Library

Chapter 18: Session Architecture — Lifecycle, Isolation, Persistence, and Resumption

In production agentic systems, the session is the fundamental unit of continuity. It binds a user's intent to an agent's execution state across time, space, and failure boundaries. Without a formally defined session primitive, agentic sy...

March 20, 2026 18 min read 3,746 words
Chapter 18MathRaw HTML

Preamble#

In production agentic systems, the session is the fundamental unit of continuity. It binds a user's intent to an agent's execution state across time, space, and failure boundaries. Without a formally defined session primitive, agentic systems degrade to stateless request-response handlers—incapable of multi-turn reasoning, resumable execution, collaborative workflows, or any form of durable interaction. This chapter formalizes the session as a first-class architectural primitive with typed state, versioned schemas, deterministic lifecycle transitions, isolation guarantees, persistence strategies, and security invariants. Every construct is specified with the same rigor applied to database transaction managers, distributed consensus protocols, and operating system process models. A session is not a "conversation history blob." It is a managed execution envelope with defined boundaries, serializable state, migration capability, and measurable operational characteristics.


18.1 Session as a First-Class Architectural Primitive#

18.1.1 Definition and Architectural Position#

A session S\mathcal{S} is a bounded, stateful execution envelope that encapsulates all context, memory, tool bindings, agent state, and interaction history required to maintain continuity for a logically coherent unit of work.

Formally:

S=(id,σ,Λ,Γ,Ms,Tb,H,Π,Ω)\mathcal{S} = \left( \text{id}, \sigma, \Lambda, \Gamma, \mathcal{M}_s, \mathcal{T}_b, \mathcal{H}, \Pi, \Omega \right)

where:

SymbolTypeSemantics
id\text{id}SessionID (UUID v7, temporally sortable)Globally unique, immutable session identifier
σ\sigmaSessionState (typed, versioned)Current mutable state of the session
Λ\LambdaLifecyclePhaseCurrent lifecycle phase (enum)
Γ\GammaIsolationContextIsolation boundaries and ownership descriptors
Ms\mathcal{M}_sSessionMemorySession-scoped memory layers
Tb\mathcal{T}_bToolBindingSetBound tool instances with caller-scoped authorization
H\mathcal{H}InteractionHistoryOrdered turn-level interaction log
Π\PiPersistencePolicyCheckpointing, WAL, and expiry configuration
Ω\OmegaSessionMetadataCreation time, owner, TTL, tags, lineage

18.1.2 Why Sessions Must Be First-Class#

Sessions are promoted from implicit infrastructure to explicit architectural primitives for the following reasons:

  1. Continuity Under Failure: Without durable session state, a crash or timeout destroys all accumulated context. First-class sessions enable resumption from the last consistent checkpoint.

  2. Isolation Enforcement: Concurrent users, tasks, or agents must not observe or mutate each other's state. First-class sessions provide the isolation boundary analogous to process isolation in operating systems.

  3. Migration and Scaling: Sessions must be movable across nodes, regions, and agent instances without loss of state. This requires serializable, versioned state schemas—impossible with ad hoc in-memory state.

  4. Observability and Auditing: Every session transition, tool invocation, and state mutation must be traceable. First-class sessions provide the natural unit of observation.

  5. Cost Attribution: Token consumption, tool invocation costs, and compute usage are attributed per session, enabling granular billing, budgeting, and resource governance.

ConceptRelationship to SessionKey Distinction
ConversationA session may contain one or more conversationsConversation is interaction-level; session is execution-level
Agent Loop ExecutionAn agent loop executes within a sessionThe loop is a control structure; the session is its state envelope
TransactionA session may contain multiple transactionsTransactions have ACID properties; sessions have lifecycle and continuity semantics
ProcessSessions are analogous to OS processesSessions are distributed, serializable, and migrateable
Context WindowThe active context window is a view into session stateContext window is bounded by model limits; session state is unbounded but tiered

18.1.4 Formal Session Invariants#

Every session in the system must satisfy the following invariants at all times:

I1:id(S) is globally unique and immutable\mathcal{I}_1: \quad \text{id}(\mathcal{S}) \text{ is globally unique and immutable} I2:Λ(S){INIT,ACTIVE,SUSPENDED,RESUMED,COMPLETED,ARCHIVED}\mathcal{I}_2: \quad \Lambda(\mathcal{S}) \in \{\texttt{INIT}, \texttt{ACTIVE}, \texttt{SUSPENDED}, \texttt{RESUMED}, \texttt{COMPLETED}, \texttt{ARCHIVED}\} I3:t:version(σ(t))=version(σ(t1))+1 after every state mutation\mathcal{I}_3: \quad \forall t: \text{version}(\sigma(t)) = \text{version}(\sigma(t-1)) + 1 \text{ after every state mutation} I4:HHmax(bounded history with eviction policy)\mathcal{I}_4: \quad |\mathcal{H}| \leq H_{\max} \quad \text{(bounded history with eviction policy)} I5:mutation m:provenance(m)\mathcal{I}_5: \quad \forall \text{mutation } m: \text{provenance}(m) \neq \varnothing I6:checksum(σ)=SHA256(canonical(σ)) is verified on every deserialization\mathcal{I}_6: \quad \text{checksum}(\sigma) = \text{SHA256}(\text{canonical}(\sigma)) \text{ is verified on every deserialization}

18.2 Session Lifecycle: Init → Active → Suspended → Resumed → Completed → Archived#

18.2.1 Lifecycle as a Finite State Machine#

The session lifecycle is modeled as a deterministic finite state machine:

Lsession=(S,s0,Σ,δ,F)\mathcal{L}_{\text{session}} = (S, s_0, \Sigma, \delta, F)

where:

  • S={INIT,ACTIVE,SUSPENDED,RESUMED,COMPLETED,ARCHIVED,TERMINATED}S = \{\texttt{INIT}, \texttt{ACTIVE}, \texttt{SUSPENDED}, \texttt{RESUMED}, \texttt{COMPLETED}, \texttt{ARCHIVED}, \texttt{TERMINATED}\}
  • s0=INITs_0 = \texttt{INIT}
  • Σ\Sigma = set of lifecycle events
  • δ:S×ΣS\delta: S \times \Sigma \rightarrow S = transition function
  • F={COMPLETED,ARCHIVED,TERMINATED}F = \{\texttt{COMPLETED}, \texttt{ARCHIVED}, \texttt{TERMINATED}\} = terminal states

18.2.2 Transition Function#

The complete transition table:

Current StateEventNext StateGuard ConditionSide Effect
INITactivateACTIVEState schema validated, tools boundEmit session.activated trace
ACTIVEsuspendSUSPENDEDCheckpoint persistedFlush working memory, release tool locks
ACTIVEcompleteCOMPLETEDAll exit criteria metFinal checkpoint, provenance sealed
ACTIVEterminateTERMINATEDOperator command or unrecoverable errorCompensating actions, error record
ACTIVEtimeoutSUSPENDEDTTL or idle timeout exceededAuto-checkpoint, release resources
SUSPENDEDresumeRESUMEDCheckpoint available, resources acquiredRehydrate context, rebind tools
SUSPENDEDexpireTERMINATEDExpiry TTL exceededCleanup, archive state
RESUMEDreactivateACTIVEState consistency verifiedResume from checkpoint
COMPLETEDarchiveARCHIVEDRetention policy evaluatedMove to cold storage
TERMINATEDarchiveARCHIVEDCleanup completeMove to cold storage

Formally:

δ(ACTIVE,suspend)=SUSPENDEDiffCheckpoint(S)=SUCCESS\delta(\texttt{ACTIVE}, \texttt{suspend}) = \texttt{SUSPENDED} \quad \text{iff} \quad \text{Checkpoint}(\mathcal{S}) = \texttt{SUCCESS} δ(SUSPENDED,resume)=RESUMEDiffRehydrate(S)=SUCCESS\delta(\texttt{SUSPENDED}, \texttt{resume}) = \texttt{RESUMED} \quad \text{iff} \quad \text{Rehydrate}(\mathcal{S}) = \texttt{SUCCESS} δ(ACTIVE,complete)=COMPLETEDiffgGg(S)=PASS\delta(\texttt{ACTIVE}, \texttt{complete}) = \texttt{COMPLETED} \quad \text{iff} \quad \bigwedge_{g \in \mathcal{G}} g(\mathcal{S}) = \texttt{PASS}

18.2.3 Lifecycle Duration Accounting#

Define the time spent in each phase:

Ttotal(S)=Tinit+Tactive+Tsuspended+Tresumed+TcompletingT_{\text{total}}(\mathcal{S}) = T_{\text{init}} + T_{\text{active}} + T_{\text{suspended}} + T_{\text{resumed}} + T_{\text{completing}}

The active ratio measures session efficiency:

ηactive(S)=Tactive+TresumedTtotal(S)\eta_{\text{active}}(\mathcal{S}) = \frac{T_{\text{active}} + T_{\text{resumed}}}{T_{\text{total}}(\mathcal{S})}

Low ηactive\eta_{\text{active}} indicates excessive suspension—possibly due to resource contention, slow human approval, or infrastructure latency.

The suspension frequency is:

fsuspend(S)={t:Λ(t)=SUSPENDED}Ttotal(S)f_{\text{suspend}}(\mathcal{S}) = \frac{|\{t : \Lambda(t) = \texttt{SUSPENDED}\}|}{T_{\text{total}}(\mathcal{S})}

High fsuspendf_{\text{suspend}} triggers investigation into timeout tuning, resource provisioning, or task decomposition quality.

18.2.4 Pseudo-Algorithm: Session Lifecycle Manager#

ALGORITHM SessionLifecycleManager
  INPUT:
    session_id: SessionID
    event: LifecycleEvent
    context: SystemContext
 
  OUTPUT:
    new_phase: LifecyclePhase
    side_effects: List<SideEffect>
 
  BEGIN:
    session ← LOAD_SESSION(session_id)
    current_phase ← session.lifecycle_phase
    side_effects ← []
 
    // Validate transition
    IF (current_phase, event) NOT IN TRANSITION_TABLE THEN
      RAISE InvalidTransitionError(current_phase, event)
    END IF
 
    target_phase ← TRANSITION_TABLE[(current_phase, event)]
    guard ← GUARD_TABLE[(current_phase, event)]
 
    // Evaluate guard condition
    IF NOT guard.evaluate(session, context) THEN
      RAISE GuardFailedError(current_phase, event, guard.reason)
    END IF
 
    // Execute pre-transition hooks
    FOR EACH hook IN PRE_TRANSITION_HOOKS[(current_phase, target_phase)] DO
      hook.execute(session, context)
    END FOR
 
    // Phase-specific side effects
    MATCH (current_phase, target_phase):
 
      (INIT, ACTIVE):
        VALIDATE_STATE_SCHEMA(session.state)
        BIND_TOOLS(session, context.tool_registry)
        INITIALIZE_MEMORY_LAYERS(session)
        APPEND side_effects, EmitTrace("session.activated", session_id)
 
      (ACTIVE, SUSPENDED):
        cp ← CREATE_CHECKPOINT(session)
        PERSIST_CHECKPOINT(cp)
        FLUSH_WORKING_MEMORY(session)
        RELEASE_TOOL_LOCKS(session)
        APPEND side_effects, EmitTrace("session.suspended", session_id)
        APPEND side_effects, ReleaseResources(session.resource_claims)
 
      (ACTIVE, COMPLETED):
        VERIFY_EXIT_CRITERIA(session)
        cp ← CREATE_FINAL_CHECKPOINT(session)
        PERSIST_CHECKPOINT(cp)
        SEAL_PROVENANCE(session)
        PROMOTE_EPISODIC_MEMORY(session)
        APPEND side_effects, EmitTrace("session.completed", session_id)
 
      (SUSPENDED, RESUMED):
        ACQUIRE_RESOURCES(session.resource_requirements)
        REHYDRATE_CONTEXT(session)
        REBIND_TOOLS(session, context.tool_registry)
        VERIFY_STATE_CONSISTENCY(session)
        APPEND side_effects, EmitTrace("session.resumed", session_id)
 
      (RESUMED, ACTIVE):
        // Verify rehydration completeness
        ASSERT session.context_integrity_check() = PASS
        APPEND side_effects, EmitTrace("session.reactivated", session_id)
 
      (*, TERMINATED):
        EXECUTE_COMPENSATING_ACTIONS(session)
        PERSIST_FAILURE_STATE(session)
        RELEASE_ALL_RESOURCES(session)
        APPEND side_effects, EmitTrace("session.terminated", session_id)
 
      (*, ARCHIVED):
        MOVE_TO_COLD_STORAGE(session)
        DELETE_HOT_STATE(session)
        APPEND side_effects, EmitTrace("session.archived", session_id)
 
    // Update lifecycle phase
    session.lifecycle_phase ← target_phase
    session.state.version ← session.state.version + 1
    session.transition_log.append(TransitionRecord(
      from=current_phase,
      to=target_phase,
      event=event,
      timestamp=NOW(),
      actor=context.actor_id
    ))
 
    PERSIST_SESSION_METADATA(session)
 
    // Execute post-transition hooks
    FOR EACH hook IN POST_TRANSITION_HOOKS[(current_phase, target_phase)] DO
      hook.execute(session, context)
    END FOR
 
    RETURN (target_phase, side_effects)
  END

18.3 Session State Schema: Typed, Versioned, Serializable, and Diff-Capable#

18.3.1 State Schema Formalization#

The session state σ\sigma is a typed record with a versioned schema:

σ=(v,schema_version,fields:{fi:τi}i=1n,checksum,last_modified)\sigma = \left( v, \text{schema\_version}, \text{fields}: \{f_i : \tau_i\}_{i=1}^{n}, \text{checksum}, \text{last\_modified} \right)

where:

  • vNv \in \mathbb{N} is the monotonically increasing state version number
  • schema_version\text{schema\_version} follows semantic versioning MAJOR.MINOR.PATCH\text{MAJOR}.\text{MINOR}.\text{PATCH}
  • Each field fif_i has an associated type τi\tau_i from the type system
  • checksum=SHA256(canonical_serialize(fields))\text{checksum} = \text{SHA256}(\text{canonical\_serialize}(\text{fields}))

The type system supports:

Type ClassExamplesSerialization
PrimitiveInt64, Float64, String, Bool, BytesDirect
TemporalTimestamp, Duration, TTLISO 8601 / epoch millis
CollectionList<T>, Map<K, V>, Set<T>Ordered JSON arrays / objects
CompositeAgentState, PlanSnapshot, MemorySummaryNested typed records
ReferenceRef<Checkpoint>, Ref<EvidenceBundle>URI + content hash
OptionalOption<T>Nullable with explicit None

18.3.2 Schema Versioning and Evolution#

Schema evolution must support backward compatibility for session resumption across software upgrades. The rules follow a strict contract:

Compatible(vold,vnew){MAJOR(vold)=MAJOR(vnew)  MINOR(vnew)MINOR(vold)\text{Compatible}(v_{\text{old}}, v_{\text{new}}) \Leftrightarrow \begin{cases} \text{MAJOR}(v_{\text{old}}) = \text{MAJOR}(v_{\text{new}}) \\ \wedge \; \text{MINOR}(v_{\text{new}}) \geq \text{MINOR}(v_{\text{old}}) \end{cases}
Change TypeVersion ImpactMigration Requirement
Add optional field with defaultMINOR bumpNone — deserializer uses default
Add required fieldMAJOR bumpMigration function required
Remove fieldMAJOR bumpMigration function to drop field
Change field typeMAJOR bumpMigration function to convert
Rename fieldMAJOR bumpMigration function to remap
Add enum variantMINOR bumpOld deserializer ignores unknown

The migration function for version vavbv_a \rightarrow v_b is:

migratevavb:σvaσvb\text{migrate}_{v_a \rightarrow v_b}: \sigma_{v_a} \rightarrow \sigma_{v_b}

Migration functions are composable:

migratevavc=migratevbvcmigratevavb\text{migrate}_{v_a \rightarrow v_c} = \text{migrate}_{v_b \rightarrow v_c} \circ \text{migrate}_{v_a \rightarrow v_b}

and stored in a migration registry indexed by (vsource,vtarget)(v_{\text{source}}, v_{\text{target}}).

18.3.3 State Diff and Merge Operations#

For multi-session coordination (Section 18.10) and migration (Section 18.8), the system must compute structural diffs and merges on session state.

Structural Diff#

Given two state versions σa\sigma_a and σb\sigma_b:

Δ(σa,σb)={(fi,opi,viold,vinew):σa.fiσb.fi}\Delta(\sigma_a, \sigma_b) = \{(f_i, \text{op}_i, v_i^{\text{old}}, v_i^{\text{new}}) : \sigma_a.f_i \neq \sigma_b.f_i\}

where opi{ADD,MODIFY,DELETE}\text{op}_i \in \{\texttt{ADD}, \texttt{MODIFY}, \texttt{DELETE}\}.

The diff size determines migration cost:

DiffCost(σa,σb)=(f,op,_,_)Δcost(op,type(f))\text{DiffCost}(\sigma_a, \sigma_b) = \sum_{(f, \text{op}, \_, \_) \in \Delta} \text{cost}(\text{op}, \text{type}(f))

Three-Way Merge#

For cross-session state sharing, a three-way merge uses a common ancestor σbase\sigma_{\text{base}}:

Merge(σbase,σa,σb)={σa.fiif σa.fiσbase.fiσb.fi=σbase.fiσb.fiif σb.fiσbase.fiσa.fi=σbase.fiσa.fiif σa.fi=σb.fiCONFLICT(fi)if σa.fiσb.fiσa.fiσbase.fiσb.fiσbase.fi\text{Merge}(\sigma_{\text{base}}, \sigma_a, \sigma_b) = \begin{cases} \sigma_a.f_i & \text{if } \sigma_a.f_i \neq \sigma_{\text{base}}.f_i \wedge \sigma_b.f_i = \sigma_{\text{base}}.f_i \\ \sigma_b.f_i & \text{if } \sigma_b.f_i \neq \sigma_{\text{base}}.f_i \wedge \sigma_a.f_i = \sigma_{\text{base}}.f_i \\ \sigma_a.f_i & \text{if } \sigma_a.f_i = \sigma_b.f_i \\ \texttt{CONFLICT}(f_i) & \text{if } \sigma_a.f_i \neq \sigma_b.f_i \wedge \sigma_a.f_i \neq \sigma_{\text{base}}.f_i \wedge \sigma_b.f_i \neq \sigma_{\text{base}}.f_i \end{cases}

Conflicts are resolved through conflict resolution policies: last-writer-wins, priority-based, or escalation to human review.

18.3.4 Pseudo-Algorithm: Versioned State Serialization and Validation#

ALGORITHM SerializeSessionState
  INPUT:
    state: SessionState
    target_format: SerializationFormat   // PROTOBUF | MSGPACK | JSON
 
  OUTPUT:
    serialized: Bytes
    checksum: Hash
 
  BEGIN:
    // Canonical field ordering (deterministic)
    ordered_fields ← SORT(state.fields, key=λf: f.name)
 
    // Type validation
    FOR EACH (field_name, field_value) IN ordered_fields DO
      expected_type ← state.schema.type_of(field_name)
      IF NOT TYPE_CHECK(field_value, expected_type) THEN
        RAISE SchemaViolationError(field_name, expected_type, ACTUAL_TYPE(field_value))
      END IF
    END FOR
 
    // Serialize with canonical ordering
    canonical ← CANONICAL_ENCODE(ordered_fields, target_format)
 
    // Compute integrity checksum
    checksum ← SHA256(canonical)
 
    // Attach version metadata
    envelope ← StateEnvelope(
      schema_version=state.schema_version,
      state_version=state.version,
      checksum=checksum,
      serialized_at=NOW(),
      payload=canonical
    )
 
    serialized ← ENCODE_ENVELOPE(envelope, target_format)
    RETURN (serialized, checksum)
  END
 
 
ALGORITHM DeserializeSessionState
  INPUT:
    serialized: Bytes
    expected_schema_version: SemanticVersion
 
  OUTPUT:
    state: SessionState
 
  BEGIN:
    envelope ← DECODE_ENVELOPE(serialized)
 
    // Schema version compatibility check
    IF NOT COMPATIBLE(envelope.schema_version, expected_schema_version) THEN
      // Attempt migration
      migration_path ← FIND_MIGRATION_PATH(
        envelope.schema_version, expected_schema_version
      )
      IF migration_path IS NULL THEN
        RAISE IncompatibleSchemaError(envelope.schema_version, expected_schema_version)
      END IF
 
      state ← DECODE_PAYLOAD(envelope.payload, envelope.schema_version)
      FOR EACH migration IN migration_path DO
        state ← migration.apply(state)
      END FOR
    ELSE
      state ← DECODE_PAYLOAD(envelope.payload, expected_schema_version)
    END IF
 
    // Integrity verification
    computed_checksum ← SHA256(CANONICAL_ENCODE(state.fields))
    IF computed_checksum ≠ envelope.checksum THEN
      RAISE IntegrityViolationError(
        expected=envelope.checksum,
        computed=computed_checksum
      )
    END IF
 
    RETURN state
  END

18.4 Session Isolation Models: Per-User, Per-Task, Per-Agent, and Nested Sessions#

18.4.1 Isolation as a Correctness Requirement#

Session isolation prevents unintended state interference between concurrent execution contexts. Formally, two sessions Sa\mathcal{S}_a and Sb\mathcal{S}_b are isolated if:

Isolated(Sa,Sb)  mMutations(Sa):affects(m,σb)      mMutations(Sb):affects(m,σa)\text{Isolated}(\mathcal{S}_a, \mathcal{S}_b) \Leftrightarrow \nexists \; m \in \text{Mutations}(\mathcal{S}_a) : \text{affects}(m, \sigma_b) \;\vee\; \nexists \; m \in \text{Mutations}(\mathcal{S}_b) : \text{affects}(m, \sigma_a)

Isolation is enforced through namespaced state, scoped tool authorizations, and memory partitioning.

18.4.2 Isolation Models#

Per-User Isolation#

The broadest isolation boundary. Each user uu has a set of sessions:

Su={S:Γ(S).owner=u}\mathcal{S}_u = \{\mathcal{S} : \Gamma(\mathcal{S}).\text{owner} = u\}

State visibility: Session Sa\mathcal{S}_a can only access state of Sb\mathcal{S}_b if Γ(Sa).owner=Γ(Sb).owner\Gamma(\mathcal{S}_a).\text{owner} = \Gamma(\mathcal{S}_b).\text{owner} and an explicit sharing policy exists.

Per-Task Isolation#

Within a user's sessions, each task receives its own session:

Su,τ={S:Γ(S).owner=uΓ(S).task_id=τ}\mathcal{S}_{u,\tau} = \{\mathcal{S} : \Gamma(\mathcal{S}).\text{owner} = u \wedge \Gamma(\mathcal{S}).\text{task\_id} = \tau\}

This prevents cross-task state contamination—a code generation task does not pollute the context of a research summarization task.

Per-Agent Isolation#

When multiple agents execute within a single session (e.g., a generator agent and a critic agent), each agent receives an isolated workspace:

Wa=(σalocal,Tabound,Maworking)\mathcal{W}_a = (\sigma_a^{\text{local}}, \mathcal{T}_a^{\text{bound}}, \mathcal{M}_a^{\text{working}})

The shared session state σshared\sigma_{\text{shared}} is accessed through a controlled interface with read/write permissions:

Access(a,f)={READ_WRITEif fowned_fields(a)READ_ONLYif fshared_readable(a)NONEotherwise\text{Access}(a, f) = \begin{cases} \texttt{READ\_WRITE} & \text{if } f \in \text{owned\_fields}(a) \\ \texttt{READ\_ONLY} & \text{if } f \in \text{shared\_readable}(a) \\ \texttt{NONE} & \text{otherwise} \end{cases}

Nested Sessions#

Complex tasks spawn child sessions that inherit certain properties from the parent:

Schild=Fork(Sparent,τsubtask,InheritancePolicy)\mathcal{S}_{\text{child}} = \text{Fork}(\mathcal{S}_{\text{parent}}, \tau_{\text{subtask}}, \text{InheritancePolicy})

The inheritance policy specifies:

PropertyInheritance Rule
Memory (semantic)Copy-on-read, isolated writes
Memory (episodic)Read-only access to parent's episodes
Tool bindingsSubset of parent's bindings (least privilege)
Token budgetAllocated fraction of parent's remaining budget
Isolation contextChild inherits owner but gets unique task scope
LifecycleChild must complete or terminate before parent completes

The parent-child relationship forms a session tree:

SessionTree(Sroot)=(Sroot,{(Sp,Sc):Sc=Fork(Sp,)})\text{SessionTree}(\mathcal{S}_{\text{root}}) = (\mathcal{S}_{\text{root}}, \{(\mathcal{S}_p, \mathcal{S}_c) : \mathcal{S}_c = \text{Fork}(\mathcal{S}_p, \ldots)\})

Invariant: A parent session cannot transition to COMPLETED until all child sessions are in terminal states:

Λ(Sparent)=COMPLETEDScchildren(Sparent):Λ(Sc)F\Lambda(\mathcal{S}_{\text{parent}}) = \texttt{COMPLETED} \Rightarrow \forall \mathcal{S}_c \in \text{children}(\mathcal{S}_{\text{parent}}): \Lambda(\mathcal{S}_c) \in F

18.4.3 Isolation Enforcement Mechanism#

ALGORITHM EnforceSessionIsolation
  INPUT:
    session: Session
    operation: StateOperation        // READ | WRITE | DELETE
    field: FieldPath
    actor: AgentID | UserID
 
  OUTPUT:
    permitted: Bool
 
  BEGIN:
    isolation_ctx ← session.isolation_context
 
    // Determine access level
    MATCH isolation_ctx.model:
      PER_USER:
        IF actor NOT IN isolation_ctx.authorized_users THEN
          AUDIT_LOG("access_denied", actor, session.id, field, operation)
          RETURN FALSE
        END IF
 
      PER_TASK:
        IF actor.current_task ≠ session.task_scope THEN
          AUDIT_LOG("cross_task_access_denied", actor, session.id)
          RETURN FALSE
        END IF
 
      PER_AGENT:
        access ← isolation_ctx.agent_permissions[actor]
        IF operation = WRITE AND access ≠ READ_WRITE THEN
          AUDIT_LOG("agent_write_denied", actor, session.id, field)
          RETURN FALSE
        END IF
        IF operation = READ AND access = NONE THEN
          AUDIT_LOG("agent_read_denied", actor, session.id, field)
          RETURN FALSE
        END IF
 
    // Field-level access control
    field_policy ← session.schema.field_access_policy(field)
    IF operation NOT IN field_policy.allowed_operations(actor.role) THEN
      AUDIT_LOG("field_access_denied", actor, field, operation)
      RETURN FALSE
    END IF
 
    RETURN TRUE
  END

18.4.4 Isolation Strength Hierarchy#

The isolation models form a hierarchy from strongest to weakest:

Per-AgentPer-TaskPer-UserGlobal\text{Per-Agent} \subset \text{Per-Task} \subset \text{Per-User} \subset \text{Global}

The tighter the isolation, the stronger the correctness guarantee but the higher the coordination cost for cross-boundary communication. The system selects isolation granularity based on the contention risk and shared-state requirements of the workload.


18.5 Session Persistence Strategies: In-Memory, Write-Ahead Log, Database-Backed, and Distributed#

18.5.1 Persistence Tier Classification#

Session state is tiered across persistence layers with different durability, latency, and cost characteristics:

PersistenceTier={L0: IN_MEMORY— volatile, μs latency, lost on crashL1: WAL— append-only log, ms latency, survives crashL2: DATABASE— ACID-compliant store, 10ms, survives node lossL3: DISTRIBUTED— replicated across regions, 50ms, survives region failure\text{PersistenceTier} = \begin{cases} \texttt{L0: IN\_MEMORY} & \text{— volatile, } \sim\mu\text{s latency, lost on crash} \\ \texttt{L1: WAL} & \text{— append-only log, } \sim\text{ms latency, survives crash} \\ \texttt{L2: DATABASE} & \text{— ACID-compliant store, } \sim 10\text{ms, survives node loss} \\ \texttt{L3: DISTRIBUTED} & \text{— replicated across regions, } \sim 50\text{ms, survives region failure} \end{cases}

18.5.2 Persistence Strategy Selection#

The optimal persistence tier for a session is determined by a cost-durability objective:

Tier(S)=argmint{L0,L1,L2,L3}Cost(t)s.t.Durability(t)Drequired(S)\text{Tier}^*(\mathcal{S}) = \arg\min_{t \in \{L0, L1, L2, L3\}} \text{Cost}(t) \quad \text{s.t.} \quad \text{Durability}(t) \geq D_{\text{required}}(\mathcal{S})

where DrequiredD_{\text{required}} is a function of session criticality:

Session CriticalityRequired DurabilityRecommended Tier
Ephemeral explorationNoneL0 (in-memory)
Standard interactiveCrash-resilientL1 (WAL)
Business-critical workflowNode-failure resilientL2 (database)
Cross-region, long-runningRegion-failure resilientL3 (distributed)

18.5.3 Write-Ahead Log (WAL) for Session State#

The WAL provides crash-resilient persistence with minimal write amplification. Every state mutation is appended to the WAL before being applied to the in-memory state:

WAL Entry=(seq_no,session_id,operation,field,value_before,value_after,timestamp,actor)\text{WAL Entry} = \left(\text{seq\_no}, \text{session\_id}, \text{operation}, \text{field}, \text{value\_before}, \text{value\_after}, \text{timestamp}, \text{actor}\right)

WAL Compaction#

Over time, the WAL grows unboundedly. Compaction produces a compressed snapshot:

Compact(WAL[0:k])=Snapshot(σk)WAL[k+1:n]\text{Compact}(\text{WAL}[0:k]) = \text{Snapshot}(\sigma_k) \oplus \text{WAL}[k+1:n]

The compacted state is:

σk=Replay(σ0,WAL[0:k])\sigma_k = \text{Replay}(\sigma_0, \text{WAL}[0:k])

Compaction is triggered when:

WAL>Wcompact_thresholdTsince_last_compact>Tcompact_interval|\text{WAL}| > W_{\text{compact\_threshold}} \quad \vee \quad T_{\text{since\_last\_compact}} > T_{\text{compact\_interval}}

18.5.4 Database-Backed Persistence#

For L2 persistence, session state is serialized into a database with the following schema:

ColumnTypeIndexSemantics
session_idUUIDPrimary keyUnique session identifier
schema_versionSemVerState schema version
state_versionInt64Monotonic state version
state_blobBytesSerialized state (Protobuf)
checksumBytes(32)SHA-256 integrity hash
lifecycle_phaseEnumIndexCurrent lifecycle phase
owner_idUUIDIndexSession owner
created_atTimestampIndexCreation time
last_active_atTimestampIndexLast activity time
expires_atTimestampIndexExpiry time (TTL-based)
metadataJSONBGIN indexTags, labels, lineage

Optimistic concurrency control is used for updates:

ALGORITHM PersistSessionToDatabase
  INPUT:
    session: Session
    db: DatabaseConnection
 
  OUTPUT:
    success: Bool
 
  BEGIN:
    (serialized, checksum) ← SERIALIZE_SESSION_STATE(session.state)
    expected_version ← session.state.version
 
    result ← db.execute(
      "UPDATE sessions
       SET state_blob = $1,
           checksum = $2,
           state_version = $3,
           lifecycle_phase = $4,
           last_active_at = $5
       WHERE session_id = $6
         AND state_version = $7",      // Optimistic lock
      params=[serialized, checksum, expected_version + 1,
              session.lifecycle_phase, NOW(),
              session.id, expected_version]
    )
 
    IF result.rows_affected = 0 THEN
      // Concurrent modification detected
      RAISE ConcurrentModificationError(session.id, expected_version)
    END IF
 
    session.state.version ← expected_version + 1
    RETURN TRUE
  END

18.5.5 Distributed Persistence#

For L3, session state is replicated across regions using a consensus protocol. The replication factor RR and consistency level CC are configurable:

Quorum Read:CRR2+1\text{Quorum Read}: C_R \geq \left\lfloor \frac{R}{2} \right\rfloor + 1 Quorum Write:CWR2+1\text{Quorum Write}: C_W \geq \left\lfloor \frac{R}{2} \right\rfloor + 1 Linearizable:CR+CW>R\text{Linearizable}: C_R + C_W > R

The session system defaults to session-consistent reads (read-your-own-writes) within a session, even when using eventual consistency across the distributed store. This is achieved by routing all reads for a session to the same replica or by attaching a causal timestamp τcausal\tau_{\text{causal}} to each operation:

Read(S,f) returns vtimestamp(v)τcausal(S)\text{Read}(\mathcal{S}, f) \text{ returns } v \Leftrightarrow \text{timestamp}(v) \geq \tau_{\text{causal}}(\mathcal{S})

18.5.6 Persistence Cost Model#

The cost of persisting session state over its lifetime:

Cpersist(S)=CwriteNwrites+CreadNreads+CstorageσTretentionC_{\text{persist}}(\mathcal{S}) = C_{\text{write}} \cdot N_{\text{writes}} + C_{\text{read}} \cdot N_{\text{reads}} + C_{\text{storage}} \cdot |\sigma| \cdot T_{\text{retention}}

where NwritesN_{\text{writes}} and NreadsN_{\text{reads}} are the total write and read operations, σ|\sigma| is the serialized state size, and TretentionT_{\text{retention}} is the retention duration. This cost model directly informs tier selection and compaction frequency.


18.6 Session Checkpointing: Periodic, Event-Triggered, and Pre-Mutation Snapshots#

18.6.1 Checkpoint Definition#

A checkpoint is a point-in-time consistent snapshot of session state:

CP(k)=(session_id,k,version(σ),snapshot(σ),checksum,trigger,timestamp)\text{CP}(k) = \left(\text{session\_id}, k, \text{version}(\sigma), \text{snapshot}(\sigma), \text{checksum}, \text{trigger}, \text{timestamp}\right)

where kk is the monotonically increasing checkpoint sequence number.

18.6.2 Checkpoint Trigger Strategies#

StrategyTrigger ConditionUse CaseTrade-off
PeriodicEvery ΔT\Delta T secondsBackground consistencySimple but may miss critical mutations
Event-TriggeredOn lifecycle transitions, tool invocations, phase changesAgent loop boundariesPrecise but higher write frequency
Pre-MutationBefore any state-changing operationSafety-critical workflowsMaximum safety, highest write cost
AdaptiveBased on state change rateGeneral purposeBalances cost and safety dynamically

Adaptive Checkpoint Interval#

The adaptive strategy adjusts the checkpoint interval based on the rate of state mutations:

ΔTcp(k)=max(ΔTmin,  ΔTbasemutation_rate(k))\Delta T_{\text{cp}}(k) = \max\left(\Delta T_{\min}, \; \frac{\Delta T_{\text{base}}}{\text{mutation\_rate}(k)}\right)

where:

mutation_rate(k)={m:tm[tk1,tk]}Δt\text{mutation\_rate}(k) = \frac{|\{m : t_m \in [t_{k-1}, t_k]\}|}{\Delta t}

High mutation rates trigger more frequent checkpoints; quiescent periods relax checkpoint frequency.

18.6.3 Checkpoint Storage Optimization#

Checkpoints can be stored as full snapshots or incremental deltas:

Full Snapshot#

CPfull(k)=σk|\text{CP}_{\text{full}}(k)| = |\sigma_k|

Incremental Delta#

CPdelta(k)=Δ(σk1,σk)|\text{CP}_{\text{delta}}(k)| = |\Delta(\sigma_{k-1}, \sigma_k)|

The space-time trade-off: full snapshots enable O(1)O(1) restoration but cost O(σ)O(|\sigma|) per checkpoint. Incremental deltas cost O(Δ)O(|\Delta|) per checkpoint but require replaying O(k)O(k) deltas for restoration.

A mixed strategy checkpoints a full snapshot every FF checkpoints and incremental deltas in between:

Restore(σ,k)=Replay(CPfull(kFF),  {CPdelta(j)}j=k/FF+1k)\text{Restore}(\sigma, k) = \text{Replay}\left(\text{CP}_{\text{full}}\left(\left\lfloor \frac{k}{F} \right\rfloor \cdot F\right), \; \left\{\text{CP}_{\text{delta}}(j)\right\}_{j=\lfloor k/F \rfloor \cdot F + 1}^{k}\right)

Restoration cost:

RestoreCost(k)=σ+j=k/FF+1kΔjσ+(F1)maxΔ\text{RestoreCost}(k) = |\sigma| + \sum_{j=\lfloor k/F \rfloor \cdot F + 1}^{k} |\Delta_j| \leq |\sigma| + (F - 1) \cdot \max|\Delta|

18.6.4 Pseudo-Algorithm: Checkpoint Manager#

ALGORITHM CheckpointManager
  INPUT:
    session: Session
    trigger: CheckpointTrigger
    config: CheckpointConfig
 
  OUTPUT:
    checkpoint_ref: CheckpointRef
 
  BEGIN:
    // Determine checkpoint type
    last_full_seq ← GET_LAST_FULL_CHECKPOINT_SEQ(session.id)
    current_seq ← session.checkpoint_seq + 1
    deltas_since_full ← current_seq - last_full_seq
 
    IF deltas_since_full ≥ config.full_checkpoint_interval
       OR trigger = LIFECYCLE_TRANSITION
       OR trigger = PRE_MUTATION_CRITICAL THEN
      cp_type ← FULL
    ELSE
      cp_type ← INCREMENTAL
    END IF
 
    // Create checkpoint
    MATCH cp_type:
      FULL:
        (serialized, checksum) ← SERIALIZE_SESSION_STATE(session.state)
        checkpoint ← Checkpoint(
          session_id=session.id,
          seq=current_seq,
          type=FULL,
          state_version=session.state.version,
          payload=serialized,
          checksum=checksum,
          trigger=trigger,
          timestamp=NOW()
        )
 
      INCREMENTAL:
        prev_state ← LOAD_PREVIOUS_STATE(session.id, current_seq - 1)
        delta ← COMPUTE_DIFF(prev_state, session.state)
        delta_serialized ← SERIALIZE_DIFF(delta)
        checksum ← SHA256(delta_serialized)
        checkpoint ← Checkpoint(
          session_id=session.id,
          seq=current_seq,
          type=INCREMENTAL,
          state_version=session.state.version,
          payload=delta_serialized,
          base_seq=current_seq - 1,
          checksum=checksum,
          trigger=trigger,
          timestamp=NOW()
        )
 
    // Persist to configured tier
    persistence_tier ← session.persistence_policy.checkpoint_tier
    PERSIST_CHECKPOINT(checkpoint, persistence_tier)
 
    // Update session metadata
    session.checkpoint_seq ← current_seq
    session.last_checkpoint_at ← NOW()
 
    EMIT_METRIC("session.checkpoint", {
      session_id: session.id,
      seq: current_seq,
      type: cp_type,
      size_bytes: SIZE(checkpoint.payload),
      trigger: trigger
    })
 
    RETURN CheckpointRef(session.id, current_seq, checksum)
  END

18.7 Session Resumption: Rehydrating Context, Rebinding Tools, and Restoring Agent State#

18.7.1 Resumption as a Multi-Phase Protocol#

Session resumption is not a single deserialization step. It is a multi-phase protocol that reconstructs the full execution environment from persisted state:

Resume(S)=Restore(σ)Rehydrate(ctx)Rebind(T)Reconstruct(M)Verify(integrity)\text{Resume}(\mathcal{S}) = \text{Restore}(\sigma) \circ \text{Rehydrate}(\text{ctx}) \circ \text{Rebind}(\mathcal{T}) \circ \text{Reconstruct}(\mathcal{M}) \circ \text{Verify}(\text{integrity})

Each phase has specific preconditions, postconditions, and failure modes.

18.7.2 Phase 1: State Restoration#

Restore the session state σ\sigma from the most recent checkpoint:

σrestored=Replay(CPfull(k0),{CPdelta(k)}k=k0+1klatest)\sigma_{\text{restored}} = \text{Replay}\left(\text{CP}_{\text{full}}(k_0), \{\text{CP}_{\text{delta}}(k)\}_{k=k_0+1}^{k_{\text{latest}}}\right)

Postcondition: checksum(σrestored)=checksum(CP(klatest))\text{checksum}(\sigma_{\text{restored}}) = \text{checksum}(\text{CP}(k_{\text{latest}}))

Failure mode: Checkpoint corruption → fall back to previous full checkpoint.

18.7.3 Phase 2: Context Rehydration#

The active context window at the time of suspension may have contained retrieved evidence, tool outputs, and compressed history that are not part of the serialized state. Rehydration reconstructs this context:

ctxrehydrated=Compile(role_policy,σrestored,Ms.summary(),ReRetrieve(σ.retrieval_queries))\text{ctx}_{\text{rehydrated}} = \text{Compile}\left(\text{role\_policy}, \sigma_{\text{restored}}, \mathcal{M}_s.\text{summary}(), \text{ReRetrieve}(\sigma.\text{retrieval\_queries})\right)

Key considerations:

  • Stale evidence: If time has passed since suspension, retrieved evidence may be outdated. The rehydration phase checks freshness scores and optionally re-retrieves:
ReRetrieve(q)={cached(q)if age(cached(q))<τfreshRetrieve(q)otherwise\text{ReRetrieve}(q) = \begin{cases} \text{cached}(q) & \text{if } \text{age}(\text{cached}(q)) < \tau_{\text{fresh}} \\ \text{Retrieve}(q) & \text{otherwise} \end{cases}
  • Token budget recalculation: The remaining token budget must be recalculated from the checkpoint:
Tremaining=TmaxTconsumed(CP(klatest))T_{\text{remaining}} = T_{\max} - T_{\text{consumed}}(\text{CP}(k_{\text{latest}}))

18.7.4 Phase 3: Tool Rebinding#

Tools may have changed availability, version, or authorization scope since the session was suspended:

Trebound={tTb:available(t)version_compatible(t)authorized(t,Γ)}\mathcal{T}_{\text{rebound}} = \{t \in \mathcal{T}_b : \text{available}(t) \wedge \text{version\_compatible}(t) \wedge \text{authorized}(t, \Gamma)\}

For tools that are no longer available, the system applies the tool substitution policy:

Substitute(t)={tif tToolRegistry:compatible(t,t)DEGRADEif t is optionalFAIL_RESUMEif t is required\text{Substitute}(t) = \begin{cases} t' & \text{if } \exists t' \in \text{ToolRegistry}: \text{compatible}(t, t') \\ \texttt{DEGRADE} & \text{if } t \text{ is optional} \\ \texttt{FAIL\_RESUME} & \text{if } t \text{ is required} \end{cases}

18.7.5 Phase 4: Memory Reconstruction#

Session memory layers are reconstructed:

Memory LayerResumption Strategy
WorkingReset (ephemeral by definition)
SessionRestored from checkpoint
EpisodicLoaded from durable store
SemanticRead from organizational memory (shared, not session-specific)
ProceduralLoaded from procedural memory store

18.7.6 Phase 5: Integrity Verification#

Before the session transitions from RESUMED to ACTIVE, a comprehensive integrity check verifies:

IntegrityCheck(S)={checksum(σrestored)=expectedschema_version(σ) is compatibleTreboundTrequiredTremaining>Tmin_viableplan_state is consistentno orphaned child sessions\text{IntegrityCheck}(\mathcal{S}) = \bigwedge \begin{cases} \text{checksum}(\sigma_{\text{restored}}) = \text{expected} \\ \text{schema\_version}(\sigma) \text{ is compatible} \\ |\mathcal{T}_{\text{rebound}}| \geq |\mathcal{T}_{\text{required}}| \\ T_{\text{remaining}} > T_{\text{min\_viable}} \\ \text{plan\_state} \text{ is consistent} \\ \text{no orphaned child sessions} \end{cases}

18.7.7 Pseudo-Algorithm: Session Resumption Protocol#

ALGORITHM ResumeSession
  INPUT:
    session_id: SessionID
    checkpoint_store: CheckpointStore
    tool_registry: ToolRegistry
    memory_store: MemoryStore
    config: ResumptionConfig
 
  OUTPUT:
    resumed_session: Session
 
  BEGIN:
    // ─── Phase 1: State Restoration ───
    checkpoints ← checkpoint_store.list(session_id, order=DESC)
    IF checkpoints IS EMPTY THEN
      RAISE NoCheckpointAvailableError(session_id)
    END IF
 
    // Find most recent full checkpoint
    full_cp ← FIND_LATEST(checkpoints, type=FULL)
    IF full_cp IS NULL THEN
      RAISE NoFullCheckpointError(session_id)
    END IF
 
    // Collect incremental deltas after full checkpoint
    deltas ← FILTER(checkpoints,
                     λcp: cp.type = INCREMENTAL AND cp.seq > full_cp.seq)
    SORT deltas BY seq ASC
 
    // Replay
    state ← DESERIALIZE_SESSION_STATE(full_cp.payload, config.expected_schema_version)
    FOR EACH delta IN deltas DO
      diff ← DESERIALIZE_DIFF(delta.payload)
      state ← APPLY_DIFF(state, diff)
    END FOR
 
    // Verify integrity
    latest_cp ← checkpoints[0]
    IF SHA256(CANONICAL_SERIALIZE(state.fields)) ≠ latest_cp.checksum THEN
      // Attempt fallback to previous full checkpoint
      WARN("checkpoint_integrity_failure", session_id, latest_cp.seq)
      IF config.allow_fallback THEN
        RETURN RESUME_FROM_FALLBACK(session_id, full_cp, config)
      ELSE
        RAISE CheckpointIntegrityError(session_id, latest_cp.seq)
      END IF
    END IF
 
    // ─── Phase 2: Context Rehydration ───
    retrieval_queries ← state.pending_retrieval_queries
    evidence ← {}
    FOR EACH query IN retrieval_queries DO
      cached ← RETRIEVAL_CACHE.get(query.hash)
      IF cached IS NOT NULL AND AGE(cached) < config.freshness_threshold THEN
        evidence[query.id] ← cached
      ELSE
        fresh_result ← RETRIEVE(query, deadline=config.retrieval_deadline)
        evidence[query.id] ← fresh_result
        RETRIEVAL_CACHE.put(query.hash, fresh_result, ttl=config.cache_ttl)
      END IF
    END FOR
 
    // Recalculate token budget
    token_budget ← config.T_max - state.tokens_consumed
    IF token_budget < config.T_min_viable THEN
      RAISE InsufficientTokenBudgetError(session_id, token_budget)
    END IF
 
    // ─── Phase 3: Tool Rebinding ───
    required_tools ← state.required_tools
    bound_tools ← {}
    degraded_tools ← []
 
    FOR EACH tool_spec IN required_tools DO
      tool ← tool_registry.resolve(tool_spec.name, tool_spec.version_constraint)
      IF tool IS NOT NULL THEN
        IF AUTHORIZE(tool, state.isolation_context) THEN
          bound_tools[tool_spec.name] ← tool
        ELSE
          IF tool_spec.required THEN
            RAISE ToolAuthorizationFailedError(tool_spec.name)
          ELSE
            APPEND degraded_tools, tool_spec.name
          END IF
        END IF
      ELSE
        substitute ← tool_registry.find_substitute(tool_spec)
        IF substitute IS NOT NULL THEN
          bound_tools[tool_spec.name] ← substitute
          WARN("tool_substituted", tool_spec.name, substitute.name)
        ELSE IF tool_spec.required THEN
          RAISE RequiredToolUnavailableError(tool_spec.name)
        ELSE
          APPEND degraded_tools, tool_spec.name
        END IF
      END IF
    END FOR
 
    // ─── Phase 4: Memory Reconstruction ───
    session_memory ← SessionMemory(
      working=WorkingMemory.fresh(),             // Ephemeral: always reset
      session=memory_store.load_session_memory(session_id),
      episodic=memory_store.load_episodic(session_id),
      semantic=memory_store.load_semantic(state.isolation_context.org_id),
      procedural=memory_store.load_procedural(state.task_type)
    )
 
    // ─── Phase 5: Construct Resumed Session ───
    session ← Session(
      id=session_id,
      state=state,
      lifecycle_phase=RESUMED,
      isolation_context=state.isolation_context,
      memory=session_memory,
      tool_bindings=bound_tools,
      evidence=evidence,
      token_budget=token_budget,
      degraded_capabilities=degraded_tools,
      metadata=state.metadata
    )
 
    // ─── Phase 6: Integrity Verification ───
    integrity ← VERIFY_SESSION_INTEGRITY(session)
    IF NOT integrity.passed THEN
      RAISE SessionIntegrityError(session_id, integrity.failures)
    END IF
 
    EMIT_TRACE("session.resumed", {
      session_id: session_id,
      resumed_from_checkpoint: latest_cp.seq,
      state_version: state.version,
      tools_bound: LEN(bound_tools),
      tools_degraded: LEN(degraded_tools),
      evidence_rehydrated: LEN(evidence),
      token_budget_remaining: token_budget
    })
 
    RETURN session
  END

18.8 Session Migration: Moving Sessions Across Nodes, Regions, and Agent Instances#

18.8.1 Migration Motivation and Scenarios#

Sessions must be migratable across infrastructure boundaries for:

ScenarioTriggerConstraint
Node failoverNode crash or scheduled maintenanceMinimize downtime; resume on healthy node
Load balancingUneven resource utilizationMinimize migration latency
Region transferUser relocates; data residency requirementsComply with data sovereignty regulations
Agent upgradeNew agent version deployedMaintain state continuity across versions
Horizontal scalingWorkload spikeDistribute sessions across expanded capacity

18.8.2 Migration Protocol#

Session migration is a three-phase commit protocol:

Phase 1: Prepare

SourcePREPARE(session_id,target)Target\text{Source} \xrightarrow{\text{PREPARE}(session\_id, target)} \text{Target}
  • Source suspends the session (creates checkpoint).
  • Source serializes complete session state, including metadata, checkpoints, and WAL tail.
  • Source notifies target of incoming migration with state size and resource requirements.
  • Target verifies capacity, schema compatibility, and tool availability.

Phase 2: Transfer

SourceTRANSFER(state_bundle)Target\text{Source} \xrightarrow{\text{TRANSFER}(\text{state\_bundle})} \text{Target}
  • State bundle is transferred over an encrypted, authenticated channel.
  • Target deserializes, validates checksum, and performs schema migration if necessary.
  • Target binds tools and verifies integrity.

Phase 3: Commit

SourceCOMMIT_MIGRATIONTarget\text{Source} \xrightarrow{\text{COMMIT\_MIGRATION}} \text{Target}
  • Target acknowledges successful restoration.
  • Source marks the session as migrated and releases all local resources.
  • Routing table is updated to direct future requests to the target.
  • If acknowledgment times out, the source retains the session (migration aborted).

18.8.3 Migration Latency Model#

The total migration latency is:

Lmigrate=Lsuspend+Lserialize+σbundleBnetwork+Ldeserialize+Lrebind+LverifyL_{\text{migrate}} = L_{\text{suspend}} + L_{\text{serialize}} + \frac{|\sigma_{\text{bundle}}|}{B_{\text{network}}} + L_{\text{deserialize}} + L_{\text{rebind}} + L_{\text{verify}}

where BnetworkB_{\text{network}} is the available network bandwidth. For large session states, the transfer dominates. Optimization strategies include:

  • Incremental migration: Transfer only the delta since the last checkpoint already present on the target.
  • Compression: Apply LZ4 or Zstandard compression to the state bundle:
σcompressed=σbundle(1ρ),ρ[0.3,0.8]|\sigma_{\text{compressed}}| = |\sigma_{\text{bundle}}| \cdot (1 - \rho), \quad \rho \in [0.3, 0.8]
  • Pre-staging: Begin transfer of large, stable state components (e.g., episodic memory) before the migration is committed.

18.8.4 Pseudo-Algorithm: Session Migration#

ALGORITHM MigrateSession
  INPUT:
    session_id: SessionID
    source_node: NodeID
    target_node: NodeID
    config: MigrationConfig
 
  OUTPUT:
    migration_result: MigrationResult
 
  BEGIN:
    // ─── Phase 1: Prepare ───
    session ← LOAD_SESSION(session_id, source_node)
 
    // Suspend session (creates checkpoint)
    LIFECYCLE_TRANSITION(session, SUSPEND)
    checkpoint ← CREATE_CHECKPOINT(session, trigger=MIGRATION)
 
    // Serialize state bundle
    state_bundle ← SerializeStateBundle(
      state=session.state,
      checkpoints=GET_RECENT_CHECKPOINTS(session_id, config.checkpoint_window),
      wal_tail=GET_WAL_TAIL(session_id),
      memory=SERIALIZE_SESSION_MEMORY(session.memory),
      metadata=session.metadata,
      tool_specs=session.tool_bindings.specs()
    )
 
    compressed_bundle ← COMPRESS(state_bundle, algorithm=ZSTD, level=3)
 
    // Verify target capacity
    target_capacity ← RPC(target_node, "check_migration_capacity", {
      state_size: SIZE(compressed_bundle),
      schema_version: session.state.schema_version,
      required_tools: session.tool_bindings.specs()
    })
 
    IF NOT target_capacity.accepted THEN
      LIFECYCLE_TRANSITION(session, RESUME)  // Abort: reactivate on source
      RETURN MigrationResult(success=FALSE, reason=target_capacity.rejection_reason)
    END IF
 
    // ─── Phase 2: Transfer ───
    transfer_start ← NOW()
 
    transfer_result ← RPC(target_node, "receive_session_migration", {
      session_id: session_id,
      bundle: compressed_bundle,
      checksum: SHA256(state_bundle),
      source_node: source_node
    }, deadline=config.transfer_deadline)
 
    transfer_latency ← NOW() - transfer_start
 
    IF NOT transfer_result.success THEN
      LIFECYCLE_TRANSITION(session, RESUME)  // Abort: reactivate on source
      RETURN MigrationResult(success=FALSE, reason=transfer_result.error)
    END IF
 
    // ─── Phase 3: Commit ───
    commit_result ← RPC(target_node, "commit_session_migration", {
      session_id: session_id,
      expected_checksum: SHA256(state_bundle)
    }, deadline=config.commit_deadline)
 
    IF commit_result.success THEN
      // Update routing table
      ROUTING_TABLE.update(session_id, target_node)
 
      // Clean up source
      MARK_SESSION_MIGRATED(session_id, source_node, target_node)
      SCHEDULE_CLEANUP(session_id, source_node, delay=config.cleanup_delay)
 
      EMIT_METRIC("session.migrated", {
        session_id: session_id,
        from: source_node,
        to: target_node,
        bundle_size: SIZE(compressed_bundle),
        transfer_latency_ms: transfer_latency,
        total_latency_ms: NOW() - transfer_start
      })
 
      RETURN MigrationResult(success=TRUE)
    ELSE
      // Commit failed: reactivate on source
      LIFECYCLE_TRANSITION(session, RESUME)
      RETURN MigrationResult(success=FALSE, reason="commit_failed")
    END IF
  END

18.9 Session Timeout and Expiry: Configurable TTL, Grace Periods, and Cleanup Hooks#

18.9.1 Timeout and Expiry Model#

Sessions are subject to multiple time-based constraints:

TimeoutConfig(S)=(τidle,τabsolute,τgrace,τexpiry)\text{TimeoutConfig}(\mathcal{S}) = \left(\tau_{\text{idle}}, \tau_{\text{absolute}}, \tau_{\text{grace}}, \tau_{\text{expiry}}\right)
ParameterSemanticsTypical Range
τidle\tau_{\text{idle}}Max time without any interaction5 min – 24 hrs
τabsolute\tau_{\text{absolute}}Max total session duration1 hr – 30 days
τgrace\tau_{\text{grace}}Grace period after timeout before state cleanup1 hr – 7 days
τexpiry\tau_{\text{expiry}}Time after which archived state is permanently deleted30 – 365 days

18.9.2 Timeout State Transitions#

δ(ACTIVE,idle_timeout)=SUSPENDED\delta(\texttt{ACTIVE}, \texttt{idle\_timeout}) = \texttt{SUSPENDED} δ(ACTIVE,absolute_timeout)=SUSPENDED\delta(\texttt{ACTIVE}, \texttt{absolute\_timeout}) = \texttt{SUSPENDED} δ(SUSPENDED,grace_expired)=TERMINATED\delta(\texttt{SUSPENDED}, \texttt{grace\_expired}) = \texttt{TERMINATED} δ(TERMINATED,expiry_reached)=ARCHIVEDDELETE\delta(\texttt{TERMINATED}, \texttt{expiry\_reached}) = \texttt{ARCHIVED} \rightarrow \text{DELETE}

18.9.3 Idle Detection#

Idle time is measured from the last meaningful interaction:

tidle(S)=tnowmax(tlast_user_input,tlast_agent_action,tlast_tool_invocation)t_{\text{idle}}(\mathcal{S}) = t_{\text{now}} - \max\left(t_{\text{last\_user\_input}}, t_{\text{last\_agent\_action}}, t_{\text{last\_tool\_invocation}}\right)

A session is idle-timed-out when:

tidle(S)>τidle(S)t_{\text{idle}}(\mathcal{S}) > \tau_{\text{idle}}(\mathcal{S})

Important distinction: Background processing (e.g., async tool results, retrieval updates) resets the idle timer only if configured to do so. Pure heartbeat messages do not reset idle time—only semantically meaningful interactions qualify.

18.9.4 Cleanup Hooks#

Before a session is terminated or archived, cleanup hooks execute in order:

  1. Flush pending state: Persist any uncommitted working memory.
  2. Release tool locks: Free any claimed resources.
  3. Notify dependent sessions: Alert parent or linked sessions of termination.
  4. Emit analytics: Record final session metrics.
  5. Promote valuable memory: Extract non-obvious insights for long-term memory (with validation).
  6. Delete sensitive data: Scrub PII or credentials from session state.
ALGORITHM SessionTimeoutMonitor
  INPUT:
    sessions: SessionIndex          // Indexed by last_active_at
    config: TimeoutConfig
 
  // Runs as a periodic background process
  BEGIN:
    LOOP EVERY config.sweep_interval DO
 
      // Idle timeout check
      idle_candidates ← sessions.query(
        lifecycle_phase IN {ACTIVE},
        last_active_at < NOW() - config.tau_idle
      )
 
      FOR EACH session IN idle_candidates DO
        EMIT_TRACE("session.idle_timeout", session.id)
        LIFECYCLE_TRANSITION(session, SUSPEND, trigger=IDLE_TIMEOUT)
      END FOR
 
      // Absolute timeout check
      absolute_candidates ← sessions.query(
        lifecycle_phase IN {ACTIVE, SUSPENDED, RESUMED},
        created_at < NOW() - config.tau_absolute
      )
 
      FOR EACH session IN absolute_candidates DO
        IF session.lifecycle_phase = ACTIVE THEN
          LIFECYCLE_TRANSITION(session, SUSPEND, trigger=ABSOLUTE_TIMEOUT)
        END IF
        LIFECYCLE_TRANSITION(session, TERMINATE, trigger=ABSOLUTE_TIMEOUT)
      END FOR
 
      // Grace period expiry
      grace_candidates ← sessions.query(
        lifecycle_phase = SUSPENDED,
        suspended_at < NOW() - config.tau_grace
      )
 
      FOR EACH session IN grace_candidates DO
        EXECUTE_CLEANUP_HOOKS(session)
        LIFECYCLE_TRANSITION(session, TERMINATE, trigger=GRACE_EXPIRED)
      END FOR
 
      // Permanent expiry
      expiry_candidates ← sessions.query(
        lifecycle_phase IN {TERMINATED, ARCHIVED},
        terminated_at < NOW() - config.tau_expiry
      )
 
      FOR EACH session IN expiry_candidates DO
        PERMANENT_DELETE(session)
        EMIT_METRIC("session.permanently_deleted", session.id)
      END FOR
 
    END LOOP
  END

18.9.5 TTL Extension and Renewal#

Sessions can request TTL extension through explicit user interaction or programmatic renewal:

τidlenew=min(τidle+Δτ,τidlemax)\tau_{\text{idle}}^{\text{new}} = \min\left(\tau_{\text{idle}} + \Delta\tau, \tau_{\text{idle}}^{\max}\right)

Renewal is granted only if:

  • The session owner is authenticated.
  • The total session duration remains within τabsolute\tau_{\text{absolute}}.
  • Resource budgets (tokens, cost) have not been exhausted.

18.10.1 Session Relationship Graph#

Related sessions form a directed graph Gsession=(V,E)\mathcal{G}_{\text{session}} = (V, E) where:

  • VV = set of sessions
  • EE = set of typed edges representing relationships

Edge types:

Edge TypeSemanticsData Flow
PARENT → CHILDNested session (fork)Child inherits parent context
PREDECESSOR → SUCCESSORSequential task chainSuccessor reads predecessor's outputs
PEER ↔ PEERCollaborative sessionsShared context subset
REFERENCE → REFERENTCross-session evidence citationRead-only access to referent's artifacts

18.10.2 Cross-Session Context Sharing Protocol#

Sharing context across sessions requires explicit declaration and access control:

SharePolicy(Sa,Sb)=(fields:SetFieldPath,access:ReadOnlyReadWrite,sync:SnapshotLive)\text{SharePolicy}(\mathcal{S}_a, \mathcal{S}_b) = \left(\text{fields}: \text{Set}\langle\text{FieldPath}\rangle, \text{access}: \text{ReadOnly} \mid \text{ReadWrite}, \text{sync}: \text{Snapshot} \mid \text{Live}\right)

Snapshot Sharing#

The shared state is copied at a point in time. Changes in the source session are not reflected in the consumer:

σshared=Project(σa,fields) at time tshare\sigma_{\text{shared}} = \text{Project}(\sigma_a, \text{fields}) \text{ at time } t_{\text{share}}

Live Sharing#

The consumer session observes changes in real time through a subscription mechanism:

Subscribe(Sb,Sa,fields)ChangeStream\text{Subscribe}(\mathcal{S}_b, \mathcal{S}_a, \text{fields}) \rightarrow \text{ChangeStream}

Live sharing introduces consistency challenges: the consumer may observe intermediate states. The system provides causal consistency by attaching version vectors:

VV(Sa)={(fi,vi)}ifields\text{VV}(\mathcal{S}_a) = \{(f_i, v_i)\}_{i \in \text{fields}}

The consumer reads field fif_i only when VV(Sa).viVVexpected.vi\text{VV}(\mathcal{S}_a).v_i \geq \text{VV}_{\text{expected}}.v_i.

18.10.3 Cross-Session Memory Promotion#

When a session discovers a non-obvious insight or correction that would benefit future sessions, it proposes a memory promotion:

Propose(item,provenance,confidence)PromotionQueue\text{Propose}(\text{item}, \text{provenance}, \text{confidence}) \rightarrow \text{PromotionQueue}

The promotion queue is processed by a validation pipeline that ensures:

  1. Deduplication: The item does not duplicate existing organizational memory.
  2. Factual verification: The item is supported by evidence.
  3. Generalizability: The item applies beyond the current session.
  4. Expiry policy: The item has a defined TTL.

Only items passing all gates are promoted to organizational (semantic) memory.

18.10.4 Pseudo-Algorithm: Multi-Session Coordinator#

ALGORITHM MultiSessionCoordinator
  INPUT:
    session_graph: SessionGraph
    requesting_session: SessionID
    operation: CrossSessionOperation
 
  OUTPUT:
    result: OperationResult
 
  BEGIN:
    MATCH operation:
 
      FORK(parent_id, subtask, inheritance_policy):
        parent ← LOAD_SESSION(parent_id)
 
        // Validate parent is active
        ASSERT parent.lifecycle_phase = ACTIVE
 
        // Allocate child budget
        child_token_budget ← parent.token_budget * inheritance_policy.budget_fraction
        parent.token_budget ← parent.token_budget - child_token_budget
 
        // Create child session
        child ← CREATE_SESSION(
          owner=parent.owner,
          task=subtask,
          isolation=PER_TASK,
          persistence=parent.persistence_policy,
          token_budget=child_token_budget
        )
 
        // Copy inherited state
        FOR EACH field IN inheritance_policy.inherited_fields DO
          child.state.set(field, DEEP_COPY(parent.state.get(field)))
        END FOR
 
        // Bind tools (subset)
        child.tool_bindings ← FILTER(
          parent.tool_bindings,
          λt: t.name IN inheritance_policy.allowed_tools
        )
 
        // Register relationship
        session_graph.add_edge(parent.id, child.id, PARENT_CHILD)
 
        // Register completion dependency
        parent.pending_children.add(child.id)
 
        RETURN OperationResult(success=TRUE, child_session=child)
 
      SHARE_CONTEXT(source_id, target_id, policy):
        source ← LOAD_SESSION(source_id)
        target ← LOAD_SESSION(target_id)
 
        // Verify authorization
        IF NOT AUTHORIZE_CROSS_SESSION(source, target, policy) THEN
          RAISE CrossSessionAccessDenied(source_id, target_id)
        END IF
 
        MATCH policy.sync:
          SNAPSHOT:
            shared_state ← PROJECT(source.state, policy.fields)
            snapshot ← DEEP_COPY(shared_state)
            target.imported_context[source_id] ← ImportedContext(
              snapshot=snapshot,
              timestamp=NOW(),
              source_version=source.state.version,
              access=policy.access
            )
 
          LIVE:
            subscription ← CREATE_SUBSCRIPTION(
              source=source_id,
              target=target_id,
              fields=policy.fields,
              access=policy.access
            )
            REGISTER_SUBSCRIPTION(subscription)
            target.subscriptions.add(subscription)
 
        session_graph.add_edge(source_id, target_id, CONTEXT_SHARE)
        RETURN OperationResult(success=TRUE)
 
      WAIT_FOR_CHILDREN(parent_id):
        parent ← LOAD_SESSION(parent_id)
        pending ← parent.pending_children
 
        FOR EACH child_id IN pending DO
          child ← LOAD_SESSION(child_id)
          IF child.lifecycle_phase IN TERMINAL_STATES THEN
            // Collect child results
            parent.child_results[child_id] ← child.final_output
            parent.pending_children.remove(child_id)
          END IF
        END FOR
 
        IF parent.pending_children IS EMPTY THEN
          RETURN OperationResult(success=TRUE, all_children_complete=TRUE)
        ELSE
          RETURN OperationResult(success=TRUE, all_children_complete=FALSE,
                                 pending=parent.pending_children)
        END IF
  END

18.11 Session Security: Encryption at Rest and in Transit, Access Control, and Session Hijacking Prevention#

18.11.1 Threat Model#

ThreatDescriptionImpact
Session HijackingAttacker obtains session ID and impersonates userFull access to session state and tools
State TamperingAttacker modifies persisted session stateCorrupted execution, policy violations
EavesdroppingAttacker intercepts session data in transitConfidentiality breach
Replay AttackAttacker replays captured session interactionsDuplicate state mutations
Privilege EscalationAgent or user accesses resources beyond scopeUnauthorized tool invocations
Cross-Session LeakageIsolation failure leaks state between sessionsConfidentiality and integrity breach

18.11.2 Encryption Architecture#

At Rest#

All persisted session state is encrypted using AES-256-GCM with per-session keys:

Ciphertext(σ)=AES-256-GCM(σ,Ksession,nonce)\text{Ciphertext}(\sigma) = \text{AES-256-GCM}(\sigma, K_{\text{session}}, \text{nonce})

where:

Ksession=KDF(Kmaster,session_id,context)K_{\text{session}} = \text{KDF}(K_{\text{master}}, \text{session\_id}, \text{context})
  • KmasterK_{\text{master}} is stored in a Hardware Security Module (HSM) or key management service.
  • KDF\text{KDF} is HKDF-SHA256.
  • The nonce is derived from the checkpoint sequence number to ensure uniqueness.

In Transit#

All session data traversing network boundaries is protected by mutual TLS (mTLS):

  • gRPC channels use TLS 1.3 with certificate pinning.
  • JSON-RPC endpoints require TLS with client certificate authentication.
  • Inter-node migration uses an additional layer of application-level encryption.

18.11.3 Session Token Security#

Session tokens (used for API authentication) must satisfy:

SessionToken=Sign(session_iduser_idissued_atexpires_atscope,  Ksigning)\text{SessionToken} = \text{Sign}\left(\text{session\_id} \| \text{user\_id} \| \text{issued\_at} \| \text{expires\_at} \| \text{scope}, \; K_{\text{signing}}\right)

Properties:

PropertyMechanism
Unpredictability256-bit cryptographically random component
BindingToken is bound to session ID, user ID, and IP (optional)
ExpiryShort-lived with refresh mechanism (τtoken15\tau_{\text{token}} \leq 15 min)
RotationToken is rotated on every sensitive operation
RevocationServer-side revocation list checked on every request

18.11.4 Access Control Model#

Session access is governed by a role-based access control (RBAC) policy augmented with attribute-based constraints:

Permit(s,a,r)=RBAC(s.role,a)ABAC(s.attributes,a.attributes,r.attributes)\text{Permit}(s, a, r) = \text{RBAC}(s.\text{role}, a) \wedge \text{ABAC}(s.\text{attributes}, a.\text{attributes}, r.\text{attributes})

where ss is the subject (user or agent), aa is the action, and rr is the resource (session state field or tool).

RolePermissions
OwnerFull read/write, lifecycle control, migration, sharing
AgentScoped read/write per isolation policy, tool invocation
ViewerRead-only access to session outputs and metrics
AuditorRead-only access to audit records and traces
AdminSession termination, migration, and policy override

18.11.5 Anti-Hijacking Measures#

ALGORITHM ValidateSessionRequest
  INPUT:
    request: SessionRequest
    session_store: SessionStore
    security_config: SecurityConfig
 
  OUTPUT:
    validated: Bool
 
  BEGIN:
    // Token validation
    token ← request.session_token
    IF NOT VERIFY_SIGNATURE(token, security_config.signing_key) THEN
      AUDIT_LOG("invalid_token_signature", request)
      RETURN FALSE
    END IF
 
    claims ← DECODE_CLAIMS(token)
 
    // Expiry check
    IF claims.expires_at < NOW() THEN
      AUDIT_LOG("expired_token", request)
      RETURN FALSE
    END IF
 
    // Revocation check
    IF REVOCATION_LIST.contains(token.id) THEN
      AUDIT_LOG("revoked_token", request)
      RETURN FALSE
    END IF
 
    // Session binding check
    session ← session_store.load(claims.session_id)
    IF session IS NULL OR session.lifecycle_phase IN TERMINAL_STATES THEN
      AUDIT_LOG("invalid_session", request)
      RETURN FALSE
    END IF
 
    // User binding
    IF claims.user_id ≠ session.owner_id THEN
      AUDIT_LOG("user_session_mismatch", request)
      RETURN FALSE
    END IF
 
    // IP binding (optional)
    IF security_config.enforce_ip_binding THEN
      IF request.source_ip ≠ claims.bound_ip THEN
        AUDIT_LOG("ip_mismatch", request)
        RETURN FALSE
      END IF
    END IF
 
    // Rate limiting
    IF NOT RATE_LIMITER.allow(claims.session_id, request.operation) THEN
      AUDIT_LOG("rate_limited", request)
      RETURN FALSE
    END IF
 
    // Anomaly detection: unusual request patterns
    IF ANOMALY_DETECTOR.is_suspicious(request, session.interaction_history) THEN
      AUDIT_LOG("anomalous_request", request)
      IF security_config.block_anomalous THEN
        RETURN FALSE
      ELSE
        FLAG_FOR_REVIEW(request)
      END IF
    END IF
 
    // Token rotation on sensitive operations
    IF request.operation IN SENSITIVE_OPERATIONS THEN
      new_token ← ROTATE_TOKEN(token, security_config)
      request.response_headers["X-New-Session-Token"] ← new_token
      REVOCATION_LIST.add(token.id, ttl=30s)  // Grace period for in-flight requests
    END IF
 
    RETURN TRUE
  END

18.11.6 Security Invariants#

The session security subsystem must maintain at all times:

Isec1:S:state_at_rest(S) is encrypted\mathcal{I}_{\text{sec}_1}: \quad \forall \mathcal{S}: \text{state\_at\_rest}(\mathcal{S}) \text{ is encrypted} Isec2:transit(S):channel is mTLS-protected\mathcal{I}_{\text{sec}_2}: \quad \forall \text{transit}(\mathcal{S}): \text{channel} \text{ is mTLS-protected} Isec3:token:lifetime(token)τtokenmax\mathcal{I}_{\text{sec}_3}: \quad \forall \text{token}: \text{lifetime}(\text{token}) \leq \tau_{\text{token}}^{\max} Isec4:(Sa,Sb):Γ(Sa).ownerΓ(Sb).ownerIsolated(Sa,Sb)\mathcal{I}_{\text{sec}_4}: \quad \forall (\mathcal{S}_a, \mathcal{S}_b): \Gamma(\mathcal{S}_a).\text{owner} \neq \Gamma(\mathcal{S}_b).\text{owner} \Rightarrow \text{Isolated}(\mathcal{S}_a, \mathcal{S}_b) Isec5:mutation:authenticatedauthorizedaudited\mathcal{I}_{\text{sec}_5}: \quad \forall \text{mutation}: \text{authenticated} \wedge \text{authorized} \wedge \text{audited}

18.12 Session Analytics: Duration, Turn Count, Tool Usage, Error Rate, and User Satisfaction Correlation#

18.12.1 Session Metrics Taxonomy#

Every session emits a structured set of metrics that enable operational monitoring, capacity planning, quality improvement, and cost optimization.

Temporal Metrics#

mtemporal(S)=[TtotalTactiveTsuspendedTtime_to_first_responseTmean_turn_latencyTp99_turn_latency]\mathbf{m}_{\text{temporal}}(\mathcal{S}) = \begin{bmatrix} T_{\text{total}} \\ T_{\text{active}} \\ T_{\text{suspended}} \\ T_{\text{time\_to\_first\_response}} \\ T_{\text{mean\_turn\_latency}} \\ T_{\text{p99\_turn\_latency}} \end{bmatrix}

where:

Tmean_turn_latency=1Nturnsi=1NturnsΔtiT_{\text{mean\_turn\_latency}} = \frac{1}{N_{\text{turns}}} \sum_{i=1}^{N_{\text{turns}}} \Delta t_i Tp99_turn_latency=Percentile99({Δti}i=1Nturns)T_{\text{p99\_turn\_latency}} = \text{Percentile}_{99}\left(\{\Delta t_i\}_{i=1}^{N_{\text{turns}}}\right)

Interaction Metrics#

minteraction(S)=[NturnsNuser_messagesNagent_messagesNtool_invocationsNretrieval_queriesNrepair_cycles]\mathbf{m}_{\text{interaction}}(\mathcal{S}) = \begin{bmatrix} N_{\text{turns}} \\ N_{\text{user\_messages}} \\ N_{\text{agent\_messages}} \\ N_{\text{tool\_invocations}} \\ N_{\text{retrieval\_queries}} \\ N_{\text{repair\_cycles}} \end{bmatrix}

Resource Consumption Metrics#

mresource(S)=[Ttokens_inputTtokens_outputCtotal_cost_usdNapi_callsNcheckpointsσpeak_state_size]\mathbf{m}_{\text{resource}}(\mathcal{S}) = \begin{bmatrix} T_{\text{tokens\_input}} \\ T_{\text{tokens\_output}} \\ C_{\text{total\_cost\_usd}} \\ N_{\text{api\_calls}} \\ N_{\text{checkpoints}} \\ |\sigma_{\text{peak\_state\_size}}| \end{bmatrix}

Quality Metrics#

mquality(S)=[rtask_completionrerror_raterrepair_success_raterhallucination_raterverification_pass_rate]\mathbf{m}_{\text{quality}}(\mathcal{S}) = \begin{bmatrix} r_{\text{task\_completion}} \\ r_{\text{error\_rate}} \\ r_{\text{repair\_success\_rate}} \\ r_{\text{hallucination\_rate}} \\ r_{\text{verification\_pass\_rate}} \end{bmatrix}

where:

rerror_rate=NerrorsNtool_invocations+Ngenerationsr_{\text{error\_rate}} = \frac{N_{\text{errors}}}{N_{\text{tool\_invocations}} + N_{\text{generations}}} rrepair_success_rate=Nrepairs_successfulNrepairs_attemptedr_{\text{repair\_success\_rate}} = \frac{N_{\text{repairs\_successful}}}{N_{\text{repairs\_attempted}}}

18.12.2 User Satisfaction Modeling#

User satisfaction U^\hat{U} is modeled as a function of observable session metrics:

U^(S)=σ(wTf(S)+b)\hat{U}(\mathcal{S}) = \sigma\left(\mathbf{w}^T \cdot \mathbf{f}(\mathcal{S}) + b\right)

where σ\sigma is the sigmoid function, f(S)\mathbf{f}(\mathcal{S}) is the feature vector, and w,b\mathbf{w}, b are learned parameters calibrated against explicit user feedback (ratings, thumbs up/down, task completion signals).

Feature vector:

f(S)=[log(Ttime_to_first_response+1)log(Nturns+1)rtask_completion1rerror_raterverification_pass_ratelog(Nrepair_cycles+1)1[session_completed_normally]log(Ctotal_cost_usd+1)]\mathbf{f}(\mathcal{S}) = \begin{bmatrix} \log(T_{\text{time\_to\_first\_response}} + 1) \\ \log(N_{\text{turns}} + 1) \\ r_{\text{task\_completion}} \\ 1 - r_{\text{error\_rate}} \\ r_{\text{verification\_pass\_rate}} \\ -\log(N_{\text{repair\_cycles}} + 1) \\ \mathbb{1}[\text{session\_completed\_normally}] \\ -\log(C_{\text{total\_cost\_usd}} + 1) \end{bmatrix}

The model is retrained periodically on accumulated feedback data. The predicted satisfaction score is used for:

  • Proactive intervention: If U^\hat{U} drops below a threshold during an active session, the system offers assistance or escalates.
  • Quality regression detection: A decline in aggregate U^\hat{U} across sessions signals systemic issues.
  • A/B testing: Different agent configurations are compared by their effect on U^\hat{U}.

18.12.3 Session Analytics Pipeline#

ALGORITHM ComputeSessionAnalytics
  INPUT:
    session: Session (completed or terminated)
 
  OUTPUT:
    analytics: SessionAnalyticsRecord
 
  BEGIN:
    // ─── Temporal Metrics ───
    T_total ← session.completed_at - session.created_at
    T_active ← SUM(duration(interval) FOR interval IN session.active_intervals)
    T_suspended ← T_total - T_active
    T_ttfr ← session.first_response_at - session.created_at
 
    turn_latencies ← [turn.response_at - turn.request_at
                       FOR turn IN session.interaction_history]
    T_mean_turn ← MEAN(turn_latencies)
    T_p99_turn ← PERCENTILE(turn_latencies, 99)
 
    // ─── Interaction Metrics ───
    N_turns ← LEN(session.interaction_history)
    N_user_msgs ← COUNT(turn FOR turn IN session.interaction_history
                         IF turn.actor = USER)
    N_agent_msgs ← COUNT(turn FOR turn IN session.interaction_history
                          IF turn.actor = AGENT)
    N_tool_invocations ← LEN(session.tool_trace)
    N_retrieval_queries ← LEN(session.retrieval_trace)
    N_repair_cycles ← SUM(loop.repair_count FOR loop IN session.agent_loops)
 
    // ─── Resource Metrics ───
    T_tokens_in ← SUM(turn.input_tokens FOR turn IN session.interaction_history)
    T_tokens_out ← SUM(turn.output_tokens FOR turn IN session.interaction_history)
    C_total ← session.cost_accumulator.total()
    N_api_calls ← session.api_call_counter.total()
    N_checkpoints ← LEN(session.checkpoints)
    peak_state_size ← MAX(SIZE(cp.payload) FOR cp IN session.checkpoints)
 
    // ─── Quality Metrics ───
    r_task_completion ← COMPUTE_TASK_COMPLETION_RATE(session)
    N_errors ← COUNT(inv FOR inv IN session.tool_trace IF inv.is_error)
    r_error_rate ← N_errors / MAX(N_tool_invocations + N_agent_msgs, 1)
    r_repair_success ← COMPUTE_REPAIR_SUCCESS_RATE(session.agent_loops)
    r_hallucination ← COMPUTE_HALLUCINATION_RATE(session.verification_results)
    r_verify_pass ← COMPUTE_VERIFICATION_PASS_RATE(session.verification_results)
 
    // ─── Satisfaction Prediction ───
    feature_vector ← CONSTRUCT_FEATURE_VECTOR(
      T_ttfr, N_turns, r_task_completion, r_error_rate,
      r_verify_pass, N_repair_cycles,
      session.lifecycle_phase = COMPLETED,
      C_total
    )
    predicted_satisfaction ← SATISFACTION_MODEL.predict(feature_vector)
 
    // ─── Tool Usage Breakdown ───
    tool_usage ← {}
    FOR EACH invocation IN session.tool_trace DO
      key ← invocation.tool_name
      IF key NOT IN tool_usage THEN
        tool_usage[key] ← ToolUsageRecord(count=0, errors=0,
                                           total_latency=0, total_cost=0)
      END IF
      tool_usage[key].count ← tool_usage[key].count + 1
      IF invocation.is_error THEN
        tool_usage[key].errors ← tool_usage[key].errors + 1
      END IF
      tool_usage[key].total_latency ← tool_usage[key].total_latency + invocation.latency
      tool_usage[key].total_cost ← tool_usage[key].total_cost + invocation.cost
    END FOR
 
    // ─── Assemble Record ───
    analytics ← SessionAnalyticsRecord(
      session_id=session.id,
      owner_id=session.owner_id,
      task_type=session.task_type,
 
      temporal=TemporalMetrics(T_total, T_active, T_suspended, T_ttfr,
                                T_mean_turn, T_p99_turn),
      interaction=InteractionMetrics(N_turns, N_user_msgs, N_agent_msgs,
                                      N_tool_invocations, N_retrieval_queries,
                                      N_repair_cycles),
      resource=ResourceMetrics(T_tokens_in, T_tokens_out, C_total,
                                N_api_calls, N_checkpoints, peak_state_size),
      quality=QualityMetrics(r_task_completion, r_error_rate, r_repair_success,
                              r_hallucination, r_verify_pass),
      tool_usage=tool_usage,
      predicted_satisfaction=predicted_satisfaction,
      computed_at=NOW()
    )
 
    // Persist to analytics store
    ANALYTICS_STORE.write(analytics)
 
    // Emit to monitoring system
    EMIT_METRICS_BATCH(analytics)
 
    RETURN analytics
  END

18.12.4 Aggregate Analytics and Operational Dashboards#

Individual session analytics are aggregated across dimensions for operational insight:

Aggregation DimensionExample MetricsPurpose
By UserMean satisfaction, session count, error rateUser experience monitoring
By Task TypeCompletion rate, mean duration, repair frequencyTask-specific optimization
By Agent VersionQuality scores, latency, cost per sessionA/B testing, regression detection
By ToolError rate, mean latency, invocation countTool reliability monitoring
By Time WindowThroughput, peak concurrency, cost rateCapacity planning

18.12.5 Anomaly Detection on Session Metrics#

Session metrics are monitored for anomalies using a statistical process control approach:

Anomaly(mt)mtmˉ>ksm\text{Anomaly}(m_t) \Leftrightarrow |m_t - \bar{m}| > k \cdot s_m

where mˉ\bar{m} is the rolling mean, sms_m is the rolling standard deviation, and kk is the sensitivity factor (typically k=3k = 3 for 3σ3\sigma bounds).

For multivariate anomaly detection across the full metric vector:

DMahal(mt)=(mtμ)TΣ1(mtμ)D_{\text{Mahal}}(\mathbf{m}_t) = \sqrt{(\mathbf{m}_t - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{m}_t - \boldsymbol{\mu})}

where μ\boldsymbol{\mu} and Σ\Sigma are the rolling mean vector and covariance matrix. An alert fires when DMahal>DthresholdD_{\text{Mahal}} > D_{\text{threshold}}.

18.12.6 Feedback Loop: Analytics to Architecture#

Session analytics feed directly back into architectural decisions:

AnalyticsinformDecisions\text{Analytics} \xrightarrow{\text{inform}} \text{Decisions}
Observed SignalArchitectural Response
High Tp99_turn_latencyT_{\text{p99\_turn\_latency}}Increase retrieval cache TTL, pre-warm tool connections
High rerror_rater_{\text{error\_rate}} on specific toolEnable circuit breaker, add fallback tool
Low rtask_completionr_{\text{task\_completion}} for task typeImprove decomposition heuristics, add retrieval sources
High Nrepair_cyclesN_{\text{repair\_cycles}}Strengthen verification, improve planning prompts
Low predicted satisfactionTrigger proactive user assistance, escalation
Excessive σpeak_state_size\|\sigma_{\text{peak\_state\_size}}\|Increase checkpoint compaction frequency, prune history
High suspension frequencyTune timeout parameters, improve resource allocation

This feedback loop ensures the session architecture evolves mechanically in response to empirical evidence, not intuition.


Synthesis: Session Architecture as a Systems Engineering Discipline#

The session architecture presented in this chapter treats the session not as a convenience abstraction but as a formally specified, cryptographically protected, lifecycle-managed, migratable execution envelope. The key architectural contributions are summarized:

Architectural PrincipleImplementation
Typed, versioned stateSchema evolution with composable migrations and integrity checksums
Deterministic lifecycleFinite state machine with guarded transitions and cleanup hooks
Tiered persistenceL0–L3 persistence with cost-durability optimization
Isolation enforcementPer-user, per-task, per-agent, nested; RBAC + ABAC access control
Resumable executionMulti-phase rehydration: state → context → tools → memory → integrity
Migratable sessionsThree-phase commit protocol with incremental transfer optimization
Security by designEncryption at rest/in transit, token rotation, anomaly detection
Measurable qualityComprehensive metrics taxonomy with satisfaction prediction
Feedback-driven evolutionAnalytics pipeline feeds back into timeout tuning, tool selection, and planning

A session that cannot survive a crash is a toy. A session that cannot migrate is a monolith. A session that cannot be audited is a liability. A session that cannot be measured cannot be improved. The architecture in this chapter ensures none of these failures are possible.


End of Chapter 18.