Chapter 18: Session Architecture — Lifecycle, Isolation, Persistence, and Resumption

Preamble#

In production agentic systems, the session is the fundamental unit of continuity. It binds a user's intent to an agent's execution state across time, space, and failure boundaries. Without a formally defined session primitive, agentic systems degrade to stateless request-response handlers—incapable of multi-turn reasoning, resumable execution, collaborative workflows, or any form of durable interaction. This chapter formalizes the session as a first-class architectural primitive with typed state, versioned schemas, deterministic lifecycle transitions, isolation guarantees, persistence strategies, and security invariants. Every construct is specified with the same rigor applied to database transaction managers, distributed consensus protocols, and operating system process models. A session is not a "conversation history blob." It is a managed execution envelope with defined boundaries, serializable state, migration capability, and measurable operational characteristics.

18.1 Session as a First-Class Architectural Primitive#

18.1.1 Definition and Architectural Position#

A session $\mathcal{S}$ is a bounded, stateful execution envelope that encapsulates all context, memory, tool bindings, agent state, and interaction history required to maintain continuity for a logically coherent unit of work.

Formally:

\mathcal{S} = \left( \text{id}, \sigma, \Lambda, \Gamma, \mathcal{M}_s, \mathcal{T}_b, \mathcal{H}, \Pi, \Omega \right)

where:

Symbol	Type	Semantics
$\text{id}$	`SessionID` (UUID v7, temporally sortable)	Globally unique, immutable session identifier
$\sigma$	`SessionState` (typed, versioned)	Current mutable state of the session
$\Lambda$	`LifecyclePhase`	Current lifecycle phase (enum)
$\Gamma$	`IsolationContext`	Isolation boundaries and ownership descriptors
$\mathcal{M}_s$	`SessionMemory`	Session-scoped memory layers
$\mathcal{T}_b$	`ToolBindingSet`	Bound tool instances with caller-scoped authorization
$\mathcal{H}$	`InteractionHistory`	Ordered turn-level interaction log
$\Pi$	`PersistencePolicy`	Checkpointing, WAL, and expiry configuration
$\Omega$	`SessionMetadata`	Creation time, owner, TTL, tags, lineage

18.1.2 Why Sessions Must Be First-Class#

Sessions are promoted from implicit infrastructure to explicit architectural primitives for the following reasons:

Continuity Under Failure: Without durable session state, a crash or timeout destroys all accumulated context. First-class sessions enable resumption from the last consistent checkpoint.
Isolation Enforcement: Concurrent users, tasks, or agents must not observe or mutate each other's state. First-class sessions provide the isolation boundary analogous to process isolation in operating systems.
Migration and Scaling: Sessions must be movable across nodes, regions, and agent instances without loss of state. This requires serializable, versioned state schemas—impossible with ad hoc in-memory state.
Observability and Auditing: Every session transition, tool invocation, and state mutation must be traceable. First-class sessions provide the natural unit of observation.
Cost Attribution: Token consumption, tool invocation costs, and compute usage are attributed per session, enabling granular billing, budgeting, and resource governance.

Concept	Relationship to Session	Key Distinction
Conversation	A session may contain one or more conversations	Conversation is interaction-level; session is execution-level
Agent Loop Execution	An agent loop executes within a session	The loop is a control structure; the session is its state envelope
Transaction	A session may contain multiple transactions	Transactions have ACID properties; sessions have lifecycle and continuity semantics
Process	Sessions are analogous to OS processes	Sessions are distributed, serializable, and migrateable
Context Window	The active context window is a view into session state	Context window is bounded by model limits; session state is unbounded but tiered

18.1.4 Formal Session Invariants#

Every session in the system must satisfy the following invariants at all times:

\mathcal{I}_1: \quad \text{id}(\mathcal{S}) \text{ is globally unique and immutable}

\mathcal{I}_2: \quad \Lambda(\mathcal{S}) \in \{\texttt{INIT}, \texttt{ACTIVE}, \texttt{SUSPENDED}, \texttt{RESUMED}, \texttt{COMPLETED}, \texttt{ARCHIVED}\}

\mathcal{I}_3: \quad \forall t: \text{version}(\sigma(t)) = \text{version}(\sigma(t-1)) + 1 \text{ after every state mutation}

\mathcal{I}_4: \quad |\mathcal{H}| \leq H_{\max} \quad \text{(bounded history with eviction policy)}

\mathcal{I}_5: \quad \forall \text{mutation } m: \text{provenance}(m) \neq \varnothing

\mathcal{I}_6: \quad \text{checksum}(\sigma) = \text{SHA256}(\text{canonical}(\sigma)) \text{ is verified on every deserialization}

18.2 Session Lifecycle: Init → Active → Suspended → Resumed → Completed → Archived#

18.2.1 Lifecycle as a Finite State Machine#

The session lifecycle is modeled as a deterministic finite state machine:

\mathcal{L}_{\text{session}} = (S, s_0, \Sigma, \delta, F)

where:

$S = \{\texttt{INIT}, \texttt{ACTIVE}, \texttt{SUSPENDED}, \texttt{RESUMED}, \texttt{COMPLETED}, \texttt{ARCHIVED}, \texttt{TERMINATED}\}$
$s_0 = \texttt{INIT}$
$\Sigma$ = set of lifecycle events
$\delta: S \times \Sigma \rightarrow S$ = transition function
$F = \{\texttt{COMPLETED}, \texttt{ARCHIVED}, \texttt{TERMINATED}\}$ = terminal states

18.2.2 Transition Function#

The complete transition table:

Current State	Event	Next State	Guard Condition	Side Effect
`INIT`	`activate`	`ACTIVE`	State schema validated, tools bound	Emit `session.activated` trace
`ACTIVE`	`suspend`	`SUSPENDED`	Checkpoint persisted	Flush working memory, release tool locks
`ACTIVE`	`complete`	`COMPLETED`	All exit criteria met	Final checkpoint, provenance sealed
`ACTIVE`	`terminate`	`TERMINATED`	Operator command or unrecoverable error	Compensating actions, error record
`ACTIVE`	`timeout`	`SUSPENDED`	TTL or idle timeout exceeded	Auto-checkpoint, release resources
`SUSPENDED`	`resume`	`RESUMED`	Checkpoint available, resources acquired	Rehydrate context, rebind tools
`SUSPENDED`	`expire`	`TERMINATED`	Expiry TTL exceeded	Cleanup, archive state
`RESUMED`	`reactivate`	`ACTIVE`	State consistency verified	Resume from checkpoint
`COMPLETED`	`archive`	`ARCHIVED`	Retention policy evaluated	Move to cold storage
`TERMINATED`	`archive`	`ARCHIVED`	Cleanup complete	Move to cold storage

Formally:

\delta(\texttt{ACTIVE}, \texttt{suspend}) = \texttt{SUSPENDED} \quad \text{iff} \quad \text{Checkpoint}(\mathcal{S}) = \texttt{SUCCESS}

\delta(\texttt{SUSPENDED}, \texttt{resume}) = \texttt{RESUMED} \quad \text{iff} \quad \text{Rehydrate}(\mathcal{S}) = \texttt{SUCCESS}

\delta(\texttt{ACTIVE}, \texttt{complete}) = \texttt{COMPLETED} \quad \text{iff} \quad \bigwedge_{g \in \mathcal{G}} g(\mathcal{S}) = \texttt{PASS}

18.2.3 Lifecycle Duration Accounting#

Define the time spent in each phase:

T_{\text{total}}(\mathcal{S}) = T_{\text{init}} + T_{\text{active}} + T_{\text{suspended}} + T_{\text{resumed}} + T_{\text{completing}}

The active ratio measures session efficiency:

\eta_{\text{active}}(\mathcal{S}) = \frac{T_{\text{active}} + T_{\text{resumed}}}{T_{\text{total}}(\mathcal{S})}

Low $\eta_{\text{active}}$ indicates excessive suspension—possibly due to resource contention, slow human approval, or infrastructure latency.

The suspension frequency is:

f_{\text{suspend}}(\mathcal{S}) = \frac{|\{t : \Lambda(t) = \texttt{SUSPENDED}\}|}{T_{\text{total}}(\mathcal{S})}

High $f_{\text{suspend}}$ triggers investigation into timeout tuning, resource provisioning, or task decomposition quality.

18.2.4 Pseudo-Algorithm: Session Lifecycle Manager#

ALGORITHM SessionLifecycleManager
  INPUT:
    session_id: SessionID
    event: LifecycleEvent
    context: SystemContext
 
  OUTPUT:
    new_phase: LifecyclePhase
    side_effects: List<SideEffect>
 
  BEGIN:
    session ← LOAD_SESSION(session_id)
    current_phase ← session.lifecycle_phase
    side_effects ← []
 
    // Validate transition
    IF (current_phase, event) NOT IN TRANSITION_TABLE THEN
      RAISE InvalidTransitionError(current_phase, event)
    END IF
 
    target_phase ← TRANSITION_TABLE[(current_phase, event)]
    guard ← GUARD_TABLE[(current_phase, event)]
 
    // Evaluate guard condition
    IF NOT guard.evaluate(session, context) THEN
      RAISE GuardFailedError(current_phase, event, guard.reason)
    END IF
 
    // Execute pre-transition hooks
    FOR EACH hook IN PRE_TRANSITION_HOOKS[(current_phase, target_phase)] DO
      hook.execute(session, context)
    END FOR
 
    // Phase-specific side effects
    MATCH (current_phase, target_phase):
 
      (INIT, ACTIVE):
        VALIDATE_STATE_SCHEMA(session.state)
        BIND_TOOLS(session, context.tool_registry)
        INITIALIZE_MEMORY_LAYERS(session)
        APPEND side_effects, EmitTrace("session.activated", session_id)
 
      (ACTIVE, SUSPENDED):
        cp ← CREATE_CHECKPOINT(session)
        PERSIST_CHECKPOINT(cp)
        FLUSH_WORKING_MEMORY(session)
        RELEASE_TOOL_LOCKS(session)
        APPEND side_effects, EmitTrace("session.suspended", session_id)
        APPEND side_effects, ReleaseResources(session.resource_claims)
 
      (ACTIVE, COMPLETED):
        VERIFY_EXIT_CRITERIA(session)
        cp ← CREATE_FINAL_CHECKPOINT(session)
        PERSIST_CHECKPOINT(cp)
        SEAL_PROVENANCE(session)
        PROMOTE_EPISODIC_MEMORY(session)
        APPEND side_effects, EmitTrace("session.completed", session_id)
 
      (SUSPENDED, RESUMED):
        ACQUIRE_RESOURCES(session.resource_requirements)
        REHYDRATE_CONTEXT(session)
        REBIND_TOOLS(session, context.tool_registry)
        VERIFY_STATE_CONSISTENCY(session)
        APPEND side_effects, EmitTrace("session.resumed", session_id)
 
      (RESUMED, ACTIVE):
        // Verify rehydration completeness
        ASSERT session.context_integrity_check() = PASS
        APPEND side_effects, EmitTrace("session.reactivated", session_id)
 
      (*, TERMINATED):
        EXECUTE_COMPENSATING_ACTIONS(session)
        PERSIST_FAILURE_STATE(session)
        RELEASE_ALL_RESOURCES(session)
        APPEND side_effects, EmitTrace("session.terminated", session_id)
 
      (*, ARCHIVED):
        MOVE_TO_COLD_STORAGE(session)
        DELETE_HOT_STATE(session)
        APPEND side_effects, EmitTrace("session.archived", session_id)
 
    // Update lifecycle phase
    session.lifecycle_phase ← target_phase
    session.state.version ← session.state.version + 1
    session.transition_log.append(TransitionRecord(
      from=current_phase,
      to=target_phase,
      event=event,
      timestamp=NOW(),
      actor=context.actor_id
    ))
 
    PERSIST_SESSION_METADATA(session)
 
    // Execute post-transition hooks
    FOR EACH hook IN POST_TRANSITION_HOOKS[(current_phase, target_phase)] DO
      hook.execute(session, context)
    END FOR
 
    RETURN (target_phase, side_effects)
  END

18.3 Session State Schema: Typed, Versioned, Serializable, and Diff-Capable#

18.3.1 State Schema Formalization#

The session state $\sigma$ is a typed record with a versioned schema:

\sigma = \left( v, \text{schema\_version}, \text{fields}: \{f_i : \tau_i\}_{i=1}^{n}, \text{checksum}, \text{last\_modified} \right)

where:

$v \in \mathbb{N}$ is the monotonically increasing state version number
$\text{schema\_version}$ follows semantic versioning $\text{MAJOR}.\text{MINOR}.\text{PATCH}$
Each field $f_i$ has an associated type $\tau_i$ from the type system
$\text{checksum} = \text{SHA256}(\text{canonical\_serialize}(\text{fields}))$

The type system supports:

Type Class	Examples	Serialization
Primitive	`Int64`, `Float64`, `String`, `Bool`, `Bytes`	Direct
Temporal	`Timestamp`, `Duration`, `TTL`	ISO 8601 / epoch millis
Collection	`List<T>`, `Map<K, V>`, `Set<T>`	Ordered JSON arrays / objects
Composite	`AgentState`, `PlanSnapshot`, `MemorySummary`	Nested typed records
Reference	`Ref<Checkpoint>`, `Ref<EvidenceBundle>`	URI + content hash
Optional	`Option<T>`	Nullable with explicit `None`

18.3.2 Schema Versioning and Evolution#

Schema evolution must support backward compatibility for session resumption across software upgrades. The rules follow a strict contract:

\text{Compatible}(v_{\text{old}}, v_{\text{new}}) \Leftrightarrow \begin{cases} \text{MAJOR}(v_{\text{old}}) = \text{MAJOR}(v_{\text{new}}) \\ \wedge \; \text{MINOR}(v_{\text{new}}) \geq \text{MINOR}(v_{\text{old}}) \end{cases}

Change Type	Version Impact	Migration Requirement
Add optional field with default	MINOR bump	None — deserializer uses default
Add required field	MAJOR bump	Migration function required
Remove field	MAJOR bump	Migration function to drop field
Change field type	MAJOR bump	Migration function to convert
Rename field	MAJOR bump	Migration function to remap
Add enum variant	MINOR bump	Old deserializer ignores unknown

The migration function for version $v_a \rightarrow v_b$ is:

\text{migrate}_{v_a \rightarrow v_b}: \sigma_{v_a} \rightarrow \sigma_{v_b}

Migration functions are composable:

\text{migrate}_{v_a \rightarrow v_c} = \text{migrate}_{v_b \rightarrow v_c} \circ \text{migrate}_{v_a \rightarrow v_b}

and stored in a migration registry indexed by $(v_{\text{source}}, v_{\text{target}})$ .

18.3.3 State Diff and Merge Operations#

For multi-session coordination (Section 18.10) and migration (Section 18.8), the system must compute structural diffs and merges on session state.

Structural Diff#

Given two state versions $\sigma_a$ and $\sigma_b$ :

\Delta(\sigma_a, \sigma_b) = \{(f_i, \text{op}_i, v_i^{\text{old}}, v_i^{\text{new}}) : \sigma_a.f_i \neq \sigma_b.f_i\}

where $\text{op}_i \in \{\texttt{ADD}, \texttt{MODIFY}, \texttt{DELETE}\}$ .

The diff size determines migration cost:

\text{DiffCost}(\sigma_a, \sigma_b) = \sum_{(f, \text{op}, \_, \_) \in \Delta} \text{cost}(\text{op}, \text{type}(f))

Three-Way Merge#

For cross-session state sharing, a three-way merge uses a common ancestor $\sigma_{\text{base}}$ :

\text{Merge}(\sigma_{\text{base}}, \sigma_a, \sigma_b) = \begin{cases} \sigma_a.f_i & \text{if } \sigma_a.f_i \neq \sigma_{\text{base}}.f_i \wedge \sigma_b.f_i = \sigma_{\text{base}}.f_i \\ \sigma_b.f_i & \text{if } \sigma_b.f_i \neq \sigma_{\text{base}}.f_i \wedge \sigma_a.f_i = \sigma_{\text{base}}.f_i \\ \sigma_a.f_i & \text{if } \sigma_a.f_i = \sigma_b.f_i \\ \texttt{CONFLICT}(f_i) & \text{if } \sigma_a.f_i \neq \sigma_b.f_i \wedge \sigma_a.f_i \neq \sigma_{\text{base}}.f_i \wedge \sigma_b.f_i \neq \sigma_{\text{base}}.f_i \end{cases}

Conflicts are resolved through conflict resolution policies: last-writer-wins, priority-based, or escalation to human review.

18.3.4 Pseudo-Algorithm: Versioned State Serialization and Validation#

ALGORITHM SerializeSessionState
  INPUT:
    state: SessionState
    target_format: SerializationFormat   // PROTOBUF | MSGPACK | JSON
 
  OUTPUT:
    serialized: Bytes
    checksum: Hash
 
  BEGIN:
    // Canonical field ordering (deterministic)
    ordered_fields ← SORT(state.fields, key=λf: f.name)
 
    // Type validation
    FOR EACH (field_name, field_value) IN ordered_fields DO
      expected_type ← state.schema.type_of(field_name)
      IF NOT TYPE_CHECK(field_value, expected_type) THEN
        RAISE SchemaViolationError(field_name, expected_type, ACTUAL_TYPE(field_value))
      END IF
    END FOR
 
    // Serialize with canonical ordering
    canonical ← CANONICAL_ENCODE(ordered_fields, target_format)
 
    // Compute integrity checksum
    checksum ← SHA256(canonical)
 
    // Attach version metadata
    envelope ← StateEnvelope(
      schema_version=state.schema_version,
      state_version=state.version,
      checksum=checksum,
      serialized_at=NOW(),
      payload=canonical
    )
 
    serialized ← ENCODE_ENVELOPE(envelope, target_format)
    RETURN (serialized, checksum)
  END
 
 
ALGORITHM DeserializeSessionState
  INPUT:
    serialized: Bytes
    expected_schema_version: SemanticVersion
 
  OUTPUT:
    state: SessionState
 
  BEGIN:
    envelope ← DECODE_ENVELOPE(serialized)
 
    // Schema version compatibility check
    IF NOT COMPATIBLE(envelope.schema_version, expected_schema_version) THEN
      // Attempt migration
      migration_path ← FIND_MIGRATION_PATH(
        envelope.schema_version, expected_schema_version
      )
      IF migration_path IS NULL THEN
        RAISE IncompatibleSchemaError(envelope.schema_version, expected_schema_version)
      END IF
 
      state ← DECODE_PAYLOAD(envelope.payload, envelope.schema_version)
      FOR EACH migration IN migration_path DO
        state ← migration.apply(state)
      END FOR
    ELSE
      state ← DECODE_PAYLOAD(envelope.payload, expected_schema_version)
    END IF
 
    // Integrity verification
    computed_checksum ← SHA256(CANONICAL_ENCODE(state.fields))
    IF computed_checksum ≠ envelope.checksum THEN
      RAISE IntegrityViolationError(
        expected=envelope.checksum,
        computed=computed_checksum
      )
    END IF
 
    RETURN state
  END

18.4 Session Isolation Models: Per-User, Per-Task, Per-Agent, and Nested Sessions#

18.4.1 Isolation as a Correctness Requirement#

Session isolation prevents unintended state interference between concurrent execution contexts. Formally, two sessions $\mathcal{S}_a$ and $\mathcal{S}_b$ are isolated if:

\text{Isolated}(\mathcal{S}_a, \mathcal{S}_b) \Leftrightarrow \nexists \; m \in \text{Mutations}(\mathcal{S}_a) : \text{affects}(m, \sigma_b) \;\vee\; \nexists \; m \in \text{Mutations}(\mathcal{S}_b) : \text{affects}(m, \sigma_a)

Isolation is enforced through namespaced state, scoped tool authorizations, and memory partitioning.

18.4.2 Isolation Models#

Per-User Isolation#

The broadest isolation boundary. Each user $u$ has a set of sessions:

\mathcal{S}_u = \{\mathcal{S} : \Gamma(\mathcal{S}).\text{owner} = u\}

State visibility: Session $\mathcal{S}_a$ can only access state of $\mathcal{S}_b$ if $\Gamma(\mathcal{S}_a).\text{owner} = \Gamma(\mathcal{S}_b).\text{owner}$ and an explicit sharing policy exists.

Per-Task Isolation#

Within a user's sessions, each task receives its own session:

\mathcal{S}_{u,\tau} = \{\mathcal{S} : \Gamma(\mathcal{S}).\text{owner} = u \wedge \Gamma(\mathcal{S}).\text{task\_id} = \tau\}

This prevents cross-task state contamination—a code generation task does not pollute the context of a research summarization task.

Per-Agent Isolation#

When multiple agents execute within a single session (e.g., a generator agent and a critic agent), each agent receives an isolated workspace:

\mathcal{W}_a = (\sigma_a^{\text{local}}, \mathcal{T}_a^{\text{bound}}, \mathcal{M}_a^{\text{working}})

The shared session state $\sigma_{\text{shared}}$ is accessed through a controlled interface with read/write permissions:

\text{Access}(a, f) = \begin{cases} \texttt{READ\_WRITE} & \text{if } f \in \text{owned\_fields}(a) \\ \texttt{READ\_ONLY} & \text{if } f \in \text{shared\_readable}(a) \\ \texttt{NONE} & \text{otherwise} \end{cases}

Nested Sessions#

Complex tasks spawn child sessions that inherit certain properties from the parent:

\mathcal{S}_{\text{child}} = \text{Fork}(\mathcal{S}_{\text{parent}}, \tau_{\text{subtask}}, \text{InheritancePolicy})

The inheritance policy specifies:

Property	Inheritance Rule
Memory (semantic)	Copy-on-read, isolated writes
Memory (episodic)	Read-only access to parent's episodes
Tool bindings	Subset of parent's bindings (least privilege)
Token budget	Allocated fraction of parent's remaining budget
Isolation context	Child inherits owner but gets unique task scope
Lifecycle	Child must complete or terminate before parent completes

The parent-child relationship forms a session tree:

\text{SessionTree}(\mathcal{S}_{\text{root}}) = (\mathcal{S}_{\text{root}}, \{(\mathcal{S}_p, \mathcal{S}_c) : \mathcal{S}_c = \text{Fork}(\mathcal{S}_p, \ldots)\})

Invariant: A parent session cannot transition to COMPLETED until all child sessions are in terminal states:

\Lambda(\mathcal{S}_{\text{parent}}) = \texttt{COMPLETED} \Rightarrow \forall \mathcal{S}_c \in \text{children}(\mathcal{S}_{\text{parent}}): \Lambda(\mathcal{S}_c) \in F

18.4.3 Isolation Enforcement Mechanism#

ALGORITHM EnforceSessionIsolation
  INPUT:
    session: Session
    operation: StateOperation        // READ | WRITE | DELETE
    field: FieldPath
    actor: AgentID | UserID
 
  OUTPUT:
    permitted: Bool
 
  BEGIN:
    isolation_ctx ← session.isolation_context
 
    // Determine access level
    MATCH isolation_ctx.model:
      PER_USER:
        IF actor NOT IN isolation_ctx.authorized_users THEN
          AUDIT_LOG("access_denied", actor, session.id, field, operation)
          RETURN FALSE
        END IF
 
      PER_TASK:
        IF actor.current_task ≠ session.task_scope THEN
          AUDIT_LOG("cross_task_access_denied", actor, session.id)
          RETURN FALSE
        END IF
 
      PER_AGENT:
        access ← isolation_ctx.agent_permissions[actor]
        IF operation = WRITE AND access ≠ READ_WRITE THEN
          AUDIT_LOG("agent_write_denied", actor, session.id, field)
          RETURN FALSE
        END IF
        IF operation = READ AND access = NONE THEN
          AUDIT_LOG("agent_read_denied", actor, session.id, field)
          RETURN FALSE
        END IF
 
    // Field-level access control
    field_policy ← session.schema.field_access_policy(field)
    IF operation NOT IN field_policy.allowed_operations(actor.role) THEN
      AUDIT_LOG("field_access_denied", actor, field, operation)
      RETURN FALSE
    END IF
 
    RETURN TRUE
  END

18.4.4 Isolation Strength Hierarchy#

The isolation models form a hierarchy from strongest to weakest:

\text{Per-Agent} \subset \text{Per-Task} \subset \text{Per-User} \subset \text{Global}

The tighter the isolation, the stronger the correctness guarantee but the higher the coordination cost for cross-boundary communication. The system selects isolation granularity based on the contention risk and shared-state requirements of the workload.

18.5 Session Persistence Strategies: In-Memory, Write-Ahead Log, Database-Backed, and Distributed#

18.5.1 Persistence Tier Classification#

Session state is tiered across persistence layers with different durability, latency, and cost characteristics:

\text{PersistenceTier} = \begin{cases} \texttt{L0: IN\_MEMORY} & \text{— volatile, } \sim\mu\text{s latency, lost on crash} \\ \texttt{L1: WAL} & \text{— append-only log, } \sim\text{ms latency, survives crash} \\ \texttt{L2: DATABASE} & \text{— ACID-compliant store, } \sim 10\text{ms, survives node loss} \\ \texttt{L3: DISTRIBUTED} & \text{— replicated across regions, } \sim 50\text{ms, survives region failure} \end{cases}

18.5.2 Persistence Strategy Selection#

The optimal persistence tier for a session is determined by a cost-durability objective:

\text{Tier}^*(\mathcal{S}) = \arg\min_{t \in \{L0, L1, L2, L3\}} \text{Cost}(t) \quad \text{s.t.} \quad \text{Durability}(t) \geq D_{\text{required}}(\mathcal{S})

where $D_{\text{required}}$ is a function of session criticality:

Session Criticality	Required Durability	Recommended Tier
Ephemeral exploration	None	L0 (in-memory)
Standard interactive	Crash-resilient	L1 (WAL)
Business-critical workflow	Node-failure resilient	L2 (database)
Cross-region, long-running	Region-failure resilient	L3 (distributed)

18.5.3 Write-Ahead Log (WAL) for Session State#

The WAL provides crash-resilient persistence with minimal write amplification. Every state mutation is appended to the WAL before being applied to the in-memory state:

\text{WAL Entry} = \left(\text{seq\_no}, \text{session\_id}, \text{operation}, \text{field}, \text{value\_before}, \text{value\_after}, \text{timestamp}, \text{actor}\right)

WAL Compaction#

Over time, the WAL grows unboundedly. Compaction produces a compressed snapshot:

\text{Compact}(\text{WAL}[0:k]) = \text{Snapshot}(\sigma_k) \oplus \text{WAL}[k+1:n]

The compacted state is:

\sigma_k = \text{Replay}(\sigma_0, \text{WAL}[0:k])

Compaction is triggered when:

|\text{WAL}| > W_{\text{compact\_threshold}} \quad \vee \quad T_{\text{since\_last\_compact}} > T_{\text{compact\_interval}}

18.5.4 Database-Backed Persistence#

For L2 persistence, session state is serialized into a database with the following schema:

Column	Type	Index	Semantics
`session_id`	`UUID`	Primary key	Unique session identifier
`schema_version`	`SemVer`	—	State schema version
`state_version`	`Int64`	—	Monotonic state version
`state_blob`	`Bytes`	—	Serialized state (Protobuf)
`checksum`	`Bytes(32)`	—	SHA-256 integrity hash
`lifecycle_phase`	`Enum`	Index	Current lifecycle phase
`owner_id`	`UUID`	Index	Session owner
`created_at`	`Timestamp`	Index	Creation time
`last_active_at`	`Timestamp`	Index	Last activity time
`expires_at`	`Timestamp`	Index	Expiry time (TTL-based)
`metadata`	`JSONB`	GIN index	Tags, labels, lineage

Optimistic concurrency control is used for updates:

ALGORITHM PersistSessionToDatabase
  INPUT:
    session: Session
    db: DatabaseConnection
 
  OUTPUT:
    success: Bool
 
  BEGIN:
    (serialized, checksum) ← SERIALIZE_SESSION_STATE(session.state)
    expected_version ← session.state.version
 
    result ← db.execute(
      "UPDATE sessions
       SET state_blob = $1,
           checksum = $2,
           state_version = $3,
           lifecycle_phase = $4,
           last_active_at = $5
       WHERE session_id = $6
         AND state_version = $7",      // Optimistic lock
      params=[serialized, checksum, expected_version + 1,
              session.lifecycle_phase, NOW(),
              session.id, expected_version]
    )
 
    IF result.rows_affected = 0 THEN
      // Concurrent modification detected
      RAISE ConcurrentModificationError(session.id, expected_version)
    END IF
 
    session.state.version ← expected_version + 1
    RETURN TRUE
  END

18.5.5 Distributed Persistence#

For L3, session state is replicated across regions using a consensus protocol. The replication factor $R$ and consistency level $C$ are configurable:

\text{Quorum Read}: C_R \geq \left\lfloor \frac{R}{2} \right\rfloor + 1

\text{Quorum Write}: C_W \geq \left\lfloor \frac{R}{2} \right\rfloor + 1

\text{Linearizable}: C_R + C_W > R

The session system defaults to session-consistent reads (read-your-own-writes) within a session, even when using eventual consistency across the distributed store. This is achieved by routing all reads for a session to the same replica or by attaching a causal timestamp $\tau_{\text{causal}}$ to each operation:

\text{Read}(\mathcal{S}, f) \text{ returns } v \Leftrightarrow \text{timestamp}(v) \geq \tau_{\text{causal}}(\mathcal{S})

18.5.6 Persistence Cost Model#

The cost of persisting session state over its lifetime:

C_{\text{persist}}(\mathcal{S}) = C_{\text{write}} \cdot N_{\text{writes}} + C_{\text{read}} \cdot N_{\text{reads}} + C_{\text{storage}} \cdot |\sigma| \cdot T_{\text{retention}}

where $N_{\text{writes}}$ and $N_{\text{reads}}$ are the total write and read operations, $|\sigma|$ is the serialized state size, and $T_{\text{retention}}$ is the retention duration. This cost model directly informs tier selection and compaction frequency.

18.6 Session Checkpointing: Periodic, Event-Triggered, and Pre-Mutation Snapshots#

18.6.1 Checkpoint Definition#

A checkpoint is a point-in-time consistent snapshot of session state:

\text{CP}(k) = \left(\text{session\_id}, k, \text{version}(\sigma), \text{snapshot}(\sigma), \text{checksum}, \text{trigger}, \text{timestamp}\right)

where $k$ is the monotonically increasing checkpoint sequence number.

18.6.2 Checkpoint Trigger Strategies#

Strategy	Trigger Condition	Use Case	Trade-off
Periodic	Every $\Delta T$ seconds	Background consistency	Simple but may miss critical mutations
Event-Triggered	On lifecycle transitions, tool invocations, phase changes	Agent loop boundaries	Precise but higher write frequency
Pre-Mutation	Before any state-changing operation	Safety-critical workflows	Maximum safety, highest write cost
Adaptive	Based on state change rate	General purpose	Balances cost and safety dynamically

Adaptive Checkpoint Interval#

The adaptive strategy adjusts the checkpoint interval based on the rate of state mutations:

\Delta T_{\text{cp}}(k) = \max\left(\Delta T_{\min}, \; \frac{\Delta T_{\text{base}}}{\text{mutation\_rate}(k)}\right)

where:

\text{mutation\_rate}(k) = \frac{|\{m : t_m \in [t_{k-1}, t_k]\}|}{\Delta t}

High mutation rates trigger more frequent checkpoints; quiescent periods relax checkpoint frequency.

18.6.3 Checkpoint Storage Optimization#

Checkpoints can be stored as full snapshots or incremental deltas:

Full Snapshot#

|\text{CP}_{\text{full}}(k)| = |\sigma_k|

Incremental Delta#

|\text{CP}_{\text{delta}}(k)| = |\Delta(\sigma_{k-1}, \sigma_k)|

The space-time trade-off: full snapshots enable $O(1)$ restoration but cost $O(|\sigma|)$ per checkpoint. Incremental deltas cost $O(|\Delta|)$ per checkpoint but require replaying $O(k)$ deltas for restoration.

A mixed strategy checkpoints a full snapshot every $F$ checkpoints and incremental deltas in between:

\text{Restore}(\sigma, k) = \text{Replay}\left(\text{CP}_{\text{full}}\left(\left\lfloor \frac{k}{F} \right\rfloor \cdot F\right), \; \left\{\text{CP}_{\text{delta}}(j)\right\}_{j=\lfloor k/F \rfloor \cdot F + 1}^{k}\right)

Restoration cost:

\text{RestoreCost}(k) = |\sigma| + \sum_{j=\lfloor k/F \rfloor \cdot F + 1}^{k} |\Delta_j| \leq |\sigma| + (F - 1) \cdot \max|\Delta|

18.6.4 Pseudo-Algorithm: Checkpoint Manager#

ALGORITHM CheckpointManager
  INPUT:
    session: Session
    trigger: CheckpointTrigger
    config: CheckpointConfig
 
  OUTPUT:
    checkpoint_ref: CheckpointRef
 
  BEGIN:
    // Determine checkpoint type
    last_full_seq ← GET_LAST_FULL_CHECKPOINT_SEQ(session.id)
    current_seq ← session.checkpoint_seq + 1
    deltas_since_full ← current_seq - last_full_seq
 
    IF deltas_since_full ≥ config.full_checkpoint_interval
       OR trigger = LIFECYCLE_TRANSITION
       OR trigger = PRE_MUTATION_CRITICAL THEN
      cp_type ← FULL
    ELSE
      cp_type ← INCREMENTAL
    END IF
 
    // Create checkpoint
    MATCH cp_type:
      FULL:
        (serialized, checksum) ← SERIALIZE_SESSION_STATE(session.state)
        checkpoint ← Checkpoint(
          session_id=session.id,
          seq=current_seq,
          type=FULL,
          state_version=session.state.version,
          payload=serialized,
          checksum=checksum,
          trigger=trigger,
          timestamp=NOW()
        )
 
      INCREMENTAL:
        prev_state ← LOAD_PREVIOUS_STATE(session.id, current_seq - 1)
        delta ← COMPUTE_DIFF(prev_state, session.state)
        delta_serialized ← SERIALIZE_DIFF(delta)
        checksum ← SHA256(delta_serialized)
        checkpoint ← Checkpoint(
          session_id=session.id,
          seq=current_seq,
          type=INCREMENTAL,
          state_version=session.state.version,
          payload=delta_serialized,
          base_seq=current_seq - 1,
          checksum=checksum,
          trigger=trigger,
          timestamp=NOW()
        )
 
    // Persist to configured tier
    persistence_tier ← session.persistence_policy.checkpoint_tier
    PERSIST_CHECKPOINT(checkpoint, persistence_tier)
 
    // Update session metadata
    session.checkpoint_seq ← current_seq
    session.last_checkpoint_at ← NOW()
 
    EMIT_METRIC("session.checkpoint", {
      session_id: session.id,
      seq: current_seq,
      type: cp_type,
      size_bytes: SIZE(checkpoint.payload),
      trigger: trigger
    })
 
    RETURN CheckpointRef(session.id, current_seq, checksum)
  END

18.7 Session Resumption: Rehydrating Context, Rebinding Tools, and Restoring Agent State#

18.7.1 Resumption as a Multi-Phase Protocol#

Session resumption is not a single deserialization step. It is a multi-phase protocol that reconstructs the full execution environment from persisted state:

\text{Resume}(\mathcal{S}) = \text{Restore}(\sigma) \circ \text{Rehydrate}(\text{ctx}) \circ \text{Rebind}(\mathcal{T}) \circ \text{Reconstruct}(\mathcal{M}) \circ \text{Verify}(\text{integrity})

Each phase has specific preconditions, postconditions, and failure modes.

18.7.2 Phase 1: State Restoration#

Restore the session state $\sigma$ from the most recent checkpoint:

\sigma_{\text{restored}} = \text{Replay}\left(\text{CP}_{\text{full}}(k_0), \{\text{CP}_{\text{delta}}(k)\}_{k=k_0+1}^{k_{\text{latest}}}\right)

Postcondition: $\text{checksum}(\sigma_{\text{restored}}) = \text{checksum}(\text{CP}(k_{\text{latest}}))$

Failure mode: Checkpoint corruption → fall back to previous full checkpoint.

18.7.3 Phase 2: Context Rehydration#

The active context window at the time of suspension may have contained retrieved evidence, tool outputs, and compressed history that are not part of the serialized state. Rehydration reconstructs this context:

\text{ctx}_{\text{rehydrated}} = \text{Compile}\left(\text{role\_policy}, \sigma_{\text{restored}}, \mathcal{M}_s.\text{summary}(), \text{ReRetrieve}(\sigma.\text{retrieval\_queries})\right)

Key considerations:

Stale evidence: If time has passed since suspension, retrieved evidence may be outdated. The rehydration phase checks freshness scores and optionally re-retrieves:

\text{ReRetrieve}(q) = \begin{cases} \text{cached}(q) & \text{if } \text{age}(\text{cached}(q)) < \tau_{\text{fresh}} \\ \text{Retrieve}(q) & \text{otherwise} \end{cases}

Token budget recalculation: The remaining token budget must be recalculated from the checkpoint:

T_{\text{remaining}} = T_{\max} - T_{\text{consumed}}(\text{CP}(k_{\text{latest}}))

18.7.4 Phase 3: Tool Rebinding#

Tools may have changed availability, version, or authorization scope since the session was suspended:

\mathcal{T}_{\text{rebound}} = \{t \in \mathcal{T}_b : \text{available}(t) \wedge \text{version\_compatible}(t) \wedge \text{authorized}(t, \Gamma)\}

For tools that are no longer available, the system applies the tool substitution policy:

\text{Substitute}(t) = \begin{cases} t' & \text{if } \exists t' \in \text{ToolRegistry}: \text{compatible}(t, t') \\ \texttt{DEGRADE} & \text{if } t \text{ is optional} \\ \texttt{FAIL\_RESUME} & \text{if } t \text{ is required} \end{cases}

18.7.5 Phase 4: Memory Reconstruction#

Session memory layers are reconstructed:

Memory Layer	Resumption Strategy
Working	Reset (ephemeral by definition)
Session	Restored from checkpoint
Episodic	Loaded from durable store
Semantic	Read from organizational memory (shared, not session-specific)
Procedural	Loaded from procedural memory store

18.7.6 Phase 5: Integrity Verification#

Before the session transitions from RESUMED to ACTIVE, a comprehensive integrity check verifies:

\text{IntegrityCheck}(\mathcal{S}) = \bigwedge \begin{cases} \text{checksum}(\sigma_{\text{restored}}) = \text{expected} \\ \text{schema\_version}(\sigma) \text{ is compatible} \\ |\mathcal{T}_{\text{rebound}}| \geq |\mathcal{T}_{\text{required}}| \\ T_{\text{remaining}} > T_{\text{min\_viable}} \\ \text{plan\_state} \text{ is consistent} \\ \text{no orphaned child sessions} \end{cases}

18.7.7 Pseudo-Algorithm: Session Resumption Protocol#

ALGORITHM ResumeSession
  INPUT:
    session_id: SessionID
    checkpoint_store: CheckpointStore
    tool_registry: ToolRegistry
    memory_store: MemoryStore
    config: ResumptionConfig
 
  OUTPUT:
    resumed_session: Session
 
  BEGIN:
    // ─── Phase 1: State Restoration ───
    checkpoints ← checkpoint_store.list(session_id, order=DESC)
    IF checkpoints IS EMPTY THEN
      RAISE NoCheckpointAvailableError(session_id)
    END IF
 
    // Find most recent full checkpoint
    full_cp ← FIND_LATEST(checkpoints, type=FULL)
    IF full_cp IS NULL THEN
      RAISE NoFullCheckpointError(session_id)
    END IF
 
    // Collect incremental deltas after full checkpoint
    deltas ← FILTER(checkpoints,
                     λcp: cp.type = INCREMENTAL AND cp.seq > full_cp.seq)
    SORT deltas BY seq ASC
 
    // Replay
    state ← DESERIALIZE_SESSION_STATE(full_cp.payload, config.expected_schema_version)
    FOR EACH delta IN deltas DO
      diff ← DESERIALIZE_DIFF(delta.payload)
      state ← APPLY_DIFF(state, diff)
    END FOR
 
    // Verify integrity
    latest_cp ← checkpoints[0]
    IF SHA256(CANONICAL_SERIALIZE(state.fields)) ≠ latest_cp.checksum THEN
      // Attempt fallback to previous full checkpoint
      WARN("checkpoint_integrity_failure", session_id, latest_cp.seq)
      IF config.allow_fallback THEN
        RETURN RESUME_FROM_FALLBACK(session_id, full_cp, config)
      ELSE
        RAISE CheckpointIntegrityError(session_id, latest_cp.seq)
      END IF
    END IF
 
    // ─── Phase 2: Context Rehydration ───
    retrieval_queries ← state.pending_retrieval_queries
    evidence ← {}
    FOR EACH query IN retrieval_queries DO
      cached ← RETRIEVAL_CACHE.get(query.hash)
      IF cached IS NOT NULL AND AGE(cached) < config.freshness_threshold THEN
        evidence[query.id] ← cached
      ELSE
        fresh_result ← RETRIEVE(query, deadline=config.retrieval_deadline)
        evidence[query.id] ← fresh_result
        RETRIEVAL_CACHE.put(query.hash, fresh_result, ttl=config.cache_ttl)
      END IF
    END FOR
 
    // Recalculate token budget
    token_budget ← config.T_max - state.tokens_consumed
    IF token_budget < config.T_min_viable THEN
      RAISE InsufficientTokenBudgetError(session_id, token_budget)
    END IF
 
    // ─── Phase 3: Tool Rebinding ───
    required_tools ← state.required_tools
    bound_tools ← {}
    degraded_tools ← []
 
    FOR EACH tool_spec IN required_tools DO
      tool ← tool_registry.resolve(tool_spec.name, tool_spec.version_constraint)
      IF tool IS NOT NULL THEN
        IF AUTHORIZE(tool, state.isolation_context) THEN
          bound_tools[tool_spec.name] ← tool
        ELSE
          IF tool_spec.required THEN
            RAISE ToolAuthorizationFailedError(tool_spec.name)
          ELSE
            APPEND degraded_tools, tool_spec.name
          END IF
        END IF
      ELSE
        substitute ← tool_registry.find_substitute(tool_spec)
        IF substitute IS NOT NULL THEN
          bound_tools[tool_spec.name] ← substitute
          WARN("tool_substituted", tool_spec.name, substitute.name)
        ELSE IF tool_spec.required THEN
          RAISE RequiredToolUnavailableError(tool_spec.name)
        ELSE
          APPEND degraded_tools, tool_spec.name
        END IF
      END IF
    END FOR
 
    // ─── Phase 4: Memory Reconstruction ───
    session_memory ← SessionMemory(
      working=WorkingMemory.fresh(),             // Ephemeral: always reset
      session=memory_store.load_session_memory(session_id),
      episodic=memory_store.load_episodic(session_id),
      semantic=memory_store.load_semantic(state.isolation_context.org_id),
      procedural=memory_store.load_procedural(state.task_type)
    )
 
    // ─── Phase 5: Construct Resumed Session ───
    session ← Session(
      id=session_id,
      state=state,
      lifecycle_phase=RESUMED,
      isolation_context=state.isolation_context,
      memory=session_memory,
      tool_bindings=bound_tools,
      evidence=evidence,
      token_budget=token_budget,
      degraded_capabilities=degraded_tools,
      metadata=state.metadata
    )
 
    // ─── Phase 6: Integrity Verification ───
    integrity ← VERIFY_SESSION_INTEGRITY(session)
    IF NOT integrity.passed THEN
      RAISE SessionIntegrityError(session_id, integrity.failures)
    END IF
 
    EMIT_TRACE("session.resumed", {
      session_id: session_id,
      resumed_from_checkpoint: latest_cp.seq,
      state_version: state.version,
      tools_bound: LEN(bound_tools),
      tools_degraded: LEN(degraded_tools),
      evidence_rehydrated: LEN(evidence),
      token_budget_remaining: token_budget
    })
 
    RETURN session
  END

18.8 Session Migration: Moving Sessions Across Nodes, Regions, and Agent Instances#

18.8.1 Migration Motivation and Scenarios#

Sessions must be migratable across infrastructure boundaries for:

Scenario	Trigger	Constraint
Node failover	Node crash or scheduled maintenance	Minimize downtime; resume on healthy node
Load balancing	Uneven resource utilization	Minimize migration latency
Region transfer	User relocates; data residency requirements	Comply with data sovereignty regulations
Agent upgrade	New agent version deployed	Maintain state continuity across versions
Horizontal scaling	Workload spike	Distribute sessions across expanded capacity

18.8.2 Migration Protocol#

Session migration is a three-phase commit protocol:

Phase 1: Prepare

\text{Source} \xrightarrow{\text{PREPARE}(session\_id, target)} \text{Target}

Source suspends the session (creates checkpoint).
Source serializes complete session state, including metadata, checkpoints, and WAL tail.
Source notifies target of incoming migration with state size and resource requirements.
Target verifies capacity, schema compatibility, and tool availability.

Phase 2: Transfer

\text{Source} \xrightarrow{\text{TRANSFER}(\text{state\_bundle})} \text{Target}

State bundle is transferred over an encrypted, authenticated channel.
Target deserializes, validates checksum, and performs schema migration if necessary.
Target binds tools and verifies integrity.

Phase 3: Commit

\text{Source} \xrightarrow{\text{COMMIT\_MIGRATION}} \text{Target}

Target acknowledges successful restoration.
Source marks the session as migrated and releases all local resources.
Routing table is updated to direct future requests to the target.
If acknowledgment times out, the source retains the session (migration aborted).

18.8.3 Migration Latency Model#

The total migration latency is:

L_{\text{migrate}} = L_{\text{suspend}} + L_{\text{serialize}} + \frac{|\sigma_{\text{bundle}}|}{B_{\text{network}}} + L_{\text{deserialize}} + L_{\text{rebind}} + L_{\text{verify}}

where $B_{\text{network}}$ is the available network bandwidth. For large session states, the transfer dominates. Optimization strategies include:

Incremental migration: Transfer only the delta since the last checkpoint already present on the target.
Compression: Apply LZ4 or Zstandard compression to the state bundle:

|\sigma_{\text{compressed}}| = |\sigma_{\text{bundle}}| \cdot (1 - \rho), \quad \rho \in [0.3, 0.8]

Pre-staging: Begin transfer of large, stable state components (e.g., episodic memory) before the migration is committed.

18.8.4 Pseudo-Algorithm: Session Migration#

ALGORITHM MigrateSession
  INPUT:
    session_id: SessionID
    source_node: NodeID
    target_node: NodeID
    config: MigrationConfig
 
  OUTPUT:
    migration_result: MigrationResult
 
  BEGIN:
    // ─── Phase 1: Prepare ───
    session ← LOAD_SESSION(session_id, source_node)
 
    // Suspend session (creates checkpoint)
    LIFECYCLE_TRANSITION(session, SUSPEND)
    checkpoint ← CREATE_CHECKPOINT(session, trigger=MIGRATION)
 
    // Serialize state bundle
    state_bundle ← SerializeStateBundle(
      state=session.state,
      checkpoints=GET_RECENT_CHECKPOINTS(session_id, config.checkpoint_window),
      wal_tail=GET_WAL_TAIL(session_id),
      memory=SERIALIZE_SESSION_MEMORY(session.memory),
      metadata=session.metadata,
      tool_specs=session.tool_bindings.specs()
    )
 
    compressed_bundle ← COMPRESS(state_bundle, algorithm=ZSTD, level=3)
 
    // Verify target capacity
    target_capacity ← RPC(target_node, "check_migration_capacity", {
      state_size: SIZE(compressed_bundle),
      schema_version: session.state.schema_version,
      required_tools: session.tool_bindings.specs()
    })
 
    IF NOT target_capacity.accepted THEN
      LIFECYCLE_TRANSITION(session, RESUME)  // Abort: reactivate on source
      RETURN MigrationResult(success=FALSE, reason=target_capacity.rejection_reason)
    END IF
 
    // ─── Phase 2: Transfer ───
    transfer_start ← NOW()
 
    transfer_result ← RPC(target_node, "receive_session_migration", {
      session_id: session_id,
      bundle: compressed_bundle,
      checksum: SHA256(state_bundle),
      source_node: source_node
    }, deadline=config.transfer_deadline)
 
    transfer_latency ← NOW() - transfer_start
 
    IF NOT transfer_result.success THEN
      LIFECYCLE_TRANSITION(session, RESUME)  // Abort: reactivate on source
      RETURN MigrationResult(success=FALSE, reason=transfer_result.error)
    END IF
 
    // ─── Phase 3: Commit ───
    commit_result ← RPC(target_node, "commit_session_migration", {
      session_id: session_id,
      expected_checksum: SHA256(state_bundle)
    }, deadline=config.commit_deadline)
 
    IF commit_result.success THEN
      // Update routing table
      ROUTING_TABLE.update(session_id, target_node)
 
      // Clean up source
      MARK_SESSION_MIGRATED(session_id, source_node, target_node)
      SCHEDULE_CLEANUP(session_id, source_node, delay=config.cleanup_delay)
 
      EMIT_METRIC("session.migrated", {
        session_id: session_id,
        from: source_node,
        to: target_node,
        bundle_size: SIZE(compressed_bundle),
        transfer_latency_ms: transfer_latency,
        total_latency_ms: NOW() - transfer_start
      })
 
      RETURN MigrationResult(success=TRUE)
    ELSE
      // Commit failed: reactivate on source
      LIFECYCLE_TRANSITION(session, RESUME)
      RETURN MigrationResult(success=FALSE, reason="commit_failed")
    END IF
  END

18.9 Session Timeout and Expiry: Configurable TTL, Grace Periods, and Cleanup Hooks#

18.9.1 Timeout and Expiry Model#

Sessions are subject to multiple time-based constraints:

\text{TimeoutConfig}(\mathcal{S}) = \left(\tau_{\text{idle}}, \tau_{\text{absolute}}, \tau_{\text{grace}}, \tau_{\text{expiry}}\right)

Parameter	Semantics	Typical Range
$\tau_{\text{idle}}$	Max time without any interaction	5 min – 24 hrs
$\tau_{\text{absolute}}$	Max total session duration	1 hr – 30 days
$\tau_{\text{grace}}$	Grace period after timeout before state cleanup	1 hr – 7 days
$\tau_{\text{expiry}}$	Time after which archived state is permanently deleted	30 – 365 days

18.9.2 Timeout State Transitions#

\delta(\texttt{ACTIVE}, \texttt{idle\_timeout}) = \texttt{SUSPENDED}

\delta(\texttt{ACTIVE}, \texttt{absolute\_timeout}) = \texttt{SUSPENDED}

\delta(\texttt{SUSPENDED}, \texttt{grace\_expired}) = \texttt{TERMINATED}

\delta(\texttt{TERMINATED}, \texttt{expiry\_reached}) = \texttt{ARCHIVED} \rightarrow \text{DELETE}

18.9.3 Idle Detection#

Idle time is measured from the last meaningful interaction:

t_{\text{idle}}(\mathcal{S}) = t_{\text{now}} - \max\left(t_{\text{last\_user\_input}}, t_{\text{last\_agent\_action}}, t_{\text{last\_tool\_invocation}}\right)

A session is idle-timed-out when:

t_{\text{idle}}(\mathcal{S}) > \tau_{\text{idle}}(\mathcal{S})

Important distinction: Background processing (e.g., async tool results, retrieval updates) resets the idle timer only if configured to do so. Pure heartbeat messages do not reset idle time—only semantically meaningful interactions qualify.

18.9.4 Cleanup Hooks#

Before a session is terminated or archived, cleanup hooks execute in order:

Flush pending state: Persist any uncommitted working memory.
Release tool locks: Free any claimed resources.
Notify dependent sessions: Alert parent or linked sessions of termination.
Emit analytics: Record final session metrics.
Promote valuable memory: Extract non-obvious insights for long-term memory (with validation).
Delete sensitive data: Scrub PII or credentials from session state.

ALGORITHM SessionTimeoutMonitor
  INPUT:
    sessions: SessionIndex          // Indexed by last_active_at
    config: TimeoutConfig
 
  // Runs as a periodic background process
  BEGIN:
    LOOP EVERY config.sweep_interval DO
 
      // Idle timeout check
      idle_candidates ← sessions.query(
        lifecycle_phase IN {ACTIVE},
        last_active_at < NOW() - config.tau_idle
      )
 
      FOR EACH session IN idle_candidates DO
        EMIT_TRACE("session.idle_timeout", session.id)
        LIFECYCLE_TRANSITION(session, SUSPEND, trigger=IDLE_TIMEOUT)
      END FOR
 
      // Absolute timeout check
      absolute_candidates ← sessions.query(
        lifecycle_phase IN {ACTIVE, SUSPENDED, RESUMED},
        created_at < NOW() - config.tau_absolute
      )
 
      FOR EACH session IN absolute_candidates DO
        IF session.lifecycle_phase = ACTIVE THEN
          LIFECYCLE_TRANSITION(session, SUSPEND, trigger=ABSOLUTE_TIMEOUT)
        END IF
        LIFECYCLE_TRANSITION(session, TERMINATE, trigger=ABSOLUTE_TIMEOUT)
      END FOR
 
      // Grace period expiry
      grace_candidates ← sessions.query(
        lifecycle_phase = SUSPENDED,
        suspended_at < NOW() - config.tau_grace
      )
 
      FOR EACH session IN grace_candidates DO
        EXECUTE_CLEANUP_HOOKS(session)
        LIFECYCLE_TRANSITION(session, TERMINATE, trigger=GRACE_EXPIRED)
      END FOR
 
      // Permanent expiry
      expiry_candidates ← sessions.query(
        lifecycle_phase IN {TERMINATED, ARCHIVED},
        terminated_at < NOW() - config.tau_expiry
      )
 
      FOR EACH session IN expiry_candidates DO
        PERMANENT_DELETE(session)
        EMIT_METRIC("session.permanently_deleted", session.id)
      END FOR
 
    END LOOP
  END

18.9.5 TTL Extension and Renewal#

Sessions can request TTL extension through explicit user interaction or programmatic renewal:

\tau_{\text{idle}}^{\text{new}} = \min\left(\tau_{\text{idle}} + \Delta\tau, \tau_{\text{idle}}^{\max}\right)

Renewal is granted only if:

The session owner is authenticated.
The total session duration remains within $\tau_{\text{absolute}}$ .
Resource budgets (tokens, cost) have not been exhausted.

18.10.1 Session Relationship Graph#

Related sessions form a directed graph $\mathcal{G}_{\text{session}} = (V, E)$ where:

$V$ = set of sessions
$E$ = set of typed edges representing relationships

Edge types:

Edge Type	Semantics	Data Flow
`PARENT → CHILD`	Nested session (fork)	Child inherits parent context
`PREDECESSOR → SUCCESSOR`	Sequential task chain	Successor reads predecessor's outputs
`PEER ↔ PEER`	Collaborative sessions	Shared context subset
`REFERENCE → REFERENT`	Cross-session evidence citation	Read-only access to referent's artifacts

Sharing context across sessions requires explicit declaration and access control:

\text{SharePolicy}(\mathcal{S}_a, \mathcal{S}_b) = \left(\text{fields}: \text{Set}\langle\text{FieldPath}\rangle, \text{access}: \text{ReadOnly} \mid \text{ReadWrite}, \text{sync}: \text{Snapshot} \mid \text{Live}\right)

The shared state is copied at a point in time. Changes in the source session are not reflected in the consumer:

\sigma_{\text{shared}} = \text{Project}(\sigma_a, \text{fields}) \text{ at time } t_{\text{share}}

The consumer session observes changes in real time through a subscription mechanism:

\text{Subscribe}(\mathcal{S}_b, \mathcal{S}_a, \text{fields}) \rightarrow \text{ChangeStream}

Live sharing introduces consistency challenges: the consumer may observe intermediate states. The system provides causal consistency by attaching version vectors:

\text{VV}(\mathcal{S}_a) = \{(f_i, v_i)\}_{i \in \text{fields}}

The consumer reads field $f_i$ only when $\text{VV}(\mathcal{S}_a).v_i \geq \text{VV}_{\text{expected}}.v_i$ .

18.10.3 Cross-Session Memory Promotion#

When a session discovers a non-obvious insight or correction that would benefit future sessions, it proposes a memory promotion:

\text{Propose}(\text{item}, \text{provenance}, \text{confidence}) \rightarrow \text{PromotionQueue}

The promotion queue is processed by a validation pipeline that ensures:

Deduplication: The item does not duplicate existing organizational memory.
Factual verification: The item is supported by evidence.
Generalizability: The item applies beyond the current session.
Expiry policy: The item has a defined TTL.

Only items passing all gates are promoted to organizational (semantic) memory.

18.10.4 Pseudo-Algorithm: Multi-Session Coordinator#

ALGORITHM MultiSessionCoordinator
  INPUT:
    session_graph: SessionGraph
    requesting_session: SessionID
    operation: CrossSessionOperation
 
  OUTPUT:
    result: OperationResult
 
  BEGIN:
    MATCH operation:
 
      FORK(parent_id, subtask, inheritance_policy):
        parent ← LOAD_SESSION(parent_id)
 
        // Validate parent is active
        ASSERT parent.lifecycle_phase = ACTIVE
 
        // Allocate child budget
        child_token_budget ← parent.token_budget * inheritance_policy.budget_fraction
        parent.token_budget ← parent.token_budget - child_token_budget
 
        // Create child session
        child ← CREATE_SESSION(
          owner=parent.owner,
          task=subtask,
          isolation=PER_TASK,
          persistence=parent.persistence_policy,
          token_budget=child_token_budget
        )
 
        // Copy inherited state
        FOR EACH field IN inheritance_policy.inherited_fields DO
          child.state.set(field, DEEP_COPY(parent.state.get(field)))
        END FOR
 
        // Bind tools (subset)
        child.tool_bindings ← FILTER(
          parent.tool_bindings,
          λt: t.name IN inheritance_policy.allowed_tools
        )
 
        // Register relationship
        session_graph.add_edge(parent.id, child.id, PARENT_CHILD)
 
        // Register completion dependency
        parent.pending_children.add(child.id)
 
        RETURN OperationResult(success=TRUE, child_session=child)
 
      SHARE_CONTEXT(source_id, target_id, policy):
        source ← LOAD_SESSION(source_id)
        target ← LOAD_SESSION(target_id)
 
        // Verify authorization
        IF NOT AUTHORIZE_CROSS_SESSION(source, target, policy) THEN
          RAISE CrossSessionAccessDenied(source_id, target_id)
        END IF
 
        MATCH policy.sync:
          SNAPSHOT:
            shared_state ← PROJECT(source.state, policy.fields)
            snapshot ← DEEP_COPY(shared_state)
            target.imported_context[source_id] ← ImportedContext(
              snapshot=snapshot,
              timestamp=NOW(),
              source_version=source.state.version,
              access=policy.access
            )
 
          LIVE:
            subscription ← CREATE_SUBSCRIPTION(
              source=source_id,
              target=target_id,
              fields=policy.fields,
              access=policy.access
            )
            REGISTER_SUBSCRIPTION(subscription)
            target.subscriptions.add(subscription)
 
        session_graph.add_edge(source_id, target_id, CONTEXT_SHARE)
        RETURN OperationResult(success=TRUE)
 
      WAIT_FOR_CHILDREN(parent_id):
        parent ← LOAD_SESSION(parent_id)
        pending ← parent.pending_children
 
        FOR EACH child_id IN pending DO
          child ← LOAD_SESSION(child_id)
          IF child.lifecycle_phase IN TERMINAL_STATES THEN
            // Collect child results
            parent.child_results[child_id] ← child.final_output
            parent.pending_children.remove(child_id)
          END IF
        END FOR
 
        IF parent.pending_children IS EMPTY THEN
          RETURN OperationResult(success=TRUE, all_children_complete=TRUE)
        ELSE
          RETURN OperationResult(success=TRUE, all_children_complete=FALSE,
                                 pending=parent.pending_children)
        END IF
  END

18.11 Session Security: Encryption at Rest and in Transit, Access Control, and Session Hijacking Prevention#

18.11.1 Threat Model#

Threat	Description	Impact
Session Hijacking	Attacker obtains session ID and impersonates user	Full access to session state and tools
State Tampering	Attacker modifies persisted session state	Corrupted execution, policy violations
Eavesdropping	Attacker intercepts session data in transit	Confidentiality breach
Replay Attack	Attacker replays captured session interactions	Duplicate state mutations
Privilege Escalation	Agent or user accesses resources beyond scope	Unauthorized tool invocations
Cross-Session Leakage	Isolation failure leaks state between sessions	Confidentiality and integrity breach

18.11.2 Encryption Architecture#

At Rest#

All persisted session state is encrypted using AES-256-GCM with per-session keys:

\text{Ciphertext}(\sigma) = \text{AES-256-GCM}(\sigma, K_{\text{session}}, \text{nonce})

where:

K_{\text{session}} = \text{KDF}(K_{\text{master}}, \text{session\_id}, \text{context})

$K_{\text{master}}$ is stored in a Hardware Security Module (HSM) or key management service.
$\text{KDF}$ is HKDF-SHA256.
The nonce is derived from the checkpoint sequence number to ensure uniqueness.

In Transit#

All session data traversing network boundaries is protected by mutual TLS (mTLS):

gRPC channels use TLS 1.3 with certificate pinning.
JSON-RPC endpoints require TLS with client certificate authentication.
Inter-node migration uses an additional layer of application-level encryption.

18.11.3 Session Token Security#

Session tokens (used for API authentication) must satisfy:

\text{SessionToken} = \text{Sign}\left(\text{session\_id} \| \text{user\_id} \| \text{issued\_at} \| \text{expires\_at} \| \text{scope}, \; K_{\text{signing}}\right)

Properties:

Property	Mechanism
Unpredictability	256-bit cryptographically random component
Binding	Token is bound to session ID, user ID, and IP (optional)
Expiry	Short-lived with refresh mechanism ( $\tau_{\text{token}} \leq 15$ min)
Rotation	Token is rotated on every sensitive operation
Revocation	Server-side revocation list checked on every request

18.11.4 Access Control Model#

Session access is governed by a role-based access control (RBAC) policy augmented with attribute-based constraints:

\text{Permit}(s, a, r) = \text{RBAC}(s.\text{role}, a) \wedge \text{ABAC}(s.\text{attributes}, a.\text{attributes}, r.\text{attributes})

where $s$ is the subject (user or agent), $a$ is the action, and $r$ is the resource (session state field or tool).

Role	Permissions
Owner	Full read/write, lifecycle control, migration, sharing
Agent	Scoped read/write per isolation policy, tool invocation
Viewer	Read-only access to session outputs and metrics
Auditor	Read-only access to audit records and traces
Admin	Session termination, migration, and policy override

18.11.5 Anti-Hijacking Measures#

ALGORITHM ValidateSessionRequest
  INPUT:
    request: SessionRequest
    session_store: SessionStore
    security_config: SecurityConfig
 
  OUTPUT:
    validated: Bool
 
  BEGIN:
    // Token validation
    token ← request.session_token
    IF NOT VERIFY_SIGNATURE(token, security_config.signing_key) THEN
      AUDIT_LOG("invalid_token_signature", request)
      RETURN FALSE
    END IF
 
    claims ← DECODE_CLAIMS(token)
 
    // Expiry check
    IF claims.expires_at < NOW() THEN
      AUDIT_LOG("expired_token", request)
      RETURN FALSE
    END IF
 
    // Revocation check
    IF REVOCATION_LIST.contains(token.id) THEN
      AUDIT_LOG("revoked_token", request)
      RETURN FALSE
    END IF
 
    // Session binding check
    session ← session_store.load(claims.session_id)
    IF session IS NULL OR session.lifecycle_phase IN TERMINAL_STATES THEN
      AUDIT_LOG("invalid_session", request)
      RETURN FALSE
    END IF
 
    // User binding
    IF claims.user_id ≠ session.owner_id THEN
      AUDIT_LOG("user_session_mismatch", request)
      RETURN FALSE
    END IF
 
    // IP binding (optional)
    IF security_config.enforce_ip_binding THEN
      IF request.source_ip ≠ claims.bound_ip THEN
        AUDIT_LOG("ip_mismatch", request)
        RETURN FALSE
      END IF
    END IF
 
    // Rate limiting
    IF NOT RATE_LIMITER.allow(claims.session_id, request.operation) THEN
      AUDIT_LOG("rate_limited", request)
      RETURN FALSE
    END IF
 
    // Anomaly detection: unusual request patterns
    IF ANOMALY_DETECTOR.is_suspicious(request, session.interaction_history) THEN
      AUDIT_LOG("anomalous_request", request)
      IF security_config.block_anomalous THEN
        RETURN FALSE
      ELSE
        FLAG_FOR_REVIEW(request)
      END IF
    END IF
 
    // Token rotation on sensitive operations
    IF request.operation IN SENSITIVE_OPERATIONS THEN
      new_token ← ROTATE_TOKEN(token, security_config)
      request.response_headers["X-New-Session-Token"] ← new_token
      REVOCATION_LIST.add(token.id, ttl=30s)  // Grace period for in-flight requests
    END IF
 
    RETURN TRUE
  END

18.11.6 Security Invariants#

The session security subsystem must maintain at all times:

\mathcal{I}_{\text{sec}_1}: \quad \forall \mathcal{S}: \text{state\_at\_rest}(\mathcal{S}) \text{ is encrypted}

\mathcal{I}_{\text{sec}_2}: \quad \forall \text{transit}(\mathcal{S}): \text{channel} \text{ is mTLS-protected}

\mathcal{I}_{\text{sec}_3}: \quad \forall \text{token}: \text{lifetime}(\text{token}) \leq \tau_{\text{token}}^{\max}

\mathcal{I}_{\text{sec}_4}: \quad \forall (\mathcal{S}_a, \mathcal{S}_b): \Gamma(\mathcal{S}_a).\text{owner} \neq \Gamma(\mathcal{S}_b).\text{owner} \Rightarrow \text{Isolated}(\mathcal{S}_a, \mathcal{S}_b)

\mathcal{I}_{\text{sec}_5}: \quad \forall \text{mutation}: \text{authenticated} \wedge \text{authorized} \wedge \text{audited}

18.12 Session Analytics: Duration, Turn Count, Tool Usage, Error Rate, and User Satisfaction Correlation#

18.12.1 Session Metrics Taxonomy#

Every session emits a structured set of metrics that enable operational monitoring, capacity planning, quality improvement, and cost optimization.

Temporal Metrics#

\mathbf{m}_{\text{temporal}}(\mathcal{S}) = \begin{bmatrix} T_{\text{total}} \\ T_{\text{active}} \\ T_{\text{suspended}} \\ T_{\text{time\_to\_first\_response}} \\ T_{\text{mean\_turn\_latency}} \\ T_{\text{p99\_turn\_latency}} \end{bmatrix}

where:

T_{\text{mean\_turn\_latency}} = \frac{1}{N_{\text{turns}}} \sum_{i=1}^{N_{\text{turns}}} \Delta t_i

T_{\text{p99\_turn\_latency}} = \text{Percentile}_{99}\left(\{\Delta t_i\}_{i=1}^{N_{\text{turns}}}\right)

Interaction Metrics#

\mathbf{m}_{\text{interaction}}(\mathcal{S}) = \begin{bmatrix} N_{\text{turns}} \\ N_{\text{user\_messages}} \\ N_{\text{agent\_messages}} \\ N_{\text{tool\_invocations}} \\ N_{\text{retrieval\_queries}} \\ N_{\text{repair\_cycles}} \end{bmatrix}

Resource Consumption Metrics#

\mathbf{m}_{\text{resource}}(\mathcal{S}) = \begin{bmatrix} T_{\text{tokens\_input}} \\ T_{\text{tokens\_output}} \\ C_{\text{total\_cost\_usd}} \\ N_{\text{api\_calls}} \\ N_{\text{checkpoints}} \\ |\sigma_{\text{peak\_state\_size}}| \end{bmatrix}

Quality Metrics#

\mathbf{m}_{\text{quality}}(\mathcal{S}) = \begin{bmatrix} r_{\text{task\_completion}} \\ r_{\text{error\_rate}} \\ r_{\text{repair\_success\_rate}} \\ r_{\text{hallucination\_rate}} \\ r_{\text{verification\_pass\_rate}} \end{bmatrix}

where:

r_{\text{error\_rate}} = \frac{N_{\text{errors}}}{N_{\text{tool\_invocations}} + N_{\text{generations}}}

r_{\text{repair\_success\_rate}} = \frac{N_{\text{repairs\_successful}}}{N_{\text{repairs\_attempted}}}

18.12.2 User Satisfaction Modeling#

User satisfaction $\hat{U}$ is modeled as a function of observable session metrics:

\hat{U}(\mathcal{S}) = \sigma\left(\mathbf{w}^T \cdot \mathbf{f}(\mathcal{S}) + b\right)

where $\sigma$ is the sigmoid function, $\mathbf{f}(\mathcal{S})$ is the feature vector, and $\mathbf{w}, b$ are learned parameters calibrated against explicit user feedback (ratings, thumbs up/down, task completion signals).

Feature vector:

\mathbf{f}(\mathcal{S}) = \begin{bmatrix} \log(T_{\text{time\_to\_first\_response}} + 1) \\ \log(N_{\text{turns}} + 1) \\ r_{\text{task\_completion}} \\ 1 - r_{\text{error\_rate}} \\ r_{\text{verification\_pass\_rate}} \\ -\log(N_{\text{repair\_cycles}} + 1) \\ \mathbb{1}[\text{session\_completed\_normally}] \\ -\log(C_{\text{total\_cost\_usd}} + 1) \end{bmatrix}

The model is retrained periodically on accumulated feedback data. The predicted satisfaction score is used for:

Proactive intervention: If $\hat{U}$ drops below a threshold during an active session, the system offers assistance or escalates.
Quality regression detection: A decline in aggregate $\hat{U}$ across sessions signals systemic issues.
A/B testing: Different agent configurations are compared by their effect on $\hat{U}$ .

18.12.3 Session Analytics Pipeline#

ALGORITHM ComputeSessionAnalytics
  INPUT:
    session: Session (completed or terminated)
 
  OUTPUT:
    analytics: SessionAnalyticsRecord
 
  BEGIN:
    // ─── Temporal Metrics ───
    T_total ← session.completed_at - session.created_at
    T_active ← SUM(duration(interval) FOR interval IN session.active_intervals)
    T_suspended ← T_total - T_active
    T_ttfr ← session.first_response_at - session.created_at
 
    turn_latencies ← [turn.response_at - turn.request_at
                       FOR turn IN session.interaction_history]
    T_mean_turn ← MEAN(turn_latencies)
    T_p99_turn ← PERCENTILE(turn_latencies, 99)
 
    // ─── Interaction Metrics ───
    N_turns ← LEN(session.interaction_history)
    N_user_msgs ← COUNT(turn FOR turn IN session.interaction_history
                         IF turn.actor = USER)
    N_agent_msgs ← COUNT(turn FOR turn IN session.interaction_history
                          IF turn.actor = AGENT)
    N_tool_invocations ← LEN(session.tool_trace)
    N_retrieval_queries ← LEN(session.retrieval_trace)
    N_repair_cycles ← SUM(loop.repair_count FOR loop IN session.agent_loops)
 
    // ─── Resource Metrics ───
    T_tokens_in ← SUM(turn.input_tokens FOR turn IN session.interaction_history)
    T_tokens_out ← SUM(turn.output_tokens FOR turn IN session.interaction_history)
    C_total ← session.cost_accumulator.total()
    N_api_calls ← session.api_call_counter.total()
    N_checkpoints ← LEN(session.checkpoints)
    peak_state_size ← MAX(SIZE(cp.payload) FOR cp IN session.checkpoints)
 
    // ─── Quality Metrics ───
    r_task_completion ← COMPUTE_TASK_COMPLETION_RATE(session)
    N_errors ← COUNT(inv FOR inv IN session.tool_trace IF inv.is_error)
    r_error_rate ← N_errors / MAX(N_tool_invocations + N_agent_msgs, 1)
    r_repair_success ← COMPUTE_REPAIR_SUCCESS_RATE(session.agent_loops)
    r_hallucination ← COMPUTE_HALLUCINATION_RATE(session.verification_results)
    r_verify_pass ← COMPUTE_VERIFICATION_PASS_RATE(session.verification_results)
 
    // ─── Satisfaction Prediction ───
    feature_vector ← CONSTRUCT_FEATURE_VECTOR(
      T_ttfr, N_turns, r_task_completion, r_error_rate,
      r_verify_pass, N_repair_cycles,
      session.lifecycle_phase = COMPLETED,
      C_total
    )
    predicted_satisfaction ← SATISFACTION_MODEL.predict(feature_vector)
 
    // ─── Tool Usage Breakdown ───
    tool_usage ← {}
    FOR EACH invocation IN session.tool_trace DO
      key ← invocation.tool_name
      IF key NOT IN tool_usage THEN
        tool_usage[key] ← ToolUsageRecord(count=0, errors=0,
                                           total_latency=0, total_cost=0)
      END IF
      tool_usage[key].count ← tool_usage[key].count + 1
      IF invocation.is_error THEN
        tool_usage[key].errors ← tool_usage[key].errors + 1
      END IF
      tool_usage[key].total_latency ← tool_usage[key].total_latency + invocation.latency
      tool_usage[key].total_cost ← tool_usage[key].total_cost + invocation.cost
    END FOR
 
    // ─── Assemble Record ───
    analytics ← SessionAnalyticsRecord(
      session_id=session.id,
      owner_id=session.owner_id,
      task_type=session.task_type,
 
      temporal=TemporalMetrics(T_total, T_active, T_suspended, T_ttfr,
                                T_mean_turn, T_p99_turn),
      interaction=InteractionMetrics(N_turns, N_user_msgs, N_agent_msgs,
                                      N_tool_invocations, N_retrieval_queries,
                                      N_repair_cycles),
      resource=ResourceMetrics(T_tokens_in, T_tokens_out, C_total,
                                N_api_calls, N_checkpoints, peak_state_size),
      quality=QualityMetrics(r_task_completion, r_error_rate, r_repair_success,
                              r_hallucination, r_verify_pass),
      tool_usage=tool_usage,
      predicted_satisfaction=predicted_satisfaction,
      computed_at=NOW()
    )
 
    // Persist to analytics store
    ANALYTICS_STORE.write(analytics)
 
    // Emit to monitoring system
    EMIT_METRICS_BATCH(analytics)
 
    RETURN analytics
  END

18.12.4 Aggregate Analytics and Operational Dashboards#

Individual session analytics are aggregated across dimensions for operational insight:

Aggregation Dimension	Example Metrics	Purpose
By User	Mean satisfaction, session count, error rate	User experience monitoring
By Task Type	Completion rate, mean duration, repair frequency	Task-specific optimization
By Agent Version	Quality scores, latency, cost per session	A/B testing, regression detection
By Tool	Error rate, mean latency, invocation count	Tool reliability monitoring
By Time Window	Throughput, peak concurrency, cost rate	Capacity planning

18.12.5 Anomaly Detection on Session Metrics#

Session metrics are monitored for anomalies using a statistical process control approach:

\text{Anomaly}(m_t) \Leftrightarrow |m_t - \bar{m}| > k \cdot s_m

where $\bar{m}$ is the rolling mean, $s_m$ is the rolling standard deviation, and $k$ is the sensitivity factor (typically $k = 3$ for $3\sigma$ bounds).

For multivariate anomaly detection across the full metric vector:

D_{\text{Mahal}}(\mathbf{m}_t) = \sqrt{(\mathbf{m}_t - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{m}_t - \boldsymbol{\mu})}

where $\boldsymbol{\mu}$ and $\Sigma$ are the rolling mean vector and covariance matrix. An alert fires when $D_{\text{Mahal}} > D_{\text{threshold}}$ .

18.12.6 Feedback Loop: Analytics to Architecture#

Session analytics feed directly back into architectural decisions:

\text{Analytics} \xrightarrow{\text{inform}} \text{Decisions}

Observed Signal	Architectural Response
High $T_{\text{p99\_turn\_latency}}$	Increase retrieval cache TTL, pre-warm tool connections
High $r_{\text{error\_rate}}$ on specific tool	Enable circuit breaker, add fallback tool
Low $r_{\text{task\_completion}}$ for task type	Improve decomposition heuristics, add retrieval sources
High $N_{\text{repair\_cycles}}$	Strengthen verification, improve planning prompts
Low predicted satisfaction	Trigger proactive user assistance, escalation
Excessive $\\|\sigma_{\text{peak\_state\_size}}\\|$	Increase checkpoint compaction frequency, prune history
High suspension frequency	Tune timeout parameters, improve resource allocation

This feedback loop ensures the session architecture evolves mechanically in response to empirical evidence, not intuition.

Synthesis: Session Architecture as a Systems Engineering Discipline#

The session architecture presented in this chapter treats the session not as a convenience abstraction but as a formally specified, cryptographically protected, lifecycle-managed, migratable execution envelope. The key architectural contributions are summarized:

Architectural Principle	Implementation
Typed, versioned state	Schema evolution with composable migrations and integrity checksums
Deterministic lifecycle	Finite state machine with guarded transitions and cleanup hooks
Tiered persistence	L0–L3 persistence with cost-durability optimization
Isolation enforcement	Per-user, per-task, per-agent, nested; RBAC + ABAC access control
Resumable execution	Multi-phase rehydration: state → context → tools → memory → integrity
Migratable sessions	Three-phase commit protocol with incremental transfer optimization
Security by design	Encryption at rest/in transit, token rotation, anomaly detection
Measurable quality	Comprehensive metrics taxonomy with satisfaction prediction
Feedback-driven evolution	Analytics pipeline feeds back into timeout tuning, tool selection, and planning

A session that cannot survive a crash is a toy. A session that cannot migrate is a monolith. A session that cannot be audited is a liability. A session that cannot be measured cannot be improved. The architecture in this chapter ensures none of these failures are possible.

End of Chapter 18.

Chapter 18: Session Architecture — Lifecycle, Isolation, Persistence, and Resumption

Preamble#

18.1 Session as a First-Class Architectural Primitive#

18.1.1 Definition and Architectural Position#

18.1.2 Why Sessions Must Be First-Class#

18.1.3 Session vs. Related Concepts#

18.1.4 Formal Session Invariants#

18.2 Session Lifecycle: Init → Active → Suspended → Resumed → Completed → Archived#

18.2.1 Lifecycle as a Finite State Machine#

18.2.2 Transition Function#

18.2.3 Lifecycle Duration Accounting#

18.2.4 Pseudo-Algorithm: Session Lifecycle Manager#

18.3 Session State Schema: Typed, Versioned, Serializable, and Diff-Capable#

18.3.1 State Schema Formalization#

18.3.2 Schema Versioning and Evolution#

18.3.3 State Diff and Merge Operations#

Structural Diff#

Three-Way Merge#

18.3.4 Pseudo-Algorithm: Versioned State Serialization and Validation#

18.4 Session Isolation Models: Per-User, Per-Task, Per-Agent, and Nested Sessions#

18.4.1 Isolation as a Correctness Requirement#

18.4.2 Isolation Models#

Per-User Isolation#

Per-Task Isolation#

Per-Agent Isolation#

Nested Sessions#

18.4.3 Isolation Enforcement Mechanism#

18.4.4 Isolation Strength Hierarchy#

18.5 Session Persistence Strategies: In-Memory, Write-Ahead Log, Database-Backed, and Distributed#

18.5.1 Persistence Tier Classification#

18.5.2 Persistence Strategy Selection#

18.5.3 Write-Ahead Log (WAL) for Session State#

WAL Compaction#

18.5.4 Database-Backed Persistence#

18.5.5 Distributed Persistence#

18.5.6 Persistence Cost Model#

18.6 Session Checkpointing: Periodic, Event-Triggered, and Pre-Mutation Snapshots#

18.6.1 Checkpoint Definition#

18.6.2 Checkpoint Trigger Strategies#

Adaptive Checkpoint Interval#

18.6.3 Checkpoint Storage Optimization#

Full Snapshot#

Incremental Delta#

18.6.4 Pseudo-Algorithm: Checkpoint Manager#

18.7 Session Resumption: Rehydrating Context, Rebinding Tools, and Restoring Agent State#

18.7.1 Resumption as a Multi-Phase Protocol#

18.7.2 Phase 1: State Restoration#

18.7.3 Phase 2: Context Rehydration#

18.7.4 Phase 3: Tool Rebinding#

18.7.5 Phase 4: Memory Reconstruction#

18.7.6 Phase 5: Integrity Verification#

18.7.7 Pseudo-Algorithm: Session Resumption Protocol#

18.8 Session Migration: Moving Sessions Across Nodes, Regions, and Agent Instances#

18.8.1 Migration Motivation and Scenarios#

18.8.2 Migration Protocol#

18.8.3 Migration Latency Model#

18.8.4 Pseudo-Algorithm: Session Migration#

18.9 Session Timeout and Expiry: Configurable TTL, Grace Periods, and Cleanup Hooks#

18.9.1 Timeout and Expiry Model#

18.9.2 Timeout State Transitions#

18.9.3 Idle Detection#

18.9.4 Cleanup Hooks#

18.9.5 TTL Extension and Renewal#

18.10 Multi-Session Coordination: Linking Related Sessions, Cross-Session Context Sharing#

18.10.1 Session Relationship Graph#

18.10.2 Cross-Session Context Sharing Protocol#

Snapshot Sharing#

Live Sharing#

18.10.3 Cross-Session Memory Promotion#

18.10.4 Pseudo-Algorithm: Multi-Session Coordinator#

18.11 Session Security: Encryption at Rest and in Transit, Access Control, and Session Hijacking Prevention#

18.11.1 Threat Model#

18.11.2 Encryption Architecture#

At Rest#

In Transit#

18.11.3 Session Token Security#

18.11.4 Access Control Model#

18.11.5 Anti-Hijacking Measures#

18.11.6 Security Invariants#

18.12 Session Analytics: Duration, Turn Count, Tool Usage, Error Rate, and User Satisfaction Correlation#