Preamble#
In production agentic systems, the session is the fundamental unit of continuity. It binds a user's intent to an agent's execution state across time, space, and failure boundaries. Without a formally defined session primitive, agentic systems degrade to stateless request-response handlers—incapable of multi-turn reasoning, resumable execution, collaborative workflows, or any form of durable interaction. This chapter formalizes the session as a first-class architectural primitive with typed state, versioned schemas, deterministic lifecycle transitions, isolation guarantees, persistence strategies, and security invariants. Every construct is specified with the same rigor applied to database transaction managers, distributed consensus protocols, and operating system process models. A session is not a "conversation history blob." It is a managed execution envelope with defined boundaries, serializable state, migration capability, and measurable operational characteristics.
18.1 Session as a First-Class Architectural Primitive#
18.1.1 Definition and Architectural Position#
A session is a bounded, stateful execution envelope that encapsulates all context, memory, tool bindings, agent state, and interaction history required to maintain continuity for a logically coherent unit of work.
Formally:
where:
| Symbol | Type | Semantics |
|---|---|---|
SessionID (UUID v7, temporally sortable) | Globally unique, immutable session identifier | |
SessionState (typed, versioned) | Current mutable state of the session | |
LifecyclePhase | Current lifecycle phase (enum) | |
IsolationContext | Isolation boundaries and ownership descriptors | |
SessionMemory | Session-scoped memory layers | |
ToolBindingSet | Bound tool instances with caller-scoped authorization | |
InteractionHistory | Ordered turn-level interaction log | |
PersistencePolicy | Checkpointing, WAL, and expiry configuration | |
SessionMetadata | Creation time, owner, TTL, tags, lineage |
18.1.2 Why Sessions Must Be First-Class#
Sessions are promoted from implicit infrastructure to explicit architectural primitives for the following reasons:
-
Continuity Under Failure: Without durable session state, a crash or timeout destroys all accumulated context. First-class sessions enable resumption from the last consistent checkpoint.
-
Isolation Enforcement: Concurrent users, tasks, or agents must not observe or mutate each other's state. First-class sessions provide the isolation boundary analogous to process isolation in operating systems.
-
Migration and Scaling: Sessions must be movable across nodes, regions, and agent instances without loss of state. This requires serializable, versioned state schemas—impossible with ad hoc in-memory state.
-
Observability and Auditing: Every session transition, tool invocation, and state mutation must be traceable. First-class sessions provide the natural unit of observation.
-
Cost Attribution: Token consumption, tool invocation costs, and compute usage are attributed per session, enabling granular billing, budgeting, and resource governance.
18.1.3 Session vs. Related Concepts#
| Concept | Relationship to Session | Key Distinction |
|---|---|---|
| Conversation | A session may contain one or more conversations | Conversation is interaction-level; session is execution-level |
| Agent Loop Execution | An agent loop executes within a session | The loop is a control structure; the session is its state envelope |
| Transaction | A session may contain multiple transactions | Transactions have ACID properties; sessions have lifecycle and continuity semantics |
| Process | Sessions are analogous to OS processes | Sessions are distributed, serializable, and migrateable |
| Context Window | The active context window is a view into session state | Context window is bounded by model limits; session state is unbounded but tiered |
18.1.4 Formal Session Invariants#
Every session in the system must satisfy the following invariants at all times:
18.2 Session Lifecycle: Init → Active → Suspended → Resumed → Completed → Archived#
18.2.1 Lifecycle as a Finite State Machine#
The session lifecycle is modeled as a deterministic finite state machine:
where:
- = set of lifecycle events
- = transition function
- = terminal states
18.2.2 Transition Function#
The complete transition table:
| Current State | Event | Next State | Guard Condition | Side Effect |
|---|---|---|---|---|
INIT | activate | ACTIVE | State schema validated, tools bound | Emit session.activated trace |
ACTIVE | suspend | SUSPENDED | Checkpoint persisted | Flush working memory, release tool locks |
ACTIVE | complete | COMPLETED | All exit criteria met | Final checkpoint, provenance sealed |
ACTIVE | terminate | TERMINATED | Operator command or unrecoverable error | Compensating actions, error record |
ACTIVE | timeout | SUSPENDED | TTL or idle timeout exceeded | Auto-checkpoint, release resources |
SUSPENDED | resume | RESUMED | Checkpoint available, resources acquired | Rehydrate context, rebind tools |
SUSPENDED | expire | TERMINATED | Expiry TTL exceeded | Cleanup, archive state |
RESUMED | reactivate | ACTIVE | State consistency verified | Resume from checkpoint |
COMPLETED | archive | ARCHIVED | Retention policy evaluated | Move to cold storage |
TERMINATED | archive | ARCHIVED | Cleanup complete | Move to cold storage |
Formally:
18.2.3 Lifecycle Duration Accounting#
Define the time spent in each phase:
The active ratio measures session efficiency:
Low indicates excessive suspension—possibly due to resource contention, slow human approval, or infrastructure latency.
The suspension frequency is:
High triggers investigation into timeout tuning, resource provisioning, or task decomposition quality.
18.2.4 Pseudo-Algorithm: Session Lifecycle Manager#
ALGORITHM SessionLifecycleManager
INPUT:
session_id: SessionID
event: LifecycleEvent
context: SystemContext
OUTPUT:
new_phase: LifecyclePhase
side_effects: List<SideEffect>
BEGIN:
session ← LOAD_SESSION(session_id)
current_phase ← session.lifecycle_phase
side_effects ← []
// Validate transition
IF (current_phase, event) NOT IN TRANSITION_TABLE THEN
RAISE InvalidTransitionError(current_phase, event)
END IF
target_phase ← TRANSITION_TABLE[(current_phase, event)]
guard ← GUARD_TABLE[(current_phase, event)]
// Evaluate guard condition
IF NOT guard.evaluate(session, context) THEN
RAISE GuardFailedError(current_phase, event, guard.reason)
END IF
// Execute pre-transition hooks
FOR EACH hook IN PRE_TRANSITION_HOOKS[(current_phase, target_phase)] DO
hook.execute(session, context)
END FOR
// Phase-specific side effects
MATCH (current_phase, target_phase):
(INIT, ACTIVE):
VALIDATE_STATE_SCHEMA(session.state)
BIND_TOOLS(session, context.tool_registry)
INITIALIZE_MEMORY_LAYERS(session)
APPEND side_effects, EmitTrace("session.activated", session_id)
(ACTIVE, SUSPENDED):
cp ← CREATE_CHECKPOINT(session)
PERSIST_CHECKPOINT(cp)
FLUSH_WORKING_MEMORY(session)
RELEASE_TOOL_LOCKS(session)
APPEND side_effects, EmitTrace("session.suspended", session_id)
APPEND side_effects, ReleaseResources(session.resource_claims)
(ACTIVE, COMPLETED):
VERIFY_EXIT_CRITERIA(session)
cp ← CREATE_FINAL_CHECKPOINT(session)
PERSIST_CHECKPOINT(cp)
SEAL_PROVENANCE(session)
PROMOTE_EPISODIC_MEMORY(session)
APPEND side_effects, EmitTrace("session.completed", session_id)
(SUSPENDED, RESUMED):
ACQUIRE_RESOURCES(session.resource_requirements)
REHYDRATE_CONTEXT(session)
REBIND_TOOLS(session, context.tool_registry)
VERIFY_STATE_CONSISTENCY(session)
APPEND side_effects, EmitTrace("session.resumed", session_id)
(RESUMED, ACTIVE):
// Verify rehydration completeness
ASSERT session.context_integrity_check() = PASS
APPEND side_effects, EmitTrace("session.reactivated", session_id)
(*, TERMINATED):
EXECUTE_COMPENSATING_ACTIONS(session)
PERSIST_FAILURE_STATE(session)
RELEASE_ALL_RESOURCES(session)
APPEND side_effects, EmitTrace("session.terminated", session_id)
(*, ARCHIVED):
MOVE_TO_COLD_STORAGE(session)
DELETE_HOT_STATE(session)
APPEND side_effects, EmitTrace("session.archived", session_id)
// Update lifecycle phase
session.lifecycle_phase ← target_phase
session.state.version ← session.state.version + 1
session.transition_log.append(TransitionRecord(
from=current_phase,
to=target_phase,
event=event,
timestamp=NOW(),
actor=context.actor_id
))
PERSIST_SESSION_METADATA(session)
// Execute post-transition hooks
FOR EACH hook IN POST_TRANSITION_HOOKS[(current_phase, target_phase)] DO
hook.execute(session, context)
END FOR
RETURN (target_phase, side_effects)
END18.3 Session State Schema: Typed, Versioned, Serializable, and Diff-Capable#
18.3.1 State Schema Formalization#
The session state is a typed record with a versioned schema:
where:
- is the monotonically increasing state version number
- follows semantic versioning
- Each field has an associated type from the type system
The type system supports:
| Type Class | Examples | Serialization |
|---|---|---|
| Primitive | Int64, Float64, String, Bool, Bytes | Direct |
| Temporal | Timestamp, Duration, TTL | ISO 8601 / epoch millis |
| Collection | List<T>, Map<K, V>, Set<T> | Ordered JSON arrays / objects |
| Composite | AgentState, PlanSnapshot, MemorySummary | Nested typed records |
| Reference | Ref<Checkpoint>, Ref<EvidenceBundle> | URI + content hash |
| Optional | Option<T> | Nullable with explicit None |
18.3.2 Schema Versioning and Evolution#
Schema evolution must support backward compatibility for session resumption across software upgrades. The rules follow a strict contract:
| Change Type | Version Impact | Migration Requirement |
|---|---|---|
| Add optional field with default | MINOR bump | None — deserializer uses default |
| Add required field | MAJOR bump | Migration function required |
| Remove field | MAJOR bump | Migration function to drop field |
| Change field type | MAJOR bump | Migration function to convert |
| Rename field | MAJOR bump | Migration function to remap |
| Add enum variant | MINOR bump | Old deserializer ignores unknown |
The migration function for version is:
Migration functions are composable:
and stored in a migration registry indexed by .
18.3.3 State Diff and Merge Operations#
For multi-session coordination (Section 18.10) and migration (Section 18.8), the system must compute structural diffs and merges on session state.
Structural Diff#
Given two state versions and :
where .
The diff size determines migration cost:
Three-Way Merge#
For cross-session state sharing, a three-way merge uses a common ancestor :
Conflicts are resolved through conflict resolution policies: last-writer-wins, priority-based, or escalation to human review.
18.3.4 Pseudo-Algorithm: Versioned State Serialization and Validation#
ALGORITHM SerializeSessionState
INPUT:
state: SessionState
target_format: SerializationFormat // PROTOBUF | MSGPACK | JSON
OUTPUT:
serialized: Bytes
checksum: Hash
BEGIN:
// Canonical field ordering (deterministic)
ordered_fields ← SORT(state.fields, key=λf: f.name)
// Type validation
FOR EACH (field_name, field_value) IN ordered_fields DO
expected_type ← state.schema.type_of(field_name)
IF NOT TYPE_CHECK(field_value, expected_type) THEN
RAISE SchemaViolationError(field_name, expected_type, ACTUAL_TYPE(field_value))
END IF
END FOR
// Serialize with canonical ordering
canonical ← CANONICAL_ENCODE(ordered_fields, target_format)
// Compute integrity checksum
checksum ← SHA256(canonical)
// Attach version metadata
envelope ← StateEnvelope(
schema_version=state.schema_version,
state_version=state.version,
checksum=checksum,
serialized_at=NOW(),
payload=canonical
)
serialized ← ENCODE_ENVELOPE(envelope, target_format)
RETURN (serialized, checksum)
END
ALGORITHM DeserializeSessionState
INPUT:
serialized: Bytes
expected_schema_version: SemanticVersion
OUTPUT:
state: SessionState
BEGIN:
envelope ← DECODE_ENVELOPE(serialized)
// Schema version compatibility check
IF NOT COMPATIBLE(envelope.schema_version, expected_schema_version) THEN
// Attempt migration
migration_path ← FIND_MIGRATION_PATH(
envelope.schema_version, expected_schema_version
)
IF migration_path IS NULL THEN
RAISE IncompatibleSchemaError(envelope.schema_version, expected_schema_version)
END IF
state ← DECODE_PAYLOAD(envelope.payload, envelope.schema_version)
FOR EACH migration IN migration_path DO
state ← migration.apply(state)
END FOR
ELSE
state ← DECODE_PAYLOAD(envelope.payload, expected_schema_version)
END IF
// Integrity verification
computed_checksum ← SHA256(CANONICAL_ENCODE(state.fields))
IF computed_checksum ≠ envelope.checksum THEN
RAISE IntegrityViolationError(
expected=envelope.checksum,
computed=computed_checksum
)
END IF
RETURN state
END18.4 Session Isolation Models: Per-User, Per-Task, Per-Agent, and Nested Sessions#
18.4.1 Isolation as a Correctness Requirement#
Session isolation prevents unintended state interference between concurrent execution contexts. Formally, two sessions and are isolated if:
Isolation is enforced through namespaced state, scoped tool authorizations, and memory partitioning.
18.4.2 Isolation Models#
Per-User Isolation#
The broadest isolation boundary. Each user has a set of sessions:
State visibility: Session can only access state of if and an explicit sharing policy exists.
Per-Task Isolation#
Within a user's sessions, each task receives its own session:
This prevents cross-task state contamination—a code generation task does not pollute the context of a research summarization task.
Per-Agent Isolation#
When multiple agents execute within a single session (e.g., a generator agent and a critic agent), each agent receives an isolated workspace:
The shared session state is accessed through a controlled interface with read/write permissions:
Nested Sessions#
Complex tasks spawn child sessions that inherit certain properties from the parent:
The inheritance policy specifies:
| Property | Inheritance Rule |
|---|---|
| Memory (semantic) | Copy-on-read, isolated writes |
| Memory (episodic) | Read-only access to parent's episodes |
| Tool bindings | Subset of parent's bindings (least privilege) |
| Token budget | Allocated fraction of parent's remaining budget |
| Isolation context | Child inherits owner but gets unique task scope |
| Lifecycle | Child must complete or terminate before parent completes |
The parent-child relationship forms a session tree:
Invariant: A parent session cannot transition to COMPLETED until all child sessions are in terminal states:
18.4.3 Isolation Enforcement Mechanism#
ALGORITHM EnforceSessionIsolation
INPUT:
session: Session
operation: StateOperation // READ | WRITE | DELETE
field: FieldPath
actor: AgentID | UserID
OUTPUT:
permitted: Bool
BEGIN:
isolation_ctx ← session.isolation_context
// Determine access level
MATCH isolation_ctx.model:
PER_USER:
IF actor NOT IN isolation_ctx.authorized_users THEN
AUDIT_LOG("access_denied", actor, session.id, field, operation)
RETURN FALSE
END IF
PER_TASK:
IF actor.current_task ≠ session.task_scope THEN
AUDIT_LOG("cross_task_access_denied", actor, session.id)
RETURN FALSE
END IF
PER_AGENT:
access ← isolation_ctx.agent_permissions[actor]
IF operation = WRITE AND access ≠ READ_WRITE THEN
AUDIT_LOG("agent_write_denied", actor, session.id, field)
RETURN FALSE
END IF
IF operation = READ AND access = NONE THEN
AUDIT_LOG("agent_read_denied", actor, session.id, field)
RETURN FALSE
END IF
// Field-level access control
field_policy ← session.schema.field_access_policy(field)
IF operation NOT IN field_policy.allowed_operations(actor.role) THEN
AUDIT_LOG("field_access_denied", actor, field, operation)
RETURN FALSE
END IF
RETURN TRUE
END18.4.4 Isolation Strength Hierarchy#
The isolation models form a hierarchy from strongest to weakest:
The tighter the isolation, the stronger the correctness guarantee but the higher the coordination cost for cross-boundary communication. The system selects isolation granularity based on the contention risk and shared-state requirements of the workload.
18.5 Session Persistence Strategies: In-Memory, Write-Ahead Log, Database-Backed, and Distributed#
18.5.1 Persistence Tier Classification#
Session state is tiered across persistence layers with different durability, latency, and cost characteristics:
18.5.2 Persistence Strategy Selection#
The optimal persistence tier for a session is determined by a cost-durability objective:
where is a function of session criticality:
| Session Criticality | Required Durability | Recommended Tier |
|---|---|---|
| Ephemeral exploration | None | L0 (in-memory) |
| Standard interactive | Crash-resilient | L1 (WAL) |
| Business-critical workflow | Node-failure resilient | L2 (database) |
| Cross-region, long-running | Region-failure resilient | L3 (distributed) |
18.5.3 Write-Ahead Log (WAL) for Session State#
The WAL provides crash-resilient persistence with minimal write amplification. Every state mutation is appended to the WAL before being applied to the in-memory state:
WAL Compaction#
Over time, the WAL grows unboundedly. Compaction produces a compressed snapshot:
The compacted state is:
Compaction is triggered when:
18.5.4 Database-Backed Persistence#
For L2 persistence, session state is serialized into a database with the following schema:
| Column | Type | Index | Semantics |
|---|---|---|---|
session_id | UUID | Primary key | Unique session identifier |
schema_version | SemVer | — | State schema version |
state_version | Int64 | — | Monotonic state version |
state_blob | Bytes | — | Serialized state (Protobuf) |
checksum | Bytes(32) | — | SHA-256 integrity hash |
lifecycle_phase | Enum | Index | Current lifecycle phase |
owner_id | UUID | Index | Session owner |
created_at | Timestamp | Index | Creation time |
last_active_at | Timestamp | Index | Last activity time |
expires_at | Timestamp | Index | Expiry time (TTL-based) |
metadata | JSONB | GIN index | Tags, labels, lineage |
Optimistic concurrency control is used for updates:
ALGORITHM PersistSessionToDatabase
INPUT:
session: Session
db: DatabaseConnection
OUTPUT:
success: Bool
BEGIN:
(serialized, checksum) ← SERIALIZE_SESSION_STATE(session.state)
expected_version ← session.state.version
result ← db.execute(
"UPDATE sessions
SET state_blob = $1,
checksum = $2,
state_version = $3,
lifecycle_phase = $4,
last_active_at = $5
WHERE session_id = $6
AND state_version = $7", // Optimistic lock
params=[serialized, checksum, expected_version + 1,
session.lifecycle_phase, NOW(),
session.id, expected_version]
)
IF result.rows_affected = 0 THEN
// Concurrent modification detected
RAISE ConcurrentModificationError(session.id, expected_version)
END IF
session.state.version ← expected_version + 1
RETURN TRUE
END18.5.5 Distributed Persistence#
For L3, session state is replicated across regions using a consensus protocol. The replication factor and consistency level are configurable:
The session system defaults to session-consistent reads (read-your-own-writes) within a session, even when using eventual consistency across the distributed store. This is achieved by routing all reads for a session to the same replica or by attaching a causal timestamp to each operation:
18.5.6 Persistence Cost Model#
The cost of persisting session state over its lifetime:
where and are the total write and read operations, is the serialized state size, and is the retention duration. This cost model directly informs tier selection and compaction frequency.
18.6 Session Checkpointing: Periodic, Event-Triggered, and Pre-Mutation Snapshots#
18.6.1 Checkpoint Definition#
A checkpoint is a point-in-time consistent snapshot of session state:
where is the monotonically increasing checkpoint sequence number.
18.6.2 Checkpoint Trigger Strategies#
| Strategy | Trigger Condition | Use Case | Trade-off |
|---|---|---|---|
| Periodic | Every seconds | Background consistency | Simple but may miss critical mutations |
| Event-Triggered | On lifecycle transitions, tool invocations, phase changes | Agent loop boundaries | Precise but higher write frequency |
| Pre-Mutation | Before any state-changing operation | Safety-critical workflows | Maximum safety, highest write cost |
| Adaptive | Based on state change rate | General purpose | Balances cost and safety dynamically |
Adaptive Checkpoint Interval#
The adaptive strategy adjusts the checkpoint interval based on the rate of state mutations:
where:
High mutation rates trigger more frequent checkpoints; quiescent periods relax checkpoint frequency.
18.6.3 Checkpoint Storage Optimization#
Checkpoints can be stored as full snapshots or incremental deltas:
Full Snapshot#
Incremental Delta#
The space-time trade-off: full snapshots enable restoration but cost per checkpoint. Incremental deltas cost per checkpoint but require replaying deltas for restoration.
A mixed strategy checkpoints a full snapshot every checkpoints and incremental deltas in between:
Restoration cost:
18.6.4 Pseudo-Algorithm: Checkpoint Manager#
ALGORITHM CheckpointManager
INPUT:
session: Session
trigger: CheckpointTrigger
config: CheckpointConfig
OUTPUT:
checkpoint_ref: CheckpointRef
BEGIN:
// Determine checkpoint type
last_full_seq ← GET_LAST_FULL_CHECKPOINT_SEQ(session.id)
current_seq ← session.checkpoint_seq + 1
deltas_since_full ← current_seq - last_full_seq
IF deltas_since_full ≥ config.full_checkpoint_interval
OR trigger = LIFECYCLE_TRANSITION
OR trigger = PRE_MUTATION_CRITICAL THEN
cp_type ← FULL
ELSE
cp_type ← INCREMENTAL
END IF
// Create checkpoint
MATCH cp_type:
FULL:
(serialized, checksum) ← SERIALIZE_SESSION_STATE(session.state)
checkpoint ← Checkpoint(
session_id=session.id,
seq=current_seq,
type=FULL,
state_version=session.state.version,
payload=serialized,
checksum=checksum,
trigger=trigger,
timestamp=NOW()
)
INCREMENTAL:
prev_state ← LOAD_PREVIOUS_STATE(session.id, current_seq - 1)
delta ← COMPUTE_DIFF(prev_state, session.state)
delta_serialized ← SERIALIZE_DIFF(delta)
checksum ← SHA256(delta_serialized)
checkpoint ← Checkpoint(
session_id=session.id,
seq=current_seq,
type=INCREMENTAL,
state_version=session.state.version,
payload=delta_serialized,
base_seq=current_seq - 1,
checksum=checksum,
trigger=trigger,
timestamp=NOW()
)
// Persist to configured tier
persistence_tier ← session.persistence_policy.checkpoint_tier
PERSIST_CHECKPOINT(checkpoint, persistence_tier)
// Update session metadata
session.checkpoint_seq ← current_seq
session.last_checkpoint_at ← NOW()
EMIT_METRIC("session.checkpoint", {
session_id: session.id,
seq: current_seq,
type: cp_type,
size_bytes: SIZE(checkpoint.payload),
trigger: trigger
})
RETURN CheckpointRef(session.id, current_seq, checksum)
END18.7 Session Resumption: Rehydrating Context, Rebinding Tools, and Restoring Agent State#
18.7.1 Resumption as a Multi-Phase Protocol#
Session resumption is not a single deserialization step. It is a multi-phase protocol that reconstructs the full execution environment from persisted state:
Each phase has specific preconditions, postconditions, and failure modes.
18.7.2 Phase 1: State Restoration#
Restore the session state from the most recent checkpoint:
Postcondition:
Failure mode: Checkpoint corruption → fall back to previous full checkpoint.
18.7.3 Phase 2: Context Rehydration#
The active context window at the time of suspension may have contained retrieved evidence, tool outputs, and compressed history that are not part of the serialized state. Rehydration reconstructs this context:
Key considerations:
- Stale evidence: If time has passed since suspension, retrieved evidence may be outdated. The rehydration phase checks freshness scores and optionally re-retrieves:
- Token budget recalculation: The remaining token budget must be recalculated from the checkpoint:
18.7.4 Phase 3: Tool Rebinding#
Tools may have changed availability, version, or authorization scope since the session was suspended:
For tools that are no longer available, the system applies the tool substitution policy:
18.7.5 Phase 4: Memory Reconstruction#
Session memory layers are reconstructed:
| Memory Layer | Resumption Strategy |
|---|---|
| Working | Reset (ephemeral by definition) |
| Session | Restored from checkpoint |
| Episodic | Loaded from durable store |
| Semantic | Read from organizational memory (shared, not session-specific) |
| Procedural | Loaded from procedural memory store |
18.7.6 Phase 5: Integrity Verification#
Before the session transitions from RESUMED to ACTIVE, a comprehensive integrity check verifies:
18.7.7 Pseudo-Algorithm: Session Resumption Protocol#
ALGORITHM ResumeSession
INPUT:
session_id: SessionID
checkpoint_store: CheckpointStore
tool_registry: ToolRegistry
memory_store: MemoryStore
config: ResumptionConfig
OUTPUT:
resumed_session: Session
BEGIN:
// ─── Phase 1: State Restoration ───
checkpoints ← checkpoint_store.list(session_id, order=DESC)
IF checkpoints IS EMPTY THEN
RAISE NoCheckpointAvailableError(session_id)
END IF
// Find most recent full checkpoint
full_cp ← FIND_LATEST(checkpoints, type=FULL)
IF full_cp IS NULL THEN
RAISE NoFullCheckpointError(session_id)
END IF
// Collect incremental deltas after full checkpoint
deltas ← FILTER(checkpoints,
λcp: cp.type = INCREMENTAL AND cp.seq > full_cp.seq)
SORT deltas BY seq ASC
// Replay
state ← DESERIALIZE_SESSION_STATE(full_cp.payload, config.expected_schema_version)
FOR EACH delta IN deltas DO
diff ← DESERIALIZE_DIFF(delta.payload)
state ← APPLY_DIFF(state, diff)
END FOR
// Verify integrity
latest_cp ← checkpoints[0]
IF SHA256(CANONICAL_SERIALIZE(state.fields)) ≠ latest_cp.checksum THEN
// Attempt fallback to previous full checkpoint
WARN("checkpoint_integrity_failure", session_id, latest_cp.seq)
IF config.allow_fallback THEN
RETURN RESUME_FROM_FALLBACK(session_id, full_cp, config)
ELSE
RAISE CheckpointIntegrityError(session_id, latest_cp.seq)
END IF
END IF
// ─── Phase 2: Context Rehydration ───
retrieval_queries ← state.pending_retrieval_queries
evidence ← {}
FOR EACH query IN retrieval_queries DO
cached ← RETRIEVAL_CACHE.get(query.hash)
IF cached IS NOT NULL AND AGE(cached) < config.freshness_threshold THEN
evidence[query.id] ← cached
ELSE
fresh_result ← RETRIEVE(query, deadline=config.retrieval_deadline)
evidence[query.id] ← fresh_result
RETRIEVAL_CACHE.put(query.hash, fresh_result, ttl=config.cache_ttl)
END IF
END FOR
// Recalculate token budget
token_budget ← config.T_max - state.tokens_consumed
IF token_budget < config.T_min_viable THEN
RAISE InsufficientTokenBudgetError(session_id, token_budget)
END IF
// ─── Phase 3: Tool Rebinding ───
required_tools ← state.required_tools
bound_tools ← {}
degraded_tools ← []
FOR EACH tool_spec IN required_tools DO
tool ← tool_registry.resolve(tool_spec.name, tool_spec.version_constraint)
IF tool IS NOT NULL THEN
IF AUTHORIZE(tool, state.isolation_context) THEN
bound_tools[tool_spec.name] ← tool
ELSE
IF tool_spec.required THEN
RAISE ToolAuthorizationFailedError(tool_spec.name)
ELSE
APPEND degraded_tools, tool_spec.name
END IF
END IF
ELSE
substitute ← tool_registry.find_substitute(tool_spec)
IF substitute IS NOT NULL THEN
bound_tools[tool_spec.name] ← substitute
WARN("tool_substituted", tool_spec.name, substitute.name)
ELSE IF tool_spec.required THEN
RAISE RequiredToolUnavailableError(tool_spec.name)
ELSE
APPEND degraded_tools, tool_spec.name
END IF
END IF
END FOR
// ─── Phase 4: Memory Reconstruction ───
session_memory ← SessionMemory(
working=WorkingMemory.fresh(), // Ephemeral: always reset
session=memory_store.load_session_memory(session_id),
episodic=memory_store.load_episodic(session_id),
semantic=memory_store.load_semantic(state.isolation_context.org_id),
procedural=memory_store.load_procedural(state.task_type)
)
// ─── Phase 5: Construct Resumed Session ───
session ← Session(
id=session_id,
state=state,
lifecycle_phase=RESUMED,
isolation_context=state.isolation_context,
memory=session_memory,
tool_bindings=bound_tools,
evidence=evidence,
token_budget=token_budget,
degraded_capabilities=degraded_tools,
metadata=state.metadata
)
// ─── Phase 6: Integrity Verification ───
integrity ← VERIFY_SESSION_INTEGRITY(session)
IF NOT integrity.passed THEN
RAISE SessionIntegrityError(session_id, integrity.failures)
END IF
EMIT_TRACE("session.resumed", {
session_id: session_id,
resumed_from_checkpoint: latest_cp.seq,
state_version: state.version,
tools_bound: LEN(bound_tools),
tools_degraded: LEN(degraded_tools),
evidence_rehydrated: LEN(evidence),
token_budget_remaining: token_budget
})
RETURN session
END18.8 Session Migration: Moving Sessions Across Nodes, Regions, and Agent Instances#
18.8.1 Migration Motivation and Scenarios#
Sessions must be migratable across infrastructure boundaries for:
| Scenario | Trigger | Constraint |
|---|---|---|
| Node failover | Node crash or scheduled maintenance | Minimize downtime; resume on healthy node |
| Load balancing | Uneven resource utilization | Minimize migration latency |
| Region transfer | User relocates; data residency requirements | Comply with data sovereignty regulations |
| Agent upgrade | New agent version deployed | Maintain state continuity across versions |
| Horizontal scaling | Workload spike | Distribute sessions across expanded capacity |
18.8.2 Migration Protocol#
Session migration is a three-phase commit protocol:
Phase 1: Prepare
- Source suspends the session (creates checkpoint).
- Source serializes complete session state, including metadata, checkpoints, and WAL tail.
- Source notifies target of incoming migration with state size and resource requirements.
- Target verifies capacity, schema compatibility, and tool availability.
Phase 2: Transfer
- State bundle is transferred over an encrypted, authenticated channel.
- Target deserializes, validates checksum, and performs schema migration if necessary.
- Target binds tools and verifies integrity.
Phase 3: Commit
- Target acknowledges successful restoration.
- Source marks the session as migrated and releases all local resources.
- Routing table is updated to direct future requests to the target.
- If acknowledgment times out, the source retains the session (migration aborted).
18.8.3 Migration Latency Model#
The total migration latency is:
where is the available network bandwidth. For large session states, the transfer dominates. Optimization strategies include:
- Incremental migration: Transfer only the delta since the last checkpoint already present on the target.
- Compression: Apply LZ4 or Zstandard compression to the state bundle:
- Pre-staging: Begin transfer of large, stable state components (e.g., episodic memory) before the migration is committed.
18.8.4 Pseudo-Algorithm: Session Migration#
ALGORITHM MigrateSession
INPUT:
session_id: SessionID
source_node: NodeID
target_node: NodeID
config: MigrationConfig
OUTPUT:
migration_result: MigrationResult
BEGIN:
// ─── Phase 1: Prepare ───
session ← LOAD_SESSION(session_id, source_node)
// Suspend session (creates checkpoint)
LIFECYCLE_TRANSITION(session, SUSPEND)
checkpoint ← CREATE_CHECKPOINT(session, trigger=MIGRATION)
// Serialize state bundle
state_bundle ← SerializeStateBundle(
state=session.state,
checkpoints=GET_RECENT_CHECKPOINTS(session_id, config.checkpoint_window),
wal_tail=GET_WAL_TAIL(session_id),
memory=SERIALIZE_SESSION_MEMORY(session.memory),
metadata=session.metadata,
tool_specs=session.tool_bindings.specs()
)
compressed_bundle ← COMPRESS(state_bundle, algorithm=ZSTD, level=3)
// Verify target capacity
target_capacity ← RPC(target_node, "check_migration_capacity", {
state_size: SIZE(compressed_bundle),
schema_version: session.state.schema_version,
required_tools: session.tool_bindings.specs()
})
IF NOT target_capacity.accepted THEN
LIFECYCLE_TRANSITION(session, RESUME) // Abort: reactivate on source
RETURN MigrationResult(success=FALSE, reason=target_capacity.rejection_reason)
END IF
// ─── Phase 2: Transfer ───
transfer_start ← NOW()
transfer_result ← RPC(target_node, "receive_session_migration", {
session_id: session_id,
bundle: compressed_bundle,
checksum: SHA256(state_bundle),
source_node: source_node
}, deadline=config.transfer_deadline)
transfer_latency ← NOW() - transfer_start
IF NOT transfer_result.success THEN
LIFECYCLE_TRANSITION(session, RESUME) // Abort: reactivate on source
RETURN MigrationResult(success=FALSE, reason=transfer_result.error)
END IF
// ─── Phase 3: Commit ───
commit_result ← RPC(target_node, "commit_session_migration", {
session_id: session_id,
expected_checksum: SHA256(state_bundle)
}, deadline=config.commit_deadline)
IF commit_result.success THEN
// Update routing table
ROUTING_TABLE.update(session_id, target_node)
// Clean up source
MARK_SESSION_MIGRATED(session_id, source_node, target_node)
SCHEDULE_CLEANUP(session_id, source_node, delay=config.cleanup_delay)
EMIT_METRIC("session.migrated", {
session_id: session_id,
from: source_node,
to: target_node,
bundle_size: SIZE(compressed_bundle),
transfer_latency_ms: transfer_latency,
total_latency_ms: NOW() - transfer_start
})
RETURN MigrationResult(success=TRUE)
ELSE
// Commit failed: reactivate on source
LIFECYCLE_TRANSITION(session, RESUME)
RETURN MigrationResult(success=FALSE, reason="commit_failed")
END IF
END18.9 Session Timeout and Expiry: Configurable TTL, Grace Periods, and Cleanup Hooks#
18.9.1 Timeout and Expiry Model#
Sessions are subject to multiple time-based constraints:
| Parameter | Semantics | Typical Range |
|---|---|---|
| Max time without any interaction | 5 min – 24 hrs | |
| Max total session duration | 1 hr – 30 days | |
| Grace period after timeout before state cleanup | 1 hr – 7 days | |
| Time after which archived state is permanently deleted | 30 – 365 days |
18.9.2 Timeout State Transitions#
18.9.3 Idle Detection#
Idle time is measured from the last meaningful interaction:
A session is idle-timed-out when:
Important distinction: Background processing (e.g., async tool results, retrieval updates) resets the idle timer only if configured to do so. Pure heartbeat messages do not reset idle time—only semantically meaningful interactions qualify.
18.9.4 Cleanup Hooks#
Before a session is terminated or archived, cleanup hooks execute in order:
- Flush pending state: Persist any uncommitted working memory.
- Release tool locks: Free any claimed resources.
- Notify dependent sessions: Alert parent or linked sessions of termination.
- Emit analytics: Record final session metrics.
- Promote valuable memory: Extract non-obvious insights for long-term memory (with validation).
- Delete sensitive data: Scrub PII or credentials from session state.
ALGORITHM SessionTimeoutMonitor
INPUT:
sessions: SessionIndex // Indexed by last_active_at
config: TimeoutConfig
// Runs as a periodic background process
BEGIN:
LOOP EVERY config.sweep_interval DO
// Idle timeout check
idle_candidates ← sessions.query(
lifecycle_phase IN {ACTIVE},
last_active_at < NOW() - config.tau_idle
)
FOR EACH session IN idle_candidates DO
EMIT_TRACE("session.idle_timeout", session.id)
LIFECYCLE_TRANSITION(session, SUSPEND, trigger=IDLE_TIMEOUT)
END FOR
// Absolute timeout check
absolute_candidates ← sessions.query(
lifecycle_phase IN {ACTIVE, SUSPENDED, RESUMED},
created_at < NOW() - config.tau_absolute
)
FOR EACH session IN absolute_candidates DO
IF session.lifecycle_phase = ACTIVE THEN
LIFECYCLE_TRANSITION(session, SUSPEND, trigger=ABSOLUTE_TIMEOUT)
END IF
LIFECYCLE_TRANSITION(session, TERMINATE, trigger=ABSOLUTE_TIMEOUT)
END FOR
// Grace period expiry
grace_candidates ← sessions.query(
lifecycle_phase = SUSPENDED,
suspended_at < NOW() - config.tau_grace
)
FOR EACH session IN grace_candidates DO
EXECUTE_CLEANUP_HOOKS(session)
LIFECYCLE_TRANSITION(session, TERMINATE, trigger=GRACE_EXPIRED)
END FOR
// Permanent expiry
expiry_candidates ← sessions.query(
lifecycle_phase IN {TERMINATED, ARCHIVED},
terminated_at < NOW() - config.tau_expiry
)
FOR EACH session IN expiry_candidates DO
PERMANENT_DELETE(session)
EMIT_METRIC("session.permanently_deleted", session.id)
END FOR
END LOOP
END18.9.5 TTL Extension and Renewal#
Sessions can request TTL extension through explicit user interaction or programmatic renewal:
Renewal is granted only if:
- The session owner is authenticated.
- The total session duration remains within .
- Resource budgets (tokens, cost) have not been exhausted.
18.10 Multi-Session Coordination: Linking Related Sessions, Cross-Session Context Sharing#
18.10.1 Session Relationship Graph#
Related sessions form a directed graph where:
- = set of sessions
- = set of typed edges representing relationships
Edge types:
| Edge Type | Semantics | Data Flow |
|---|---|---|
PARENT → CHILD | Nested session (fork) | Child inherits parent context |
PREDECESSOR → SUCCESSOR | Sequential task chain | Successor reads predecessor's outputs |
PEER ↔ PEER | Collaborative sessions | Shared context subset |
REFERENCE → REFERENT | Cross-session evidence citation | Read-only access to referent's artifacts |
18.10.2 Cross-Session Context Sharing Protocol#
Sharing context across sessions requires explicit declaration and access control:
Snapshot Sharing#
The shared state is copied at a point in time. Changes in the source session are not reflected in the consumer:
Live Sharing#
The consumer session observes changes in real time through a subscription mechanism:
Live sharing introduces consistency challenges: the consumer may observe intermediate states. The system provides causal consistency by attaching version vectors:
The consumer reads field only when .
18.10.3 Cross-Session Memory Promotion#
When a session discovers a non-obvious insight or correction that would benefit future sessions, it proposes a memory promotion:
The promotion queue is processed by a validation pipeline that ensures:
- Deduplication: The item does not duplicate existing organizational memory.
- Factual verification: The item is supported by evidence.
- Generalizability: The item applies beyond the current session.
- Expiry policy: The item has a defined TTL.
Only items passing all gates are promoted to organizational (semantic) memory.
18.10.4 Pseudo-Algorithm: Multi-Session Coordinator#
ALGORITHM MultiSessionCoordinator
INPUT:
session_graph: SessionGraph
requesting_session: SessionID
operation: CrossSessionOperation
OUTPUT:
result: OperationResult
BEGIN:
MATCH operation:
FORK(parent_id, subtask, inheritance_policy):
parent ← LOAD_SESSION(parent_id)
// Validate parent is active
ASSERT parent.lifecycle_phase = ACTIVE
// Allocate child budget
child_token_budget ← parent.token_budget * inheritance_policy.budget_fraction
parent.token_budget ← parent.token_budget - child_token_budget
// Create child session
child ← CREATE_SESSION(
owner=parent.owner,
task=subtask,
isolation=PER_TASK,
persistence=parent.persistence_policy,
token_budget=child_token_budget
)
// Copy inherited state
FOR EACH field IN inheritance_policy.inherited_fields DO
child.state.set(field, DEEP_COPY(parent.state.get(field)))
END FOR
// Bind tools (subset)
child.tool_bindings ← FILTER(
parent.tool_bindings,
λt: t.name IN inheritance_policy.allowed_tools
)
// Register relationship
session_graph.add_edge(parent.id, child.id, PARENT_CHILD)
// Register completion dependency
parent.pending_children.add(child.id)
RETURN OperationResult(success=TRUE, child_session=child)
SHARE_CONTEXT(source_id, target_id, policy):
source ← LOAD_SESSION(source_id)
target ← LOAD_SESSION(target_id)
// Verify authorization
IF NOT AUTHORIZE_CROSS_SESSION(source, target, policy) THEN
RAISE CrossSessionAccessDenied(source_id, target_id)
END IF
MATCH policy.sync:
SNAPSHOT:
shared_state ← PROJECT(source.state, policy.fields)
snapshot ← DEEP_COPY(shared_state)
target.imported_context[source_id] ← ImportedContext(
snapshot=snapshot,
timestamp=NOW(),
source_version=source.state.version,
access=policy.access
)
LIVE:
subscription ← CREATE_SUBSCRIPTION(
source=source_id,
target=target_id,
fields=policy.fields,
access=policy.access
)
REGISTER_SUBSCRIPTION(subscription)
target.subscriptions.add(subscription)
session_graph.add_edge(source_id, target_id, CONTEXT_SHARE)
RETURN OperationResult(success=TRUE)
WAIT_FOR_CHILDREN(parent_id):
parent ← LOAD_SESSION(parent_id)
pending ← parent.pending_children
FOR EACH child_id IN pending DO
child ← LOAD_SESSION(child_id)
IF child.lifecycle_phase IN TERMINAL_STATES THEN
// Collect child results
parent.child_results[child_id] ← child.final_output
parent.pending_children.remove(child_id)
END IF
END FOR
IF parent.pending_children IS EMPTY THEN
RETURN OperationResult(success=TRUE, all_children_complete=TRUE)
ELSE
RETURN OperationResult(success=TRUE, all_children_complete=FALSE,
pending=parent.pending_children)
END IF
END18.11 Session Security: Encryption at Rest and in Transit, Access Control, and Session Hijacking Prevention#
18.11.1 Threat Model#
| Threat | Description | Impact |
|---|---|---|
| Session Hijacking | Attacker obtains session ID and impersonates user | Full access to session state and tools |
| State Tampering | Attacker modifies persisted session state | Corrupted execution, policy violations |
| Eavesdropping | Attacker intercepts session data in transit | Confidentiality breach |
| Replay Attack | Attacker replays captured session interactions | Duplicate state mutations |
| Privilege Escalation | Agent or user accesses resources beyond scope | Unauthorized tool invocations |
| Cross-Session Leakage | Isolation failure leaks state between sessions | Confidentiality and integrity breach |
18.11.2 Encryption Architecture#
At Rest#
All persisted session state is encrypted using AES-256-GCM with per-session keys:
where:
- is stored in a Hardware Security Module (HSM) or key management service.
- is HKDF-SHA256.
- The nonce is derived from the checkpoint sequence number to ensure uniqueness.
In Transit#
All session data traversing network boundaries is protected by mutual TLS (mTLS):
- gRPC channels use TLS 1.3 with certificate pinning.
- JSON-RPC endpoints require TLS with client certificate authentication.
- Inter-node migration uses an additional layer of application-level encryption.
18.11.3 Session Token Security#
Session tokens (used for API authentication) must satisfy:
Properties:
| Property | Mechanism |
|---|---|
| Unpredictability | 256-bit cryptographically random component |
| Binding | Token is bound to session ID, user ID, and IP (optional) |
| Expiry | Short-lived with refresh mechanism ( min) |
| Rotation | Token is rotated on every sensitive operation |
| Revocation | Server-side revocation list checked on every request |
18.11.4 Access Control Model#
Session access is governed by a role-based access control (RBAC) policy augmented with attribute-based constraints:
where is the subject (user or agent), is the action, and is the resource (session state field or tool).
| Role | Permissions |
|---|---|
| Owner | Full read/write, lifecycle control, migration, sharing |
| Agent | Scoped read/write per isolation policy, tool invocation |
| Viewer | Read-only access to session outputs and metrics |
| Auditor | Read-only access to audit records and traces |
| Admin | Session termination, migration, and policy override |
18.11.5 Anti-Hijacking Measures#
ALGORITHM ValidateSessionRequest
INPUT:
request: SessionRequest
session_store: SessionStore
security_config: SecurityConfig
OUTPUT:
validated: Bool
BEGIN:
// Token validation
token ← request.session_token
IF NOT VERIFY_SIGNATURE(token, security_config.signing_key) THEN
AUDIT_LOG("invalid_token_signature", request)
RETURN FALSE
END IF
claims ← DECODE_CLAIMS(token)
// Expiry check
IF claims.expires_at < NOW() THEN
AUDIT_LOG("expired_token", request)
RETURN FALSE
END IF
// Revocation check
IF REVOCATION_LIST.contains(token.id) THEN
AUDIT_LOG("revoked_token", request)
RETURN FALSE
END IF
// Session binding check
session ← session_store.load(claims.session_id)
IF session IS NULL OR session.lifecycle_phase IN TERMINAL_STATES THEN
AUDIT_LOG("invalid_session", request)
RETURN FALSE
END IF
// User binding
IF claims.user_id ≠ session.owner_id THEN
AUDIT_LOG("user_session_mismatch", request)
RETURN FALSE
END IF
// IP binding (optional)
IF security_config.enforce_ip_binding THEN
IF request.source_ip ≠ claims.bound_ip THEN
AUDIT_LOG("ip_mismatch", request)
RETURN FALSE
END IF
END IF
// Rate limiting
IF NOT RATE_LIMITER.allow(claims.session_id, request.operation) THEN
AUDIT_LOG("rate_limited", request)
RETURN FALSE
END IF
// Anomaly detection: unusual request patterns
IF ANOMALY_DETECTOR.is_suspicious(request, session.interaction_history) THEN
AUDIT_LOG("anomalous_request", request)
IF security_config.block_anomalous THEN
RETURN FALSE
ELSE
FLAG_FOR_REVIEW(request)
END IF
END IF
// Token rotation on sensitive operations
IF request.operation IN SENSITIVE_OPERATIONS THEN
new_token ← ROTATE_TOKEN(token, security_config)
request.response_headers["X-New-Session-Token"] ← new_token
REVOCATION_LIST.add(token.id, ttl=30s) // Grace period for in-flight requests
END IF
RETURN TRUE
END18.11.6 Security Invariants#
The session security subsystem must maintain at all times:
18.12 Session Analytics: Duration, Turn Count, Tool Usage, Error Rate, and User Satisfaction Correlation#
18.12.1 Session Metrics Taxonomy#
Every session emits a structured set of metrics that enable operational monitoring, capacity planning, quality improvement, and cost optimization.
Temporal Metrics#
where:
Interaction Metrics#
Resource Consumption Metrics#
Quality Metrics#
where:
18.12.2 User Satisfaction Modeling#
User satisfaction is modeled as a function of observable session metrics:
where is the sigmoid function, is the feature vector, and are learned parameters calibrated against explicit user feedback (ratings, thumbs up/down, task completion signals).
Feature vector:
The model is retrained periodically on accumulated feedback data. The predicted satisfaction score is used for:
- Proactive intervention: If drops below a threshold during an active session, the system offers assistance or escalates.
- Quality regression detection: A decline in aggregate across sessions signals systemic issues.
- A/B testing: Different agent configurations are compared by their effect on .
18.12.3 Session Analytics Pipeline#
ALGORITHM ComputeSessionAnalytics
INPUT:
session: Session (completed or terminated)
OUTPUT:
analytics: SessionAnalyticsRecord
BEGIN:
// ─── Temporal Metrics ───
T_total ← session.completed_at - session.created_at
T_active ← SUM(duration(interval) FOR interval IN session.active_intervals)
T_suspended ← T_total - T_active
T_ttfr ← session.first_response_at - session.created_at
turn_latencies ← [turn.response_at - turn.request_at
FOR turn IN session.interaction_history]
T_mean_turn ← MEAN(turn_latencies)
T_p99_turn ← PERCENTILE(turn_latencies, 99)
// ─── Interaction Metrics ───
N_turns ← LEN(session.interaction_history)
N_user_msgs ← COUNT(turn FOR turn IN session.interaction_history
IF turn.actor = USER)
N_agent_msgs ← COUNT(turn FOR turn IN session.interaction_history
IF turn.actor = AGENT)
N_tool_invocations ← LEN(session.tool_trace)
N_retrieval_queries ← LEN(session.retrieval_trace)
N_repair_cycles ← SUM(loop.repair_count FOR loop IN session.agent_loops)
// ─── Resource Metrics ───
T_tokens_in ← SUM(turn.input_tokens FOR turn IN session.interaction_history)
T_tokens_out ← SUM(turn.output_tokens FOR turn IN session.interaction_history)
C_total ← session.cost_accumulator.total()
N_api_calls ← session.api_call_counter.total()
N_checkpoints ← LEN(session.checkpoints)
peak_state_size ← MAX(SIZE(cp.payload) FOR cp IN session.checkpoints)
// ─── Quality Metrics ───
r_task_completion ← COMPUTE_TASK_COMPLETION_RATE(session)
N_errors ← COUNT(inv FOR inv IN session.tool_trace IF inv.is_error)
r_error_rate ← N_errors / MAX(N_tool_invocations + N_agent_msgs, 1)
r_repair_success ← COMPUTE_REPAIR_SUCCESS_RATE(session.agent_loops)
r_hallucination ← COMPUTE_HALLUCINATION_RATE(session.verification_results)
r_verify_pass ← COMPUTE_VERIFICATION_PASS_RATE(session.verification_results)
// ─── Satisfaction Prediction ───
feature_vector ← CONSTRUCT_FEATURE_VECTOR(
T_ttfr, N_turns, r_task_completion, r_error_rate,
r_verify_pass, N_repair_cycles,
session.lifecycle_phase = COMPLETED,
C_total
)
predicted_satisfaction ← SATISFACTION_MODEL.predict(feature_vector)
// ─── Tool Usage Breakdown ───
tool_usage ← {}
FOR EACH invocation IN session.tool_trace DO
key ← invocation.tool_name
IF key NOT IN tool_usage THEN
tool_usage[key] ← ToolUsageRecord(count=0, errors=0,
total_latency=0, total_cost=0)
END IF
tool_usage[key].count ← tool_usage[key].count + 1
IF invocation.is_error THEN
tool_usage[key].errors ← tool_usage[key].errors + 1
END IF
tool_usage[key].total_latency ← tool_usage[key].total_latency + invocation.latency
tool_usage[key].total_cost ← tool_usage[key].total_cost + invocation.cost
END FOR
// ─── Assemble Record ───
analytics ← SessionAnalyticsRecord(
session_id=session.id,
owner_id=session.owner_id,
task_type=session.task_type,
temporal=TemporalMetrics(T_total, T_active, T_suspended, T_ttfr,
T_mean_turn, T_p99_turn),
interaction=InteractionMetrics(N_turns, N_user_msgs, N_agent_msgs,
N_tool_invocations, N_retrieval_queries,
N_repair_cycles),
resource=ResourceMetrics(T_tokens_in, T_tokens_out, C_total,
N_api_calls, N_checkpoints, peak_state_size),
quality=QualityMetrics(r_task_completion, r_error_rate, r_repair_success,
r_hallucination, r_verify_pass),
tool_usage=tool_usage,
predicted_satisfaction=predicted_satisfaction,
computed_at=NOW()
)
// Persist to analytics store
ANALYTICS_STORE.write(analytics)
// Emit to monitoring system
EMIT_METRICS_BATCH(analytics)
RETURN analytics
END18.12.4 Aggregate Analytics and Operational Dashboards#
Individual session analytics are aggregated across dimensions for operational insight:
| Aggregation Dimension | Example Metrics | Purpose |
|---|---|---|
| By User | Mean satisfaction, session count, error rate | User experience monitoring |
| By Task Type | Completion rate, mean duration, repair frequency | Task-specific optimization |
| By Agent Version | Quality scores, latency, cost per session | A/B testing, regression detection |
| By Tool | Error rate, mean latency, invocation count | Tool reliability monitoring |
| By Time Window | Throughput, peak concurrency, cost rate | Capacity planning |
18.12.5 Anomaly Detection on Session Metrics#
Session metrics are monitored for anomalies using a statistical process control approach:
where is the rolling mean, is the rolling standard deviation, and is the sensitivity factor (typically for bounds).
For multivariate anomaly detection across the full metric vector:
where and are the rolling mean vector and covariance matrix. An alert fires when .
18.12.6 Feedback Loop: Analytics to Architecture#
Session analytics feed directly back into architectural decisions:
| Observed Signal | Architectural Response |
|---|---|
| High | Increase retrieval cache TTL, pre-warm tool connections |
| High on specific tool | Enable circuit breaker, add fallback tool |
| Low for task type | Improve decomposition heuristics, add retrieval sources |
| High | Strengthen verification, improve planning prompts |
| Low predicted satisfaction | Trigger proactive user assistance, escalation |
| Excessive | Increase checkpoint compaction frequency, prune history |
| High suspension frequency | Tune timeout parameters, improve resource allocation |
This feedback loop ensures the session architecture evolves mechanically in response to empirical evidence, not intuition.
Synthesis: Session Architecture as a Systems Engineering Discipline#
The session architecture presented in this chapter treats the session not as a convenience abstraction but as a formally specified, cryptographically protected, lifecycle-managed, migratable execution envelope. The key architectural contributions are summarized:
| Architectural Principle | Implementation |
|---|---|
| Typed, versioned state | Schema evolution with composable migrations and integrity checksums |
| Deterministic lifecycle | Finite state machine with guarded transitions and cleanup hooks |
| Tiered persistence | L0–L3 persistence with cost-durability optimization |
| Isolation enforcement | Per-user, per-task, per-agent, nested; RBAC + ABAC access control |
| Resumable execution | Multi-phase rehydration: state → context → tools → memory → integrity |
| Migratable sessions | Three-phase commit protocol with incremental transfer optimization |
| Security by design | Encryption at rest/in transit, token rotation, anomaly detection |
| Measurable quality | Comprehensive metrics taxonomy with satisfaction prediction |
| Feedback-driven evolution | Analytics pipeline feeds back into timeout tuning, tool selection, and planning |
A session that cannot survive a crash is a toy. A session that cannot migrate is a monolith. A session that cannot be audited is a liability. A session that cannot be measured cannot be improved. The architecture in this chapter ensures none of these failures are possible.
End of Chapter 18.