Preamble#
Multi-agent coordination transcends the orchestration of isolated tool-calling loops. When agents operate as a team, the system acquires emergent properties—collective reasoning capacity, fault tolerance through redundancy, specialization-driven throughput—that no single-agent architecture can replicate. However, these gains materialize only when the coordination substrate enforces explicit role contracts, bounded communication protocols, verified handoffs, shared memory discipline, and measurable team-level quality gates. This chapter formalizes agent team dynamics as an engineering discipline: typed organizational structures, mathematically grounded consensus and conflict resolution, provenance-tracked shared memory, adaptive composition under runtime evolution, and production-grade reliability patterns drawn from High-Reliability Organizations (HROs). Every mechanism is specified at protocol level with pseudo-algorithms, formal objective functions, and architecture trade-off analysis suitable for enterprise-scale deployment.
17.1 Agent Teams as Organizational Units: Roles, Responsibilities, and Accountability#
17.1.1 Foundational Abstractions#
An agent team is a bounded set of agents operating under a shared task mandate with explicit role assignments, communication topology, and an accountability ledger:
where:
- — the agent pool
- — the role taxonomy
- — the role assignment function (an agent may hold multiple roles)
- — the shared goal specification (a typed task DAG or objective tree)
- — the communication topology (directed graph of permitted message channels)
- — the accountability ledger (append-only trace of decisions, outputs, and responsibility attributions)
17.1.2 Role Taxonomy#
Roles are not informal labels. Each role is a typed contract:
| Role Dimension | Specification |
|---|---|
| Capabilities | Set of tools, model variants, retrieval indices, and output modalities the role may invoke |
| Authority | Mutation scope: which state domains, artifacts, and external systems this role may modify |
| Obligations | Mandatory outputs, verification checks, and reporting duties per execution cycle |
| Constraints | Token budgets, latency ceilings, recursion depth bounds, and approval gates |
Canonical role archetypes in production agentic teams:
- Planner / Orchestrator — decomposes goals into sub-tasks, assigns work, manages DAG state
- Implementer / Executor — performs domain-specific generation, transformation, or computation
- Retriever / Analyst — executes hybrid retrieval, ranks evidence, surfaces provenance
- Verifier / Critic — validates outputs against specifications, detects hallucinations, runs test harnesses
- Documenter / Synthesizer — aggregates partial results, produces coherent deliverables
- Monitor / Sentinel — observes system health, enforces rate limits, triggers escalation
17.1.3 Accountability Ledger#
Every action within is recorded in the accountability ledger as a structured entry:
The ledger is:
- Append-only — no retroactive mutation, ensuring auditability
- Content-addressed — input and output hashes enable deterministic replay
- Causally ordered — Lamport timestamps or vector clocks maintain happens-before relations across concurrent agents
- Queryable — supports provenance tracing: "Which agent produced artifact ? Under what evidence?"
17.1.4 Responsibility Attribution Model#
When a team output is incorrect, the system must attribute responsibility to enable targeted repair:
Causal contribution is computed via the ledger by tracing the dependency chain from the failure artifact back through all contributing actions, weighted by each agent's decision authority at the relevant branching points. This drives:
- Targeted re-execution — only the causally responsible subtree is re-planned
- Capability scoring — persistent per-agent reliability metrics inform future role assignment
- Escalation triggers — repeated attribution to a single agent triggers model swap, parameter adjustment, or human escalation
17.2 Team Formation Strategies: Static Assignment, Dynamic Assembly, and Capability-Based Matching#
17.2.1 Formation Strategy Taxonomy#
| Strategy | When to Use | Trade-offs |
|---|---|---|
| Static Assignment | Stable, well-understood task domains; compliance-sensitive environments | Low overhead, deterministic behavior; inflexible to novel task types |
| Dynamic Assembly | Heterogeneous, evolving workloads; multi-domain requests | Adaptive, optimal specialization; higher formation latency, coordination cost |
| Capability-Based Matching | Large agent pools with diverse specializations; marketplace architectures | Precise skill-task alignment; requires maintained capability registry, scoring infrastructure |
17.2.2 Static Assignment#
The team structure is defined at design time. A configuration manifest specifies:
Static teams are versioned artifacts deployed through CI/CD. Changes require manifest updates, regression testing, and rollout procedures. This is the preferred strategy for regulated domains (medical, financial, legal) where audit determinism is paramount.
17.2.3 Dynamic Assembly#
At task ingestion time, a formation controller selects agents from a pool based on task analysis:
Pseudo-Algorithm 17.1 — Dynamic Team Assembly
PROCEDURE AssembleTeam(task, agent_pool, formation_policy):
// Phase 1: Task Analysis
task_graph ← DecomposeTask(task)
required_capabilities ← ExtractCapabilityRequirements(task_graph)
required_roles ← MapCapabilitiesToRoles(required_capabilities)
// Phase 2: Candidate Selection
candidates ← {}
FOR EACH role IN required_roles:
eligible ← FILTER agent_pool WHERE:
agent.capabilities ⊇ role.capabilities AND
agent.current_load < agent.capacity_limit AND
agent.reliability_score ≥ formation_policy.min_reliability
ranked ← SORT eligible BY CapabilityMatchScore(agent, role) DESC
candidates[role] ← ranked[0 : formation_policy.candidates_per_role]
// Phase 3: Team Optimization
team ← SolveAssignment(candidates, required_roles, formation_policy.constraints)
// Constraints include: budget ceiling, latency target, diversity requirements,
// anti-affinity rules (e.g., verifier ≠ implementer model family)
// Phase 4: Topology Construction
topology ← BuildCommunicationGraph(team, task_graph)
// Phase 5: Initialization
FOR EACH (agent, role) IN team:
InitializeAgentContext(agent, role, task_graph, topology)
RETURN TeamInstance(team, topology, task_graph)17.2.4 Capability-Based Matching: Formal Model#
Each agent publishes a capability vector spanning skill dimensions (e.g., code generation quality, retrieval precision, mathematical reasoning, domain expertise scores). Each role defines a requirement vector with minimum thresholds .
Match score:
where are importance weights derived from task analysis.
Optimal Assignment is modeled as a constrained optimization:
subject to:
This is an instance of a generalized assignment problem, solvable via ILP for small teams or greedy heuristics with bounded approximation ratios for large pools.
17.2.5 Capability Registry#
The capability registry is a typed, versioned data store:
Capability vectors are updated through:
- Benchmark evaluation — periodic execution against standardized task suites
- Production telemetry — rolling accuracy, latency, and failure rates from live traces
- Peer assessment — verifier agents' acceptance rates for each implementer's outputs
17.3 Shared Mental Models: Establishing Common Context, Goals, and Constraints Across Agents#
17.3.1 Definition and Motivation#
A shared mental model (SMM) in multi-agent systems is a synchronized, bounded representation of the team's collective understanding of:
- Task state — current progress, pending sub-tasks, completed artifacts, blocking dependencies
- Environment state — relevant external system states, data freshness, resource availability
- Team state — who is doing what, agent health, capacity utilization
- Constraint state — active policies, budget remaining, deadline proximity, quality gates
Without explicit SMM construction, agents operate on divergent assumptions, producing inconsistent outputs, redundant work, or conflicting mutations.
17.3.2 Formal Representation#
Define the shared mental model at time as:
Each component is a typed, versioned, causally consistent data structure with monotonically increasing version counters:
17.3.3 SMM Construction Pipeline#
Pseudo-Algorithm 17.2 — Shared Mental Model Construction
PROCEDURE ConstructSMM(team, task_graph, environment_state):
// Phase 1: Goal Alignment
goal_specification ← ExtractGoalTree(task_graph)
success_criteria ← ExtractMeasurableCriteria(goal_specification)
FOR EACH agent IN team:
InjectGoalContext(agent, goal_specification, success_criteria)
// Phase 2: Task State Synchronization
task_state ← InitializeTaskDAG(task_graph)
PublishToSharedState("task_state", task_state, version=0)
// Phase 3: Constraint Broadcasting
constraints ← {
token_budget_remaining: ComputeRemainingBudget(team),
deadline: task.deadline,
quality_gates: LoadQualityGates(task.domain),
authority_matrix: LoadAuthorityMatrix(team),
escalation_policy: LoadEscalationPolicy(team)
}
PublishToSharedState("constraints", constraints, version=0)
// Phase 4: Environment Snapshot
env_snapshot ← CaptureEnvironmentState(environment_state)
PublishToSharedState("environment", env_snapshot, version=0)
// Phase 5: Team Roster
roster ← {}
FOR EACH (agent, role) IN team:
roster[agent.id] ← {
role: role,
capabilities: agent.capabilities,
status: IDLE,
capacity: agent.remaining_capacity
}
PublishToSharedState("team_roster", roster, version=0)
RETURN SMM(task_state, env_snapshot, constraints, roster)17.3.4 SMM Synchronization Protocol#
Maintaining consistency across agents requires a synchronization discipline:
- State Channel — A shared, versioned key-value store (e.g., a lightweight coordination service) accessible by all team members through typed read/write interfaces
- Optimistic Concurrency — Writes carry version numbers; the state channel rejects stale writes (compare-and-swap semantics)
- Event Propagation — State changes emit typed events; agents subscribe to relevant channels
- Bounded Staleness — Agents may operate on slightly stale state within a tolerance window ; critical decisions require fresh reads
17.3.5 Context Injection for Model-Based Agents#
Since LLM-based agents consume context through token windows, the SMM must be compiled into an efficient context payload:
where includes only the sub-tasks relevant to agent 's current assignment, compressed to stay within token budget :
17.4 Handoff Protocols: Clean State Transfer, Context Summarization, and Responsibility Chain#
17.4.1 The Handoff Problem#
A handoff occurs when responsibility for a work unit transfers from agent (sender) to agent (receiver). Failed handoffs are the single largest source of coordination errors in multi-agent systems: dropped context, duplicated work, inconsistent state, and untracked responsibility gaps.
17.4.2 Handoff Packet Specification#
Every handoff transmits a typed handoff packet :
| Field | Description |
|---|---|
artifacts | Versioned, content-addressed outputs produced by sender |
context_summary | Compressed representation of sender's working context, decisions made, and rationale |
open_issues | Unresolved questions, known risks, deferred decisions requiring receiver attention |
constraints_remaining | Residual budget (tokens, time, cost), quality gates not yet satisfied |
provenance_chain | Ordered list of all prior agents and actions that contributed to current state |
17.4.3 Context Summarization for Handoffs#
The sender must compress its working context into a summary that preserves decision-critical information while fitting within the receiver's token budget. This is a lossy compression problem with a fidelity objective:
where is an information loss function (approximated by evaluating whether downstream tasks succeed with versus full context on held-out examples).
Pseudo-Algorithm 17.3 — Handoff Context Summarization
PROCEDURE SummarizeForHandoff(working_context, task_spec, budget):
// Step 1: Identify decision-critical elements
decisions ← ExtractDecisions(working_context)
constraints ← ExtractActiveConstraints(working_context)
unresolved ← ExtractOpenQuestions(working_context)
artifacts ← ExtractOutputArtifacts(working_context)
// Step 2: Rank by downstream utility
elements ← decisions ∪ constraints ∪ unresolved ∪ ArtifactSummaries(artifacts)
FOR EACH element IN elements:
element.priority ← ScoreDownstreamUtility(element, task_spec)
ranked ← SORT elements BY priority DESC
// Step 3: Greedy packing under budget
summary ← []
token_count ← 0
FOR EACH element IN ranked:
element_tokens ← CountTokens(Serialize(element))
IF token_count + element_tokens ≤ budget:
summary.APPEND(element)
token_count ← token_count + element_tokens
// Step 4: Structural formatting
RETURN FormatHandoffSummary(summary, task_spec)17.4.4 Handoff Protocol State Machine#
The handoff follows a strict three-phase commit:
Phase 1 — PREPARE: Sender constructs , locks the work unit, publishes handoff intent to the coordination service.
Phase 2 — VALIDATE: Receiver inspects , verifies:
- Artifact integrity (hash verification)
- Context sufficiency (receiver can identify its next action)
- Constraint feasibility (remaining budget and deadline are achievable)
Phase 3 — COMMIT or RETRY:
- On ACK: Sender releases the lock; receiver assumes ownership; accountability ledger records transfer
- On NACK (with reason): Sender supplements missing context, re-summarizes, or escalates
where is a system-level SLO (typically ).
17.4.5 Responsibility Chain#
The responsibility chain is the ordered sequence of agents that have held ownership of a work unit:
This chain is immutable, stored in , and serves three purposes:
- Failure forensics — trace output errors to the responsible ownership interval
- Latency attribution — identify bottleneck agents in the chain
- Compliance audit — verify that only authorized agents handled sensitive data
17.5 Consensus Mechanisms: Majority Voting, Weighted Voting, Debate, and Arbitration#
17.5.1 When Consensus Is Required#
Consensus mechanisms activate when:
- Multiple agents produce conflicting outputs for the same task
- A critical decision requires collective judgment (e.g., plan selection, risk assessment)
- Verification results are ambiguous or contradictory
- The task specification admits multiple valid solutions and one must be committed
17.5.2 Majority Voting#
Given agents producing candidate outputs for a task, majority voting selects the output with the most support:
where is a semantic equivalence function (exact match for structured outputs, embedding-space clustering for natural language).
Properties:
- Requires participants (odd preferred to avoid ties)
- Correct when the majority of agents are individually correct: when (Condorcet Jury Theorem)
- Inefficient when agents share failure modes (correlated errors from same model family)
Condorcet amplification — for independent agents each with accuracy :
This converges to 1 as , but the independence assumption rarely holds in practice when agents share model weights.
17.5.3 Weighted Voting#
Assigns differential credibility to agents based on expertise, past performance, or role authority:
where weights satisfy .
Weight computation:
where is the historical accuracy of agent on domain , and decays older performance observations.
17.5.4 Structured Debate#
When outputs are complex or the decision requires justification, debate replaces simple voting:
Pseudo-Algorithm 17.4 — Multi-Agent Structured Debate
PROCEDURE StructuredDebate(proposals, agents, moderator, max_rounds):
// proposals: list of (agent, proposed_output, rationale)
debate_log ← []
FOR round ← 1 TO max_rounds:
// Phase 1: Challenge
FOR EACH (agent_i, proposal_i, rationale_i) IN proposals:
challenges ← []
FOR EACH (agent_j, proposal_j, _) IN proposals WHERE j ≠ i:
challenge ← agent_j.Critique(
target_proposal = proposal_i,
target_rationale = rationale_i,
own_proposal = proposal_j,
debate_history = debate_log
)
challenges.APPEND((agent_j, challenge))
debate_log.APPEND((round, "challenges", agent_i, challenges))
// Phase 2: Defend / Revise
updated_proposals ← []
FOR EACH (agent_i, proposal_i, rationale_i) IN proposals:
relevant_challenges ← GetChallengesFor(agent_i, debate_log, round)
response ← agent_i.DefendOrRevise(
own_proposal = proposal_i,
challenges = relevant_challenges,
debate_history = debate_log
)
updated_proposals.APPEND((agent_i, response.proposal, response.rationale))
debate_log.APPEND((round, "defense", agent_i, response))
proposals ← updated_proposals
// Phase 3: Convergence Check
IF AllProposalsEquivalent(proposals):
RETURN ConsensusResult(proposals[0], debate_log, "converged")
// Phase 4: Early Termination
IF moderator.JudgeConvergenceLikelihood(debate_log) < threshold:
BREAK
// Fallback: Moderator Arbitration
RETURN moderator.Arbitrate(proposals, debate_log)17.5.5 Arbitration#
When debate does not converge, a designated arbiter agent (or human escalation target) resolves the dispute:
The arbiter has elevated authority and access to the full debate log. Arbitration decisions are flagged in the accountability ledger with decision_method: ARBITRATION for downstream audit.
17.5.6 Consensus Mechanism Selection Matrix#
| Criterion | Majority Vote | Weighted Vote | Debate | Arbitration |
|---|---|---|---|---|
| Decision latency | Low | Low | High | Medium |
| Justification depth | None | None | Deep | Medium |
| Accuracy (uncorrelated errors) | High | Higher | Highest | Varies |
| Accuracy (correlated errors) | Low | Medium | Higher | High |
| Token cost | ||||
| Suitable for structured outputs | Yes | Yes | Less so | Yes |
| Suitable for open-ended reasoning | No | No | Yes | Yes |
where is the number of debate rounds.
17.6 Conflict Resolution: Priority Hierarchies, Evidence-Based Arbitration, and Escalation#
17.6.1 Conflict Taxonomy#
Conflicts in agent teams arise from:
| Conflict Type | Description | Example |
|---|---|---|
| Output Conflict | Agents produce mutually incompatible outputs for the same task | Two implementers generate contradictory code patches |
| Resource Conflict | Multiple agents claim the same resource simultaneously | Concurrent writes to the same artifact or tool endpoint |
| Priority Conflict | Agents disagree on task ordering or urgency | Planner schedules task first; verifier demands task first |
| Authority Conflict | Role boundaries overlap or are ambiguous | Two agents both believe they have write authority over a document |
| Semantic Conflict | Agents hold contradictory beliefs about facts or requirements | Inconsistent interpretations of an ambiguous specification |
17.6.2 Priority Hierarchy#
A total ordering over authority resolves unambiguous conflicts mechanically:
Typical ordering:
When agents of different roles conflict, the higher-priority role's output prevails by default.
17.6.3 Evidence-Based Arbitration#
For substantive disagreements (semantic or output conflicts), resolution is not authority-based but evidence-based:
Pseudo-Algorithm 17.5 — Evidence-Based Conflict Resolution
PROCEDURE ResolveConflict(conflicting_outputs, agents, evidence_store, escalation_policy):
// Step 1: Evidence Collection
evidence_sets ← {}
FOR EACH (agent_i, output_i) IN conflicting_outputs:
evidence_i ← agent_i.ProduceEvidence(output_i, evidence_store)
// Evidence includes: source documents, test results, retrieved facts,
// formal proofs, historical precedents
evidence_sets[agent_i] ← evidence_i
// Step 2: Evidence Quality Scoring
FOR EACH (agent_i, evidence_i) IN evidence_sets:
evidence_i.score ← EvidenceQuality(evidence_i)
// Quality = f(provenance_strength, recency, source_authority,
// internal_consistency, corroboration_count)
// Step 3: Automated Resolution Attempt
best_supported ← argmax over (agent_i, output_i) of evidence_sets[agent_i].score
confidence ← evidence_sets[best_supported.agent].score /
SUM(all evidence scores)
IF confidence ≥ escalation_policy.auto_resolve_threshold:
RETURN Resolution(
selected = best_supported.output,
method = "evidence_based_auto",
confidence = confidence,
evidence_summary = evidence_sets
)
// Step 4: Escalation
IF escalation_policy.allow_human_escalation:
RETURN EscalateToHuman(conflicting_outputs, evidence_sets)
ELSE:
RETURN Resolution(
selected = best_supported.output,
method = "evidence_based_low_confidence",
confidence = confidence,
flag = "requires_review"
)17.6.4 Evidence Quality Function#
where:
- — traceability to a verified source
- — exponential decay by source age
- — source tier ranking (official documentation > community wiki > generated text)
- — number of independent sources confirming the same claim
- weights are domain-configurable
17.6.5 Escalation Ladder#
Escalation proceeds through a defined chain when automated resolution fails:
Each escalation level has:
- SLA: maximum response time
- Cost ceiling: budget allocated for the escalation step
- Outcome: binding decision plus ledger entry recording the resolution path
17.7 Team Memory: Shared Session State, Collective Episodic Memory, and Team Knowledge Base#
17.7.1 Memory Architecture Overview#
Team memory is stratified into four tiers with distinct lifecycle, access patterns, and write policies:
| Tier | Scope | Lifetime | Write Policy | Access Pattern |
|---|---|---|---|---|
| Working Memory | Per-agent, per-step | Single execution step | Agent-local, no coordination | Private read/write |
| Session Memory | Per-team, per-task | Task duration | Optimistic concurrency via state channel | Shared read/write |
| Episodic Memory | Per-team, cross-task | Configurable TTL (hours to months) | Validated write-back after task completion | Shared read; gated write |
| Knowledge Base | Organization-wide | Permanent (until explicit revocation) | Human-approved promotion from episodic tier | Read-only for agents; write via promotion pipeline |
17.7.2 Shared Session State#
The shared session state is the team's real-time coordination substrate:
Operations:
READ(key) → (value, version)— returns latest committed valueWRITE(key, value, expected_version) → Success(new_version) | ConflictError— compare-and-swapSUBSCRIBE(key_pattern, callback)— event-driven notification on matching writesSCAN(prefix, limit) → [(key, value, version)]— bounded enumeration
Namespacing prevents collision:
session/{team_id}/task_state/{subtask_id}
session/{team_id}/artifacts/{artifact_id}
session/{team_id}/roster/{agent_id}/status
session/{team_id}/consensus/{decision_id}17.7.3 Collective Episodic Memory#
After task completion, the team's session state is processed into validated episodic memories:
Pseudo-Algorithm 17.6 — Episodic Memory Write-Back
PROCEDURE WriteBackEpisodicMemory(session_state, task_result, team, policy):
// Step 1: Extract candidate memories
candidates ← []
// Non-obvious corrections (e.g., "approach X failed; approach Y succeeded")
corrections ← ExtractCorrections(session_state.accountability_ledger)
candidates.EXTEND(corrections)
// Effective strategies (e.g., "for task type T, agent role R was critical")
strategies ← ExtractSuccessfulStrategies(session_state, task_result)
candidates.EXTEND(strategies)
// Discovered constraints (e.g., "API X has undocumented rate limit of 100/min")
constraints ← ExtractDiscoveredConstraints(session_state)
candidates.EXTEND(constraints)
// Step 2: Filter for novelty and non-obviousness
FOR EACH candidate IN candidates:
IF IsDuplicate(candidate, existing_episodic_memory):
SKIP
IF IsObvious(candidate, knowledge_base):
SKIP // Don't store what's already in canonical knowledge
candidate.utility_score ← EstimateUtility(candidate, policy.utility_model)
// Step 3: Validate and store
validated ← FILTER candidates WHERE utility_score ≥ policy.min_utility
FOR EACH memory IN validated:
memory.provenance ← BuildProvenance(memory, session_state, team)
memory.expiry ← ComputeExpiry(memory, policy.ttl_model)
memory.embedding ← Embed(memory.content)
EpisodicStore.Write(memory)
RETURN validated17.7.4 Team Knowledge Base#
The knowledge base contains organization-level facts, policies, and validated procedures that transcend individual teams:
Promotion pipeline (episodic → knowledge):
- Frequency threshold — episodic memories referenced times across distinct teams
- Validation — confirmed by human reviewer or automated test suite
- Deduplication — merged with existing knowledge items if overlapping
- Versioning — supersedes prior versions with explicit deprecation markers
17.7.5 Memory Garbage Collection#
Stale memories degrade retrieval precision. A periodic cleanup agent enforces:
Expired items are soft-deleted (moved to archive), not hard-deleted, to preserve audit trails.
17.8 Load Balancing Across Team Members: Work Distribution, Capacity Monitoring, and Rebalancing#
17.8.1 Load Model#
Each agent has a capacity model:
and a current load vector:
The load ratio is:
where the norm is a weighted combination reflecting the most constrained dimension:
17.8.2 Work Distribution Strategies#
Strategy 1: Round-Robin with Capability Filtering
Simple, fair, but ignores heterogeneous agent strengths:
where .
Strategy 2: Least-Loaded Assignment
Balances load but may under-utilize specialized agents by routing work to generalists.
Strategy 3: Capability-Weighted Least-Loaded
This favors agents with both low load and high task affinity.
Strategy 4: Predictive Assignment
Uses estimated task completion time to minimize makespan:
17.8.3 Capacity Monitoring#
Pseudo-Algorithm 17.7 — Capacity Monitor
PROCEDURE MonitorCapacity(team, interval, alert_thresholds):
LOOP EVERY interval:
FOR EACH agent IN team:
load ← MeasureCurrentLoad(agent)
capacity ← GetCapacity(agent)
rho ← ComputeLoadRatio(load, capacity)
PublishMetric("agent_load_ratio", agent.id, rho)
IF rho ≥ alert_thresholds.overload: // e.g., 0.9
EmitAlert(OVERLOAD, agent, rho)
TriggerRebalance(team, agent)
ELSE IF rho ≥ alert_thresholds.high: // e.g., 0.75
EmitAlert(HIGH_LOAD, agent, rho)
ELSE IF rho ≤ alert_thresholds.idle: // e.g., 0.1
EmitAlert(UNDERUTILIZED, agent, rho)
// Team-level metrics
rho_mean ← MEAN(all rho values)
rho_stddev ← STDDEV(all rho values)
imbalance ← rho_stddev / rho_mean // Coefficient of variation
PublishMetric("team_load_imbalance", team.id, imbalance)
IF imbalance > alert_thresholds.max_imbalance:
TriggerRebalance(team)17.8.4 Rebalancing Protocol#
When load imbalance exceeds thresholds, the orchestrator redistributes work:
Pseudo-Algorithm 17.8 — Work Rebalancing
PROCEDURE Rebalance(team, overloaded_agents, orchestrator):
// Step 1: Identify movable tasks
movable_tasks ← []
FOR EACH agent IN overloaded_agents:
FOR EACH task IN agent.active_queue:
IF task.status = QUEUED AND NOT task.pinned:
movable_tasks.APPEND((agent, task))
// Step 2: Find target agents
FOR EACH (source, task) IN movable_tasks:
targets ← FILTER team WHERE:
agent.id ≠ source.id AND
agent.capabilities ⊇ task.requirements AND
LoadAfterAssignment(agent, task) < rho_max
IF targets NOT EMPTY:
best_target ← argmin over targets of LoadAfterAssignment(target, task)
ExecuteHandoff(source, best_target, task)
LogRebalance(source, best_target, task)
// Step 3: Scale if rebalancing insufficient
IF StillOverloaded(team):
IF scaling_policy.allow_auto_scale:
new_agent ← ProvisionAgent(required_capabilities, scaling_policy)
team.ADD(new_agent)
Rebalance(team, overloaded_agents, orchestrator) // Recurse once17.8.5 Backpressure#
When all agents are saturated and scaling is exhausted, the system applies backpressure:
- Queue depth limits — reject new tasks beyond queue capacity with explicit error codes
- Priority-based shedding — drop lowest-priority tasks, notifying callers
- Deadline-aware deferral — tasks with distant deadlines are deferred; urgent tasks are prioritized
- Client-facing latency signals — propagate expected wait times to upstream callers
17.9 Team Performance Metrics: Throughput, Quality, Coordination Overhead, and Team Efficiency#
17.9.1 Metric Taxonomy#
Team performance must be measured at multiple granularities to enable diagnosis:
| Level | Metrics |
|---|---|
| Agent-level | Task accuracy, mean latency, token efficiency, error rate, handoff success rate |
| Team-level | End-to-end task throughput, collective quality score, coordination overhead, makespan |
| System-level | Cost per task, SLO compliance, human escalation rate, knowledge base growth rate |
17.9.2 Core Metric Definitions#
Throughput:
Quality:
where is domain-specific: exact match for structured outputs, rubric-based evaluation for generative tasks, test-pass rate for code.
Coordination Overhead:
A well-functioning team targets (less than 20% of total time on coordination).
Team Efficiency:
indicates super-additive collaboration; indicates coordination costs exceed collaboration benefits.
Makespan vs. Sum-of-Parts:
Theoretical maximum is (linear speedup); practical values are bounded by Amdahl's Law:
where is the fraction of work that is parallelizable.
17.9.3 Coordination Cost Model#
Total cost of a team task:
Optimization objective:
17.9.4 Performance Dashboard Schema#
A production team performance dashboard exposes:
TeamPerformanceDashboard:
real_time:
active_tasks_count: gauge
agent_load_ratios: histogram
queue_depths: per_agent gauge
handoff_success_rate: rolling_window counter
periodic (per task completion):
task_latency: histogram (p50, p90, p99)
task_quality_score: histogram
tokens_consumed: counter
consensus_rounds_required: histogram
conflict_resolution_count: counter
aggregate (hourly/daily):
throughput: rate
coordination_overhead_ratio: gauge
team_efficiency: gauge
cost_per_task: gauge
SLO_compliance_rate: percentage
human_escalation_rate: percentage17.9.5 Diagnostic Analysis: Coordination Anti-Patterns#
| Anti-Pattern | Symptom | Metric Signal | Remediation |
|---|---|---|---|
| Bottleneck Agent | One agent's queue grows while others are idle | High load variance, | Rebalance, add specialists, decompose tasks |
| Consensus Thrashing | Debate rounds consistently hit max_rounds | High , low convergence rate | Tighten task specs, increase verifier authority |
| Handoff Ping-Pong | Tasks bounce between agents repeatedly | Handoff count per task > 3 | Clarify role boundaries, improve context summaries |
| Redundant Work | Multiple agents unknowingly solve the same sub-task | Token cost anomaly, duplicate artifact detection | Improve task locking, shared state visibility |
| Escalation Cascade | Most conflicts escalate to human review | Escalation rate > 15% | Improve evidence quality, lower auto-resolve threshold |
17.10 Adaptive Team Composition: Runtime Role Reassignment Based on Task Evolution#
17.10.1 Motivation#
Tasks evolve during execution. A planning-heavy initial phase may give way to implementation-intensive work, then shift to verification-dominant finalization. Static role assignments waste capacity by keeping planning agents idle during implementation and vice versa.
17.10.2 Task Phase Model#
Model the task lifecycle as a sequence of phases with distinct capability demands:
Each phase has a capability demand profile:
where represents the intensity of demand for role during phase .
17.10.3 Phase Detection#
The orchestrator continuously monitors task state to detect phase transitions:
Pseudo-Algorithm 17.9 — Phase Detection and Role Reassignment
PROCEDURE AdaptiveComposition(team, task_state, phase_model, reassignment_policy):
current_phase ← DetectCurrentPhase(task_state, phase_model)
// Detection uses: subtask completion ratios, artifact types being produced,
// queue composition, time elapsed vs. estimated timeline
IF current_phase ≠ last_detected_phase:
// Phase transition detected
demand_profile ← phase_model.GetDemand(current_phase)
current_allocation ← GetCurrentRoleAllocation(team)
// Compute reallocation
deficit ← {}
surplus ← {}
FOR EACH role IN role_taxonomy:
delta ← demand_profile[role] - current_allocation[role]
IF delta > 0:
deficit[role] ← delta
ELSE IF delta < 0:
surplus[role] ← |delta|
// Reassign surplus agents to deficit roles
FOR EACH (surplus_role, count) IN surplus:
reassignable ← GetAgentsWithRole(team, surplus_role)
reassignable ← FILTER reassignable WHERE:
agent.capabilities SUPPORTS any deficit_role AND
agent.active_tasks = 0 OR agent.active_tasks are pausable
FOR EACH agent IN reassignable[0:count]:
target_role ← SelectBestDeficitRole(agent, deficit)
IF target_role AND reassignment_policy.AllowReassignment(agent, target_role):
ExecuteRoleReassignment(agent, surplus_role, target_role)
UpdateSMM(team, agent, target_role)
deficit[target_role] ← deficit[target_role] - 1
// Scale if deficit persists
FOR EACH (role, remaining) IN deficit WHERE remaining > 0:
IF reassignment_policy.allow_scaling:
FOR i ← 1 TO remaining:
new_agent ← ProvisionAgent(role)
team.ADD(new_agent)
last_detected_phase ← current_phase17.10.4 Role Reassignment Cost Function#
Reassignment is not free. The cost includes context reconstruction, warm-up latency, and potential errors during transition:
Reassignment is justified only when:
where and are value multipliers converting metric improvements to cost equivalents.
17.10.5 Agent Polymorphism#
Some agents are polymorphic: they can assume multiple roles with minimal context switching cost (e.g., a large frontier model with broad capabilities). Others are specialized: optimized for one role with high performance but unable to switch (e.g., a fine-tuned code generation model).
The team composition optimizer balances:
A practical heuristic: maintain a core of specialized agents for steady-state roles and a pool of polymorphic agents for adaptive reallocation.
17.11 Human-Agent Team Integration: Blended Teams with Human Experts and AI Agents#
17.11.1 Blended Team Model#
A blended team extends the agent team model to include human participants:
Human participants differ from agent participants in:
| Dimension | Human | Agent |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Availability | Scheduled, asynchronous | Always-on, synchronous |
| Judgment | Nuanced, contextual, value-laden | Consistent, scalable, policy-bound |
| Authority | Ultimate decision authority | Delegated authority within policy bounds |
| Error profile | Fatigue, attention, bias | Hallucination, specification misinterpretation |
| Communication | Natural language, high bandwidth | Structured protocols, token-bounded |
17.11.2 Human-Agent Interaction Protocols#
Protocol 1: Human-in-the-Loop (HITL) — Approval Gate
Agents produce candidate outputs; humans approve, modify, or reject before commitment:
Use when: mutation risk is high, regulatory requirements mandate human oversight, or agent confidence is below threshold.
Protocol 2: Human-on-the-Loop (HOTL) — Supervisory Monitoring
Agents operate autonomously with human monitoring. Human intervenes only on alerts:
Use when: task volume is high, agent reliability is established, and cost of occasional errors is bounded.
Protocol 3: Human-as-Tool
Agents invoke human expertise as a structured tool call:
The tool interface exposes:
schema: typed question format with required context fieldstimeout: maximum wait time before fallbackfallback: default action if human does not respond within deadline
17.11.3 Asymmetric Communication Design#
Human attention is the scarcest resource. Communication from agents to humans must be:
- Summarized — compress full context into decision-ready briefings
- Actionable — present clear options with trade-off analysis, not raw data
- Prioritized — sort by urgency and impact; batch low-priority items
- Minimal — eliminate unnecessary interruptions; escalate only when policy requires
Pseudo-Algorithm 17.10 — Human Escalation Manager
PROCEDURE ManageHumanEscalation(escalation_request, human_state, policy):
// Step 1: Assess necessity
urgency ← AssessUrgency(escalation_request)
impact ← AssessImpact(escalation_request)
agent_confidence ← escalation_request.confidence
// Step 2: Check if agent can self-resolve with lower threshold
IF agent_confidence > policy.soft_threshold AND impact < policy.impact_ceiling:
RETURN AutoResolve(escalation_request, flag_for_async_review=TRUE)
// Step 3: Batch non-urgent escalations
IF urgency < policy.urgency_threshold:
AddToBatch(escalation_request, human_state.pending_batch)
IF human_state.pending_batch.size ≥ policy.batch_size OR
human_state.pending_batch.age ≥ policy.max_batch_age:
FormatBatchBriefing(human_state.pending_batch)
NotifyHuman(human_state, "batch_review_ready")
RETURN DEFERRED
// Step 4: Urgent escalation
briefing ← FormatUrgentBriefing(
question = escalation_request.question,
context_summary = CompressContext(escalation_request.context, policy.briefing_budget),
options = escalation_request.options,
recommendation = escalation_request.agent_recommendation,
evidence = escalation_request.evidence_summary,
deadline = escalation_request.deadline
)
NotifyHuman(human_state, briefing, priority=HIGH)
// Step 5: Wait with timeout
response ← WaitForHumanResponse(timeout=escalation_request.deadline)
IF response = TIMEOUT:
RETURN FallbackAction(escalation_request, policy.timeout_fallback)
// Step 6: Record and apply
RecordHumanDecision(response, escalation_request, accountability_ledger)
RETURN response17.11.4 Trust Calibration#
The team's delegation policy must adapt based on observed agent reliability:
A practical trust model uses a Beta distribution:
where counts successful autonomous completions and counts failures requiring human correction. The delegation threshold is:
This ensures both high estimated reliability and sufficient evidence (sample size).
17.12 Inspiration from High-Reliability Organizations (HROs): Crew Resource Management for Agent Teams#
17.12.1 HRO Principles Applied to Agent Teams#
High-Reliability Organizations (HROs)—nuclear power plants, aircraft carrier operations, air traffic control, surgical teams—achieve extremely low failure rates in high-complexity, high-consequence environments. Five core HRO principles map directly to agent team design:
| HRO Principle | Definition | Agent Team Application |
|---|---|---|
| Preoccupation with Failure | Treat near-misses as full failures; actively seek failure signals | Monitor all agent outputs for hallucination markers, even when outputs appear correct; log near-misses (low-confidence correct answers) |
| Reluctance to Simplify | Resist reductive interpretations; maintain nuanced situation awareness | Require agents to preserve uncertainty and alternative interpretations in handoff context; forbid premature commitment |
| Sensitivity to Operations | Maintain real-time situational awareness of frontline work | Orchestrator continuously monitors agent-level execution, not just aggregate metrics; agents report anomalies unprompted |
| Commitment to Resilience | Design for graceful degradation, not just failure prevention | Bounded retry, compensating actions, fallback models, degraded-but-functional operation modes |
| Deference to Expertise | Decision authority flows to the most knowledgeable agent, not the highest-ranked | Override priority hierarchies when a specialist agent has domain-specific evidence that contradicts a generalist orchestrator |
17.12.2 Crew Resource Management (CRM) for Agent Teams#
CRM, originally developed for aviation cockpit teams, provides formalized protocols for:
Briefing and Debriefing:
- Pre-Task Briefing — Before execution, the orchestrator distributes the shared mental model, confirms role understanding, identifies known risks, and establishes communication protocols
- Post-Task Debrief — After completion, the team reviews outcomes, identifies coordination failures, extracts lessons, and writes them to episodic memory
Pseudo-Algorithm 17.11 — CRM-Inspired Pre-Task Briefing
PROCEDURE PreTaskBriefing(team, task, orchestrator):
// Step 1: Situation Assessment
situation ← orchestrator.AssessSituation(task, environment_state)
risks ← orchestrator.IdentifyRisks(task, team.capabilities)
// Step 2: Plan Communication
plan ← orchestrator.CreatePlan(task)
FOR EACH agent IN team:
agent_brief ← {
role: agent.assigned_role,
objectives: ExtractRoleObjectives(plan, agent.assigned_role),
risks: FilterRoleRelevantRisks(risks, agent.assigned_role),
communication_protocols: {
report_to: GetSupervisor(agent, team),
escalation_trigger: GetEscalationCriteria(agent.assigned_role),
status_interval: GetStatusReportInterval(task.urgency)
},
authority_boundaries: GetAuthorityBounds(agent.assigned_role),
challenge_protocol: "If you observe information contradicting the plan, "
+ "you are OBLIGATED to voice concern to orchestrator "
+ "with evidence before proceeding."
}
DeliverBriefing(agent, agent_brief)
// Step 3: Confirmation
FOR EACH agent IN team:
confirmation ← agent.ConfirmBriefing(agent_brief)
IF NOT confirmation.understood:
ClarifyAndRebriefing(agent, confirmation.questions)
// Step 4: Establish Monitoring
SetupMonitoringChannels(team, task)
RETURN BriefingRecord(situation, plan, risks, confirmations)17.12.3 Challenge-and-Response Protocol#
A critical CRM mechanism is the challenge protocol: any team member who observes an anomaly is obligated to raise it, regardless of role hierarchy. In agent teams:
The challenged agent must respond with one of:
- Acknowledge and Correct — accept the challenge, modify output
- Acknowledge and Justify — explain why the observation does not invalidate the output
- Escalate — neither agent can resolve; escalate to orchestrator or human
Challenges are logged in the accountability ledger. An agent that ignores a challenge without justification triggers an automatic escalation.
17.12.4 Assertive Communication Hierarchy#
Adapted from aviation's assertiveness ladder:
| Level | Action | Agent Equivalent |
|---|---|---|
| 1. Hint | Subtle suggestion | Append low-confidence note to output |
| 2. Preference | Express opinion | Include alternative approach in rationale |
| 3. Query | Direct question | Formal challenge message to responsible agent |
| 4. Statement | Declarative concern | Flag output as potentially incorrect in shared state |
| 5. Command | Direct override (authority required) | Orchestrator vetoes output; triggers re-execution |
Agent teams should be configured to operate at level 3–4 by default: agents should actively challenge rather than passively suggest. This is implemented by including challenge obligations in the role contract's field.
17.12.5 Structured Debriefing Protocol#
Pseudo-Algorithm 17.12 — Post-Task Structured Debrief
PROCEDURE PostTaskDebrief(team, task_result, execution_trace, orchestrator):
// Step 1: Outcome Assessment
success ← EvaluateOutcome(task_result, task.success_criteria)
quality_score ← ComputeQualityScore(task_result)
// Step 2: Timeline Reconstruction
timeline ← ReconstructTimeline(execution_trace)
critical_path ← IdentifyCriticalPath(timeline)
bottlenecks ← IdentifyBottlenecks(timeline)
// Step 3: Anomaly Review
anomalies ← []
FOR EACH event IN execution_trace:
IF event.type IN {CONFLICT, ESCALATION, RETRY, CHALLENGE, ERROR}:
anomalies.APPEND(event)
// Step 4: Causal Analysis (for failures or near-misses)
IF NOT success OR quality_score < threshold:
root_causes ← RootCauseAnalysis(anomalies, timeline, task_result)
corrective_actions ← GenerateCorrectiveActions(root_causes)
ELSE:
// Even for successes, analyze near-misses
near_misses ← FILTER anomalies WHERE resolved_without_failure = TRUE
IF near_misses NOT EMPTY:
preventive_actions ← AnalyzeNearMisses(near_misses)
// Step 5: Lessons Extraction
lessons ← ExtractLessons(
anomalies, bottlenecks,
successful_strategies = IdentifyEffectivePatterns(timeline),
failed_strategies = IdentifyFailedPatterns(timeline)
)
// Step 6: Memory Write-Back
WriteBackEpisodicMemory(lessons, task_result, team, memory_policy)
// Step 7: Metric Update
UpdateAgentReliabilityScores(team, task_result, timeline)
UpdateTeamPerformanceMetrics(team, task_result, timeline)
// Step 8: Policy Refinement (if warranted)
IF corrective_actions CONTAINS policy_changes:
ProposePoliocyUpdates(corrective_actions, orchestrator)
RETURN DebriefReport(success, quality_score, anomalies, lessons, corrective_actions)17.12.6 Swiss Cheese Model for Agent Failure Defense#
Borrowing from James Reason's accident causation model, agent team reliability is achieved through multiple independent defense layers, each imperfect but collectively robust:
| Defense Layer | Mechanism | Failure Mode Addressed |
|---|---|---|
| 1. Input Validation | Schema enforcement, constraint checking | Malformed or adversarial inputs |
| 2. Agent Self-Check | Chain-of-thought verification, confidence scoring | Hallucination, reasoning errors |
| 3. Peer Review | Verifier agent cross-checks implementer output | Systematic model bias |
| 4. Consensus | Multi-agent voting or debate | Individual agent failure |
| 5. Automated Testing | Test harness execution against known cases | Functional correctness |
| 6. Human Oversight | HITL/HOTL review for high-risk decisions | Novel failure modes, value alignment |
| 7. Post-Deployment Monitoring | Production regression detection, anomaly alerting | Drift, environmental changes |
If each layer independently catches 90% of errors passing through it:
In practice, layers are not perfectly independent, but even partial independence provides substantial reliability amplification.
17.12.7 Operational Readiness Levels#
Inspired by NASA's Technology Readiness Levels (TRLs), define Team Operational Readiness Levels (TORLs):
| TORL | Description | Criteria |
|---|---|---|
| 1 | Concept | Team roles and protocols defined on paper |
| 2 | Component Testing | Individual agents validated in isolation |
| 3 | Integration Testing | Handoffs, consensus, and conflict resolution tested with synthetic tasks |
| 4 | Simulated Operations | Full team operates on realistic workloads in staging environment |
| 5 | Supervised Production | Team operates on production tasks with mandatory human review (HITL) |
| 6 | Monitored Production | Team operates autonomously with human monitoring (HOTL) and automatic escalation |
| 7 | Full Autonomy | Team operates within well-defined bounds without routine human intervention; human escalation only for edge cases |
Teams advance through TORLs based on measured performance against quality gates at each level, never by fiat or optimism.
Chapter Summary#
Agent team coordination is a systems engineering discipline, not a prompt engineering exercise. This chapter has formalized:
- Organizational structure — teams as typed tuples with explicit roles, authority, obligations, and accountability ledgers
- Formation — static, dynamic, and capability-matched team assembly with formal optimization models
- Shared mental models — synchronized, version-controlled, token-efficient context sharing
- Handoffs — three-phase commit protocols with context summarization and responsibility chains
- Consensus — majority voting, weighted voting, structured debate, and arbitration with selection criteria
- Conflict resolution — evidence-based arbitration with formal quality scoring and escalation ladders
- Team memory — four-tier architecture with validated write-back, garbage collection, and promotion pipelines
- Load balancing — capacity models, assignment strategies, rebalancing protocols, and backpressure mechanisms
- Performance metrics — throughput, quality, coordination overhead, and efficiency with diagnostic anti-pattern detection
- Adaptive composition — runtime phase detection and cost-justified role reassignment
- Human integration — HITL/HOTL/human-as-tool protocols with trust calibration via Beta distributions
- HRO principles — CRM briefing/debriefing, challenge protocols, Swiss Cheese defense layers, and operational readiness levels
The unifying principle: coordination reliability is not emergent; it is engineered. Every communication channel is typed. Every handoff is verified. Every decision is traceable. Every failure is attributable. Every lesson is captured. The team that executes reliably at scale is the team whose coordination substrate was designed with the same rigor as its individual agents.