Agentic Notes Library

Chapter 8: Retrieval Architecture — Hybrid, Multi-Tier, Provenance-First

In production agentic systems, the single greatest determinant of downstream generation quality is not the model, not the prompt, and not the orchestration topology—it is the evidence that reaches the context window at inference time. Re...

March 20, 2026 22 min read 4,664 words

Chapter 08MathRaw HTML

Preface to the Chapter#

In production agentic systems, the single greatest determinant of downstream generation quality is not the model, not the prompt, and not the orchestration topology—it is the evidence that reaches the context window at inference time. Retrieval is the supply chain of cognition. When that supply chain delivers imprecise, stale, unattributed, or contextually irrelevant material, no amount of prompt engineering or agent-loop sophistication can compensate. This chapter formalizes retrieval as a deterministic, provenance-tagged, multi-tier evidence engine with typed contracts, latency budgets, authority ranking, and measurable quality gates. Every design decision is motivated by the requirements of an agentic runtime that must act on evidence under bounded time, bounded tokens, and bounded trust.

8.1 Retrieval as a Deterministic Evidence Engine, Not Ad Hoc RAG#

8.1.1 The Fundamental Problem with Conventional RAG#

Retrieval-Augmented Generation (RAG), as typically deployed, operates as a loose coupling between a vector store and a language model. A user query is embedded, a nearest-neighbor lookup returns $k$ chunks, those chunks are concatenated into the prompt, and the model generates. This pattern is structurally deficient for agentic workloads across every axis that matters:

Non-determinism: Identical queries may return different chunks across runs due to index staleness, embedding drift, or approximate nearest-neighbor (ANN) non-determinism.
Absence of provenance: Chunks arrive as anonymous text blobs without source identity, extraction timestamp, confidence score, or chain-of-custody metadata.
Single-signal ranking: Retrieval relies on a single similarity score (typically cosine distance in embedding space), ignoring authority, freshness, execution utility, and access control.
No query decomposition: The user query is forwarded verbatim to the retrieval layer, forfeiting the opportunity to rewrite, expand, or decompose it into multiple subqueries targeting heterogeneous sources.
No verification loop: Retrieved evidence is never validated against ground truth, cross-referenced across sources, or scored for faithfulness before injection into the generation context.

8.1.2 The Evidence Engine Paradigm#

A deterministic evidence engine replaces ad hoc RAG with a structured pipeline that guarantees:

Reproducibility: Given the same query, corpus version, and configuration, the engine returns identical ranked evidence sets.
Provenance: Every evidence fragment carries a typed provenance record: source identifier, extraction timestamp, confidence score, authority tier, and chain-of-custody path.
Multi-signal ranking: Evidence is scored across a composite function of relevance, authority, freshness, execution utility, and diversity.
Latency-bounded execution: Retrieval operates under a strict deadline budget with tiered fallbacks, early termination, and cache-first policies.
Token-budget awareness: The engine returns evidence sized and prioritized to fit within the caller's declared token budget, not an arbitrary $k$ .

8.1.3 Formal Contract of the Evidence Engine#

The evidence engine exposes a typed interface:

EvidenceRequest {
  query: DecomposedQuery          // Rewritten, expanded, decomposed subqueries
  source_policy: SourcePolicy     // Allowed sources, authority thresholds, ACL scope
  token_budget: uint32            // Maximum tokens for returned evidence
  latency_deadline_ms: uint32     // Hard deadline for retrieval completion
  ranking_weights: RankingWeights // Authority, freshness, relevance, utility weights
  diversity_constraint: float32   // MMR lambda or equivalent diversity parameter
  provenance_required: bool       // If true, reject evidence without provenance
  session_context: SessionID      // For historical usage pattern retrieval
}
 
EvidenceResponse {
  fragments: List<EvidenceFragment>  // Ranked, provenance-tagged evidence
  total_candidates: uint32           // Total candidates before ranking/filtering
  latency_ms: uint32                 // Actual retrieval latency
  source_coverage: Map<SourceID, CoverageReport>
  truncation_applied: bool           // True if token budget forced truncation
  cache_hit_ratio: float32           // Fraction of results served from cache
}
 
EvidenceFragment {
  content: string                    // The evidence text
  source_id: SourceID               // Canonical source identifier
  chunk_id: ChunkID                 // Unique chunk identifier within source
  extraction_timestamp: Timestamp   // When the chunk was extracted/indexed
  confidence: float32               // Retrieval confidence score in [0, 1]
  authority_tier: AuthorityTier     // e.g., CANONICAL, CURATED, DERIVED, USER
  relevance_score: float32          // Composite relevance score
  provenance_chain: List<ProvenanceEntry>  // Full chain-of-custody
  metadata: Map<string, string>     // Arbitrary typed metadata
  token_count: uint32               // Pre-computed token count
}

This contract is exposed via gRPC/Protobuf for internal agent-to-retrieval calls (low latency, typed, streaming-capable) and via JSON-RPC at the application boundary for external consumers. MCP tool servers wrap the retrieval engine for agent discovery and interoperability.

8.1.4 Pseudo-Algorithm: Evidence Engine Top-Level Dispatch#

ALGORITHM: EvidenceEngineDispatch
INPUT:  request: EvidenceRequest
OUTPUT: response: EvidenceResponse
 
1.  deadline ← NOW() + request.latency_deadline_ms
2.  subqueries ← QueryDecomposer.decompose(request.query)
3.  sources ← SourceRegistry.select(
        subqueries, request.source_policy, deadline
    )
4.  // Parallel fan-out with per-source deadline allocation
5.  raw_results ← PARALLEL_FOR_EACH source IN sources:
        budget_ms ← SourceRegistry.latency_tier(source, deadline)
        retriever ← RetrieverFactory.get(source.type)  // BM25, dense, graph, SQL, etc.
        RETURN retriever.retrieve(subqueries, source, budget_ms)
        ON_TIMEOUT: RETURN CachedFallback.get(subqueries, source)
    END_PARALLEL
6.  merged ← FusionEngine.fuse(raw_results, request.ranking_weights)
7.  ranked ← Ranker.rank(merged, request.ranking_weights, request.diversity_constraint)
8.  filtered ← ACLFilter.apply(ranked, request.source_policy.acl_scope)
9.  IF request.provenance_required THEN
        filtered ← REMOVE_IF(filtered, λ f: f.provenance_chain IS EMPTY)
10. truncated ← TokenBudgetAllocator.fit(filtered, request.token_budget)
11. response ← BUILD_RESPONSE(truncated, metadata)
12. ObservabilityEmitter.emit(request, response, latency=NOW()-deadline+request.latency_deadline_ms)
13. RETURN response

8.2 Hybrid Retrieval Pipeline Architecture#

Hybrid retrieval is the composition of multiple retrieval modalities—sparse lexical, dense semantic, structured query, and graph traversal—into a unified pipeline that exploits the complementary strengths of each. No single retrieval modality dominates across all query types, corpus structures, and task demands. The hybrid pipeline is not optional; it is the minimum viable architecture for production agentic retrieval.

8.2.1 Exact Match: Keyword, BM25, TF-IDF, Boolean Filters#

8.2.1.1 TF-IDF Formalization#

Term Frequency–Inverse Document Frequency remains the foundational sparse retrieval signal. For a term $t$ in document $d$ within corpus $D$ :

\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}

where $f_{t,d}$ is the raw count of term $t$ in document $d$ .

\text{IDF}(t, D) = \log \frac{|D|}{|\{d \in D : t \in d\}|}

\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \cdot \text{IDF}(t, D)

The document-query relevance score under TF-IDF is:

S_{\text{TF-IDF}}(q, d) = \sum_{t \in q} \text{TF-IDF}(t, d, D)

8.2.1.2 BM25 Formalization#

BM25 (Best Matching 25) extends TF-IDF with sublinear term frequency saturation and document length normalization. For query $q = \{t_1, t_2, \ldots, t_n\}$ and document $d$ :

S_{\text{BM25}}(q, d) = \sum_{i=1}^{n} \text{IDF}(t_i) \cdot \frac{f_{t_i, d} \cdot (k_1 + 1)}{f_{t_i, d} + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

where:

$k_1 \in [1.2, 2.0]$ controls term frequency saturation
$b \in [0, 1]$ controls document length normalization (typically $b = 0.75$ )
$|d|$ is the length of document $d$ in tokens
$\text{avgdl}$ is the average document length across the corpus
$\text{IDF}(t_i)$ is computed as:

\text{IDF}(t_i) = \log \frac{|D| - n(t_i) + 0.5}{n(t_i) + 0.5}

where $n(t_i) = |\{d \in D : t_i \in d\}|$ .

Key properties for agentic retrieval:

Exact term matching: BM25 excels when the query contains domain-specific identifiers, error codes, function names, configuration keys, or precise terminology that must match lexically.
Predictability: Scoring is fully deterministic, transparent, and explainable—critical for provenance and auditability.
Low latency: Inverted index lookup operates in $O(\sum_{t \in q} |\text{postings}(t)|)$ , typically sub-millisecond for moderate corpora.

8.2.1.3 Boolean Filters#

Boolean retrieval applies hard inclusion/exclusion constraints before or after scoring:

\text{BooleanFilter}(d, \phi) = \begin{cases} 1 & \text{if } d \models \phi \\ 0 & \text{otherwise} \end{cases}

where $\phi$ is a predicate in conjunctive normal form (CNF) over metadata fields:

\phi = \bigwedge_{i} \left( \bigvee_{j} \ell_{ij} \right)

Examples: source_type = "canonical" AND language = "en" AND updated_after > "2024-01-01".

Boolean filters are applied as pre-filters (reducing the candidate set before scoring) or post-filters (applied after scoring to enforce hard constraints). Pre-filtering is preferred when the filter is highly selective; post-filtering is preferred when the filter is non-selective and scoring is cheap.

8.2.1.4 Pseudo-Algorithm: BM25 Retrieval with Boolean Pre-Filtering#

ALGORITHM: BM25RetrieveWithFilter
INPUT:  query_terms: List<string>, filter: BooleanPredicate, k: int, index: InvertedIndex
OUTPUT: results: List<(DocID, float)>
 
1.  candidate_docs ← index.filter(filter)   // Pre-filter by metadata predicate
2.  scores ← EMPTY_MAP<DocID, float>
3.  FOR EACH term t IN query_terms:
4.      idf_t ← log((|D| - n(t) + 0.5) / (n(t) + 0.5))
5.      postings ← index.get_postings(t) ∩ candidate_docs
6.      FOR EACH (doc_id, freq) IN postings:
7.          dl ← index.doc_length(doc_id)
8.          tf_component ← (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * dl / avgdl))
9.          scores[doc_id] ← scores[doc_id] + idf_t * tf_component
10. results ← TOP_K(scores, k)
11. RETURN results

8.2.2 Semantic Search: Dense Embedding Retrieval, Cross-Encoder Re-Ranking#

8.2.2.1 Dense Embedding Retrieval (Bi-Encoder)#

Dense retrieval encodes queries and documents into a shared embedding space using a bi-encoder architecture. Let $E_q: \mathcal{Q} \rightarrow \mathbb{R}^d$ and $E_d: \mathcal{D} \rightarrow \mathbb{R}^d$ be the query and document encoders respectively (often shared weights):

\mathbf{q} = E_q(q), \quad \mathbf{d}_i = E_d(d_i)

Relevance is computed as the similarity between the query and document embeddings:

S_{\text{dense}}(q, d_i) = \text{sim}(\mathbf{q}, \mathbf{d}_i)

Common similarity functions:

Cosine similarity: $\text{sim}(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \cdot \|\mathbf{d}\|}$
Dot product: $\text{sim}(\mathbf{q}, \mathbf{d}) = \mathbf{q} \cdot \mathbf{d}$
Euclidean distance (inverted): $\text{sim}(\mathbf{q}, \mathbf{d}) = -\|\mathbf{q} - \mathbf{d}\|_2$

Retrieval is performed via Approximate Nearest Neighbor (ANN) search over the pre-computed document index using algorithms such as HNSW (Hierarchical Navigable Small World), IVF-PQ (Inverted File with Product Quantization), or ScaNN.

Complexity characteristics:

Property	Bi-Encoder Dense Retrieval
Indexing	$O(N)$ encoder forward passes, one-time
Query encoding	$O(1)$ encoder forward pass per query
ANN search	$O(\log N)$ to $O(N^{1/2})$ depending on algorithm
Storage	$O(N \cdot d)$ for $N$ documents, $d$ -dimensional embeddings
Semantic coverage	High: captures paraphrase, synonym, conceptual similarity
Lexical precision	Low: may miss exact term matches, identifiers, codes

8.2.2.2 Cross-Encoder Re-Ranking#

Cross-encoders jointly encode the query-document pair through a single transformer pass, enabling full cross-attention between query and document tokens:

S_{\text{cross}}(q, d) = \sigma\left(W \cdot \text{CLS}\left(\text{Transformer}([q; \text{SEP}; d])\right) + b\right)

where $\sigma$ is the sigmoid function, $W \in \mathbb{R}^{1 \times h}$ , and $\text{CLS}(\cdot)$ extracts the classification token representation.

Cross-encoders are orders of magnitude more accurate than bi-encoders for relevance estimation but are computationally prohibitive for full-corpus search ( $O(N)$ forward passes per query). They are therefore deployed exclusively as re-rankers over a pre-filtered candidate set of size $k_{\text{candidates}} \ll N$ .

Two-stage retrieval pipeline:

\text{Stage 1 (Recall):} \quad C = \text{ANN}_k(E_q(q), \text{Index}) \quad |C| = k_{\text{candidates}}

\text{Stage 2 (Precision):} \quad \text{Re-ranked} = \text{Sort}_{d \in C}\left(S_{\text{cross}}(q, d)\right)

Typical configurations: $k_{\text{candidates}} \in [50, 200]$ , final return set $k_{\text{final}} \in [5, 20]$ .

8.2.2.3 Pseudo-Algorithm: Dense Retrieval with Cross-Encoder Re-Ranking#

ALGORITHM: DenseRetrieveAndRerank
INPUT:  query: string, k_candidates: int, k_final: int,
        bi_encoder: Model, cross_encoder: Model, ann_index: ANNIndex
OUTPUT: results: List<(DocID, float, ProvenanceRecord)>
 
1.  q_emb ← bi_encoder.encode_query(query)
2.  candidates ← ann_index.search(q_emb, k_candidates)
    // candidates: List<(DocID, embedding_score)>
3.  reranked ← EMPTY_LIST
4.  FOR EACH (doc_id, _) IN candidates:
5.      doc_text ← DocumentStore.get_text(doc_id)
6.      cross_score ← cross_encoder.score(query, doc_text)
7.      provenance ← DocumentStore.get_provenance(doc_id)
8.      APPEND (doc_id, cross_score, provenance) TO reranked
9.  SORT reranked BY cross_score DESCENDING
10. RETURN reranked[0 : k_final]

8.2.3 Sparse-Dense Fusion: Reciprocal Rank Fusion (RRF), Linear Interpolation, Learned Merging#

No single retrieval modality is uniformly superior. Sparse retrieval (BM25) captures exact lexical matches, entity names, and code identifiers with high precision. Dense retrieval captures semantic similarity, paraphrase, and conceptual relatedness. Fusion combines these complementary signals into a unified ranking.

8.2.3.1 Reciprocal Rank Fusion (RRF)#

RRF is a rank-based fusion method that does not require score normalization across heterogeneous retrieval systems. Given $m$ ranked lists $\{R_1, R_2, \ldots, R_m\}$ , the RRF score for document $d$ is:

S_{\text{RRF}}(d) = \sum_{j=1}^{m} \frac{1}{\kappa + r_j(d)}

where:

$r_j(d)$ is the rank of document $d$ in ranked list $R_j$ (1-indexed; $\infty$ if absent)
$\kappa$ is a smoothing constant (typically $\kappa = 60$ )

Properties of RRF:

Score-agnostic: Does not require raw scores from individual retrievers, only ranks.
Robust to outliers: The reciprocal function dampens the influence of extreme rankings.
Parameter-free (aside from $\kappa$ ): No learned weights, no training data required.
Theoretically grounded: Approximates the Borda count under reciprocal weighting.

8.2.3.2 Linear Interpolation Fusion#

When raw scores are available and calibrated (or normalized to $[0, 1]$ ), linear interpolation provides a weighted combination:

S_{\text{linear}}(q, d) = \sum_{j=1}^{m} w_j \cdot \hat{S}_j(q, d), \quad \sum_{j=1}^{m} w_j = 1, \quad w_j \geq 0

where $\hat{S}_j(q, d)$ is the normalized score from retriever $j$ :

\hat{S}_j(q, d) = \frac{S_j(q, d) - \min_{d'} S_j(q, d')}{\max_{d'} S_j(q, d') - \min_{d'} S_j(q, d')}

Weights $\{w_j\}$ can be:

Static: Set by domain expertise (e.g., $w_{\text{BM25}} = 0.4$ , $w_{\text{dense}} = 0.6$ ).
Query-dependent: Predicted by a lightweight classifier based on query features (e.g., query length, presence of identifiers, domain classification).
Learned: Optimized via gradient descent on a held-out relevance dataset.

8.2.3.3 Learned Merging#

Learned merging replaces hand-tuned fusion with a parameterized model $f_\theta$ that takes per-retriever scores (and optionally query features, document features, and retriever metadata) as input and produces a unified relevance score:

S_{\text{learned}}(q, d) = f_\theta\left(\mathbf{s}(q, d), \mathbf{x}_q, \mathbf{x}_d\right)

where:

$\mathbf{s}(q, d) = [S_1(q, d), S_2(q, d), \ldots, S_m(q, d)]$ is the vector of per-retriever scores
$\mathbf{x}_q$ is a query feature vector (query type, length, domain, decomposition signals)
$\mathbf{x}_d$ is a document feature vector (source authority, freshness, chunk type)
$f_\theta$ is a small MLP or gradient-boosted tree trained on relevance judgments

The training objective is typically a pairwise ranking loss:

\mathcal{L}(\theta) = \sum_{(q, d^+, d^-) \in \mathcal{T}} \max\left(0, \epsilon - f_\theta(q, d^+) + f_\theta(q, d^-)\right)

where $(q, d^+, d^-)$ are triples of query, relevant document, and non-relevant document, and $\epsilon$ is a margin parameter.

8.2.3.4 Pseudo-Algorithm: Hybrid Fusion Pipeline#

ALGORITHM: HybridFusion
INPUT:  query: string, sources: List<RetrieverConfig>, fusion_mode: FusionMode,
        weights: RankingWeights, k_final: int
OUTPUT: fused_results: List<(DocID, float, ProvenanceRecord)>
 
1.  ranked_lists ← EMPTY_LIST
2.  FOR EACH retriever_config IN sources:
3.      retriever ← RetrieverFactory.create(retriever_config)
4.      results_j ← retriever.retrieve(query, k=retriever_config.k_candidates)
5.      APPEND results_j TO ranked_lists
6.  
7.  SWITCH fusion_mode:
8.      CASE RRF:
9.          score_map ← EMPTY_MAP<DocID, float>
10.         FOR j = 1 TO |ranked_lists|:
11.             FOR rank = 1 TO |ranked_lists[j]|:
12.                 doc_id ← ranked_lists[j][rank].doc_id
13.                 score_map[doc_id] += 1.0 / (κ + rank)
14.         fused ← SORT(score_map, DESCENDING)
15.     
16.     CASE LINEAR:
17.         score_map ← EMPTY_MAP<DocID, float>
18.         FOR j = 1 TO |ranked_lists|:
19.             normalized ← MinMaxNormalize(ranked_lists[j])
20.             FOR EACH (doc_id, norm_score) IN normalized:
21.                 score_map[doc_id] += weights[j] * norm_score
22.         fused ← SORT(score_map, DESCENDING)
23.     
24.     CASE LEARNED:
25.         all_docs ← UNION(ranked_lists)
26.         FOR EACH doc_id IN all_docs:
27.             s_vec ← [score_from(ranked_lists[j], doc_id) FOR j IN 1..|ranked_lists|]
28.             x_q ← QueryFeatureExtractor(query)
29.             x_d ← DocFeatureExtractor(doc_id)
30.             score_map[doc_id] ← LearnedMerger.predict(s_vec, x_q, x_d)
31.         fused ← SORT(score_map, DESCENDING)
32. 
33. fused_results ← ATTACH_PROVENANCE(fused[0 : k_final])
34. RETURN fused_results

8.2.3.5 Comparative Analysis of Fusion Methods#

Criterion	RRF	Linear Interpolation	Learned Merging
Training data required	None	None (manual) or minimal	Relevance judgments
Score calibration needed	No	Yes (normalization)	No (features)
Adaptability	Low	Medium	High
Latency overhead	Negligible	Negligible	Sub-millisecond (inference)
Transparency	High	High	Medium (model opacity)
Optimality	Near-optimal for rank-only	Sensitive to weight selection	Highest ceiling
Recommended use	Default/bootstrap	Known retriever quality ratios	Mature system with feedback

8.2.4 Structured Query: SQL, GraphQL, SPARQL for Relational and Knowledge Graph Sources#

Not all evidence resides in unstructured text. Agentic systems must retrieve from:

Relational databases via SQL: configuration tables, user records, transactional logs, feature registries.
APIs via GraphQL: third-party service state, organizational hierarchies, project metadata.
Knowledge graphs via SPARQL: ontological relationships, entity-attribute-value triples, causal and dependency links.

8.2.4.1 Query Generation from Natural Language#

The agent's retrieval planner must translate natural language subqueries into structured query language. This is formalized as a semantic parsing task:

q_{\text{structured}} = \text{SemanticParser}(q_{\text{NL}}, \mathcal{S})

where $\mathcal{S}$ is the schema of the target source (table definitions, GraphQL type system, or RDF ontology).

Safety constraints:

Generated queries must be read-only (SELECT, not UPDATE/DELETE).
Queries must be parameterized to prevent injection.
Queries must respect ACL scoping: the agent's caller identity determines which tables/fields/entities are accessible.
Query execution must be bounded: LIMIT clauses, timeout enforcement, and result-set size caps.

8.2.4.2 SPARQL for Knowledge Graph Retrieval#

For knowledge graph traversal, SPARQL queries enable multi-hop relational retrieval:

SELECT ?entity ?relation ?target WHERE {
  ?entity rdf:type :ServiceComponent .
  ?entity :dependsOn ?target .
  ?target :ownedBy ?team .
  FILTER (?team = :PlatformTeam)
}
LIMIT 100

Graph retrieval is scored by path relevance: shorter paths with higher-authority edges receive higher scores:

S_{\text{graph}}(q, d) = \sum_{p \in \text{paths}(q, d)} \frac{\alpha(p)}{|p|^\beta}

where $\alpha(p)$ is the minimum authority weight along path $p$ , $|p|$ is the path length, and $\beta > 0$ controls the decay rate.

8.3 Multi-Source Retrieval Federation#

8.3.1 Source Registry: Schema, Authority, Freshness SLA, Latency Tier, Access Policy#

The Source Registry is the canonical catalog of all retrievable sources. Every source is registered with a typed descriptor:

SourceDescriptor {
  source_id: SourceID              // Globally unique identifier
  source_type: enum {              // Retrieval modality
    INVERTED_INDEX, VECTOR_STORE, RELATIONAL_DB,
    KNOWLEDGE_GRAPH, API, LOG_STORE, CODE_INDEX,
    HUMAN_ANNOTATION_STORE, MEMORY_LAYER
  }
  schema: SchemaDefinition         // Fields, types, indices, capabilities
  authority_tier: AuthorityTier    // CANONICAL > CURATED > DERIVED > EPHEMERAL
  freshness_sla: Duration          // Maximum staleness guarantee (e.g., 5m, 1h, 24h)
  latency_tier: LatencyTier        // HOT (<10ms), WARM (<100ms), COLD (<1000ms), ARCHIVE (>1s)
  access_policy: ACLPolicy         // Role-based, attribute-based, or identity-scoped
  query_capabilities: List<Cap>    // EXACT_MATCH, SEMANTIC, STRUCTURED, GRAPH_TRAVERSAL
  cost_per_query: float32          // Monetary cost per query (for optimization)
  max_qps: uint32                  // Rate limit
  version: SemanticVersion         // Schema version for compatibility checking
}

8.3.1.1 Source Selection Function#

Given a decomposed query $Q = \{q_1, q_2, \ldots, q_n\}$ and a deadline $T$ , the source selector identifies the optimal subset of sources $\mathcal{S}^* \subseteq \mathcal{S}$ that maximizes expected evidence quality under latency and cost constraints:

\mathcal{S}^* = \arg\max_{\mathcal{S}' \subseteq \mathcal{S}} \sum_{s \in \mathcal{S}'} \mathbb{E}\left[\text{Utility}(s, Q)\right]

subject to:

\max_{s \in \mathcal{S}'} \text{latency}(s) \leq T \quad \text{(parallel execution: bottleneck constraint)}

\sum_{s \in \mathcal{S}'} \text{cost}(s) \leq C_{\max}

\forall s \in \mathcal{S}' : \text{ACL}(s, \text{caller}) = \text{ALLOW}

In practice, this optimization is approximated by a greedy algorithm that selects sources in descending order of expected utility-per-latency-cost ratio.

8.3.2 Parallel Fan-Out with Deadline-Aware Source Selection#

Retrieval across multiple sources is executed in parallel fan-out with per-source deadline allocation:

ALGORITHM: ParallelFanOutRetrieval
INPUT:  subqueries: List<SubQuery>, sources: List<SourceDescriptor>,
        global_deadline_ms: uint32
OUTPUT: aggregated: List<(SourceID, List<EvidenceFragment>)>
 
1.  global_deadline ← NOW() + global_deadline_ms
2.  source_assignments ← EMPTY_MAP<SourceID, List<SubQuery>>
3.  FOR EACH sq IN subqueries:
4.      compatible ← FILTER(sources, λ s: s.query_capabilities ∩ sq.required_caps ≠ ∅
5.                                       AND ACL_CHECK(s, caller)
6.                                       AND s.latency_tier.p99 ≤ remaining_time(global_deadline))
7.      selected ← TOP_BY(compatible, λ s: s.authority_tier * s.freshness_score, max=3)
8.      FOR EACH s IN selected:
9.          source_assignments[s.source_id].append(sq)
10. 
11. aggregated ← PARALLEL_FOR_EACH (source_id, sqs) IN source_assignments:
12.     per_source_deadline ← MIN(
13.         SourceRegistry.latency_tier(source_id).p99 * 1.5,
14.         remaining_time(global_deadline) - SAFETY_MARGIN_MS
15.     )
16.     TRY:
17.         results ← SourceClient(source_id).retrieve(sqs, deadline=per_source_deadline)
18.         RETURN (source_id, results)
19.     ON_TIMEOUT:
20.         cached ← RetrievalCache.get(source_id, sqs)
21.         IF cached IS NOT EMPTY:
22.             RETURN (source_id, cached WITH staleness_flag=TRUE)
23.         ELSE:
24.             RETURN (source_id, EMPTY WITH source_failure_flag=TRUE)
25.     ON_ERROR(e):
26.         CircuitBreaker.record_failure(source_id, e)
27.         RETURN (source_id, EMPTY WITH error=e)
28. END_PARALLEL
29. 
30. RETURN aggregated

8.3.3 Source Conflict Resolution: Authority Ranking, Temporal Precedence, Provenance Chain#

When multiple sources return evidence that is semantically overlapping but factually divergent, the system must resolve conflicts deterministically:

8.3.3.1 Conflict Resolution Priority Stack#

Authority Tier (highest priority): $\text{CANONICAL} > \text{CURATED} > \text{DERIVED} > \text{EPHEMERAL}$
Temporal Precedence: Among equal authority tiers, the most recently updated evidence wins: $d^* = \arg\max_{d \in \text{conflicts}} \text{timestamp}(d) \quad \text{if } \text{authority}(d) \text{ are equal}$
Provenance Chain Length: Shorter provenance chains (fewer derivation steps from ground truth) are preferred: $d^* = \arg\min_{d \in \text{conflicts}} |\text{provenance\_chain}(d)|$
Human Annotation Override: If any conflicting evidence carries a human annotation or expert correction, that evidence takes precedence regardless of other signals.

8.3.3.2 Formal Conflict Resolution Score#

S_{\text{conflict}}(d) = w_a \cdot \text{auth}(d) + w_t \cdot \text{norm\_time}(d) + w_p \cdot \frac{1}{|\text{prov}(d)|} + w_h \cdot \mathbb{1}[\text{human\_annotated}(d)]

where $w_a \gg w_t > w_p > w_h$ ensures the priority stack is respected, and $\text{auth}(d) \in \{4, 3, 2, 1\}$ maps authority tiers to numeric values.

8.4 Metadata Filtering, Faceted Retrieval, and ACL-Aware Evidence Scoping#

8.4.1 Metadata Filtering#

Every indexed document carries a typed metadata record:

DocumentMetadata {
  source_id: SourceID
  content_type: enum { CODE, DOCUMENTATION, CONFIG, LOG, ANNOTATION, SCHEMA, RUNBOOK }
  language: LanguageCode
  created_at: Timestamp
  updated_at: Timestamp
  author: PrincipalID
  classification: SecurityClassification  // PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
  tags: Set<string>
  version: SemanticVersion
  deprecation_status: enum { ACTIVE, DEPRECATED, ARCHIVED }
}

Metadata filters are applied as pre-retrieval constraints to reduce the candidate set before expensive scoring operations. The filter predicate $\phi_{\text{meta}}$ is expressed as:

\phi_{\text{meta}} = \bigwedge_{i=1}^{k} \text{pred}_i(\text{field}_i, \text{op}_i, \text{value}_i)

where $\text{op}_i \in \{=, \neq, <, >, \leq, \geq, \in, \notin, \text{CONTAINS}, \text{PREFIX}\}$ .

8.4.2 Faceted Retrieval#

Faceted retrieval returns not only ranked results but also aggregated counts per metadata facet, enabling the agent to understand the distribution of evidence across dimensions:

\text{Facets}(q, \phi) = \{(\text{facet}_i, \text{value}_j, \text{count}_{ij}) : d \in \text{Results}(q, \phi), d.\text{facet}_i = \text{value}_j\}

This is critical for agent planning: if the agent discovers that 90% of results are from deprecated documentation, it can dynamically adjust its query or source selection.

8.4.3 ACL-Aware Evidence Scoping#

Every retrieval request carries a caller identity (principal, roles, groups, attributes). The ACL enforcement layer ensures:

\forall d \in \text{Results}: \text{ACL\_EVAL}(d.\text{classification}, \text{caller}) = \text{ALLOW}

ACL evaluation is performed at the earliest possible stage in the pipeline (ideally as a pre-filter on the index) to avoid retrieving, scoring, and then discarding restricted content.

Implementation strategy:

Attribute-Based Access Control (ABAC): Policies expressed as predicates over caller attributes and document attributes.
Row-level security in SQL sources: WHERE tenant_id = :caller_tenant injected into all generated queries.
Vector index partitioning: Separate ANN indices per security classification, with routing based on caller clearance.

8.5 Lineage and Graph Context Retrieval: Traversing Dependency, Ownership, and Causal Graphs#

8.5.1 Motivation#

Many agentic tasks require not just what a document says, but how it relates to other entities: which service depends on this library, who owns this configuration, what upstream change caused this failure. This relational context is captured in lineage and dependency graphs.

8.5.2 Graph Types#

Graph Type	Nodes	Edges	Use Case
Dependency Graph	Services, libraries, packages	`depends_on`, `imports`	Impact analysis, upgrade planning
Ownership Graph	Teams, services, configs	`owns`, `maintains`	Escalation, approval routing
Causal Graph	Events, changes, incidents	`caused_by`, `triggered`	Root cause analysis
Data Lineage	Tables, columns, pipelines	`derived_from`, `transforms`	Data quality, compliance
Call Graph	Functions, methods, endpoints	`calls`, `invokes`	Performance analysis, refactoring

8.5.3 Graph Traversal Retrieval#

Given an anchor entity $e_0$ (identified from the user query), the graph retriever performs bounded traversal:

\text{GraphRetrieve}(e_0, G, r, \text{edge\_types}, \text{max\_depth}) = \{e \in V(G) : \exists \text{ path } p = (e_0, \ldots, e), |p| \leq \text{max\_depth}, \text{edges}(p) \subseteq \text{edge\_types}\}

8.5.3.1 Relevance-Weighted Graph Traversal#

Not all graph neighbors are equally relevant. Edge weights encode relevance decay:

\text{relevance}(e) = \prod_{(u, v) \in \text{path}(e_0, e)} w(u, v)

where $w(u, v) \in (0, 1]$ is the edge weight (strength of dependency, recency of interaction, etc.).

Traversal terminates when:

\text{relevance}(e) < \tau_{\text{min}} \quad \text{or} \quad |\text{path}(e_0, e)| > D_{\text{max}}

ALGORITHM: RelevanceWeightedGraphTraversal
INPUT:  anchor: EntityID, graph: Graph, max_depth: int,
        min_relevance: float, edge_types: Set<EdgeType>
OUTPUT: context_entities: List<(EntityID, float, Path)>
 
1.  frontier ← PriorityQueue()  // Max-heap by relevance
2.  frontier.push((anchor, 1.0, [anchor]))
3.  visited ← {anchor}
4.  context_entities ← EMPTY_LIST
5.  
6.  WHILE frontier IS NOT EMPTY:
7.      (current, rel, path) ← frontier.pop()
8.      IF current ≠ anchor:
9.          APPEND (current, rel, path) TO context_entities
10.     IF |path| ≥ max_depth:
11.         CONTINUE
12.     FOR EACH (current, neighbor, edge) IN graph.edges(current):
13.         IF edge.type ∉ edge_types OR neighbor ∈ visited:
14.             CONTINUE
15.         new_rel ← rel * edge.weight
16.         IF new_rel < min_relevance:
17.             CONTINUE
18.         visited.add(neighbor)
19.         frontier.push((neighbor, new_rel, path + [neighbor]))
20. 
21. SORT context_entities BY relevance DESCENDING
22. RETURN context_entities

8.6 Historical Usage Pattern Retrieval: What Was Previously Useful for Similar Queries#

8.6.1 Rationale#

The most powerful retrieval signal that conventional RAG systems entirely ignore is historical utility: which evidence fragments actually contributed to successful agent task completions for queries similar to the current one. This creates a feedback loop where the retrieval system learns from downstream agent behavior.

8.6.2 Formalization#

Let $\mathcal{H} = \{(q_i, d_i, u_i)\}_{i=1}^{N}$ be the history log, where:

$q_i$ is a past query
$d_i$ is a document that was retrieved and included in the agent's context
$u_i \in \{0, 1\}$ or $u_i \in [0, 1]$ is the utility signal: whether the document contributed to a successful task outcome (binary) or a graded utility score

For a new query $q$ , the historical utility score of candidate document $d$ is:

S_{\text{hist}}(q, d) = \frac{\sum_{(q_i, d_i, u_i) \in \mathcal{H}} \text{sim}(q, q_i) \cdot \mathbb{1}[d_i = d] \cdot u_i}{\sum_{(q_i, d_i, u_i) \in \mathcal{H}} \text{sim}(q, q_i) \cdot \mathbb{1}[d_i = d] + \epsilon}

where $\text{sim}(q, q_i)$ is the semantic similarity between the current and historical queries, and $\epsilon > 0$ prevents division by zero.

8.6.3 Utility Signal Sources#

Signal	Measurement	Granularity
Agent task success	Task completed without error or human override	Binary per task
Human approval	Human reviewer accepted the agent's output	Binary per output
Citation in output	Agent explicitly referenced/cited the document	Binary per document
No-hallucination verification	Output was verified against the retrieved evidence	Binary per output
Downstream action success	Tool call triggered by retrieved evidence succeeded	Binary per action
User satisfaction	Explicit feedback (thumbs up/down)	Binary or ordinal

8.6.4 Pseudo-Algorithm: Historical Usage Retrieval#

ALGORITHM: HistoricalUsageRetrieval
INPUT:  query: string, candidate_docs: List<DocID>, history: UsageHistory,
        sim_threshold: float, k: int
OUTPUT: augmented_scores: Map<DocID, float>
 
1.  q_emb ← Encoder.encode(query)
2.  similar_past_queries ← history.search_queries(q_emb, sim_threshold, max=100)
3.  augmented_scores ← EMPTY_MAP<DocID, float>
4.  
5.  FOR EACH doc_id IN candidate_docs:
6.      numerator ← 0.0
7.      denominator ← ε
8.      FOR EACH (past_q, past_q_emb, docs_used) IN similar_past_queries:
9.          sim ← cosine_similarity(q_emb, past_q_emb)
10.         IF doc_id IN docs_used:
11.             utility ← docs_used[doc_id].utility_score
12.             numerator += sim * utility
13.             denominator += sim
14.     augmented_scores[doc_id] ← numerator / denominator
15. 
16. RETURN augmented_scores

8.7 Human Annotation Retrieval: Curated Labels, Expert Corrections, Institutional Knowledge#

8.7.1 Annotation Types#

Human annotations represent the highest-authority evidence layer in the retrieval hierarchy. They include:

Expert corrections: Amendments to automated documentation or generated content, carrying explicit provenance of the correcting expert.
Curated labels: Classification tags, quality assessments, relevance judgments applied by domain experts.
Institutional knowledge: Undocumented conventions, tribal knowledge, design rationale, post-mortem findings that exist only in human memory and are explicitly captured into the annotation store.
Review comments: Code review feedback, architecture decision records (ADRs), design review outcomes.

8.7.2 Annotation Schema#

Annotation {
  annotation_id: UUID
  target: AnnotationTarget {
    target_type: enum { DOCUMENT, CODE_SYMBOL, CONFIG_KEY, SERVICE, INCIDENT }
    target_id: string
    target_version: SemanticVersion  // Pinned to specific version of the target
  }
  annotation_type: enum { CORRECTION, LABEL, KNOWLEDGE, REVIEW, DEPRECATION_NOTICE }
  content: string
  author: PrincipalID
  author_expertise: ExpertiseLevel  // DOMAIN_EXPERT, SENIOR_ENGINEER, TEAM_LEAD, etc.
  created_at: Timestamp
  expires_at: Timestamp | NULL
  confidence: float32              // Author's self-assessed confidence
  supersedes: List<AnnotationID>   // Previous annotations this one replaces
  provenance: ProvenanceRecord
}

8.7.3 Retrieval Strategy#

Annotations are retrieved by target entity (not by text similarity). When the retrieval pipeline identifies a document $d$ as a candidate, it queries the annotation store for all annotations targeting $d$ or entities referenced by $d$ :

A(d) = \{a \in \mathcal{A} : a.\text{target\_id} = d.\text{id} \land a.\text{expires\_at} > \text{NOW}() \land \neg \exists a' \in \mathcal{A} : a.\text{id} \in a'.\text{supersedes}\}

Annotations are injected into the evidence fragment as attached metadata, not as separate retrieval results, preserving the association between base evidence and human corrections.

8.8 Code-Derived Enrichment: AST Analysis, Symbol Resolution, Dependency Graph Retrieval#

8.8.1 Motivation#

For agentic systems operating on software engineering tasks, code is not merely text—it is a structured artifact with rich semantic structure extractable through static analysis. Code-derived enrichment transforms raw source code into high-signal retrieval evidence.

8.8.2 Enrichment Layers#

Layer	Analysis	Output
AST Analysis	Parse source code into Abstract Syntax Trees	Function signatures, class hierarchies, control flow
Symbol Resolution	Resolve identifiers to their definitions across files	Cross-file references, import chains
Type Analysis	Extract type annotations, inferred types, interfaces	Typed API contracts, compatibility constraints
Dependency Graph	Map `import`/`require`/`use` statements	Package-level and module-level dependency DAGs
Call Graph	Static or dynamic call graph extraction	Function invocation paths, entry points
Diff Analysis	Parse version control diffs	Changed symbols, affected interfaces, migration patterns
Documentation Binding	Link docstrings/comments to symbols	Symbol-to-documentation mapping

8.8.3 Code Chunking Strategy#

Code requires structural chunking rather than naive token-window chunking:

ALGORITHM: StructuralCodeChunking
INPUT:  source_file: string, ast: AST, max_chunk_tokens: int
OUTPUT: chunks: List<CodeChunk>
 
1.  chunks ← EMPTY_LIST
2.  top_level_nodes ← ast.get_top_level_declarations()
    // Classes, functions, modules, constants
3.  FOR EACH node IN top_level_nodes:
4.      node_text ← source_file.extract(node.span)
5.      IF token_count(node_text) ≤ max_chunk_tokens:
6.          chunk ← CodeChunk {
7.              content: node_text,
8.              symbol: node.qualified_name,
9.              type: node.type,  // FUNCTION, CLASS, METHOD, etc.
10.             signature: node.signature,
11.             dependencies: node.imports ∪ node.references,
12.             file_path: source_file.path,
13.             line_range: node.span
14.         }
15.         APPEND chunk TO chunks
16.     ELSE:
17.         // Recursively chunk large classes into method-level chunks
18.         sub_chunks ← StructuralCodeChunking(node_text, node.sub_ast, max_chunk_tokens)
19.         // Attach parent context (class signature) to each sub-chunk
20.         FOR EACH sc IN sub_chunks:
21.             sc.parent_context ← node.signature
22.         EXTEND chunks WITH sub_chunks
23. RETURN chunks

8.8.4 Symbol-Aware Retrieval#

When the user query or agent task references a specific symbol (function name, class, variable), the retrieval system performs symbol resolution before text retrieval:

\text{SymbolRetrieve}(s) = \text{Definition}(s) \cup \text{References}(s) \cup \text{Tests}(s) \cup \text{Documentation}(s) \cup \text{Annotations}(s)

This returns the complete evidence envelope for a symbol: its definition, all call sites, associated tests, documentation, and human annotations.

8.9 Live Runtime Inspection: Querying Logs, Metrics, Traces, and System State as Evidence#

8.9.1 The Observability-as-Evidence Principle#

An agent that cannot observe the system it operates on cannot reliably reason about or improve it. Live runtime data—logs, metrics, distributed traces, configuration state, deployment status, health checks—must be queryable as first-class evidence sources within the retrieval pipeline.

8.9.2 Runtime Evidence Sources#

Source	Query Interface	Latency Tier	Freshness
Structured Logs	Full-text search + metadata filters (e.g., Elasticsearch)	WARM (50–200ms)	Near-real-time (seconds)
Metrics	PromQL, MQL, or time-series query (e.g., Prometheus, Datadog)	WARM (50–200ms)	Near-real-time (15s–1m)
Distributed Traces	Trace ID lookup, span query (e.g., Jaeger, Tempo)	WARM (100–500ms)	Near-real-time
Configuration State	Key-value lookup or diff (e.g., etcd, Consul)	HOT (<10ms)	Real-time
Deployment State	API query (e.g., Kubernetes API, CI/CD pipeline)	WARM (100–500ms)	Near-real-time
Health Checks	HTTP probe or gRPC health service	HOT (<10ms)	Real-time
Feature Flags	Flag evaluation API	HOT (<10ms)	Real-time

8.9.3 Temporal Scoping#

Runtime evidence must be temporally scoped to the relevant time window. The agent or retrieval planner specifies:

\text{TimeWindow} = [t_{\text{start}}, t_{\text{end}}]

where $t_{\text{start}}$ and $t_{\text{end}}$ are derived from the task context (e.g., "the last 30 minutes" for an incident, "the last deployment" for a regression, "the last 24 hours" for a trend analysis).

8.9.4 Pseudo-Algorithm: Live Runtime Evidence Retrieval#

ALGORITHM: LiveRuntimeRetrieval
INPUT:  query: string, time_window: TimeWindow, runtime_sources: List<RuntimeSource>,
        deadline_ms: uint32, max_results_per_source: int
OUTPUT: runtime_evidence: List<RuntimeEvidenceFragment>
 
1.  deadline ← NOW() + deadline_ms
2.  query_plan ← RuntimeQueryPlanner.plan(query, time_window, runtime_sources)
    // Translates NL query into source-specific queries:
    //   e.g., PromQL for metrics, Elasticsearch DSL for logs, trace ID for traces
3.  
4.  runtime_evidence ← PARALLEL_FOR_EACH (source, source_query) IN query_plan:
5.      TRY:
6.          raw ← source.client.execute(
7.              source_query,
8.              time_range=time_window,
9.              limit=max_results_per_source,
10.             timeout=remaining_time(deadline) * 0.8
11.         )
12.         fragments ← RuntimeEvidenceFormatter.format(raw, source.type)
13.         // Attach provenance: source system, query used, time window, freshness
14.         FOR EACH f IN fragments:
15.             f.provenance ← ProvenanceRecord {
16.                 source: source.id,
17.                 query: source_query,
18.                 retrieved_at: NOW(),
19.                 data_timestamp: f.timestamp,
20.                 freshness: NOW() - f.timestamp
21.             }
22.         RETURN fragments
23.     ON_TIMEOUT:
24.         RETURN EMPTY WITH timeout_flag=TRUE
25.     ON_ERROR(e):
26.         LOG_WARNING("Runtime source {source.id} failed: {e}")
27.         RETURN EMPTY WITH error_flag=TRUE
28. END_PARALLEL
29. 
30. RETURN FLATTEN(runtime_evidence)

8.9.5 Safety and Cost Controls#

Query cost estimation: Before executing runtime queries, the planner estimates the scan cost (e.g., bytes scanned in a log store, cardinality of a metrics query) and rejects queries exceeding the cost budget.
Sampling: For high-volume sources (millions of log lines), the retriever applies statistical sampling with confidence intervals rather than exhaustive scan.
Rate limiting: Runtime queries are subject to per-source rate limits to avoid impacting production observability infrastructure.

8.10 Ranking and Scoring#

8.10.1 Multi-Signal Ranking: Authority × Freshness × Relevance × Execution Utility#

The final ranking of evidence fragments combines multiple orthogonal signals into a unified score. Let $d$ be an evidence fragment, $q$ the query, and $\mathcal{T}$ the task context. The composite ranking score is:

S(q, d, \mathcal{T}) = \sum_{i=1}^{m} w_i \cdot \phi_i(q, d, \mathcal{T})

where $\{\phi_i\}$ are normalized signal functions and $\{w_i\}$ are task-dependent weights.

8.10.1.1 Signal Definitions#

Authority Signal $\phi_{\text{auth}}(d)$ :

\phi_{\text{auth}}(d) = \frac{\text{auth\_tier}(d)}{\max_{\text{tier}}} \cdot \left(1 + \lambda_h \cdot \mathbb{1}[\text{human\_annotated}(d)]\right)

where $\text{auth\_tier}(d) \in \{1, 2, 3, 4\}$ maps $\{\text{EPHEMERAL}, \text{DERIVED}, \text{CURATED}, \text{CANONICAL}\}$ and $\lambda_h$ is a bonus for human-annotated evidence.

Freshness Signal $\phi_{\text{fresh}}(d)$ :

\phi_{\text{fresh}}(d) = \exp\left(-\frac{\text{NOW}() - \text{updated\_at}(d)}{\tau_{\text{decay}}}\right)

where $\tau_{\text{decay}}$ is a source-specific decay constant (e.g., 24 hours for logs, 30 days for documentation, 365 days for canonical specifications). This exponential decay ensures that evidence freshness degrades smoothly.

Relevance Signal $\phi_{\text{rel}}(q, d)$ :

\phi_{\text{rel}}(q, d) = \alpha \cdot S_{\text{BM25}}^{\text{norm}}(q, d) + \beta \cdot S_{\text{dense}}^{\text{norm}}(q, d) + \gamma \cdot S_{\text{cross}}^{\text{norm}}(q, d)

where $\alpha + \beta + \gamma = 1$ and the norm superscript denotes min-max normalization to $[0, 1]$ .

Execution Utility Signal $\phi_{\text{util}}(d, \mathcal{T})$ :

\phi_{\text{util}}(d, \mathcal{T}) = \mu_1 \cdot \text{actionability}(d, \mathcal{T}) + \mu_2 \cdot S_{\text{hist}}(q, d) + \mu_3 \cdot \text{specificity}(d)

where:

$\text{actionability}(d, \mathcal{T})$ measures whether $d$ contains executable information (code, commands, configurations) relevant to the current task type $\mathcal{T}$ .
$S_{\text{hist}}(q, d)$ is the historical usage utility from §8.6.
$\text{specificity}(d)$ penalizes overly generic evidence (e.g., high-level overviews when the task requires implementation detail).

8.10.1.2 Full Composite Score#

S(q, d, \mathcal{T}) = w_a \cdot \phi_{\text{auth}}(d) + w_f \cdot \phi_{\text{fresh}}(d) + w_r \cdot \phi_{\text{rel}}(q, d) + w_u \cdot \phi_{\text{util}}(d, \mathcal{T})

with the constraint $w_a + w_f + w_r + w_u = 1$ , $w_i > 0$ .

Default weight configuration (tunable per deployment):

Signal	Weight	Rationale
Relevance ( $w_r$ )	0.45	Primary driver of evidence quality
Authority ( $w_a$ )	0.25	Canonical sources are preferred
Freshness ( $w_f$ )	0.15	Stale evidence causes failures
Execution Utility ( $w_u$ )	0.15	Agent needs actionable evidence

8.10.2 Learned Ranking Models: LTR with Agent Feedback Signals#

8.10.2.1 Learning-to-Rank (LTR) Framework#

When sufficient labeled data exists (from agent feedback loops, human annotations, or implicit utility signals), the ranking function can be replaced with a learned model trained via Learning-to-Rank (LTR).

Feature vector for a (query, document) pair:

\mathbf{x}(q, d) = \left[\phi_{\text{auth}}(d),\; \phi_{\text{fresh}}(d),\; \phi_{\text{rel}}(q, d),\; \phi_{\text{util}}(d, \mathcal{T}),\; S_{\text{BM25}}(q, d),\; S_{\text{dense}}(q, d),\; |d|,\; \text{source\_type}(d),\; \ldots \right]

Training objectives:

Pointwise: Predict the absolute relevance grade $y \in \{0, 1, 2, 3\}$ via regression or classification:

\mathcal{L}_{\text{pointwise}} = \sum_{(q, d)} \left(y_{q,d} - f_\theta(\mathbf{x}(q, d))\right)^2

Pairwise: Learn to correctly order document pairs:

\mathcal{L}_{\text{pairwise}} = \sum_{(q, d^+, d^-)} \log\left(1 + \exp\left(-(f_\theta(\mathbf{x}(q, d^+)) - f_\theta(\mathbf{x}(q, d^-)))\right)\right)

Listwise (e.g., LambdaMART, ApproxNDCG): Optimize directly for ranking metrics:

\mathcal{L}_{\text{listwise}} = -\sum_{q} \text{NDCG}\left(\text{Sort}_{d \in D_q}(f_\theta(\mathbf{x}(q, d))), \mathbf{y}_q\right)

In practice, LambdaMART (gradient-boosted trees with lambda gradients) provides the best balance of accuracy, interpretability, and inference speed for retrieval re-ranking:

\lambda_{ij} = \frac{-\sigma}{1 + \exp\left(\sigma \cdot (f_\theta(\mathbf{x}_i) - f_\theta(\mathbf{x}_j))\right)} \cdot |\Delta \text{NDCG}_{ij}|

where $\Delta \text{NDCG}_{ij}$ is the change in NDCG from swapping documents $i$ and $j$ in the ranking.

8.10.2.2 Agent Feedback Integration#

The LTR model is continuously retrained on agent feedback signals:

ALGORITHM: LTRFeedbackLoop
INPUT:  completed_tasks: Stream<CompletedTask>
OUTPUT: updated_model: LTRModel
 
1.  training_buffer ← EMPTY_LIST
2.  FOR EACH task IN completed_tasks:
3.      FOR EACH evidence_fragment IN task.retrieved_evidence:
4.          features ← FeatureExtractor.extract(task.query, evidence_fragment)
5.          utility ← UtilityEstimator.estimate(
6.              evidence_fragment,
7.              task.outcome,          // SUCCESS, FAILURE, PARTIAL
8.              task.human_feedback,   // APPROVED, REJECTED, CORRECTED
9.              task.citation_map      // Which fragments were cited in output
10.         )
11.         APPEND (features, utility) TO training_buffer
12. 
13.     IF |training_buffer| ≥ RETRAIN_THRESHOLD:
14.         new_model ← LambdaMART.train(training_buffer, validation_split=0.2)
15.         IF Evaluator.ndcg(new_model) > Evaluator.ndcg(current_model) + MIN_IMPROVEMENT:
16.             current_model ← new_model
17.             ModelRegistry.deploy(new_model, version=NEXT_VERSION)
18.         training_buffer ← EMPTY_LIST

8.10.3 Diversity-Aware Ranking: Maximal Marginal Relevance (MMR)#

8.10.3.1 The Redundancy Problem#

Naïve top- $k$ ranking by composite score often returns near-duplicate evidence fragments from the same or similar sources. This wastes token budget and reduces the information density of the context window. Maximal Marginal Relevance (MMR) addresses this by balancing relevance against diversity.

8.10.3.2 MMR Formulation#

Given a query $q$ , a candidate set $C$ , and a selected set $S$ (initially empty), MMR iteratively selects the next document:

d^* = \arg\max_{d \in C \setminus S} \left[\lambda \cdot \text{Rel}(q, d) - (1 - \lambda) \cdot \max_{d_j \in S} \text{Sim}(d, d_j)\right]

where:

$\text{Rel}(q, d)$ is the relevance score (composite $S(q, d, \mathcal{T})$ from §8.10.1)
$\text{Sim}(d, d_j)$ is the inter-document similarity (cosine similarity of embeddings)
$\lambda \in [0, 1]$ $λ \in [0, 1]$ controls the relevance-diversity trade-off:
- $\lambda = 1$ : Pure relevance ranking (no diversity)
- $\lambda = 0$ : Pure diversity (maximum dissimilarity from selected set)
- $\lambda \in [0.5, 0.7]$ : Typical operating range for agentic retrieval

8.10.3.3 Pseudo-Algorithm: MMR Selection#

ALGORITHM: MMRSelection
INPUT:  query: string, candidates: List<(DocID, float)>, embeddings: Map<DocID, Vector>,
        lambda: float, k: int
OUTPUT: selected: List<(DocID, float)>
 
1.  S ← EMPTY_LIST             // Selected set
2.  C ← SET(candidates)        // Remaining candidates
3.  q_emb ← Encoder.encode(query)
4.  
5.  WHILE |S| < k AND C IS NOT EMPTY:
6.      best_doc ← NULL
7.      best_mmr ← -∞
8.      FOR EACH (doc_id, rel_score) IN C:
9.          IF S IS EMPTY:
10.             max_sim ← 0.0
11.         ELSE:
12.             max_sim ← MAX(
13.                 cosine_similarity(embeddings[doc_id], embeddings[s.doc_id])
14.                 FOR s IN S
15.             )
16.         mmr_score ← lambda * rel_score - (1 - lambda) * max_sim
17.         IF mmr_score > best_mmr:
18.             best_mmr ← mmr_score
19.             best_doc ← (doc_id, rel_score)
20.     APPEND best_doc TO S
21.     REMOVE best_doc FROM C
22. 
23. RETURN S

8.10.3.4 Computational Optimization#

Naïve MMR is $O(k \cdot |C| \cdot |S|)$ per query. For large candidate sets, this is optimized via:

Lazy evaluation: Skip candidates whose upper-bound MMR score (assuming zero similarity to selected set) is below the current best.
Pre-computed similarity matrix: For candidate sets $|C| \leq 500$ , pre-compute all pairwise similarities.
Approximate similarity: Use locality-sensitive hashing (LSH) for $O(1)$ approximate similarity instead of exact cosine computation.

8.11 Provenance Tagging: Every Evidence Fragment Carries Source, Timestamp, Confidence, and Chain-of-Custody#

8.11.1 The Provenance Imperative#

In agentic systems, unattributed evidence is inadmissible. If the agent cannot trace where a fact originated, when it was last verified, how it arrived in the context window, and through which transformations it passed, then:

The agent cannot assess the evidence's reliability.
The human operator cannot audit the agent's reasoning.
Conflicting evidence cannot be resolved.
Hallucination cannot be distinguished from retrieval error.
Compliance and regulatory requirements cannot be met.

8.11.2 Provenance Record Schema#

ProvenanceRecord {
  origin: OriginDescriptor {
    source_id: SourceID           // Canonical source system
    source_type: SourceType       // Database, API, vector store, annotation store, etc.
    document_id: string           // Original document identifier within source
    document_version: string      // Version or commit hash at extraction time
    extraction_method: string     // "full_text", "chunk_semantic", "ast_parse", "api_call"
    extraction_timestamp: Timestamp
  }
  transformations: List<TransformationRecord> {
    step: string                  // "chunk", "embed", "reindex", "summarize", "translate"
    tool_id: ToolID               // Which tool/pipeline performed the transformation
    tool_version: SemanticVersion
    timestamp: Timestamp
    input_hash: SHA256            // Hash of input to transformation
    output_hash: SHA256           // Hash of output from transformation
  }
  retrieval_context: RetrievalContext {
    query_hash: SHA256            // Hash of the query that triggered retrieval
    retriever_type: string        // "bm25", "dense", "graph", "sql", "runtime"
    retrieval_timestamp: Timestamp
    rank_position: uint32         // Position in the pre-fusion ranked list
    relevance_score: float32
    fusion_method: string         // "rrf", "linear", "learned"
    final_rank: uint32            // Position in post-fusion ranked list
  }
  trust_metadata: TrustMetadata {
    authority_tier: AuthorityTier
    confidence: float32           // In [0, 1]
    verified_by: List<PrincipalID>  // Humans who have verified this content
    last_verification: Timestamp
    staleness: Duration           // NOW() - extraction_timestamp
    conflict_flag: bool           // True if conflicts with other retrieved evidence
  }
}

8.11.3 Chain-of-Custody Integrity#

The chain-of-custody is cryptographically verifiable:

\text{custody\_hash}(i) = \text{SHA256}\left(\text{custody\_hash}(i-1) \,\|\, \text{step}_i \,\|\, \text{output\_hash}_i \,\|\, \text{timestamp}_i\right)

with $\text{custody\_hash}(0) = \text{SHA256}(\text{original\_content})$ .

This ensures that any tampering with the evidence or its transformation history is detectable. The final evidence fragment presented to the agent carries the full chain, and the agent's prefill compiler can include provenance summaries in the context to enable the model to reason about evidence reliability.

8.11.4 Provenance in the Context Window#

Provenance is injected into the agent's context in a compressed, structured format to minimize token cost while preserving attribution:

[EVIDENCE source="service-catalog-v3" authority=CANONICAL freshness=2h confidence=0.95
 retrieval_method="bm25+dense_rrf" rank=1]
ServiceX depends on LibraryY v2.3.1 and requires PostgreSQL ≥14.
[/EVIDENCE]
 
[EVIDENCE source="incident-log-2024-Q4" authority=CURATED freshness=36h confidence=0.82
 retrieval_method="log_search" rank=3 human_verified=true]
ServiceX experienced 5xx errors after LibraryY upgrade to v2.4.0 on 2024-11-15.
[/EVIDENCE]

This structured tagging enables the model to weigh evidence during generation, cite sources in its output, and flag low-confidence evidence for human review.

8.12 Retrieval Latency Budget Management: Tiered Deadlines, Early Termination, Cached Fallbacks#

8.12.1 Latency as a First-Class Constraint#

Retrieval latency directly impacts agent loop iteration time, user-perceived responsiveness, and the number of retrieval-action cycles possible within a task deadline. The retrieval engine must operate under a hard latency budget that is allocated across pipeline stages and sources.

8.12.2 Latency Budget Allocation Model#

Given a global retrieval deadline $T_{\text{global}}$ (in milliseconds), the budget is allocated across pipeline stages:

T_{\text{global}} = T_{\text{decompose}} + T_{\text{source\_select}} + T_{\text{fan\_out}} + T_{\text{fusion}} + T_{\text{rerank}} + T_{\text{filter}} + T_{\text{provenance}} + T_{\text{safety\_margin}}

Typical allocation for a $T_{\text{global}} = 500\text{ms}$ budget:

Stage	Budget	Notes
Query Decomposition	20ms	LLM call if needed, or cached pattern
Source Selection	5ms	Registry lookup, ACL check
Parallel Fan-Out	350ms	Bottleneck: slowest source
Fusion	10ms	RRF/linear: sub-ms; learned: 5–10ms
Cross-Encoder Re-rank	80ms	GPU batch inference over top-100 candidates
MMR + Filtering	15ms
Provenance Assembly	10ms	Metadata joins
Safety Margin	10ms	Buffer for network jitter

8.12.3 Tiered Deadline Enforcement#

Sources are assigned to latency tiers with differentiated deadline enforcement:

T_{\text{source}}(s) = \min\left(\text{tier\_budget}(s), \; T_{\text{global}} - T_{\text{post\_retrieval}} - T_{\text{safety}}\right)

where $\text{tier\_budget}(s)$ is the p99 latency expectation for source $s$ 's tier.

8.12.4 Early Termination Strategies#

Sufficient evidence: If the top- $k$ results from completed sources already exceed a quality threshold $Q_{\text{min}}$ , cancel outstanding source requests:

\text{EarlyTerminate} \iff \frac{1}{k} \sum_{i=1}^{k} S(q, d_i, \mathcal{T}) \geq Q_{\text{min}}

Diminishing returns: If each additional source contributes marginal score improvement below threshold $\delta$ :

\text{EarlyTerminate} \iff \Delta S_{\text{new\_source}} < \delta

Deadline proximity: At $T_{\text{global}} - T_{\text{post\_retrieval}}$ , the system forcibly proceeds to fusion with whatever results are available.

8.12.5 Cache Hierarchy#

A multi-layer cache hierarchy reduces latency for repeated or similar queries:

Layer	Key	TTL	Storage
L1: Request-local	Exact query + source + filter hash	Request lifetime	In-process memory
L2: Session-local	Query embedding bucket + source	Session duration	Redis / local SSD
L3: Global warm	Query cluster centroid + source	1–24 hours	Distributed cache
L4: Pre-computed	Scheduled queries (common patterns)	Until invalidated	Object store

Cache invalidation is triggered by:

Source data mutation (change notifications via MCP subscriptions or CDC streams)
TTL expiry
Schema version change in the source registry
Explicit cache flush on configuration update

8.12.6 Pseudo-Algorithm: Deadline-Managed Retrieval with Fallback#

ALGORITHM: DeadlineManagedRetrieval
INPUT:  request: EvidenceRequest
OUTPUT: response: EvidenceResponse
 
1.  t_start ← NOW()
2.  deadline ← t_start + request.latency_deadline_ms
3.  
4.  // Phase 1: Cache check (L1 → L2 → L3)
5.  cached ← CacheHierarchy.lookup(request.cache_key)
6.  IF cached IS NOT NULL AND cached.freshness ≤ request.max_staleness:
7.      RETURN cached WITH cache_hit=TRUE
8.  
9.  // Phase 2: Query decomposition (budgeted)
10. subqueries ← WITH_TIMEOUT(
11.     QueryDecomposer.decompose(request.query),
12.     budget=T_DECOMPOSE
13. )
14. IF TIMED_OUT: subqueries ← [request.query]  // Fallback: use raw query
15. 
16. // Phase 3: Source selection
17. sources ← SourceSelector.select(subqueries, request.source_policy, deadline)
18. 
19. // Phase 4: Parallel fan-out with per-source deadlines
20. raw_results ← EMPTY_LIST
21. pending ← LAUNCH_PARALLEL(sources, subqueries, deadline)
22. 
23. WHILE pending IS NOT EMPTY AND NOW() < deadline - T_POST_RETRIEVAL:
24.     completed ← WAIT_ANY(pending, timeout=10ms)
25.     IF completed IS NOT NULL:
26.         EXTEND raw_results WITH completed.results
27.         REMOVE completed FROM pending
28.         // Check early termination condition
29.         IF EarlyTerminationCheck(raw_results, request):
30.             CANCEL_ALL(pending)
31.             BREAK
32. 
33. // Phase 5: Force-collect remaining results or use cached fallbacks
34. FOR EACH p IN pending:
35.     CANCEL(p)
36.     fallback ← CacheHierarchy.get_stale(p.source_id, subqueries)
37.     IF fallback IS NOT NULL:
38.         EXTEND raw_results WITH MARK_STALE(fallback)
39. 
40. // Phase 6: Fusion, re-ranking, filtering (budgeted)
41. fused ← FusionEngine.fuse(raw_results, request.ranking_weights)
42. reranked ← CrossEncoderReranker.rerank(fused, request.query, budget=T_RERANK)
43. diverse ← MMRSelector.select(reranked, lambda=request.diversity_constraint, k=TARGET_K)
44. filtered ← ACLFilter.apply(diverse, request.source_policy.acl_scope)
45. provenance_tagged ← ProvenanceAssembler.attach(filtered)
46. truncated ← TokenBudgetAllocator.fit(provenance_tagged, request.token_budget)
47. 
48. // Phase 7: Cache write-through
49. CacheHierarchy.write(request.cache_key, truncated, ttl=ComputeTTL(sources))
50. 
51. response ← BUILD_RESPONSE(truncated, latency=NOW()-t_start)
52. ObservabilityEmitter.emit(request, response)
53. RETURN response

8.13 Retrieval Quality Evaluation: Recall@K, Precision@K, NDCG, Faithfulness, and Agent Task Success Correlation#

8.13.1 Evaluation Philosophy#

Retrieval quality evaluation in agentic systems must operate at two levels:

Intrinsic evaluation: How well does the retrieval system return relevant documents? (Standard IR metrics.)
Extrinsic evaluation: How well does the retrieved evidence enable the agent to complete its task? (Agent task success correlation.)

Optimizing intrinsic metrics without measuring extrinsic impact is insufficient; a retrieval system that achieves perfect Recall@10 but returns evidence that the agent cannot act upon provides no value. Conversely, extrinsic-only evaluation makes debugging retrieval failures impossible. Both levels must be measured, correlated, and optimized jointly.

8.13.2 Intrinsic Metrics#

8.13.2.1 Precision@K#

\text{Precision@K}(q) = \frac{|\{d \in \text{Top-K}(q) : d \in \text{Relevant}(q)\}|}{K}

Measures the fraction of the top- $K$ retrieved documents that are relevant. Critical for token efficiency: low precision means the agent's context window is polluted with irrelevant evidence.

8.13.2.2 Recall@K#

\text{Recall@K}(q) = \frac{|\{d \in \text{Top-K}(q) : d \in \text{Relevant}(q)\}|}{|\text{Relevant}(q)|}

Measures the fraction of all relevant documents that appear in the top- $K$ . Critical for completeness: low recall means the agent is missing evidence it needs.

8.13.2.3 Normalized Discounted Cumulative Gain (NDCG@K)#

NDCG measures ranking quality with graded relevance judgments.

Discounted Cumulative Gain:

\text{DCG@K}(q) = \sum_{i=1}^{K} \frac{2^{\text{rel}(d_i)} - 1}{\log_2(i + 1)}

where $\text{rel}(d_i) \in \{0, 1, 2, 3\}$ is the graded relevance of the document at rank $i$ .

Ideal DCG (documents sorted by true relevance):

\text{IDCG@K}(q) = \sum_{i=1}^{K} \frac{2^{\text{rel}(d_{\pi(i)})} - 1}{\log_2(i + 1)}

where $\pi$ is the ideal ordering.

NDCG:

\text{NDCG@K}(q) = \frac{\text{DCG@K}(q)}{\text{IDCG@K}(q)}

$\text{NDCG@K} \in [0, 1]$ , where 1 indicates a perfect ranking.

8.13.2.4 Mean Reciprocal Rank (MRR)#

\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}

where $\text{rank}_q$ is the rank of the first relevant document for query $q$ . MRR is critical for agentic retrieval: often the agent needs at least one high-quality document, and the position of that first relevant document determines generation quality.

8.13.3 Faithfulness and Attribution Metrics#

Beyond relevance, agentic retrieval must measure faithfulness: whether the agent's generated output is grounded in the retrieved evidence.

8.13.3.1 Faithfulness Score#

\text{Faithfulness}(o, E) = \frac{|\{c \in \text{claims}(o) : \exists e \in E, \; e \vdash c\}|}{|\text{claims}(o)|}

where:

$o$ is the agent's output
$E$ is the set of retrieved evidence fragments
$\text{claims}(o)$ is the set of factual claims in the output (extracted via claim decomposition)
$e \vdash c$ denotes that evidence $e$ entails claim $c$ (verified via NLI or entailment model)

8.13.3.2 Attribution Coverage#

\text{AttributionCoverage}(o, E) = \frac{|\{c \in \text{claims}(o) : c \text{ is explicitly attributed to some } e \in E\}|}{|\text{claims}(o)|}

This measures whether the agent cites its sources, a necessary condition for auditability.

8.13.4 Agent Task Success Correlation#

The most important evaluation is the correlation between retrieval quality and agent task success:

\rho = \text{Corr}\left(\text{NDCG@K}(q_i),\; \text{TaskSuccess}(q_i)\right) \quad \text{over } i = 1, \ldots, N

where $\text{TaskSuccess} \in \{0, 1\}$ or $\text{TaskSuccess} \in [0, 1]$ is the downstream task outcome.

A high correlation ( $\rho > 0.6$ ) validates that retrieval improvements translate to agent improvements. A low correlation indicates that retrieval is not the bottleneck—the problem may lie in prompt construction, tool use, or the model's reasoning.

8.13.4.1 Causal Analysis: Retrieval Ablation#

To isolate the causal effect of retrieval quality on task success, run ablation studies:

No retrieval: Agent operates with zero evidence. Establishes baseline.
Random retrieval: Agent receives randomly selected documents. Tests whether any context helps.
Oracle retrieval: Agent receives the ideal evidence set (manually curated). Establishes ceiling.
System retrieval: Agent receives the output of the retrieval system under test.

\text{Retrieval Contribution} = \frac{\text{TaskSuccess}_{\text{system}} - \text{TaskSuccess}_{\text{random}}}{\text{TaskSuccess}_{\text{oracle}} - \text{TaskSuccess}_{\text{random}}}

This ratio quantifies what fraction of the theoretically achievable improvement is captured by the current retrieval system.

8.13.5 Continuous Evaluation Infrastructure#

ALGORITHM: ContinuousRetrievalEvaluation
INPUT:  eval_queries: List<EvalQuery>, retrieval_engine: EvidenceEngine,
        agent: Agent, schedule: CronSchedule
OUTPUT: eval_report: EvalReport
 
1.  ON schedule:
2.      results ← EMPTY_LIST
3.      FOR EACH eq IN eval_queries:
4.          // Intrinsic evaluation
5.          retrieved ← retrieval_engine.retrieve(eq.query, eq.config)
6.          precision_k ← compute_precision_at_k(retrieved, eq.relevant_docs, K)
7.          recall_k ← compute_recall_at_k(retrieved, eq.relevant_docs, K)
8.          ndcg_k ← compute_ndcg_at_k(retrieved, eq.graded_relevance, K)
9.          mrr ← compute_mrr(retrieved, eq.relevant_docs)
10.         
11.         // Extrinsic evaluation (agent task success)
12.         agent_output ← agent.execute(eq.task, evidence=retrieved)
13.         task_success ← TaskEvaluator.evaluate(agent_output, eq.expected_outcome)
14.         faithfulness ← FaithfulnessChecker.check(agent_output, retrieved)
15.         attribution ← AttributionChecker.check(agent_output, retrieved)
16.         
17.         APPEND {eq.id, precision_k, recall_k, ndcg_k, mrr,
18.                 task_success, faithfulness, attribution} TO results
19.     
20.     // Aggregate metrics
21.     report ← EvalReport {
22.         mean_precision_k: MEAN(results.precision_k),
23.         mean_recall_k: MEAN(results.recall_k),
24.         mean_ndcg_k: MEAN(results.ndcg_k),
25.         mean_mrr: MEAN(results.mrr),
26.         mean_task_success: MEAN(results.task_success),
27.         mean_faithfulness: MEAN(results.faithfulness),
28.         retrieval_task_correlation: PEARSON_CORR(results.ndcg_k, results.task_success),
29.         regressions: DETECT_REGRESSIONS(results, historical_results),
30.         timestamp: NOW()
31.     }
32.     
33.     // Quality gate enforcement
34.     IF report.mean_ndcg_k < NDCG_THRESHOLD
35.        OR report.mean_faithfulness < FAITHFULNESS_THRESHOLD
36.        OR report.regressions IS NOT EMPTY:
37.         AlertSystem.fire(RETRIEVAL_QUALITY_DEGRADATION, report)
38.         IF report.regressions.severity == CRITICAL:
39.             DeploymentGate.block_release("Retrieval quality regression detected")
40.     
41.     MetricsStore.persist(report)
42.     RETURN report

8.13.6 Evaluation Metric Summary#

Metric	Formula	Measures	Target (Production)
Precision@K	$\frac{\\|\text{Rel} \cap \text{Top-K}\\|}{K}$	Context purity	$\geq 0.70$
Recall@K	$\frac{\\|\text{Rel} \cap \text{Top-K}\\|}{\\|\text{Rel}\\|}$	Evidence completeness	$\geq 0.85$
NDCG@K	$\frac{\text{DCG@K}}{\text{IDCG@K}}$	Ranking quality	$\geq 0.80$
MRR	$\frac{1}{\\|Q\\|}\sum \frac{1}{\text{rank}_q}$	First-hit quality	$\geq 0.85$
Faithfulness	$\frac{\\|\text{grounded claims}\\|}{\\|\text{all claims}\\|}$	Hallucination control	$\geq 0.90$
Attribution Coverage	$\frac{\\|\text{attributed claims}\\|}{\\|\text{all claims}\\|}$	Auditability	$\geq 0.85$
Retrieval Contribution	$\frac{S_{\text{sys}} - S_{\text{rand}}}{S_{\text{oracle}} - S_{\text{rand}}}$	System effectiveness	$\geq 0.60$
Latency P99	Measured	Responsiveness	$\leq T_{\text{global}}$
Cache Hit Ratio	$\frac{\text{cache hits}}{\text{total requests}}$	Efficiency	$\geq 0.30$

Chapter Summary#

The retrieval architecture presented in this chapter replaces ad hoc RAG with a deterministic, provenance-first, multi-tier evidence engine that operates as critical infrastructure within the agentic platform. The key architectural principles are:

Hybrid retrieval is mandatory: No single modality suffices. Sparse (BM25), dense (bi-encoder + cross-encoder), structured (SQL/SPARQL), graph (lineage traversal), and live runtime sources must be composed through principled fusion.
Provenance is non-negotiable: Every evidence fragment carries a full chain-of-custody with source identity, extraction timestamp, transformation history, authority tier, and confidence score. Unattributed evidence is inadmissible.
Multi-source federation with typed contracts: Sources are registered with schemas, authority tiers, freshness SLAs, latency tiers, and access policies. Retrieval is executed as parallel fan-out with deadline-aware source selection and conflict resolution.
Ranking is multi-dimensional: The composite ranking function $S(q, d, \mathcal{T}) = w_a \phi_{\text{auth}} + w_f \phi_{\text{fresh}} + w_r \phi_{\text{rel}} + w_u \phi_{\text{util}}$ captures authority, freshness, relevance, and execution utility. MMR ensures diversity. LTR models with agent feedback signals enable continuous optimization.
Latency is a hard constraint: Budget allocation, tiered deadlines, early termination, cache hierarchies, and graceful degradation ensure the retrieval engine meets its SLA regardless of source availability.
Evaluation is continuous and causal: Intrinsic metrics (Precision@K, Recall@K, NDCG) are measured alongside extrinsic metrics (faithfulness, attribution, agent task success). Correlation analysis and ablation studies validate that retrieval improvements causally improve agent performance. Evaluation runs in CI/CD with automated regression detection and deployment gates.

The retrieval engine is the epistemic foundation of the agentic system. Every downstream operation—planning, tool selection, code generation, verification, critique—is only as reliable as the evidence upon which it is conditioned. Engineering this foundation with the rigor, observability, and formal guarantees described in this chapter is a prerequisite for building agentic systems that operate predictably, safely, and at scale.