Preface to the Chapter#
In production agentic systems, the single greatest determinant of downstream generation quality is not the model, not the prompt, and not the orchestration topology—it is the evidence that reaches the context window at inference time. Retrieval is the supply chain of cognition. When that supply chain delivers imprecise, stale, unattributed, or contextually irrelevant material, no amount of prompt engineering or agent-loop sophistication can compensate. This chapter formalizes retrieval as a deterministic, provenance-tagged, multi-tier evidence engine with typed contracts, latency budgets, authority ranking, and measurable quality gates. Every design decision is motivated by the requirements of an agentic runtime that must act on evidence under bounded time, bounded tokens, and bounded trust.
8.1 Retrieval as a Deterministic Evidence Engine, Not Ad Hoc RAG#
8.1.1 The Fundamental Problem with Conventional RAG#
Retrieval-Augmented Generation (RAG), as typically deployed, operates as a loose coupling between a vector store and a language model. A user query is embedded, a nearest-neighbor lookup returns chunks, those chunks are concatenated into the prompt, and the model generates. This pattern is structurally deficient for agentic workloads across every axis that matters:
- Non-determinism: Identical queries may return different chunks across runs due to index staleness, embedding drift, or approximate nearest-neighbor (ANN) non-determinism.
- Absence of provenance: Chunks arrive as anonymous text blobs without source identity, extraction timestamp, confidence score, or chain-of-custody metadata.
- Single-signal ranking: Retrieval relies on a single similarity score (typically cosine distance in embedding space), ignoring authority, freshness, execution utility, and access control.
- No query decomposition: The user query is forwarded verbatim to the retrieval layer, forfeiting the opportunity to rewrite, expand, or decompose it into multiple subqueries targeting heterogeneous sources.
- No verification loop: Retrieved evidence is never validated against ground truth, cross-referenced across sources, or scored for faithfulness before injection into the generation context.
8.1.2 The Evidence Engine Paradigm#
A deterministic evidence engine replaces ad hoc RAG with a structured pipeline that guarantees:
- Reproducibility: Given the same query, corpus version, and configuration, the engine returns identical ranked evidence sets.
- Provenance: Every evidence fragment carries a typed provenance record: source identifier, extraction timestamp, confidence score, authority tier, and chain-of-custody path.
- Multi-signal ranking: Evidence is scored across a composite function of relevance, authority, freshness, execution utility, and diversity.
- Latency-bounded execution: Retrieval operates under a strict deadline budget with tiered fallbacks, early termination, and cache-first policies.
- Token-budget awareness: The engine returns evidence sized and prioritized to fit within the caller's declared token budget, not an arbitrary .
8.1.3 Formal Contract of the Evidence Engine#
The evidence engine exposes a typed interface:
EvidenceRequest {
query: DecomposedQuery // Rewritten, expanded, decomposed subqueries
source_policy: SourcePolicy // Allowed sources, authority thresholds, ACL scope
token_budget: uint32 // Maximum tokens for returned evidence
latency_deadline_ms: uint32 // Hard deadline for retrieval completion
ranking_weights: RankingWeights // Authority, freshness, relevance, utility weights
diversity_constraint: float32 // MMR lambda or equivalent diversity parameter
provenance_required: bool // If true, reject evidence without provenance
session_context: SessionID // For historical usage pattern retrieval
}
EvidenceResponse {
fragments: List<EvidenceFragment> // Ranked, provenance-tagged evidence
total_candidates: uint32 // Total candidates before ranking/filtering
latency_ms: uint32 // Actual retrieval latency
source_coverage: Map<SourceID, CoverageReport>
truncation_applied: bool // True if token budget forced truncation
cache_hit_ratio: float32 // Fraction of results served from cache
}
EvidenceFragment {
content: string // The evidence text
source_id: SourceID // Canonical source identifier
chunk_id: ChunkID // Unique chunk identifier within source
extraction_timestamp: Timestamp // When the chunk was extracted/indexed
confidence: float32 // Retrieval confidence score in [0, 1]
authority_tier: AuthorityTier // e.g., CANONICAL, CURATED, DERIVED, USER
relevance_score: float32 // Composite relevance score
provenance_chain: List<ProvenanceEntry> // Full chain-of-custody
metadata: Map<string, string> // Arbitrary typed metadata
token_count: uint32 // Pre-computed token count
}This contract is exposed via gRPC/Protobuf for internal agent-to-retrieval calls (low latency, typed, streaming-capable) and via JSON-RPC at the application boundary for external consumers. MCP tool servers wrap the retrieval engine for agent discovery and interoperability.
8.1.4 Pseudo-Algorithm: Evidence Engine Top-Level Dispatch#
ALGORITHM: EvidenceEngineDispatch
INPUT: request: EvidenceRequest
OUTPUT: response: EvidenceResponse
1. deadline ← NOW() + request.latency_deadline_ms
2. subqueries ← QueryDecomposer.decompose(request.query)
3. sources ← SourceRegistry.select(
subqueries, request.source_policy, deadline
)
4. // Parallel fan-out with per-source deadline allocation
5. raw_results ← PARALLEL_FOR_EACH source IN sources:
budget_ms ← SourceRegistry.latency_tier(source, deadline)
retriever ← RetrieverFactory.get(source.type) // BM25, dense, graph, SQL, etc.
RETURN retriever.retrieve(subqueries, source, budget_ms)
ON_TIMEOUT: RETURN CachedFallback.get(subqueries, source)
END_PARALLEL
6. merged ← FusionEngine.fuse(raw_results, request.ranking_weights)
7. ranked ← Ranker.rank(merged, request.ranking_weights, request.diversity_constraint)
8. filtered ← ACLFilter.apply(ranked, request.source_policy.acl_scope)
9. IF request.provenance_required THEN
filtered ← REMOVE_IF(filtered, λ f: f.provenance_chain IS EMPTY)
10. truncated ← TokenBudgetAllocator.fit(filtered, request.token_budget)
11. response ← BUILD_RESPONSE(truncated, metadata)
12. ObservabilityEmitter.emit(request, response, latency=NOW()-deadline+request.latency_deadline_ms)
13. RETURN response8.2 Hybrid Retrieval Pipeline Architecture#
Hybrid retrieval is the composition of multiple retrieval modalities—sparse lexical, dense semantic, structured query, and graph traversal—into a unified pipeline that exploits the complementary strengths of each. No single retrieval modality dominates across all query types, corpus structures, and task demands. The hybrid pipeline is not optional; it is the minimum viable architecture for production agentic retrieval.
8.2.1 Exact Match: Keyword, BM25, TF-IDF, Boolean Filters#
8.2.1.1 TF-IDF Formalization#
Term Frequency–Inverse Document Frequency remains the foundational sparse retrieval signal. For a term in document within corpus :
where is the raw count of term in document .
The document-query relevance score under TF-IDF is:
8.2.1.2 BM25 Formalization#
BM25 (Best Matching 25) extends TF-IDF with sublinear term frequency saturation and document length normalization. For query and document :
where:
- controls term frequency saturation
- controls document length normalization (typically )
- is the length of document in tokens
- is the average document length across the corpus
- is computed as:
where .
Key properties for agentic retrieval:
- Exact term matching: BM25 excels when the query contains domain-specific identifiers, error codes, function names, configuration keys, or precise terminology that must match lexically.
- Predictability: Scoring is fully deterministic, transparent, and explainable—critical for provenance and auditability.
- Low latency: Inverted index lookup operates in , typically sub-millisecond for moderate corpora.
8.2.1.3 Boolean Filters#
Boolean retrieval applies hard inclusion/exclusion constraints before or after scoring:
where is a predicate in conjunctive normal form (CNF) over metadata fields:
Examples: source_type = "canonical" AND language = "en" AND updated_after > "2024-01-01".
Boolean filters are applied as pre-filters (reducing the candidate set before scoring) or post-filters (applied after scoring to enforce hard constraints). Pre-filtering is preferred when the filter is highly selective; post-filtering is preferred when the filter is non-selective and scoring is cheap.
8.2.1.4 Pseudo-Algorithm: BM25 Retrieval with Boolean Pre-Filtering#
ALGORITHM: BM25RetrieveWithFilter
INPUT: query_terms: List<string>, filter: BooleanPredicate, k: int, index: InvertedIndex
OUTPUT: results: List<(DocID, float)>
1. candidate_docs ← index.filter(filter) // Pre-filter by metadata predicate
2. scores ← EMPTY_MAP<DocID, float>
3. FOR EACH term t IN query_terms:
4. idf_t ← log((|D| - n(t) + 0.5) / (n(t) + 0.5))
5. postings ← index.get_postings(t) ∩ candidate_docs
6. FOR EACH (doc_id, freq) IN postings:
7. dl ← index.doc_length(doc_id)
8. tf_component ← (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * dl / avgdl))
9. scores[doc_id] ← scores[doc_id] + idf_t * tf_component
10. results ← TOP_K(scores, k)
11. RETURN results8.2.2 Semantic Search: Dense Embedding Retrieval, Cross-Encoder Re-Ranking#
8.2.2.1 Dense Embedding Retrieval (Bi-Encoder)#
Dense retrieval encodes queries and documents into a shared embedding space using a bi-encoder architecture. Let and be the query and document encoders respectively (often shared weights):
Relevance is computed as the similarity between the query and document embeddings:
Common similarity functions:
- Cosine similarity:
- Dot product:
- Euclidean distance (inverted):
Retrieval is performed via Approximate Nearest Neighbor (ANN) search over the pre-computed document index using algorithms such as HNSW (Hierarchical Navigable Small World), IVF-PQ (Inverted File with Product Quantization), or ScaNN.
Complexity characteristics:
| Property | Bi-Encoder Dense Retrieval |
|---|---|
| Indexing | encoder forward passes, one-time |
| Query encoding | encoder forward pass per query |
| ANN search | to depending on algorithm |
| Storage | for documents, -dimensional embeddings |
| Semantic coverage | High: captures paraphrase, synonym, conceptual similarity |
| Lexical precision | Low: may miss exact term matches, identifiers, codes |
8.2.2.2 Cross-Encoder Re-Ranking#
Cross-encoders jointly encode the query-document pair through a single transformer pass, enabling full cross-attention between query and document tokens:
where is the sigmoid function, , and extracts the classification token representation.
Cross-encoders are orders of magnitude more accurate than bi-encoders for relevance estimation but are computationally prohibitive for full-corpus search ( forward passes per query). They are therefore deployed exclusively as re-rankers over a pre-filtered candidate set of size .
Two-stage retrieval pipeline:
Typical configurations: , final return set .
8.2.2.3 Pseudo-Algorithm: Dense Retrieval with Cross-Encoder Re-Ranking#
ALGORITHM: DenseRetrieveAndRerank
INPUT: query: string, k_candidates: int, k_final: int,
bi_encoder: Model, cross_encoder: Model, ann_index: ANNIndex
OUTPUT: results: List<(DocID, float, ProvenanceRecord)>
1. q_emb ← bi_encoder.encode_query(query)
2. candidates ← ann_index.search(q_emb, k_candidates)
// candidates: List<(DocID, embedding_score)>
3. reranked ← EMPTY_LIST
4. FOR EACH (doc_id, _) IN candidates:
5. doc_text ← DocumentStore.get_text(doc_id)
6. cross_score ← cross_encoder.score(query, doc_text)
7. provenance ← DocumentStore.get_provenance(doc_id)
8. APPEND (doc_id, cross_score, provenance) TO reranked
9. SORT reranked BY cross_score DESCENDING
10. RETURN reranked[0 : k_final]8.2.3 Sparse-Dense Fusion: Reciprocal Rank Fusion (RRF), Linear Interpolation, Learned Merging#
No single retrieval modality is uniformly superior. Sparse retrieval (BM25) captures exact lexical matches, entity names, and code identifiers with high precision. Dense retrieval captures semantic similarity, paraphrase, and conceptual relatedness. Fusion combines these complementary signals into a unified ranking.
8.2.3.1 Reciprocal Rank Fusion (RRF)#
RRF is a rank-based fusion method that does not require score normalization across heterogeneous retrieval systems. Given ranked lists , the RRF score for document is:
where:
- is the rank of document in ranked list (1-indexed; if absent)
- is a smoothing constant (typically )
Properties of RRF:
- Score-agnostic: Does not require raw scores from individual retrievers, only ranks.
- Robust to outliers: The reciprocal function dampens the influence of extreme rankings.
- Parameter-free (aside from ): No learned weights, no training data required.
- Theoretically grounded: Approximates the Borda count under reciprocal weighting.
8.2.3.2 Linear Interpolation Fusion#
When raw scores are available and calibrated (or normalized to ), linear interpolation provides a weighted combination:
where is the normalized score from retriever :
Weights can be:
- Static: Set by domain expertise (e.g., , ).
- Query-dependent: Predicted by a lightweight classifier based on query features (e.g., query length, presence of identifiers, domain classification).
- Learned: Optimized via gradient descent on a held-out relevance dataset.
8.2.3.3 Learned Merging#
Learned merging replaces hand-tuned fusion with a parameterized model that takes per-retriever scores (and optionally query features, document features, and retriever metadata) as input and produces a unified relevance score:
where:
- is the vector of per-retriever scores
- is a query feature vector (query type, length, domain, decomposition signals)
- is a document feature vector (source authority, freshness, chunk type)
- is a small MLP or gradient-boosted tree trained on relevance judgments
The training objective is typically a pairwise ranking loss:
where are triples of query, relevant document, and non-relevant document, and is a margin parameter.
8.2.3.4 Pseudo-Algorithm: Hybrid Fusion Pipeline#
ALGORITHM: HybridFusion
INPUT: query: string, sources: List<RetrieverConfig>, fusion_mode: FusionMode,
weights: RankingWeights, k_final: int
OUTPUT: fused_results: List<(DocID, float, ProvenanceRecord)>
1. ranked_lists ← EMPTY_LIST
2. FOR EACH retriever_config IN sources:
3. retriever ← RetrieverFactory.create(retriever_config)
4. results_j ← retriever.retrieve(query, k=retriever_config.k_candidates)
5. APPEND results_j TO ranked_lists
6.
7. SWITCH fusion_mode:
8. CASE RRF:
9. score_map ← EMPTY_MAP<DocID, float>
10. FOR j = 1 TO |ranked_lists|:
11. FOR rank = 1 TO |ranked_lists[j]|:
12. doc_id ← ranked_lists[j][rank].doc_id
13. score_map[doc_id] += 1.0 / (κ + rank)
14. fused ← SORT(score_map, DESCENDING)
15.
16. CASE LINEAR:
17. score_map ← EMPTY_MAP<DocID, float>
18. FOR j = 1 TO |ranked_lists|:
19. normalized ← MinMaxNormalize(ranked_lists[j])
20. FOR EACH (doc_id, norm_score) IN normalized:
21. score_map[doc_id] += weights[j] * norm_score
22. fused ← SORT(score_map, DESCENDING)
23.
24. CASE LEARNED:
25. all_docs ← UNION(ranked_lists)
26. FOR EACH doc_id IN all_docs:
27. s_vec ← [score_from(ranked_lists[j], doc_id) FOR j IN 1..|ranked_lists|]
28. x_q ← QueryFeatureExtractor(query)
29. x_d ← DocFeatureExtractor(doc_id)
30. score_map[doc_id] ← LearnedMerger.predict(s_vec, x_q, x_d)
31. fused ← SORT(score_map, DESCENDING)
32.
33. fused_results ← ATTACH_PROVENANCE(fused[0 : k_final])
34. RETURN fused_results8.2.3.5 Comparative Analysis of Fusion Methods#
| Criterion | RRF | Linear Interpolation | Learned Merging |
|---|---|---|---|
| Training data required | None | None (manual) or minimal | Relevance judgments |
| Score calibration needed | No | Yes (normalization) | No (features) |
| Adaptability | Low | Medium | High |
| Latency overhead | Negligible | Negligible | Sub-millisecond (inference) |
| Transparency | High | High | Medium (model opacity) |
| Optimality | Near-optimal for rank-only | Sensitive to weight selection | Highest ceiling |
| Recommended use | Default/bootstrap | Known retriever quality ratios | Mature system with feedback |
8.2.4 Structured Query: SQL, GraphQL, SPARQL for Relational and Knowledge Graph Sources#
Not all evidence resides in unstructured text. Agentic systems must retrieve from:
- Relational databases via SQL: configuration tables, user records, transactional logs, feature registries.
- APIs via GraphQL: third-party service state, organizational hierarchies, project metadata.
- Knowledge graphs via SPARQL: ontological relationships, entity-attribute-value triples, causal and dependency links.
8.2.4.1 Query Generation from Natural Language#
The agent's retrieval planner must translate natural language subqueries into structured query language. This is formalized as a semantic parsing task:
where is the schema of the target source (table definitions, GraphQL type system, or RDF ontology).
Safety constraints:
- Generated queries must be read-only (SELECT, not UPDATE/DELETE).
- Queries must be parameterized to prevent injection.
- Queries must respect ACL scoping: the agent's caller identity determines which tables/fields/entities are accessible.
- Query execution must be bounded:
LIMITclauses, timeout enforcement, and result-set size caps.
8.2.4.2 SPARQL for Knowledge Graph Retrieval#
For knowledge graph traversal, SPARQL queries enable multi-hop relational retrieval:
SELECT ?entity ?relation ?target WHERE {
?entity rdf:type :ServiceComponent .
?entity :dependsOn ?target .
?target :ownedBy ?team .
FILTER (?team = :PlatformTeam)
}
LIMIT 100Graph retrieval is scored by path relevance: shorter paths with higher-authority edges receive higher scores:
where is the minimum authority weight along path , is the path length, and controls the decay rate.
8.3 Multi-Source Retrieval Federation#
8.3.1 Source Registry: Schema, Authority, Freshness SLA, Latency Tier, Access Policy#
The Source Registry is the canonical catalog of all retrievable sources. Every source is registered with a typed descriptor:
SourceDescriptor {
source_id: SourceID // Globally unique identifier
source_type: enum { // Retrieval modality
INVERTED_INDEX, VECTOR_STORE, RELATIONAL_DB,
KNOWLEDGE_GRAPH, API, LOG_STORE, CODE_INDEX,
HUMAN_ANNOTATION_STORE, MEMORY_LAYER
}
schema: SchemaDefinition // Fields, types, indices, capabilities
authority_tier: AuthorityTier // CANONICAL > CURATED > DERIVED > EPHEMERAL
freshness_sla: Duration // Maximum staleness guarantee (e.g., 5m, 1h, 24h)
latency_tier: LatencyTier // HOT (<10ms), WARM (<100ms), COLD (<1000ms), ARCHIVE (>1s)
access_policy: ACLPolicy // Role-based, attribute-based, or identity-scoped
query_capabilities: List<Cap> // EXACT_MATCH, SEMANTIC, STRUCTURED, GRAPH_TRAVERSAL
cost_per_query: float32 // Monetary cost per query (for optimization)
max_qps: uint32 // Rate limit
version: SemanticVersion // Schema version for compatibility checking
}8.3.1.1 Source Selection Function#
Given a decomposed query and a deadline , the source selector identifies the optimal subset of sources that maximizes expected evidence quality under latency and cost constraints:
subject to:
In practice, this optimization is approximated by a greedy algorithm that selects sources in descending order of expected utility-per-latency-cost ratio.
8.3.2 Parallel Fan-Out with Deadline-Aware Source Selection#
Retrieval across multiple sources is executed in parallel fan-out with per-source deadline allocation:
ALGORITHM: ParallelFanOutRetrieval
INPUT: subqueries: List<SubQuery>, sources: List<SourceDescriptor>,
global_deadline_ms: uint32
OUTPUT: aggregated: List<(SourceID, List<EvidenceFragment>)>
1. global_deadline ← NOW() + global_deadline_ms
2. source_assignments ← EMPTY_MAP<SourceID, List<SubQuery>>
3. FOR EACH sq IN subqueries:
4. compatible ← FILTER(sources, λ s: s.query_capabilities ∩ sq.required_caps ≠ ∅
5. AND ACL_CHECK(s, caller)
6. AND s.latency_tier.p99 ≤ remaining_time(global_deadline))
7. selected ← TOP_BY(compatible, λ s: s.authority_tier * s.freshness_score, max=3)
8. FOR EACH s IN selected:
9. source_assignments[s.source_id].append(sq)
10.
11. aggregated ← PARALLEL_FOR_EACH (source_id, sqs) IN source_assignments:
12. per_source_deadline ← MIN(
13. SourceRegistry.latency_tier(source_id).p99 * 1.5,
14. remaining_time(global_deadline) - SAFETY_MARGIN_MS
15. )
16. TRY:
17. results ← SourceClient(source_id).retrieve(sqs, deadline=per_source_deadline)
18. RETURN (source_id, results)
19. ON_TIMEOUT:
20. cached ← RetrievalCache.get(source_id, sqs)
21. IF cached IS NOT EMPTY:
22. RETURN (source_id, cached WITH staleness_flag=TRUE)
23. ELSE:
24. RETURN (source_id, EMPTY WITH source_failure_flag=TRUE)
25. ON_ERROR(e):
26. CircuitBreaker.record_failure(source_id, e)
27. RETURN (source_id, EMPTY WITH error=e)
28. END_PARALLEL
29.
30. RETURN aggregated8.3.3 Source Conflict Resolution: Authority Ranking, Temporal Precedence, Provenance Chain#
When multiple sources return evidence that is semantically overlapping but factually divergent, the system must resolve conflicts deterministically:
8.3.3.1 Conflict Resolution Priority Stack#
- Authority Tier (highest priority):
- Temporal Precedence: Among equal authority tiers, the most recently updated evidence wins:
- Provenance Chain Length: Shorter provenance chains (fewer derivation steps from ground truth) are preferred:
- Human Annotation Override: If any conflicting evidence carries a human annotation or expert correction, that evidence takes precedence regardless of other signals.
8.3.3.2 Formal Conflict Resolution Score#
where ensures the priority stack is respected, and maps authority tiers to numeric values.
8.4 Metadata Filtering, Faceted Retrieval, and ACL-Aware Evidence Scoping#
8.4.1 Metadata Filtering#
Every indexed document carries a typed metadata record:
DocumentMetadata {
source_id: SourceID
content_type: enum { CODE, DOCUMENTATION, CONFIG, LOG, ANNOTATION, SCHEMA, RUNBOOK }
language: LanguageCode
created_at: Timestamp
updated_at: Timestamp
author: PrincipalID
classification: SecurityClassification // PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
tags: Set<string>
version: SemanticVersion
deprecation_status: enum { ACTIVE, DEPRECATED, ARCHIVED }
}Metadata filters are applied as pre-retrieval constraints to reduce the candidate set before expensive scoring operations. The filter predicate is expressed as:
where .
8.4.2 Faceted Retrieval#
Faceted retrieval returns not only ranked results but also aggregated counts per metadata facet, enabling the agent to understand the distribution of evidence across dimensions:
This is critical for agent planning: if the agent discovers that 90% of results are from deprecated documentation, it can dynamically adjust its query or source selection.
8.4.3 ACL-Aware Evidence Scoping#
Every retrieval request carries a caller identity (principal, roles, groups, attributes). The ACL enforcement layer ensures:
ACL evaluation is performed at the earliest possible stage in the pipeline (ideally as a pre-filter on the index) to avoid retrieving, scoring, and then discarding restricted content.
Implementation strategy:
- Attribute-Based Access Control (ABAC): Policies expressed as predicates over caller attributes and document attributes.
- Row-level security in SQL sources:
WHERE tenant_id = :caller_tenantinjected into all generated queries. - Vector index partitioning: Separate ANN indices per security classification, with routing based on caller clearance.
8.5 Lineage and Graph Context Retrieval: Traversing Dependency, Ownership, and Causal Graphs#
8.5.1 Motivation#
Many agentic tasks require not just what a document says, but how it relates to other entities: which service depends on this library, who owns this configuration, what upstream change caused this failure. This relational context is captured in lineage and dependency graphs.
8.5.2 Graph Types#
| Graph Type | Nodes | Edges | Use Case |
|---|---|---|---|
| Dependency Graph | Services, libraries, packages | depends_on, imports | Impact analysis, upgrade planning |
| Ownership Graph | Teams, services, configs | owns, maintains | Escalation, approval routing |
| Causal Graph | Events, changes, incidents | caused_by, triggered | Root cause analysis |
| Data Lineage | Tables, columns, pipelines | derived_from, transforms | Data quality, compliance |
| Call Graph | Functions, methods, endpoints | calls, invokes | Performance analysis, refactoring |
8.5.3 Graph Traversal Retrieval#
Given an anchor entity (identified from the user query), the graph retriever performs bounded traversal:
8.5.3.1 Relevance-Weighted Graph Traversal#
Not all graph neighbors are equally relevant. Edge weights encode relevance decay:
where is the edge weight (strength of dependency, recency of interaction, etc.).
Traversal terminates when:
ALGORITHM: RelevanceWeightedGraphTraversal
INPUT: anchor: EntityID, graph: Graph, max_depth: int,
min_relevance: float, edge_types: Set<EdgeType>
OUTPUT: context_entities: List<(EntityID, float, Path)>
1. frontier ← PriorityQueue() // Max-heap by relevance
2. frontier.push((anchor, 1.0, [anchor]))
3. visited ← {anchor}
4. context_entities ← EMPTY_LIST
5.
6. WHILE frontier IS NOT EMPTY:
7. (current, rel, path) ← frontier.pop()
8. IF current ≠ anchor:
9. APPEND (current, rel, path) TO context_entities
10. IF |path| ≥ max_depth:
11. CONTINUE
12. FOR EACH (current, neighbor, edge) IN graph.edges(current):
13. IF edge.type ∉ edge_types OR neighbor ∈ visited:
14. CONTINUE
15. new_rel ← rel * edge.weight
16. IF new_rel < min_relevance:
17. CONTINUE
18. visited.add(neighbor)
19. frontier.push((neighbor, new_rel, path + [neighbor]))
20.
21. SORT context_entities BY relevance DESCENDING
22. RETURN context_entities8.6 Historical Usage Pattern Retrieval: What Was Previously Useful for Similar Queries#
8.6.1 Rationale#
The most powerful retrieval signal that conventional RAG systems entirely ignore is historical utility: which evidence fragments actually contributed to successful agent task completions for queries similar to the current one. This creates a feedback loop where the retrieval system learns from downstream agent behavior.
8.6.2 Formalization#
Let be the history log, where:
- is a past query
- is a document that was retrieved and included in the agent's context
- or is the utility signal: whether the document contributed to a successful task outcome (binary) or a graded utility score
For a new query , the historical utility score of candidate document is:
where is the semantic similarity between the current and historical queries, and prevents division by zero.
8.6.3 Utility Signal Sources#
| Signal | Measurement | Granularity |
|---|---|---|
| Agent task success | Task completed without error or human override | Binary per task |
| Human approval | Human reviewer accepted the agent's output | Binary per output |
| Citation in output | Agent explicitly referenced/cited the document | Binary per document |
| No-hallucination verification | Output was verified against the retrieved evidence | Binary per output |
| Downstream action success | Tool call triggered by retrieved evidence succeeded | Binary per action |
| User satisfaction | Explicit feedback (thumbs up/down) | Binary or ordinal |
8.6.4 Pseudo-Algorithm: Historical Usage Retrieval#
ALGORITHM: HistoricalUsageRetrieval
INPUT: query: string, candidate_docs: List<DocID>, history: UsageHistory,
sim_threshold: float, k: int
OUTPUT: augmented_scores: Map<DocID, float>
1. q_emb ← Encoder.encode(query)
2. similar_past_queries ← history.search_queries(q_emb, sim_threshold, max=100)
3. augmented_scores ← EMPTY_MAP<DocID, float>
4.
5. FOR EACH doc_id IN candidate_docs:
6. numerator ← 0.0
7. denominator ← ε
8. FOR EACH (past_q, past_q_emb, docs_used) IN similar_past_queries:
9. sim ← cosine_similarity(q_emb, past_q_emb)
10. IF doc_id IN docs_used:
11. utility ← docs_used[doc_id].utility_score
12. numerator += sim * utility
13. denominator += sim
14. augmented_scores[doc_id] ← numerator / denominator
15.
16. RETURN augmented_scores8.7 Human Annotation Retrieval: Curated Labels, Expert Corrections, Institutional Knowledge#
8.7.1 Annotation Types#
Human annotations represent the highest-authority evidence layer in the retrieval hierarchy. They include:
- Expert corrections: Amendments to automated documentation or generated content, carrying explicit provenance of the correcting expert.
- Curated labels: Classification tags, quality assessments, relevance judgments applied by domain experts.
- Institutional knowledge: Undocumented conventions, tribal knowledge, design rationale, post-mortem findings that exist only in human memory and are explicitly captured into the annotation store.
- Review comments: Code review feedback, architecture decision records (ADRs), design review outcomes.
8.7.2 Annotation Schema#
Annotation {
annotation_id: UUID
target: AnnotationTarget {
target_type: enum { DOCUMENT, CODE_SYMBOL, CONFIG_KEY, SERVICE, INCIDENT }
target_id: string
target_version: SemanticVersion // Pinned to specific version of the target
}
annotation_type: enum { CORRECTION, LABEL, KNOWLEDGE, REVIEW, DEPRECATION_NOTICE }
content: string
author: PrincipalID
author_expertise: ExpertiseLevel // DOMAIN_EXPERT, SENIOR_ENGINEER, TEAM_LEAD, etc.
created_at: Timestamp
expires_at: Timestamp | NULL
confidence: float32 // Author's self-assessed confidence
supersedes: List<AnnotationID> // Previous annotations this one replaces
provenance: ProvenanceRecord
}8.7.3 Retrieval Strategy#
Annotations are retrieved by target entity (not by text similarity). When the retrieval pipeline identifies a document as a candidate, it queries the annotation store for all annotations targeting or entities referenced by :
Annotations are injected into the evidence fragment as attached metadata, not as separate retrieval results, preserving the association between base evidence and human corrections.
8.8 Code-Derived Enrichment: AST Analysis, Symbol Resolution, Dependency Graph Retrieval#
8.8.1 Motivation#
For agentic systems operating on software engineering tasks, code is not merely text—it is a structured artifact with rich semantic structure extractable through static analysis. Code-derived enrichment transforms raw source code into high-signal retrieval evidence.
8.8.2 Enrichment Layers#
| Layer | Analysis | Output |
|---|---|---|
| AST Analysis | Parse source code into Abstract Syntax Trees | Function signatures, class hierarchies, control flow |
| Symbol Resolution | Resolve identifiers to their definitions across files | Cross-file references, import chains |
| Type Analysis | Extract type annotations, inferred types, interfaces | Typed API contracts, compatibility constraints |
| Dependency Graph | Map import/require/use statements | Package-level and module-level dependency DAGs |
| Call Graph | Static or dynamic call graph extraction | Function invocation paths, entry points |
| Diff Analysis | Parse version control diffs | Changed symbols, affected interfaces, migration patterns |
| Documentation Binding | Link docstrings/comments to symbols | Symbol-to-documentation mapping |
8.8.3 Code Chunking Strategy#
Code requires structural chunking rather than naive token-window chunking:
ALGORITHM: StructuralCodeChunking
INPUT: source_file: string, ast: AST, max_chunk_tokens: int
OUTPUT: chunks: List<CodeChunk>
1. chunks ← EMPTY_LIST
2. top_level_nodes ← ast.get_top_level_declarations()
// Classes, functions, modules, constants
3. FOR EACH node IN top_level_nodes:
4. node_text ← source_file.extract(node.span)
5. IF token_count(node_text) ≤ max_chunk_tokens:
6. chunk ← CodeChunk {
7. content: node_text,
8. symbol: node.qualified_name,
9. type: node.type, // FUNCTION, CLASS, METHOD, etc.
10. signature: node.signature,
11. dependencies: node.imports ∪ node.references,
12. file_path: source_file.path,
13. line_range: node.span
14. }
15. APPEND chunk TO chunks
16. ELSE:
17. // Recursively chunk large classes into method-level chunks
18. sub_chunks ← StructuralCodeChunking(node_text, node.sub_ast, max_chunk_tokens)
19. // Attach parent context (class signature) to each sub-chunk
20. FOR EACH sc IN sub_chunks:
21. sc.parent_context ← node.signature
22. EXTEND chunks WITH sub_chunks
23. RETURN chunks8.8.4 Symbol-Aware Retrieval#
When the user query or agent task references a specific symbol (function name, class, variable), the retrieval system performs symbol resolution before text retrieval:
This returns the complete evidence envelope for a symbol: its definition, all call sites, associated tests, documentation, and human annotations.
8.9 Live Runtime Inspection: Querying Logs, Metrics, Traces, and System State as Evidence#
8.9.1 The Observability-as-Evidence Principle#
An agent that cannot observe the system it operates on cannot reliably reason about or improve it. Live runtime data—logs, metrics, distributed traces, configuration state, deployment status, health checks—must be queryable as first-class evidence sources within the retrieval pipeline.
8.9.2 Runtime Evidence Sources#
| Source | Query Interface | Latency Tier | Freshness |
|---|---|---|---|
| Structured Logs | Full-text search + metadata filters (e.g., Elasticsearch) | WARM (50–200ms) | Near-real-time (seconds) |
| Metrics | PromQL, MQL, or time-series query (e.g., Prometheus, Datadog) | WARM (50–200ms) | Near-real-time (15s–1m) |
| Distributed Traces | Trace ID lookup, span query (e.g., Jaeger, Tempo) | WARM (100–500ms) | Near-real-time |
| Configuration State | Key-value lookup or diff (e.g., etcd, Consul) | HOT (<10ms) | Real-time |
| Deployment State | API query (e.g., Kubernetes API, CI/CD pipeline) | WARM (100–500ms) | Near-real-time |
| Health Checks | HTTP probe or gRPC health service | HOT (<10ms) | Real-time |
| Feature Flags | Flag evaluation API | HOT (<10ms) | Real-time |
8.9.3 Temporal Scoping#
Runtime evidence must be temporally scoped to the relevant time window. The agent or retrieval planner specifies:
where and are derived from the task context (e.g., "the last 30 minutes" for an incident, "the last deployment" for a regression, "the last 24 hours" for a trend analysis).
8.9.4 Pseudo-Algorithm: Live Runtime Evidence Retrieval#
ALGORITHM: LiveRuntimeRetrieval
INPUT: query: string, time_window: TimeWindow, runtime_sources: List<RuntimeSource>,
deadline_ms: uint32, max_results_per_source: int
OUTPUT: runtime_evidence: List<RuntimeEvidenceFragment>
1. deadline ← NOW() + deadline_ms
2. query_plan ← RuntimeQueryPlanner.plan(query, time_window, runtime_sources)
// Translates NL query into source-specific queries:
// e.g., PromQL for metrics, Elasticsearch DSL for logs, trace ID for traces
3.
4. runtime_evidence ← PARALLEL_FOR_EACH (source, source_query) IN query_plan:
5. TRY:
6. raw ← source.client.execute(
7. source_query,
8. time_range=time_window,
9. limit=max_results_per_source,
10. timeout=remaining_time(deadline) * 0.8
11. )
12. fragments ← RuntimeEvidenceFormatter.format(raw, source.type)
13. // Attach provenance: source system, query used, time window, freshness
14. FOR EACH f IN fragments:
15. f.provenance ← ProvenanceRecord {
16. source: source.id,
17. query: source_query,
18. retrieved_at: NOW(),
19. data_timestamp: f.timestamp,
20. freshness: NOW() - f.timestamp
21. }
22. RETURN fragments
23. ON_TIMEOUT:
24. RETURN EMPTY WITH timeout_flag=TRUE
25. ON_ERROR(e):
26. LOG_WARNING("Runtime source {source.id} failed: {e}")
27. RETURN EMPTY WITH error_flag=TRUE
28. END_PARALLEL
29.
30. RETURN FLATTEN(runtime_evidence)8.9.5 Safety and Cost Controls#
- Query cost estimation: Before executing runtime queries, the planner estimates the scan cost (e.g., bytes scanned in a log store, cardinality of a metrics query) and rejects queries exceeding the cost budget.
- Sampling: For high-volume sources (millions of log lines), the retriever applies statistical sampling with confidence intervals rather than exhaustive scan.
- Rate limiting: Runtime queries are subject to per-source rate limits to avoid impacting production observability infrastructure.
8.10 Ranking and Scoring#
8.10.1 Multi-Signal Ranking: Authority × Freshness × Relevance × Execution Utility#
The final ranking of evidence fragments combines multiple orthogonal signals into a unified score. Let be an evidence fragment, the query, and the task context. The composite ranking score is:
where are normalized signal functions and are task-dependent weights.
8.10.1.1 Signal Definitions#
Authority Signal :
where maps and is a bonus for human-annotated evidence.
Freshness Signal :
where is a source-specific decay constant (e.g., 24 hours for logs, 30 days for documentation, 365 days for canonical specifications). This exponential decay ensures that evidence freshness degrades smoothly.
Relevance Signal :
where and the norm superscript denotes min-max normalization to .
Execution Utility Signal :
where:
- measures whether contains executable information (code, commands, configurations) relevant to the current task type .
- is the historical usage utility from §8.6.
- penalizes overly generic evidence (e.g., high-level overviews when the task requires implementation detail).
8.10.1.2 Full Composite Score#
with the constraint , .
Default weight configuration (tunable per deployment):
| Signal | Weight | Rationale |
|---|---|---|
| Relevance () | 0.45 | Primary driver of evidence quality |
| Authority () | 0.25 | Canonical sources are preferred |
| Freshness () | 0.15 | Stale evidence causes failures |
| Execution Utility () | 0.15 | Agent needs actionable evidence |
8.10.2 Learned Ranking Models: LTR with Agent Feedback Signals#
8.10.2.1 Learning-to-Rank (LTR) Framework#
When sufficient labeled data exists (from agent feedback loops, human annotations, or implicit utility signals), the ranking function can be replaced with a learned model trained via Learning-to-Rank (LTR).
Feature vector for a (query, document) pair:
Training objectives:
- Pointwise: Predict the absolute relevance grade via regression or classification:
- Pairwise: Learn to correctly order document pairs:
- Listwise (e.g., LambdaMART, ApproxNDCG): Optimize directly for ranking metrics:
In practice, LambdaMART (gradient-boosted trees with lambda gradients) provides the best balance of accuracy, interpretability, and inference speed for retrieval re-ranking:
where is the change in NDCG from swapping documents and in the ranking.
8.10.2.2 Agent Feedback Integration#
The LTR model is continuously retrained on agent feedback signals:
ALGORITHM: LTRFeedbackLoop
INPUT: completed_tasks: Stream<CompletedTask>
OUTPUT: updated_model: LTRModel
1. training_buffer ← EMPTY_LIST
2. FOR EACH task IN completed_tasks:
3. FOR EACH evidence_fragment IN task.retrieved_evidence:
4. features ← FeatureExtractor.extract(task.query, evidence_fragment)
5. utility ← UtilityEstimator.estimate(
6. evidence_fragment,
7. task.outcome, // SUCCESS, FAILURE, PARTIAL
8. task.human_feedback, // APPROVED, REJECTED, CORRECTED
9. task.citation_map // Which fragments were cited in output
10. )
11. APPEND (features, utility) TO training_buffer
12.
13. IF |training_buffer| ≥ RETRAIN_THRESHOLD:
14. new_model ← LambdaMART.train(training_buffer, validation_split=0.2)
15. IF Evaluator.ndcg(new_model) > Evaluator.ndcg(current_model) + MIN_IMPROVEMENT:
16. current_model ← new_model
17. ModelRegistry.deploy(new_model, version=NEXT_VERSION)
18. training_buffer ← EMPTY_LIST8.10.3 Diversity-Aware Ranking: Maximal Marginal Relevance (MMR)#
8.10.3.1 The Redundancy Problem#
Naïve top- ranking by composite score often returns near-duplicate evidence fragments from the same or similar sources. This wastes token budget and reduces the information density of the context window. Maximal Marginal Relevance (MMR) addresses this by balancing relevance against diversity.
8.10.3.2 MMR Formulation#
Given a query , a candidate set , and a selected set (initially empty), MMR iteratively selects the next document:
where:
- is the relevance score (composite from §8.10.1)
- is the inter-document similarity (cosine similarity of embeddings)
- controls the relevance-diversity trade-off:
- : Pure relevance ranking (no diversity)
- : Pure diversity (maximum dissimilarity from selected set)
- : Typical operating range for agentic retrieval
8.10.3.3 Pseudo-Algorithm: MMR Selection#
ALGORITHM: MMRSelection
INPUT: query: string, candidates: List<(DocID, float)>, embeddings: Map<DocID, Vector>,
lambda: float, k: int
OUTPUT: selected: List<(DocID, float)>
1. S ← EMPTY_LIST // Selected set
2. C ← SET(candidates) // Remaining candidates
3. q_emb ← Encoder.encode(query)
4.
5. WHILE |S| < k AND C IS NOT EMPTY:
6. best_doc ← NULL
7. best_mmr ← -∞
8. FOR EACH (doc_id, rel_score) IN C:
9. IF S IS EMPTY:
10. max_sim ← 0.0
11. ELSE:
12. max_sim ← MAX(
13. cosine_similarity(embeddings[doc_id], embeddings[s.doc_id])
14. FOR s IN S
15. )
16. mmr_score ← lambda * rel_score - (1 - lambda) * max_sim
17. IF mmr_score > best_mmr:
18. best_mmr ← mmr_score
19. best_doc ← (doc_id, rel_score)
20. APPEND best_doc TO S
21. REMOVE best_doc FROM C
22.
23. RETURN S8.10.3.4 Computational Optimization#
Naïve MMR is per query. For large candidate sets, this is optimized via:
- Lazy evaluation: Skip candidates whose upper-bound MMR score (assuming zero similarity to selected set) is below the current best.
- Pre-computed similarity matrix: For candidate sets , pre-compute all pairwise similarities.
- Approximate similarity: Use locality-sensitive hashing (LSH) for approximate similarity instead of exact cosine computation.
8.11 Provenance Tagging: Every Evidence Fragment Carries Source, Timestamp, Confidence, and Chain-of-Custody#
8.11.1 The Provenance Imperative#
In agentic systems, unattributed evidence is inadmissible. If the agent cannot trace where a fact originated, when it was last verified, how it arrived in the context window, and through which transformations it passed, then:
- The agent cannot assess the evidence's reliability.
- The human operator cannot audit the agent's reasoning.
- Conflicting evidence cannot be resolved.
- Hallucination cannot be distinguished from retrieval error.
- Compliance and regulatory requirements cannot be met.
8.11.2 Provenance Record Schema#
ProvenanceRecord {
origin: OriginDescriptor {
source_id: SourceID // Canonical source system
source_type: SourceType // Database, API, vector store, annotation store, etc.
document_id: string // Original document identifier within source
document_version: string // Version or commit hash at extraction time
extraction_method: string // "full_text", "chunk_semantic", "ast_parse", "api_call"
extraction_timestamp: Timestamp
}
transformations: List<TransformationRecord> {
step: string // "chunk", "embed", "reindex", "summarize", "translate"
tool_id: ToolID // Which tool/pipeline performed the transformation
tool_version: SemanticVersion
timestamp: Timestamp
input_hash: SHA256 // Hash of input to transformation
output_hash: SHA256 // Hash of output from transformation
}
retrieval_context: RetrievalContext {
query_hash: SHA256 // Hash of the query that triggered retrieval
retriever_type: string // "bm25", "dense", "graph", "sql", "runtime"
retrieval_timestamp: Timestamp
rank_position: uint32 // Position in the pre-fusion ranked list
relevance_score: float32
fusion_method: string // "rrf", "linear", "learned"
final_rank: uint32 // Position in post-fusion ranked list
}
trust_metadata: TrustMetadata {
authority_tier: AuthorityTier
confidence: float32 // In [0, 1]
verified_by: List<PrincipalID> // Humans who have verified this content
last_verification: Timestamp
staleness: Duration // NOW() - extraction_timestamp
conflict_flag: bool // True if conflicts with other retrieved evidence
}
}8.11.3 Chain-of-Custody Integrity#
The chain-of-custody is cryptographically verifiable:
with .
This ensures that any tampering with the evidence or its transformation history is detectable. The final evidence fragment presented to the agent carries the full chain, and the agent's prefill compiler can include provenance summaries in the context to enable the model to reason about evidence reliability.
8.11.4 Provenance in the Context Window#
Provenance is injected into the agent's context in a compressed, structured format to minimize token cost while preserving attribution:
[EVIDENCE source="service-catalog-v3" authority=CANONICAL freshness=2h confidence=0.95
retrieval_method="bm25+dense_rrf" rank=1]
ServiceX depends on LibraryY v2.3.1 and requires PostgreSQL ≥14.
[/EVIDENCE]
[EVIDENCE source="incident-log-2024-Q4" authority=CURATED freshness=36h confidence=0.82
retrieval_method="log_search" rank=3 human_verified=true]
ServiceX experienced 5xx errors after LibraryY upgrade to v2.4.0 on 2024-11-15.
[/EVIDENCE]This structured tagging enables the model to weigh evidence during generation, cite sources in its output, and flag low-confidence evidence for human review.
8.12 Retrieval Latency Budget Management: Tiered Deadlines, Early Termination, Cached Fallbacks#
8.12.1 Latency as a First-Class Constraint#
Retrieval latency directly impacts agent loop iteration time, user-perceived responsiveness, and the number of retrieval-action cycles possible within a task deadline. The retrieval engine must operate under a hard latency budget that is allocated across pipeline stages and sources.
8.12.2 Latency Budget Allocation Model#
Given a global retrieval deadline (in milliseconds), the budget is allocated across pipeline stages:
Typical allocation for a budget:
| Stage | Budget | Notes |
|---|---|---|
| Query Decomposition | 20ms | LLM call if needed, or cached pattern |
| Source Selection | 5ms | Registry lookup, ACL check |
| Parallel Fan-Out | 350ms | Bottleneck: slowest source |
| Fusion | 10ms | RRF/linear: sub-ms; learned: 5–10ms |
| Cross-Encoder Re-rank | 80ms | GPU batch inference over top-100 candidates |
| MMR + Filtering | 15ms | |
| Provenance Assembly | 10ms | Metadata joins |
| Safety Margin | 10ms | Buffer for network jitter |
8.12.3 Tiered Deadline Enforcement#
Sources are assigned to latency tiers with differentiated deadline enforcement:
where is the p99 latency expectation for source 's tier.
8.12.4 Early Termination Strategies#
- Sufficient evidence: If the top- results from completed sources already exceed a quality threshold , cancel outstanding source requests:
- Diminishing returns: If each additional source contributes marginal score improvement below threshold :
- Deadline proximity: At , the system forcibly proceeds to fusion with whatever results are available.
8.12.5 Cache Hierarchy#
A multi-layer cache hierarchy reduces latency for repeated or similar queries:
| Layer | Key | TTL | Storage |
|---|---|---|---|
| L1: Request-local | Exact query + source + filter hash | Request lifetime | In-process memory |
| L2: Session-local | Query embedding bucket + source | Session duration | Redis / local SSD |
| L3: Global warm | Query cluster centroid + source | 1–24 hours | Distributed cache |
| L4: Pre-computed | Scheduled queries (common patterns) | Until invalidated | Object store |
Cache invalidation is triggered by:
- Source data mutation (change notifications via MCP subscriptions or CDC streams)
- TTL expiry
- Schema version change in the source registry
- Explicit cache flush on configuration update
8.12.6 Pseudo-Algorithm: Deadline-Managed Retrieval with Fallback#
ALGORITHM: DeadlineManagedRetrieval
INPUT: request: EvidenceRequest
OUTPUT: response: EvidenceResponse
1. t_start ← NOW()
2. deadline ← t_start + request.latency_deadline_ms
3.
4. // Phase 1: Cache check (L1 → L2 → L3)
5. cached ← CacheHierarchy.lookup(request.cache_key)
6. IF cached IS NOT NULL AND cached.freshness ≤ request.max_staleness:
7. RETURN cached WITH cache_hit=TRUE
8.
9. // Phase 2: Query decomposition (budgeted)
10. subqueries ← WITH_TIMEOUT(
11. QueryDecomposer.decompose(request.query),
12. budget=T_DECOMPOSE
13. )
14. IF TIMED_OUT: subqueries ← [request.query] // Fallback: use raw query
15.
16. // Phase 3: Source selection
17. sources ← SourceSelector.select(subqueries, request.source_policy, deadline)
18.
19. // Phase 4: Parallel fan-out with per-source deadlines
20. raw_results ← EMPTY_LIST
21. pending ← LAUNCH_PARALLEL(sources, subqueries, deadline)
22.
23. WHILE pending IS NOT EMPTY AND NOW() < deadline - T_POST_RETRIEVAL:
24. completed ← WAIT_ANY(pending, timeout=10ms)
25. IF completed IS NOT NULL:
26. EXTEND raw_results WITH completed.results
27. REMOVE completed FROM pending
28. // Check early termination condition
29. IF EarlyTerminationCheck(raw_results, request):
30. CANCEL_ALL(pending)
31. BREAK
32.
33. // Phase 5: Force-collect remaining results or use cached fallbacks
34. FOR EACH p IN pending:
35. CANCEL(p)
36. fallback ← CacheHierarchy.get_stale(p.source_id, subqueries)
37. IF fallback IS NOT NULL:
38. EXTEND raw_results WITH MARK_STALE(fallback)
39.
40. // Phase 6: Fusion, re-ranking, filtering (budgeted)
41. fused ← FusionEngine.fuse(raw_results, request.ranking_weights)
42. reranked ← CrossEncoderReranker.rerank(fused, request.query, budget=T_RERANK)
43. diverse ← MMRSelector.select(reranked, lambda=request.diversity_constraint, k=TARGET_K)
44. filtered ← ACLFilter.apply(diverse, request.source_policy.acl_scope)
45. provenance_tagged ← ProvenanceAssembler.attach(filtered)
46. truncated ← TokenBudgetAllocator.fit(provenance_tagged, request.token_budget)
47.
48. // Phase 7: Cache write-through
49. CacheHierarchy.write(request.cache_key, truncated, ttl=ComputeTTL(sources))
50.
51. response ← BUILD_RESPONSE(truncated, latency=NOW()-t_start)
52. ObservabilityEmitter.emit(request, response)
53. RETURN response8.13 Retrieval Quality Evaluation: Recall@K, Precision@K, NDCG, Faithfulness, and Agent Task Success Correlation#
8.13.1 Evaluation Philosophy#
Retrieval quality evaluation in agentic systems must operate at two levels:
- Intrinsic evaluation: How well does the retrieval system return relevant documents? (Standard IR metrics.)
- Extrinsic evaluation: How well does the retrieved evidence enable the agent to complete its task? (Agent task success correlation.)
Optimizing intrinsic metrics without measuring extrinsic impact is insufficient; a retrieval system that achieves perfect Recall@10 but returns evidence that the agent cannot act upon provides no value. Conversely, extrinsic-only evaluation makes debugging retrieval failures impossible. Both levels must be measured, correlated, and optimized jointly.
8.13.2 Intrinsic Metrics#
8.13.2.1 Precision@K#
Measures the fraction of the top- retrieved documents that are relevant. Critical for token efficiency: low precision means the agent's context window is polluted with irrelevant evidence.
8.13.2.2 Recall@K#
Measures the fraction of all relevant documents that appear in the top-. Critical for completeness: low recall means the agent is missing evidence it needs.
8.13.2.3 Normalized Discounted Cumulative Gain (NDCG@K)#
NDCG measures ranking quality with graded relevance judgments.
Discounted Cumulative Gain:
where is the graded relevance of the document at rank .
Ideal DCG (documents sorted by true relevance):
where is the ideal ordering.
NDCG:
, where 1 indicates a perfect ranking.
8.13.2.4 Mean Reciprocal Rank (MRR)#
where is the rank of the first relevant document for query . MRR is critical for agentic retrieval: often the agent needs at least one high-quality document, and the position of that first relevant document determines generation quality.
8.13.3 Faithfulness and Attribution Metrics#
Beyond relevance, agentic retrieval must measure faithfulness: whether the agent's generated output is grounded in the retrieved evidence.
8.13.3.1 Faithfulness Score#
where:
- is the agent's output
- is the set of retrieved evidence fragments
- is the set of factual claims in the output (extracted via claim decomposition)
- denotes that evidence entails claim (verified via NLI or entailment model)
8.13.3.2 Attribution Coverage#
This measures whether the agent cites its sources, a necessary condition for auditability.
8.13.4 Agent Task Success Correlation#
The most important evaluation is the correlation between retrieval quality and agent task success:
where or is the downstream task outcome.
A high correlation () validates that retrieval improvements translate to agent improvements. A low correlation indicates that retrieval is not the bottleneck—the problem may lie in prompt construction, tool use, or the model's reasoning.
8.13.4.1 Causal Analysis: Retrieval Ablation#
To isolate the causal effect of retrieval quality on task success, run ablation studies:
- No retrieval: Agent operates with zero evidence. Establishes baseline.
- Random retrieval: Agent receives randomly selected documents. Tests whether any context helps.
- Oracle retrieval: Agent receives the ideal evidence set (manually curated). Establishes ceiling.
- System retrieval: Agent receives the output of the retrieval system under test.
This ratio quantifies what fraction of the theoretically achievable improvement is captured by the current retrieval system.
8.13.5 Continuous Evaluation Infrastructure#
ALGORITHM: ContinuousRetrievalEvaluation
INPUT: eval_queries: List<EvalQuery>, retrieval_engine: EvidenceEngine,
agent: Agent, schedule: CronSchedule
OUTPUT: eval_report: EvalReport
1. ON schedule:
2. results ← EMPTY_LIST
3. FOR EACH eq IN eval_queries:
4. // Intrinsic evaluation
5. retrieved ← retrieval_engine.retrieve(eq.query, eq.config)
6. precision_k ← compute_precision_at_k(retrieved, eq.relevant_docs, K)
7. recall_k ← compute_recall_at_k(retrieved, eq.relevant_docs, K)
8. ndcg_k ← compute_ndcg_at_k(retrieved, eq.graded_relevance, K)
9. mrr ← compute_mrr(retrieved, eq.relevant_docs)
10.
11. // Extrinsic evaluation (agent task success)
12. agent_output ← agent.execute(eq.task, evidence=retrieved)
13. task_success ← TaskEvaluator.evaluate(agent_output, eq.expected_outcome)
14. faithfulness ← FaithfulnessChecker.check(agent_output, retrieved)
15. attribution ← AttributionChecker.check(agent_output, retrieved)
16.
17. APPEND {eq.id, precision_k, recall_k, ndcg_k, mrr,
18. task_success, faithfulness, attribution} TO results
19.
20. // Aggregate metrics
21. report ← EvalReport {
22. mean_precision_k: MEAN(results.precision_k),
23. mean_recall_k: MEAN(results.recall_k),
24. mean_ndcg_k: MEAN(results.ndcg_k),
25. mean_mrr: MEAN(results.mrr),
26. mean_task_success: MEAN(results.task_success),
27. mean_faithfulness: MEAN(results.faithfulness),
28. retrieval_task_correlation: PEARSON_CORR(results.ndcg_k, results.task_success),
29. regressions: DETECT_REGRESSIONS(results, historical_results),
30. timestamp: NOW()
31. }
32.
33. // Quality gate enforcement
34. IF report.mean_ndcg_k < NDCG_THRESHOLD
35. OR report.mean_faithfulness < FAITHFULNESS_THRESHOLD
36. OR report.regressions IS NOT EMPTY:
37. AlertSystem.fire(RETRIEVAL_QUALITY_DEGRADATION, report)
38. IF report.regressions.severity == CRITICAL:
39. DeploymentGate.block_release("Retrieval quality regression detected")
40.
41. MetricsStore.persist(report)
42. RETURN report8.13.6 Evaluation Metric Summary#
| Metric | Formula | Measures | Target (Production) |
|---|---|---|---|
| Precision@K | Context purity | ||
| Recall@K | Evidence completeness | ||
| NDCG@K | Ranking quality | ||
| MRR | First-hit quality | ||
| Faithfulness | Hallucination control | ||
| Attribution Coverage | Auditability | ||
| Retrieval Contribution | System effectiveness | ||
| Latency P99 | Measured | Responsiveness | |
| Cache Hit Ratio | Efficiency |
Chapter Summary#
The retrieval architecture presented in this chapter replaces ad hoc RAG with a deterministic, provenance-first, multi-tier evidence engine that operates as critical infrastructure within the agentic platform. The key architectural principles are:
-
Hybrid retrieval is mandatory: No single modality suffices. Sparse (BM25), dense (bi-encoder + cross-encoder), structured (SQL/SPARQL), graph (lineage traversal), and live runtime sources must be composed through principled fusion.
-
Provenance is non-negotiable: Every evidence fragment carries a full chain-of-custody with source identity, extraction timestamp, transformation history, authority tier, and confidence score. Unattributed evidence is inadmissible.
-
Multi-source federation with typed contracts: Sources are registered with schemas, authority tiers, freshness SLAs, latency tiers, and access policies. Retrieval is executed as parallel fan-out with deadline-aware source selection and conflict resolution.
-
Ranking is multi-dimensional: The composite ranking function captures authority, freshness, relevance, and execution utility. MMR ensures diversity. LTR models with agent feedback signals enable continuous optimization.
-
Latency is a hard constraint: Budget allocation, tiered deadlines, early termination, cache hierarchies, and graceful degradation ensure the retrieval engine meets its SLA regardless of source availability.
-
Evaluation is continuous and causal: Intrinsic metrics (Precision@K, Recall@K, NDCG) are measured alongside extrinsic metrics (faithfulness, attribution, agent task success). Correlation analysis and ablation studies validate that retrieval improvements causally improve agent performance. Evaluation runs in CI/CD with automated regression detection and deployment gates.
The retrieval engine is the epistemic foundation of the agentic system. Every downstream operation—planning, tool selection, code generation, verification, critique—is only as reliable as the evidence upon which it is conditioned. Engineering this foundation with the rigor, observability, and formal guarantees described in this chapter is a prerequisite for building agentic systems that operate predictably, safely, and at scale.