Preamble#
Chunking is the act of partitioning a source document into discrete retrieval units — the atomic segments that are embedded, indexed, retrieved, and injected into an agent's context window. In production agentic systems, chunking is not a preprocessing convenience; it is a retrieval precision lever that directly governs the signal-to-noise ratio of every context window the agent constructs. A poorly chunked corpus degrades retrieval recall, inflates token cost, introduces incoherent context boundaries, and ultimately amplifies hallucination risk. A well-engineered chunking layer, by contrast, produces retrieval units that are semantically self-contained, provenance-tagged, hierarchically navigable, and optimally sized for downstream synthesis.
This chapter develops chunking as a first-class engineering discipline within the agentic retrieval stack. We formalize each strategy — structural, semantic, hierarchical, agentic, code-aware, tabular, multi-modal, and adaptive — with mathematical definitions, pseudo-algorithms, quality metrics, and explicit trade-off analyses. Every strategy is evaluated against the requirements of retrieval precision, contextual completeness, synthesis utility, token efficiency, and production-grade scalability.
9.1 Chunking as a Retrieval Precision Lever: Why One Strategy Fails All Document Types#
9.1.1 The Fundamental Problem#
A retrieval-augmented agent constructs its context window from a set of retrieved chunks selected from an indexed corpus . The quality of the agent's output is bounded by the quality of :
where is the user query and is the ideal complete evidence set. The gap between and is the retrieval information loss, and chunking strategy is the primary determinant of that loss.
9.1.2 Why Uniform Chunking Fails#
A naïve fixed-size chunking strategy partitions a document of length tokens into chunks of size :
This approach introduces three classes of failure:
| Failure Class | Description | Consequence |
|---|---|---|
| Boundary Incoherence | Splits occur mid-sentence, mid-paragraph, or mid-argument | Retrieved chunks are semantically incomplete, forcing the model to hallucinate missing context |
| Granularity Mismatch | A single chunk size cannot match the natural information density of heterogeneous documents | Code files, legal contracts, tabular data, and narrative prose have radically different information topologies |
| Metadata Destruction | Fixed splitting discards structural cues — headings, nesting, schema boundaries | Retrieval loses the ability to rank by document position, section relevance, or hierarchical containment |
9.1.3 Document-Class Taxonomy#
Production corpora are heterogeneous. Each document class exhibits a distinct information topology that demands a specialized chunking strategy:
| Document Class | Information Topology | Optimal Chunking Strategy |
|---|---|---|
| Technical documentation | Heading-section hierarchy | Structural |
| Research papers | Argument-claim-evidence chains | Semantic + Hierarchical |
| Legal contracts | Clause-subclause nesting | Structural + Hierarchical |
| Source code | AST-defined scopes | Code (AST-based) |
| Spreadsheets / CSV / databases | Row-column schema | Tabular |
| Conversational transcripts | Turn-topic segmentation | Semantic |
| Multi-modal artifacts | Image-text-audio alignment | Multi-modal |
| Knowledge base articles | Proposition-fact granularity | Agentic |
9.1.4 The Chunking Design Objective#
Formally, the chunking function must simultaneously optimize:
subject to:
where is the maximum chunk token size, is the agent's context token budget, and provenance is mandatory for every chunk.
9.1.5 Architectural Position of Chunking in the Agentic Stack#
Chunking sits at the critical junction between ingestion and retrieval:
Document Ingestion → [Classification] → [Strategy Selection] → [Chunking Engine]
→ [Metadata Enrichment] → [Embedding] → [Indexing] → Retrieval EngineThe chunking engine receives a classified document and a strategy selector, produces enriched chunks, and emits them to the embedding and indexing pipeline. This is not a batch-once operation; adaptive chunking (§9.11) permits runtime re-chunking based on query characteristics.
9.2 Structural Chunking: Heading, Section, Paragraph, and Markup-Aware Splitting#
9.2.1 Definition and Scope#
Structural chunking exploits the explicit organizational markers present in a document — headings, subheadings, paragraphs, list items, HTML/XML tags, Markdown formatting, LaTeX section commands, or PDF layout elements — to define chunk boundaries that align with the author's intended information hierarchy.
Let a document be represented as an ordered tree where each node corresponds to a structural element (section, subsection, paragraph, list item) and edges represent containment. Structural chunking defines chunk boundaries at a target depth of :
If , the node is recursively split at depth .
9.2.2 Markup Detection and Parsing#
Structural chunking requires a format-specific parser that recovers the document tree:
| Source Format | Structural Signals | Parser Strategy |
|---|---|---|
| Markdown | #, ##, ###, ---, list markers | Regex + CommonMark AST |
| HTML | <h1>–<h6>, <section>, <article>, <div> | DOM tree traversal |
| LaTeX | \section, \subsection, \paragraph | LaTeX AST parser |
| Font size transitions, bold headings, indentation | Layout analysis (heuristic or ML-based) | |
| DOCX/ODT | Style-tagged heading levels, <w:pStyle> | XML schema extraction |
| reStructuredText | Underline-delimited headings, directives | RST parser |
9.2.3 Pseudo-Algorithm: Structural Chunking#
ALGORITHM: StructuralChunk(D, format, τ_max, d_target)
───────────────────────────────────────────────────────
INPUT:
D — raw document content
format — detected document format (markdown, html, pdf, ...)
τ_max — maximum chunk token size
d_target — target structural depth for chunking
OUTPUT:
chunks[] — list of structurally-bounded chunks with metadata
PROCEDURE:
1. T ← ParseDocumentTree(D, format)
// Recover the structural tree T = (V, E) from format-specific parser
2. candidates ← CollectNodesAtDepth(T, d_target)
// Gather all nodes at target depth d_target
3. chunks ← []
4. FOR EACH node v IN candidates:
text ← Serialize(v) // Flatten node subtree to text
tokens ← Tokenize(text)
IF |tokens| ≤ τ_max:
breadcrumb ← AncestorPath(v, T)
// e.g., "Chapter 3 > Section 3.2 > Subsection 3.2.1"
chunk ← {
content: text,
token_count: |tokens|,
section_path: breadcrumb,
depth: Depth(v),
position_index: GlobalIndex(v, T),
source_doc: DocumentID(D),
chunk_type: "structural",
parent_id: ParentID(v, T),
children_ids: ChildrenIDs(v, T)
}
APPEND chunk TO chunks
ELSE:
// Node exceeds τ_max: recurse to finer depth
sub_chunks ← StructuralChunk(
Serialize(v), format, τ_max, d_target + 1
)
FOR EACH sc IN sub_chunks:
sc.parent_id ← NodeID(v)
APPEND sc TO chunks
5. // Handle leaf nodes with no structural children that still exceed τ_max
FOR EACH chunk c IN chunks WHERE c.token_count > τ_max:
splits ← ParagraphSplit(c.content, τ_max)
REPLACE c IN chunks WITH splits
// Paragraph-level fallback preserves sentence boundaries
6. RETURN chunks9.2.4 Paragraph-Level Splitting as Fallback#
When structural depth is exhausted (leaf nodes exceed ), the fallback is paragraph-level splitting. A paragraph boundary is detected by:
This preserves sentence coherence while respecting token limits.
9.2.5 Trade-Off Analysis#
| Strength | Limitation |
|---|---|
| Preserves author-intended organization | Requires well-structured source documents |
| Breadcrumb metadata enables hierarchical retrieval | PDF and legacy formats have noisy structural signals |
| Chunk boundaries align with human reading units | Information density varies across sections; some chunks may be too sparse or too dense |
| Low computational cost | Cannot detect topical shifts within a single section |
9.2.6 When to Apply#
Structural chunking is the default first-pass strategy for any document with reliable markup. It should be combined with semantic chunking (§9.3) when sections contain multiple distinct topics, and with hierarchical chunking (§9.4) when parent-child retrieval is required.
9.3 Semantic Chunking: Topic Segmentation, Embedding Similarity Boundaries, Coherence Scoring#
9.3.1 Motivation#
Structural chunking fails when a document lacks reliable markup, or when a single structural section contains multiple distinct topics. Semantic chunking addresses this by detecting topical boundaries within text using distributional similarity measures.
9.3.2 Core Principle: Embedding Breakpoint Detection#
Given a document segmented into an ordered sequence of sentences , each sentence is embedded into a vector space:
The inter-sentence similarity between consecutive sentences is:
A breakpoint is detected at position when the similarity drops below a threshold or exhibits a statistically significant local minimum:
where and are the mean and standard deviation of similarities within a sliding window of size , and is a sensitivity parameter (typically ).
9.3.3 Windowed Similarity with Smoothing#
To reduce noise from short sentences, compute a windowed embedding by averaging consecutive sentence embeddings:
The smoothed similarity curve is:
Breakpoints are then detected as local minima of below the adaptive threshold.
9.3.4 Topic Coherence Scoring#
After chunking, each chunk consisting of sentences is scored for internal coherence:
A high-quality chunk exhibits for a domain-calibrated threshold .
The inter-chunk distinctiveness between adjacent chunks validates that breakpoints are meaningful:
Effective semantic chunking maximizes both intra-chunk coherence and inter-chunk distinctiveness.
9.3.5 Formal Objective#
subject to:
where is the set of breakpoint positions, prevents trivially small chunks, and balances coherence against distinctiveness.
9.3.6 Pseudo-Algorithm: Semantic Chunking via Embedding Breakpoints#
ALGORITHM: SemanticChunk(D, embed_fn, τ_max, τ_min, α, w)
────────────────────────────────────────────────────────────
INPUT:
D — document text
embed_fn — sentence embedding function: string → ℝ^d
τ_max — maximum chunk token size
τ_min — minimum chunk token size
α — breakpoint sensitivity (standard deviations below mean)
w — smoothing window size (number of sentences)
OUTPUT:
chunks[] — list of semantically coherent chunks with metadata
PROCEDURE:
1. sentences[] ← SentenceTokenize(D)
N ← |sentences|
2. embeddings[] ← [embed_fn(s) FOR s IN sentences]
3. // Compute windowed similarity curve
sim_curve[] ← []
FOR i ← 0 TO N - 2w:
left_avg ← Mean(embeddings[i : i + w])
right_avg ← Mean(embeddings[i + w : i + 2w])
sim_curve[i] ← CosineSimilarity(left_avg, right_avg)
4. // Compute adaptive threshold
μ ← Mean(sim_curve)
σ ← StdDev(sim_curve)
threshold ← μ - α · σ
5. // Detect breakpoints as local minima below threshold
breakpoints ← []
FOR i ← 1 TO |sim_curve| - 1:
IF sim_curve[i] < threshold
AND sim_curve[i] < sim_curve[i-1]
AND sim_curve[i] < sim_curve[i+1]:
candidate_pos ← i + w // Center of the window gap
APPEND candidate_pos TO breakpoints
6. // Form chunks from breakpoints
boundaries ← [0] + breakpoints + [N]
raw_chunks ← []
FOR j ← 0 TO |boundaries| - 2:
text ← JoinSentences(sentences[boundaries[j] : boundaries[j+1]])
tokens ← Tokenize(text)
APPEND {content: text, token_count: |tokens|} TO raw_chunks
7. // Enforce size constraints via merge and split
chunks ← []
buffer ← EMPTY
FOR EACH rc IN raw_chunks:
IF buffer.token_count + rc.token_count ≤ τ_max:
buffer ← Merge(buffer, rc)
ELSE:
IF buffer.token_count ≥ τ_min:
APPEND buffer TO chunks
ELSE:
// Merge small buffer into previous chunk if possible
IF chunks IS NOT EMPTY:
chunks[-1] ← Merge(chunks[-1], buffer)
ELSE:
APPEND buffer TO chunks
buffer ← rc
IF buffer IS NOT EMPTY:
APPEND buffer TO chunks
8. // Split any remaining oversized chunks at sentence boundaries
final_chunks ← []
FOR EACH c IN chunks:
IF c.token_count > τ_max:
splits ← SentenceBoundarySplit(c.content, τ_max)
EXTEND final_chunks WITH splits
ELSE:
APPEND c TO final_chunks
9. // Score coherence for each chunk
FOR EACH c IN final_chunks:
sent_embeds ← [embed_fn(s) FOR s IN SentenceTokenize(c.content)]
c.coherence_score ← MeanPairwiseCosine(sent_embeds)
c.chunk_type ← "semantic"
10. RETURN final_chunks9.3.7 Advanced Variant: TextTiling and BayesSeg#
Beyond cosine-breakpoint detection, two classical algorithms bear mention:
-
TextTiling (Hearst, 1997): Computes block similarity using term-frequency vectors across sliding windows, detecting topic boundaries at similarity valleys. It operates in with no embedding model dependency.
-
BayesSeg (Eisenstein & Barzilay, 2008): Formulates topic segmentation as Bayesian inference over segment boundaries using a generative language model. It optimizes a posterior probability of segmentation given observed word distributions:
These may be used as baselines or fused with embedding-based approaches in an ensemble.
9.3.8 Trade-Off Analysis#
| Strength | Limitation |
|---|---|
| Detects topical shifts independent of formatting | Requires embedding model inference per sentence ( calls) |
| Works on unstructured text, transcripts, raw prose | Sensitive to embedding model quality and domain alignment |
| Adaptive threshold accommodates varying document styles | Smoothing window must be tuned per corpus |
| Produces coherent, self-contained retrieval units | Cannot capture hierarchical structure (flat segmentation) |
9.4 Hierarchical Chunking: Parent-Child Relationships, Summary-Detail Layering, Recursive Decomposition#
9.4.1 Motivation and Architecture#
Many retrieval tasks require chunks at multiple granularity levels simultaneously. A user query about a broad concept should retrieve a summary-level chunk; a follow-up query about a specific detail should retrieve the fine-grained child chunk. Hierarchical chunking constructs a chunk tree where each node is a chunk and edges encode parent-child containment.
9.4.2 Formal Definition#
Let be a document. A hierarchical chunking of at levels produces a tree:
where:
- is the set of chunks across levels
- is the set of chunks at level (level 1 = coarsest, level = finest)
Each parent chunk satisfies:
where denotes ordered concatenation, and an optional summary layer stores:
9.4.3 Retrieval with Hierarchical Chunks#
At query time, the retrieval engine can operate in multiple modes:
- Leaf retrieval: Retrieve fine-grained chunks from for precise evidence extraction
- Parent expansion: Upon retrieving a leaf chunk , expand to its parent for surrounding context
- Summary-first routing: Retrieve from summary-level , then drill into children for detail
- Multi-level fusion: Retrieve from all levels, deduplicate, and rank by relevance-at-granularity
The parent expansion strategy is formalized as:
This provides the model with surrounding context that a leaf chunk alone cannot supply.
9.4.4 Pseudo-Algorithm: Hierarchical Chunking with Summary-Detail Layering#
ALGORITHM: HierarchicalChunk(D, levels, τ_sizes[], summarize_fn)
──────────────────────────────────────────────────────────────────
INPUT:
D — document text
levels — number of hierarchical levels (e.g., 3)
τ_sizes[] — max chunk token size per level: [τ_1, τ_2, ..., τ_L]
where τ_1 > τ_2 > ... > τ_L
summarize_fn — function to generate summaries (LLM or extractive)
OUTPUT:
chunk_tree — tree of chunks with parent-child edges and summaries
PROCEDURE:
1. // Level 1: Coarsest segmentation
level_1_chunks ← StructuralChunk(D, format, τ_sizes[1], d_target=1)
// Or SemanticChunk if no structure available
2. chunk_tree ← InitializeTree()
FOR EACH c1 IN level_1_chunks:
node_1 ← CreateNode(level=1, content=c1.content, metadata=c1)
AddToTree(chunk_tree, node_1, parent=ROOT)
3. // Levels 2 through L: Recursive refinement
FOR l ← 2 TO levels:
parent_nodes ← GetNodesAtLevel(chunk_tree, l - 1)
FOR EACH parent IN parent_nodes:
IF TokenCount(parent.content) > τ_sizes[l]:
children ← SemanticChunk(
parent.content, embed_fn, τ_sizes[l], τ_min=50,
α=1.5, w=3
)
FOR EACH child IN children:
node_child ← CreateNode(
level=l, content=child.content, metadata=child
)
AddToTree(chunk_tree, node_child, parent=parent)
ELSE:
// Parent is already small enough; promote as leaf
MarkAsLeaf(parent)
4. // Generate summaries for non-leaf nodes (bottom-up)
FOR l ← levels - 1 DOWNTO 1:
nodes_at_l ← GetNodesAtLevel(chunk_tree, l)
FOR EACH node IN nodes_at_l:
children_text ← Concatenate([c.content FOR c IN Children(node)])
node.summary ← summarize_fn(children_text)
node.summary_tokens ← TokenCount(node.summary)
5. // Assign globally unique chunk IDs and parent pointers
AssignIDs(chunk_tree)
FOR EACH node IN AllNodes(chunk_tree):
node.metadata.parent_id ← ParentID(node)
node.metadata.children_ids ← [ChildID(c) FOR c IN Children(node)]
node.metadata.level ← node.level
6. RETURN chunk_tree9.4.5 Token Budget Considerations#
For a document of tokens with hierarchical levels, the total storage cost is:
where is the average summary length at level . This is a storage multiplication factor of approximately (where accounts for summaries), which is acceptable given that it enables multi-granularity retrieval. The embedding and indexing cost scales linearly with the total number of chunk nodes: .
9.4.6 Trade-Off Analysis#
| Strength | Limitation |
|---|---|
| Enables multi-granularity retrieval with a single index | Storage and embedding cost scale linearly with level count |
| Parent expansion provides surrounding context automatically | Summarization introduces potential information loss |
| Summary layers enable high-recall coarse search | Requires careful calibration of per level |
| Natural fit for documents with inherent hierarchy | Flat documents (e.g., chat logs) gain limited benefit |
9.5 Agentic Chunking: LLM-Guided Proposition Extraction, Claim Decomposition, and Fact Isolation#
9.5.1 Definition#
Agentic chunking uses a language model to decompose documents into atomic propositions — self-contained, independently verifiable statements that each express exactly one factual claim. Unlike structural or semantic chunking, which operate on surface features, agentic chunking operates on meaning, producing retrieval units optimized for factual precision and downstream synthesis.
9.5.2 Proposition as the Atomic Retrieval Unit#
A proposition satisfies three properties:
- Atomicity: expresses exactly one claim or fact
- Self-containedness: is interpretable without reference to surrounding text (all pronouns resolved, all abbreviations expanded)
- Verifiability: can be independently confirmed or refuted against an authoritative source
Example transformation:
Source: "Founded in 2004, the company grew to 500 employees by 2020 and was acquired by Acme Corp for $2B."
Propositions:
- "The company was founded in 2004."
- "The company had 500 employees by 2020."
- "Acme Corp acquired the company."
- "The acquisition price was $2 billion."
9.5.3 Formal Definition#
Given a document , agentic chunking produces a set of propositions:
where each is a decontextualized atomic statement with provenance:
The proposition count typically satisfies where is the structural chunk count, since a single paragraph may yield 5–15 propositions.
9.5.4 Pseudo-Algorithm: Agentic Proposition Extraction#
ALGORITHM: AgenticChunk(D, llm_fn, τ_max_input, batch_size)
─────────────────────────────────────────────────────────────
INPUT:
D — document text
llm_fn — language model function for proposition extraction
τ_max_input — max tokens per LLM extraction call
batch_size — number of paragraphs per batch
OUTPUT:
propositions[] — list of atomic, self-contained propositions with provenance
PROCEDURE:
1. // Pre-segment document into manageable passages
passages ← StructuralChunk(D, format, τ_max_input, d_target=MAX_DEPTH)
// Alternatively: ParagraphSplit(D, τ_max_input)
2. propositions ← []
3. FOR EACH batch IN BatchIterator(passages, batch_size):
prompt ← CompileExtractionPrompt(batch)
// Prompt instructs:
// "Decompose the following text into atomic, self-contained
// propositions. Each proposition must:
// - Express exactly one factual claim
// - Resolve all pronouns and references to explicit entities
// - Be independently interpretable without surrounding text
// - Preserve numerical values, dates, and proper nouns exactly
// Return as a JSON array of objects with fields:
// {claim, source_sentence, entities[], confidence}"
response ← llm_fn(prompt)
extracted ← ParseJSON(response)
FOR EACH prop IN extracted:
prop.source_doc ← DocumentID(D)
prop.source_passage_id ← PassageID(batch)
prop.chunk_type ← "agentic_proposition"
prop.token_count ← TokenCount(prop.claim)
// Validate: reject empty, duplicate, or non-atomic propositions
IF prop.claim IS NOT EMPTY
AND NOT IsDuplicate(prop, propositions)
AND TokenCount(prop.claim) ≥ 5:
APPEND prop TO propositions
4. // Deduplication pass: semantic near-duplicate removal
embeddings ← [Embed(p.claim) FOR p IN propositions]
duplicate_pairs ← FindPairsAboveThreshold(embeddings, θ_dedup=0.95)
FOR EACH (i, j) IN duplicate_pairs:
// Keep the proposition with richer entity set
IF |propositions[i].entities| ≥ |propositions[j].entities|:
MarkForRemoval(propositions[j])
ELSE:
MarkForRemoval(propositions[i])
propositions ← RemoveMarked(propositions)
5. // Optional: cluster propositions by topic for grouped retrieval
clusters ← ClusterByEmbedding(propositions, method="HDBSCAN")
FOR EACH cluster IN clusters:
FOR EACH prop IN cluster:
prop.topic_cluster_id ← cluster.id
6. RETURN propositions9.5.5 Claim Decomposition for Complex Statements#
Complex claims (conjunctions, conditionals, causal chains) require further decomposition. A claim is non-atomic if:
The LLM is instructed to decompose:
9.5.6 Cost Analysis#
Agentic chunking is the most expensive strategy in terms of inference cost. For a document of tokens processed in batches of tokens:
For cost optimization: (a) use a smaller, fine-tuned model for extraction; (b) batch aggressively; (c) cache extraction results and only re-extract on source change.
9.5.7 Trade-Off Analysis#
| Strength | Limitation |
|---|---|
| Produces maximally precise retrieval units | High inference cost ( LLM calls) |
| Self-contained propositions eliminate context dependency | LLM extraction may hallucinate propositions not in source |
| Ideal for fact-checking, QA, and claim verification tasks | Not suitable for code, tabular data, or structured content |
| Enables fine-grained attribution and provenance | Over-decomposition can fragment coherent arguments |
9.6 Code Chunking: AST-Based, Function-Level, Class-Level, Dependency-Scope Chunking#
9.6.1 Motivation#
Source code possesses a formal grammar and a well-defined syntactic structure (the Abstract Syntax Tree, AST). Naïve text-based chunking of code produces fragments that split functions mid-body, separate signatures from implementations, and destroy import context. Code chunking must respect the syntactic and semantic boundaries defined by the programming language's grammar.
9.6.2 AST-Based Chunking#
The AST of a source file is a tree where nodes correspond to syntactic constructs: module, class, function, method, block, statement, expression. Chunking at a target AST depth produces semantically complete code units.
Let denote the set of scope-defining node types:
A code chunk is the serialization of a scope-defining node plus its required context:
where includes only the import statements referenced within 's body.
9.6.3 Pseudo-Algorithm: AST-Based Code Chunking#
ALGORITHM: CodeChunk(F, language, τ_max, granularity)
─────────────────────────────────────────────────────
INPUT:
F — source file content
language — programming language (python, java, typescript, ...)
τ_max — maximum chunk token size
granularity — target scope level: "function" | "class" | "module"
OUTPUT:
chunks[] — list of code chunks with AST metadata
PROCEDURE:
1. AST ← Parse(F, language)
// Use tree-sitter, libcst, or language-specific parser
2. imports ← ExtractImports(AST)
// Global import statements for the file
3. scope_nodes ← CollectNodes(AST, type ∈ N_scope, granularity)
// If granularity = "function": collect all function/method nodes
// If granularity = "class": collect all class nodes
// If granularity = "module": the entire file is one chunk
4. chunks ← []
5. FOR EACH node IN scope_nodes:
signature ← ExtractSignature(node) // e.g., "def foo(x: int) -> str:"
docstring ← ExtractDocstring(node) // First string literal in body
body ← ExtractBody(node) // Full body source
decorators ← ExtractDecorators(node) // @decorator lines
parent_class← GetEnclosingClass(node, AST) // None if top-level
// Resolve used imports
used_symbols ← ExtractReferencedSymbols(body)
relevant_imports ← Filter(imports, symbol ∈ used_symbols)
full_text ← Join([
relevant_imports,
decorators,
signature,
docstring,
body
])
tokens ← Tokenize(full_text)
IF |tokens| ≤ τ_max:
chunk ← {
content: full_text,
token_count: |tokens|,
chunk_type: "code_" + granularity,
language: language,
scope_name: QualifiedName(node, AST),
signature: signature,
enclosing_class: parent_class,
file_path: FilePath(F),
line_start: StartLine(node),
line_end: EndLine(node),
dependencies: used_symbols,
docstring: docstring,
complexity: CyclomaticComplexity(node)
}
APPEND chunk TO chunks
ELSE:
// Function too large: split into sub-blocks
IF granularity = "class":
// Recurse at method level
method_chunks ← CodeChunk(
Serialize(node), language, τ_max, "function"
)
FOR EACH mc IN method_chunks:
mc.enclosing_class ← QualifiedName(node, AST)
APPEND mc TO chunks
ELSE:
// Split large function at logical block boundaries
blocks ← ExtractBlocks(node)
// blocks: loops, conditionals, try-except, sequential blocks
FOR EACH block IN blocks:
block_text ← Join([relevant_imports, signature + "...", Serialize(block)])
block_chunk ← CreateChunk(block_text, metadata={
scope_name: QualifiedName(node) + "::block_" + BlockIndex(block),
...
})
APPEND block_chunk TO chunks
6. // Add file-level context chunk (always)
file_summary ← {
content: Join([imports, ExtractClassSignatures(AST), ExtractFunctionSignatures(AST)]),
chunk_type: "code_file_summary",
file_path: FilePath(F),
token_count: TokenCount(above)
}
APPEND file_summary TO chunks
7. RETURN chunks9.6.4 Dependency-Scope Chunking#
For queries that require understanding the context of a function's callees or callers, dependency-scope chunking constructs chunks that include transitively referenced code:
This provides the retrieval engine with enough context to answer questions about a function's behavior without retrieving the entire codebase.
9.6.5 Language-Specific Considerations#
| Language | Parser | Scope Nodes | Special Handling |
|---|---|---|---|
| Python | tree-sitter-python, libcst | function_definition, class_definition | Decorators, type hints, __init__ grouping |
| Java | tree-sitter-java, JavaParser | method_declaration, class_declaration | Interface/abstract methods, annotations |
| TypeScript | tree-sitter-typescript | function_declaration, class_declaration, arrow_function | JSX components, type declarations |
| Go | tree-sitter-go | function_declaration, method_declaration | Struct methods, interface satisfaction |
| Rust | tree-sitter-rust | function_item, impl_item | Trait implementations, lifetime annotations |
9.6.6 Trade-Off Analysis#
| Strength | Limitation |
|---|---|
| Respects language grammar; never splits mid-function | Requires a working parser per language |
| Includes import context for self-containedness | Very large functions or classes may still exceed |
| AST metadata enables precise scope-based retrieval | Dynamic languages (Python, JS) may have incomplete ASTs for metaprogramming |
| Cyclomatic complexity metadata aids relevance ranking | Cross-file dependencies require call graph analysis |
9.7 Tabular and Structured Data Chunking: Row-Group, Schema-Preserving, Pivot-Aware Strategies#
9.7.1 The Challenge#
Tabular data (CSV, Excel, SQL tables, JSON arrays) encodes information in a two-dimensional schema. Chunking rows without schema context produces meaningless text. Chunking entire tables may exceed . The chunking strategy must preserve the schema-value binding so that each chunk is independently interpretable.
9.7.2 Schema-Preserving Row-Group Chunking#
Given a table with schema and rows, partition into row groups of size :
Every chunk begins with the schema header , ensuring self-containedness. The group size is determined by:
where is the average row token count.
9.7.3 Pivot-Aware Chunking#
When a table has a key column (e.g., entity name, date) that groups logically related rows, chunking should respect key boundaries:
This produces one chunk per unique key value (or per key-value group if the key has high cardinality), preserving the semantic unit of "all data about entity ."
9.7.4 Serialization Formats#
The way a table chunk is serialized affects embedding quality and retrieval precision:
| Format | Example | Pros | Cons |
|---|---|---|---|
| Markdown table | | col1 | col2 | ... | | Readable, familiar to LLMs | Verbose for wide tables |
| Row-as-sentence | "The revenue for Q3 2023 was $5.2M with margin 12%." | High embedding quality | Requires NL template per schema |
| JSON records | [{"col1": "v1", "col2": "v2"}] | Structured, parseable | Token-heavy due to key repetition |
| Key-value pairs | col1: v1, col2: v2 | Compact | Ambiguous without schema context |
For optimal retrieval, row-as-sentence serialization with natural-language templates is preferred because it produces text that aligns with the embedding model's training distribution.
9.7.5 Pseudo-Algorithm: Schema-Preserving Tabular Chunking#
ALGORITHM: TabularChunk(T, schema, τ_max, key_col, serialize_mode)
────────────────────────────────────────────────────────────────────
INPUT:
T — table data (list of row dictionaries)
schema — column definitions with types and descriptions
τ_max — maximum chunk token size
key_col — optional grouping key column (may be NULL)
serialize_mode — "markdown" | "row_sentence" | "json"
OUTPUT:
chunks[] — list of schema-preserving table chunks
PROCEDURE:
1. schema_header ← SerializeSchema(schema, serialize_mode)
schema_tokens ← TokenCount(schema_header)
2. IF key_col IS NOT NULL:
// Pivot-aware grouping
groups ← GroupBy(T, key_col)
ELSE:
// Sequential row grouping
avg_row_tokens ← Mean([TokenCount(SerializeRow(r, schema, serialize_mode)) FOR r IN T])
g ← Floor((τ_max - schema_tokens) / avg_row_tokens)
g ← Max(g, 1)
groups ← SequentialBatch(T, batch_size=g)
3. chunks ← []
FOR EACH group IN groups:
serialized_rows ← [SerializeRow(r, schema, serialize_mode) FOR r IN group]
content ← schema_header + "\n" + Join(serialized_rows, "\n")
tokens ← TokenCount(content)
IF tokens ≤ τ_max:
chunk ← {
content: content,
token_count: tokens,
chunk_type: "tabular",
schema: schema,
row_count: |group|,
key_value: group[0][key_col] IF key_col ELSE NULL,
row_range: (FirstRowIndex(group), LastRowIndex(group)),
source_table: TableID(T)
}
APPEND chunk TO chunks
ELSE:
// Group exceeds τ_max: split into sub-groups
sub_groups ← SequentialBatch(group, batch_size=Floor(g/2))
RECURSE on each sub_group
4. // Generate table summary chunk
summary ← {
content: "Table: " + TableName(T) + ". Schema: " + schema_header
+ ". Row count: " + |T|
+ ". Key statistics: " + ComputeColumnStats(T, schema),
chunk_type: "tabular_summary",
token_count: TokenCount(above),
source_table: TableID(T)
}
APPEND summary TO chunks
5. RETURN chunks9.7.6 Row-as-Sentence Template#
For the row-as-sentence serialization, a template is constructed from the schema:
Example: For schema (company, revenue, quarter, margin):
This produces text that embeds well and retrieves accurately against natural-language queries.
9.8 Multi-Modal Chunking: Image-Caption Pairing, Video Segment Annotation, Audio Transcript Alignment#
9.8.1 Scope#
Multi-modal documents (PDFs with figures, web pages with images, video tutorials, podcast transcripts) require chunking strategies that preserve the alignment between modalities. A figure without its caption is uninterpretable; a transcript segment without its timestamp is unlocalizable.
9.8.2 Image-Caption Pairing#
For documents containing images (figures, diagrams, charts), each image is chunked as a paired unit:
where:
- image_reference: URI or embedding of the image
- caption: extracted figure caption (e.g., "Figure 3: Architecture diagram")
- alt_text: accessibility description if available
- surrounding_text: the sentences before and after the image in the document
- OCR_text: text extracted from the image via OCR (for diagrams, charts, screenshots)
The text representation for embedding and retrieval is:
Optionally, a vision-language model generates a synthetic description of the image:
This description is appended to the embeddable text for richer retrieval.
9.8.3 Video Segment Chunking#
Video content is chunked by segmenting the timeline into semantically coherent segments:
Segment boundaries are determined by:
- Silence detection: Pauses seconds
- Speaker diarization: Speaker change points
- Transcript semantic segmentation: Apply semantic chunking (§9.3) to the transcript text
- Visual scene change detection: Keyframe difference exceeds threshold
9.8.4 Audio Transcript Alignment#
For audio-only content (podcasts, meetings, calls), the chunking pipeline is:
Audio → ASR (speech-to-text) → Diarization → Sentence segmentation
→ Semantic chunking on transcript → Timestamp alignmentEach chunk carries:
9.8.5 Pseudo-Algorithm: Multi-Modal Document Chunking#
ALGORITHM: MultiModalChunk(D_mm, modality_extractors, τ_max)
─────────────────────────────────────────────────────────────
INPUT:
D_mm — multi-modal document (PDF, web page, video, ...)
modality_extractors — dict of modality → extraction function
τ_max — maximum chunk token size
OUTPUT:
chunks[] — list of modality-annotated chunks
PROCEDURE:
1. // Extract modality streams
text_blocks ← modality_extractors["text"](D_mm)
images ← modality_extractors["image"](D_mm)
tables ← modality_extractors["table"](D_mm)
audio_segs ← modality_extractors["audio"](D_mm) // May be empty
video_segs ← modality_extractors["video"](D_mm) // May be empty
2. chunks ← []
3. // Text chunking (structural or semantic)
text_chunks ← StructuralChunk(JoinBlocks(text_blocks), format, τ_max, d_target=2)
FOR EACH tc IN text_chunks:
tc.modality ← "text"
APPEND tc TO chunks
4. // Image-caption pairing
FOR EACH img IN images:
caption ← ExtractCaption(img, D_mm)
alt_text ← ExtractAltText(img, D_mm)
ocr_text ← OCR(img.bytes)
surrounding ← GetSurroundingText(img.position, text_blocks, k=3)
vlm_description ← VLMDescribe(img.bytes) // Optional
embeddable ← Join([caption, alt_text, ocr_text, surrounding, vlm_description])
chunk ← {
content: embeddable,
token_count: TokenCount(embeddable),
chunk_type: "image_caption_pair",
modality: "image",
image_ref: img.uri,
caption: caption,
position_in_doc: img.position,
source_doc: DocumentID(D_mm)
}
APPEND chunk TO chunks
5. // Table chunking
FOR EACH tbl IN tables:
tbl_chunks ← TabularChunk(tbl.data, tbl.schema, τ_max, key_col=NULL, "markdown")
FOR EACH tc IN tbl_chunks:
tc.modality ← "table"
APPEND tc TO chunks
6. // Audio/video segment chunking
FOR EACH seg IN audio_segs + video_segs:
transcript_chunks ← SemanticChunk(seg.transcript, embed_fn, τ_max, τ_min=30, α=1.5, w=2)
FOR EACH tc IN transcript_chunks:
tc.modality ← seg.type // "audio" or "video"
tc.time_start ← AlignTimestamp(tc, seg)
tc.time_end ← AlignTimestamp(tc, seg)
tc.speaker ← seg.speaker_id
tc.keyframe_ref ← ExtractKeyframe(tc.time_start) IF seg.type = "video" ELSE NULL
APPEND tc TO chunks
7. RETURN chunks9.8.6 Cross-Modal Linking#
After chunking, cross-modal links should be established:
- An image chunk links to the text chunk that references it ("As shown in Figure 3...")
- A table chunk links to the text chunk that discusses its findings
- A video segment chunk links to slides visible during that segment
These links are stored as metadata edges and enable the retrieval engine to expand across modalities when a query touches multiple information types.
9.9 Overlap, Stride, and Context Window Strategies for Boundary Coherence#
9.9.1 The Boundary Problem#
Every chunking strategy introduces artificial boundaries. Information that spans a boundary is split across two chunks, and if only one chunk is retrieved, the agent receives an incomplete evidence fragment. Overlap and stride strategies mitigate this by ensuring that boundary-adjacent content appears in multiple chunks.
9.9.2 Formal Definitions#
Given a document of tokens, a chunk size , and an overlap (where ), the stride is:
The -th chunk is:
The total number of chunks is:
The overlap ratio is:
The storage expansion factor (ratio of total stored tokens to document length) is:
For (20% overlap), expansion ; for , expansion .
9.9.3 Optimal Overlap Selection#
The optimal overlap trades off between boundary coherence and storage/embedding cost. Define boundary information loss as:
This counts the number of evaluation queries whose answer falls in the boundary region and is not fully captured by a single chunk. The objective is:
In practice, overlaps of – of chunk size are standard for prose documents, while code chunks typically use overlap (since AST boundaries are semantically precise).
9.9.4 Sentence-Aligned Overlap#
Rather than token-level overlap (which may split words), sentence-aligned overlap ensures that the overlap region begins and ends at sentence boundaries:
This prevents partial-sentence artifacts in the overlap region.
9.9.5 Context Window Prefix Strategy#
An alternative to overlap is the context window prefix: instead of duplicating content, each chunk is prefixed with a compressed summary of its preceding context:
This provides positional context without full content duplication, at the cost of summary generation. The summary prefix is typically bounded to tokens:
9.9.6 Decision Matrix#
| Strategy | Boundary Coherence | Storage Cost | Compute Cost | Best For |
|---|---|---|---|---|
| No overlap () | Low | Minimal | AST-bounded code, clean structural splits | |
| Fixed overlap (10–20%) | Medium | – | Low | General prose, documentation |
| High overlap (50%) | High | Moderate | Dense technical text with cross-reference dependencies | |
| Sentence-aligned overlap | Medium-High | – | Low | Narrative text, legal documents |
| Context window prefix | High | + summary cost | High (LLM calls) | Long-form analysis, reports |
9.10 Chunk Metadata Enrichment: Section Title, Document Position, Entity Tags, Summary, Parent Pointer#
9.10.1 Principle#
A chunk without metadata is an anonymous text fragment. In production retrieval systems, metadata is the primary lever for non-semantic filtering, ranking, and provenance tracking. Every chunk must carry a rich, typed metadata envelope that enables the retrieval engine to filter, re-rank, and attribute.
9.10.2 Metadata Schema#
The canonical chunk metadata schema for an agentic retrieval system:
| Field | Type | Source | Purpose |
|---|---|---|---|
chunk_id | UUID | Generated | Unique identifier, idempotent re-chunking |
source_doc_id | UUID | Ingestion pipeline | Provenance: which document |
source_doc_title | string | Document metadata | Human-readable provenance |
section_path | string[] | Structural parser | Breadcrumb: ["Ch3", "Sec3.2", "Sub3.2.1"] |
position_index | int | Chunking engine | Ordinal position within document (0-indexed) |
total_chunks | int | Chunking engine | Total chunks from this document |
chunk_type | enum | Strategy selector | structural, semantic, agentic, code, tabular, multimodal |
token_count | int | Tokenizer | Exact token count for budget planning |
entity_tags | string[] | NER pipeline | Named entities mentioned: people, orgs, dates, products |
summary | string | LLM or extractive | 1–2 sentence summary of chunk content |
parent_chunk_id | UUID | Hierarchical chunker | Pointer to parent chunk (null if root) |
children_chunk_ids | UUID[] | Hierarchical chunker | Pointers to child chunks |
language | string | Detector | Natural language or programming language |
created_at | timestamp | Ingestion pipeline | Ingestion timestamp |
source_version | string | Version control | Git commit, document revision, API version |
freshness_score | float | Decay function | |
authority_score | float | Source ranking | Confidence in source reliability |
modality | enum | Modality extractor | text, image, table, audio, video |
coherence_score | float | Quality scorer | Intra-chunk semantic coherence (§9.3.4) |
9.10.3 Pseudo-Algorithm: Metadata Enrichment Pipeline#
ALGORITHM: EnrichChunkMetadata(chunks[], ner_fn, summarize_fn, embed_fn)
──────────────────────────────────────────────────────────────────────────
INPUT:
chunks[] — raw chunks from any chunking strategy
ner_fn — named entity recognition function
summarize_fn — summarization function (extractive or LLM)
embed_fn — embedding function
OUTPUT:
enriched[] — chunks with complete metadata
PROCEDURE:
1. FOR EACH chunk c IN chunks:
// Entity extraction
c.entity_tags ← ner_fn(c.content)
// Returns: [{"text": "OpenAI", "type": "ORG"}, ...]
2. // Batch summarization for efficiency
unsummarized ← [c FOR c IN chunks WHERE c.summary IS NULL]
IF |unsummarized| > 0:
summaries ← BatchSummarize(
[c.content FOR c IN unsummarized],
summarize_fn,
max_summary_tokens=50
)
FOR i, c IN Enumerate(unsummarized):
c.summary ← summaries[i]
3. // Coherence scoring
FOR EACH chunk c IN chunks:
sent_embeds ← [embed_fn(s) FOR s IN SentenceTokenize(c.content)]
IF |sent_embeds| ≥ 2:
c.coherence_score ← MeanPairwiseCosine(sent_embeds)
ELSE:
c.coherence_score ← 1.0 // Single-sentence chunk is trivially coherent
4. // Freshness scoring
λ ← 0.01 // Decay rate (configurable per source type)
FOR EACH chunk c IN chunks:
age_days ← (Now() - c.created_at).days
c.freshness_score ← Exp(-λ · age_days)
5. // Position normalization
doc_groups ← GroupBy(chunks, key=source_doc_id)
FOR EACH doc_id, group IN doc_groups:
total ← |group|
FOR i, c IN Enumerate(SortByPosition(group)):
c.position_index ← i
c.total_chunks ← total
c.relative_position ← i / total // 0.0 = start, 1.0 = end
6. // Generate chunk IDs (deterministic for idempotency)
FOR EACH chunk c IN chunks:
c.chunk_id ← DeterministicUUID(
c.source_doc_id, c.position_index, c.chunk_type, c.source_version
)
// Same content + same version → same ID (enables dedup on re-ingestion)
7. RETURN chunks9.10.4 Metadata-Driven Retrieval Filtering#
At query time, metadata enables pre-retrieval filtering that dramatically reduces the candidate set:
where is the minimum freshness threshold, is the query language, and is the set of entities mentioned in the query. This filtering occurs before embedding similarity computation, reducing both latency and false positives.
9.10.5 Summary as a Retrieval Proxy#
Chunk summaries serve a dual purpose:
- Retrieval proxy: Embed and index the summary instead of (or in addition to) the full chunk. Summaries are denser and produce more discriminative embeddings.
- Context compression: When injecting chunks into the agent's context window, the summary can replace the full content for lower-priority chunks, conserving token budget.
The dual-embedding strategy indexes both:
Retrieval computes similarity against both and takes the maximum:
9.11 Adaptive Chunking: Runtime Chunk Size Adjustment Based on Query Complexity and Token Budget#
9.11.1 Motivation#
Static chunking assumes a fixed optimal chunk size at ingestion time. However, the optimal retrieval granularity depends on the query at runtime:
- A broad question ("Summarize the company's Q3 performance") benefits from large, summary-level chunks
- A precise question ("What was the EBITDA margin in Q3 2023?") benefits from small, fact-level chunks
- A complex multi-hop question ("Compare Q3 margins across all subsidiaries") requires multiple medium-granularity chunks
Adaptive chunking adjusts the effective chunk granularity at query time without re-indexing.
9.11.2 Query Complexity Classification#
Define a query complexity classifier:
Each complexity class maps to a preferred granularity level:
| Query Class | Preferred Granularity | Chunk Size Range | Strategy |
|---|---|---|---|
| Factoid | Fine (propositions, small chunks) | 50–150 tokens | Retrieve from agentic/leaf-level index |
| Analytical | Medium (paragraphs, sections) | 200–500 tokens | Retrieve from structural/semantic index |
| Comparative | Medium + multi-entity | 200–500 tokens | Retrieve per entity, align by schema |
| Exploratory | Coarse (summaries, full sections) | 500–1500 tokens | Retrieve from summary/parent-level index |
| Multi-hop | Mixed (fine + medium) | Varies | Iterative retrieval with decomposed sub-queries |
9.11.3 Token Budget-Aware Chunk Selection#
Given a context budget , the number of retrievable chunks is:
where:
- : tokens consumed by system prompt, role policy, tool definitions
- : tokens in the user query
- : tokens reserved for model generation
- : average chunk size for the selected granularity level
This ensures the retrieval engine never overfills the context window.
9.11.4 Runtime Chunk Merging and Splitting#
If the index stores chunks at a single granularity but the query demands a different one, runtime transformation is applied:
Runtime Merging (for coarser granularity):
where are consecutive chunks from the same document, and is chosen such that .
Runtime Splitting (for finer granularity):
Alternatively, if a hierarchical index (§9.4) is available, the retrieval engine simply navigates to the appropriate level.
9.11.5 Pseudo-Algorithm: Adaptive Retrieval with Dynamic Granularity#
ALGORITHM: AdaptiveRetrieve(q, index, B_context, B_system, B_reserve)
────────────────────────────────────────────────────────────────────────
INPUT:
q — user query
index — hierarchical chunk index with multiple granularity levels
B_context — total context window budget (tokens)
B_system — system prompt token consumption
B_reserve — tokens reserved for generation
OUTPUT:
context[] — ordered list of chunks for context injection
PROCEDURE:
1. // Classify query complexity
κ ← ClassifyQueryComplexity(q)
// κ ∈ {factoid, analytical, comparative, exploratory, multi_hop}
2. // Determine granularity level and chunk budget
level ← GranularityMap[κ]
avg_size ← index.AverageChunkSize(level)
B_available ← B_context - B_system - TokenCount(q) - B_reserve
k ← Floor(B_available / avg_size)
k ← Clamp(k, min=3, max=20) // Sensible bounds
3. // Retrieve at target granularity
candidates ← index.Retrieve(q, level=level, top_k=k * 3)
// Over-retrieve by 3× for re-ranking headroom
4. // Re-rank with cross-encoder or metadata-weighted scoring
FOR EACH c IN candidates:
c.relevance ← CrossEncoderScore(q, c.content)
c.final_score ← (
w_rel · c.relevance
+ w_fresh · c.freshness_score
+ w_auth · c.authority_score
+ w_coh · c.coherence_score
)
5. candidates ← SortByFinalScore(candidates, descending=True)
6. // Greedy token-budget packing
context ← []
tokens_used ← 0
FOR EACH c IN candidates:
IF tokens_used + c.token_count ≤ B_available:
APPEND c TO context
tokens_used ← tokens_used + c.token_count
ELSE:
// Try summary version if available
IF c.summary IS NOT NULL AND tokens_used + TokenCount(c.summary) ≤ B_available:
APPEND SummaryChunk(c) TO context
tokens_used ← tokens_used + TokenCount(c.summary)
ELSE:
BREAK // Budget exhausted
7. // For multi-hop queries: decompose and iterate
IF κ = "multi_hop":
sub_queries ← DecomposeQuery(q)
FOR EACH sq IN sub_queries:
sub_context ← AdaptiveRetrieve(sq, index, B_available / |sub_queries|, 0, 0)
EXTEND context WITH sub_context
context ← Deduplicate(context)
context ← TruncateToFit(context, B_available)
8. // Order chunks by document position for coherent reading
context ← SortByDocumentPosition(context)
9. RETURN context9.11.6 Formal Objective: Budget-Optimal Chunk Selection#
The adaptive retrieval problem can be formalized as a constrained optimization:
This is a variant of the 0-1 knapsack problem with item weight and value . For practical purposes, the greedy approximation (sort by score-per-token ratio and pack greedily) yields near-optimal solutions:
Alternatively, for small candidate sets (), dynamic programming yields an exact solution in .
9.12 Chunk Quality Metrics: Retrieval Precision Impact, Contextual Completeness, Synthesis Utility#
9.12.1 Motivation#
Chunking quality is not an abstract property — it manifests in measurable downstream effects on retrieval precision, context coherence, and agent output quality. A rigorous chunk quality framework defines metrics at three levels: chunk-level (intrinsic quality), retrieval-level (search effectiveness), and synthesis-level (downstream generation quality).
9.12.2 Chunk-Level Intrinsic Metrics#
9.12.2.1 Coherence Score#
As defined in §9.3.4:
where is the set of sentences in chunk and are sentence embeddings. Range: ; target .
9.12.2.2 Self-Containedness Score#
Measures whether a chunk can be understood independently:
where unresolved references include dangling pronouns ("it," "they," "the above"), undefined acronyms, and incomplete sentences. Range: ; target .
9.12.2.3 Information Density#
Higher density indicates more information per token, meaning fewer tokens are wasted on filler or repetition.
9.12.2.4 Boundary Quality#
Assesses whether chunk boundaries respect semantic units:
9.12.3 Retrieval-Level Metrics#
9.12.3.1 Chunk Retrieval Precision at #
Given a set of evaluation queries with ground-truth relevant chunks for each query :
9.12.3.2 Chunk Recall at #
9.12.3.3 Mean Reciprocal Rank (MRR)#
where is the position of the first relevant chunk in the ranked retrieval list.
9.12.3.4 Normalized Discounted Cumulative Gain (NDCG)#
For graded relevance labels :
where is the ideal DCG (relevance labels sorted in descending order).
9.12.4 Synthesis-Level Metrics#
These measure the impact of chunk quality on the final agent output:
9.12.4.1 Faithfulness (Groundedness)#
The fraction of claims in the agent's output that are entailed by the retrieved context . This is the primary hallucination metric, and chunking directly affects it: poorly chunked context that omits critical evidence forces the model to confabulate.
9.12.4.2 Answer Completeness#
where is the ground-truth answer. Incomplete chunking produces incomplete answers.
9.12.4.3 Context Utilization#
Measures what fraction of retrieved chunks the model actually used. Low utilization indicates that chunking is producing low-relevance results.
9.12.5 Composite Chunk Quality Score#
A weighted composite metric for chunking strategy evaluation:
subject to:
where and are corpus-level averages. This composite score enables A/B evaluation of chunking strategies against a fixed evaluation set.
9.12.6 Pseudo-Algorithm: Chunk Quality Evaluation Pipeline#
ALGORITHM: EvaluateChunkQuality(chunks[], Q_eval, ground_truth, agent_fn, embed_fn)
───────────────────────────────────────────────────────────────────────────────────────
INPUT:
chunks[] — chunked corpus under evaluation
Q_eval — evaluation query set with ground-truth answers and relevant chunks
ground_truth — mapping: query → (answer, relevant_chunk_ids)
agent_fn — agent pipeline function: (query, context) → response
embed_fn — embedding function
OUTPUT:
report — quality report with all metrics
PROCEDURE:
1. // Index chunks
index ← BuildIndex(chunks, embed_fn)
2. // Intrinsic metrics
coherence_scores ← [Coherence(c, embed_fn) FOR c IN chunks]
self_contained_scores ← [SelfContainednessScore(c) FOR c IN chunks]
density_scores ← [InformationDensity(c) FOR c IN chunks]
boundary_scores ← [BoundaryQuality(c) FOR c IN chunks]
3. // Retrieval metrics
precision_at_k ← []
recall_at_k ← []
mrr_scores ← []
ndcg_scores ← []
FOR EACH (q, gt) IN Zip(Q_eval, ground_truth):
retrieved ← index.Retrieve(q, top_k=10)
retrieved_ids ← [c.chunk_id FOR c IN retrieved]
relevant_ids ← gt.relevant_chunk_ids
precision_at_k.APPEND(|Set(retrieved_ids[:k]) ∩ Set(relevant_ids)| / k)
recall_at_k.APPEND(|Set(retrieved_ids[:k]) ∩ Set(relevant_ids)| / |relevant_ids|)
mrr_scores.APPEND(1 / FirstRelevantRank(retrieved_ids, relevant_ids))
ndcg_scores.APPEND(NDCG(retrieved_ids, relevant_ids, k))
4. // Synthesis metrics
faithfulness_scores ← []
completeness_scores ← []
utilization_scores ← []
FOR EACH (q, gt) IN Zip(Q_eval, ground_truth):
retrieved ← index.Retrieve(q, top_k=5)
response ← agent_fn(q, retrieved)
claims_y ← ExtractClaims(response)
claims_gt ← ExtractClaims(gt.answer)
faithfulness_scores.APPEND(
|EntailedClaims(claims_y, retrieved)| / Max(|claims_y|, 1)
)
completeness_scores.APPEND(
|Set(claims_y) ∩ Set(claims_gt)| / Max(|claims_gt|, 1)
)
utilization_scores.APPEND(
|UsedChunks(response, retrieved)| / Max(|retrieved|, 1)
)
5. report ← {
intrinsic: {
mean_coherence: Mean(coherence_scores),
mean_self_contained: Mean(self_contained_scores),
mean_density: Mean(density_scores),
mean_boundary: Mean(boundary_scores)
},
retrieval: {
precision_at_k: Mean(precision_at_k),
recall_at_k: Mean(recall_at_k),
mrr: Mean(mrr_scores),
ndcg_at_k: Mean(ndcg_scores)
},
synthesis: {
faithfulness: Mean(faithfulness_scores),
completeness: Mean(completeness_scores),
utilization: Mean(utilization_scores)
},
composite: ComputeCompositeScore(above, weights)
}
6. RETURN report9.13 Chunk Storage and Indexing: Vector Stores, Inverted Indexes, Hybrid Index Structures#
9.13.1 Architectural Requirements#
Chunks, once produced and enriched, must be stored in a retrieval infrastructure that supports:
- Semantic search: approximate nearest neighbor (ANN) over dense embeddings
- Keyword search: exact match, BM25, inverted index queries
- Metadata filtering: pre-retrieval filtering on entity tags, document ID, date ranges, chunk type
- Hierarchical navigation: parent-child traversal for chunk tree indexes
- Freshness-aware retrieval: time-weighted scoring
- Scalability: sub-100ms retrieval latency at millions-to-billions of chunks
- Idempotent upsert: re-chunking the same document version produces no duplicates
9.13.2 Vector Store Architecture#
Dense embedding retrieval stores each chunk as a vector alongside its metadata. The retrieval operation for query is:
where is typically cosine similarity or inner product.
ANN Index Structures#
| Index Type | Build Time | Query Time | Memory | Recall | Best For |
|---|---|---|---|---|---|
| Flat (brute-force) | 100% | Small corpora ( chunks) | |||
| IVF (Inverted File) | 95–99% | Medium corpora (–) | |||
| HNSW (Hierarchical NSW) | 97–99.9% | Production systems requiring high recall | |||
| PQ (Product Quantization) | 90–95% | Billion-scale with memory constraints | |||
| ScaNN | 98–99% | Google-scale retrieval |
For production agentic systems, HNSW is the default recommendation due to its logarithmic query time, high recall, and mature ecosystem support.
The HNSW parameters:
- : maximum number of connections per node (typical: –)
- : search depth during index build (typical: –)
- : search depth during query (typical: –; tune for recall-latency trade-off)
The recall-latency trade-off is governed by:
where depends on data dimensionality and distribution.
9.13.3 Inverted Index for Keyword Search#
For exact-match and keyword-based retrieval, an inverted index maps terms to chunk IDs:
BM25 scoring for a query against chunk :
where:
- : term frequency of in chunk
- : total number of chunks
- : number of chunks containing term
- , : tuning parameters
- : average chunk length in tokens
9.13.4 Hybrid Index: Dense + Sparse Fusion#
Production systems combine vector search and keyword search through reciprocal rank fusion (RRF) or linear score combination:
Reciprocal Rank Fusion (RRF)#
where is the set of ranking systems (dense + sparse), is the rank of chunk in system , and is a smoothing constant (typically ).
Linear Score Combination#
where is tuned per domain. For most knowledge-base retrieval tasks, (slightly favoring semantic search) performs well.
9.13.5 Metadata Index Layer#
A separate metadata index enables pre-retrieval filtering. This is typically implemented as:
- Structured metadata store: PostgreSQL, SQLite, or document store with indexed fields
- Filter pushdown: Metadata filters are applied before ANN search, reducing the candidate set:
where .
Common filter predicates:
source_doc_id = "doc_xyz"
created_at >= "2024-01-01"
chunk_type IN ["structural", "semantic"]
entity_tags CONTAINS "OpenAI"
language = "python"
freshness_score >= 0.59.13.6 Hierarchical Index for Multi-Granularity Retrieval#
For hierarchical chunks (§9.4), the index supports level-aware retrieval:
Level 1 (summaries): Index_L1 → HNSW over summary embeddings
Level 2 (sections): Index_L2 → HNSW over section embeddings
Level 3 (paragraphs): Index_L3 → HNSW over paragraph embeddings
Level 4 (propositions): Index_L4 → HNSW over proposition embeddingsThe retrieval router selects the index level based on query complexity (§9.11.2) and then optionally expands to parent or child chunks via the chunk tree structure stored in the metadata layer.
9.13.7 Pseudo-Algorithm: Hybrid Index Build and Query#
ALGORITHM: BuildHybridIndex(chunks[], embed_fn, tokenizer)
────────────────────────────────────────────────────────────
INPUT:
chunks[] — enriched chunks with metadata
embed_fn — embedding function: string → ℝ^d
tokenizer — tokenizer for BM25 term extraction
OUTPUT:
hybrid_index — composite index supporting dense, sparse, and metadata queries
PROCEDURE:
1. // Dense index construction
dense_index ← InitHNSW(dim=d, M=32, efConstruction=400)
FOR EACH chunk c IN chunks:
v ← embed_fn(c.content)
dense_index.Add(c.chunk_id, v)
2. // Sparse index construction (BM25)
sparse_index ← InitInvertedIndex()
FOR EACH chunk c IN chunks:
terms ← tokenizer.Tokenize(c.content)
sparse_index.AddDocument(c.chunk_id, terms, metadata={
doc_length: |terms|
})
sparse_index.ComputeIDF()
3. // Metadata index construction
metadata_store ← InitStructuredStore(schema=ChunkMetadataSchema)
FOR EACH chunk c IN chunks:
metadata_store.Upsert(c.chunk_id, c.metadata)
// Upsert: idempotent — same chunk_id overwrites cleanly
4. // Summary embedding index (optional, for dual-embedding)
summary_index ← InitHNSW(dim=d, M=16, efConstruction=200)
FOR EACH chunk c IN chunks WHERE c.summary IS NOT NULL:
v_s ← embed_fn(c.summary)
summary_index.Add(c.chunk_id, v_s)
5. hybrid_index ← {
dense: dense_index,
sparse: sparse_index,
metadata: metadata_store,
summary: summary_index,
config: {α: 0.6, K_rrf: 60}
}
6. RETURN hybrid_index
ALGORITHM: HybridQuery(hybrid_index, q, k, filters, α)
───────────────────────────────────────────────────────
INPUT:
hybrid_index — composite index
q — query string
k — number of results to return
filters — metadata filter predicates
α — dense/sparse weight (override or use default)
OUTPUT:
results[] — ranked list of (chunk_id, score, chunk) tuples
PROCEDURE:
1. // Apply metadata filters
candidate_ids ← hybrid_index.metadata.Query(filters)
// Returns set of chunk_ids matching filter predicates
2. // Dense retrieval (restricted to candidates)
v_q ← embed_fn(q)
dense_results ← hybrid_index.dense.Search(
v_q, top_k=k * 3, restrict_to=candidate_ids
)
// Returns: [(chunk_id, cosine_score), ...]
3. // Sparse retrieval (restricted to candidates)
q_terms ← tokenizer.Tokenize(q)
sparse_results ← hybrid_index.sparse.Search(
q_terms, top_k=k * 3, restrict_to=candidate_ids
)
// Returns: [(chunk_id, bm25_score), ...]
4. // Score fusion
IF fusion_mode = "RRF":
// Reciprocal Rank Fusion
scores ← DefaultDict(float)
FOR rank, (cid, _) IN Enumerate(dense_results):
scores[cid] += 1.0 / (K_rrf + rank + 1)
FOR rank, (cid, _) IN Enumerate(sparse_results):
scores[cid] += 1.0 / (K_rrf + rank + 1)
ELSE IF fusion_mode = "linear":
// Normalize scores to [0, 1]
dense_norm ← MinMaxNormalize(dense_results)
sparse_norm ← MinMaxNormalize(sparse_results)
scores ← DefaultDict(float)
FOR (cid, s) IN dense_norm:
scores[cid] += α · s
FOR (cid, s) IN sparse_norm:
scores[cid] += (1 - α) · s
5. // Sort and return top-k
ranked ← SortByScore(scores, descending=True)[:k]
6. // Fetch full chunk content and metadata
results ← []
FOR EACH (cid, score) IN ranked:
chunk ← hybrid_index.metadata.GetChunk(cid)
APPEND (cid, score, chunk) TO results
7. RETURN results9.13.8 Storage Architecture Patterns#
Pattern 1: Integrated Vector Database#
A single system (e.g., Weaviate, Qdrant, Milvus, Pinecone) serves as both vector store and metadata store:
[Chunks] → [Embedding] → [Vector DB with metadata filtering]
├── HNSW index (dense)
├── BM25 index (sparse, if supported)
└── Metadata index (filterable fields)Pros: Operational simplicity, atomic upserts, consistent filtering. Cons: May lack full-featured BM25 or complex metadata query support.
Pattern 2: Disaggregated Stores#
Separate systems for each concern:
[Chunks] → [Embedding] → [HNSW Vector Store (Qdrant, Faiss)]
→ [Elasticsearch / OpenSearch (BM25 + metadata)]
→ [PostgreSQL (chunk metadata, hierarchical pointers)]Pros: Best-of-breed per concern, independent scaling. Cons: Operational complexity, consistency management, query fan-out latency.
Pattern 3: Lakehouse with Vector Index Overlay#
Chunks stored in a data lakehouse (Delta Lake, Iceberg) with vector index acceleration:
[Parquet files in object store]
├── Column: chunk_id, content, metadata (structured)
├── Column: embedding (vector)
└── [Vector index sidecar: HNSW over embedding column]Pros: Cost-effective for very large corpora, SQL analytics over chunks, version control. Cons: Higher query latency, complex infrastructure.
9.13.9 Idempotent Upsert and Versioning#
Re-ingestion of a document must not produce duplicate chunks. The idempotency contract:
On re-ingestion:
- Compute new chunk IDs for the updated document
- Delete chunks whose
source_doc_idmatches but whose IDs are no longer in the new set - Upsert new/changed chunks
- Index entries are atomically updated
This ensures the index is always consistent with the latest document version without manual deduplication.
9.13.10 Cache Hierarchy for Retrieval Artifacts#
To minimize retrieval latency under repeated or similar queries:
| Cache Layer | Scope | TTL | Content |
|---|---|---|---|
| Query embedding cache | Per query hash | 1 hour | Precomputed |
| Result cache | Per (query hash, filter hash) | 5 minutes | Top-k chunk IDs and scores |
| Chunk content cache | Per chunk ID | 24 hours | Full chunk content + metadata |
| Cross-encoder score cache | Per (query, chunk ID) | 1 hour | Re-ranking scores |
Cache invalidation is triggered by chunk upserts: any chunk ID that changes invalidates all result caches containing it.
9.13.11 Observability and Operational Metrics#
| Metric | Description | Alert Threshold |
|---|---|---|
retrieval_latency_p99 | 99th percentile retrieval latency | > 200ms |
index_size_chunks | Total chunks in index | Monitor growth rate |
embedding_throughput | Chunks embedded per second | < 100/s triggers scaling |
cache_hit_rate | Fraction of queries served from cache | < 50% indicates cold cache |
index_freshness_lag | Time between document update and index update | > 5 minutes |
duplicate_chunk_rate | Fraction of chunks with duplicate content | > 1% indicates idempotency failure |
metadata_completeness | Fraction of chunks with all metadata fields populated | < 95% triggers enrichment pipeline review |
Chapter Summary#
Chunking is the foundational engineering discipline that determines the quality ceiling of every downstream retrieval and generation operation in an agentic system. This chapter has established:
-
No universal strategy exists: Document class determines the optimal chunking approach. The system must classify documents and route to the appropriate strategy.
-
Six primary strategies — structural, semantic, hierarchical, agentic, code-aware, and tabular — each formalized with mathematical definitions, pseudo-algorithms, and trade-off analyses.
-
Multi-modal and adaptive extensions that handle heterogeneous media and runtime query-driven granularity adjustment.
-
Overlap and boundary coherence techniques that mitigate information loss at chunk boundaries, with formal cost-expansion analysis.
-
Metadata enrichment as a non-optional requirement: every chunk must carry provenance, position, entity tags, summary, coherence score, and hierarchical pointers.
-
Quality metrics at three levels — intrinsic, retrieval, and synthesis — with a composite evaluation framework for A/B comparison of chunking strategies.
-
Hybrid indexing infrastructure combining dense (HNSW), sparse (BM25), and metadata indexes with formal fusion algorithms, cache hierarchies, idempotent upsert contracts, and production observability.
The key architectural principle is: chunking is not preprocessing; it is a continuously evaluated, document-class-aware, query-adaptive retrieval precision lever that must be engineered with the same rigor as any other production subsystem in the agentic stack.