Agentic Notes Library

Chapter 9: Chunking Strategies — Document-Class-Specific Segmentation for Retrieval Precision

Chunking is the act of partitioning a source document into discrete retrieval units — the atomic segments that are embedded, indexed, retrieved, and injected into an agent's context window. In production agentic systems, chunking is not...

March 20, 2026 23 min read 5,016 words
Chapter 09MathRaw HTML

Preamble#

Chunking is the act of partitioning a source document into discrete retrieval units — the atomic segments that are embedded, indexed, retrieved, and injected into an agent's context window. In production agentic systems, chunking is not a preprocessing convenience; it is a retrieval precision lever that directly governs the signal-to-noise ratio of every context window the agent constructs. A poorly chunked corpus degrades retrieval recall, inflates token cost, introduces incoherent context boundaries, and ultimately amplifies hallucination risk. A well-engineered chunking layer, by contrast, produces retrieval units that are semantically self-contained, provenance-tagged, hierarchically navigable, and optimally sized for downstream synthesis.

This chapter develops chunking as a first-class engineering discipline within the agentic retrieval stack. We formalize each strategy — structural, semantic, hierarchical, agentic, code-aware, tabular, multi-modal, and adaptive — with mathematical definitions, pseudo-algorithms, quality metrics, and explicit trade-off analyses. Every strategy is evaluated against the requirements of retrieval precision, contextual completeness, synthesis utility, token efficiency, and production-grade scalability.


9.1 Chunking as a Retrieval Precision Lever: Why One Strategy Fails All Document Types#

9.1.1 The Fundamental Problem#

A retrieval-augmented agent constructs its context window C\mathcal{C} from a set of retrieved chunks {c1,c2,,ck}\{c_1, c_2, \ldots, c_k\} selected from an indexed corpus D\mathcal{D}. The quality of the agent's output yy is bounded by the quality of C\mathcal{C}:

P(yq,C)P(yq,D)P(y \mid q, \mathcal{C}) \leq P(y \mid q, \mathcal{D}^*)

where qq is the user query and D\mathcal{D}^* is the ideal complete evidence set. The gap between P(yq,C)P(y \mid q, \mathcal{C}) and P(yq,D)P(y \mid q, \mathcal{D}^*) is the retrieval information loss, and chunking strategy is the primary determinant of that loss.

9.1.2 Why Uniform Chunking Fails#

A naïve fixed-size chunking strategy partitions a document DD of length LL tokens into n=L/sn = \lceil L / s \rceil chunks of size ss:

ci=D[is:(i+1)s],i{0,1,,n1}c_i = D[i \cdot s : (i+1) \cdot s], \quad i \in \{0, 1, \ldots, n-1\}

This approach introduces three classes of failure:

Failure ClassDescriptionConsequence
Boundary IncoherenceSplits occur mid-sentence, mid-paragraph, or mid-argumentRetrieved chunks are semantically incomplete, forcing the model to hallucinate missing context
Granularity MismatchA single chunk size cannot match the natural information density of heterogeneous documentsCode files, legal contracts, tabular data, and narrative prose have radically different information topologies
Metadata DestructionFixed splitting discards structural cues — headings, nesting, schema boundariesRetrieval loses the ability to rank by document position, section relevance, or hierarchical containment

9.1.3 Document-Class Taxonomy#

Production corpora are heterogeneous. Each document class exhibits a distinct information topology that demands a specialized chunking strategy:

Document ClassInformation TopologyOptimal Chunking Strategy
Technical documentationHeading-section hierarchyStructural
Research papersArgument-claim-evidence chainsSemantic + Hierarchical
Legal contractsClause-subclause nestingStructural + Hierarchical
Source codeAST-defined scopesCode (AST-based)
Spreadsheets / CSV / databasesRow-column schemaTabular
Conversational transcriptsTurn-topic segmentationSemantic
Multi-modal artifactsImage-text-audio alignmentMulti-modal
Knowledge base articlesProposition-fact granularityAgentic

9.1.4 The Chunking Design Objective#

Formally, the chunking function CHUNK:D{c1,c2,,cn}\text{CHUNK}: \mathcal{D} \rightarrow \{c_1, c_2, \ldots, c_n\} must simultaneously optimize:

maxCHUNK  RetrievalPrecision(C,q)signal relevance+λ1ContextualCompleteness(ci)self-containedness+λ2SynthesisUtility(ci)downstream usability\max_{\text{CHUNK}} \; \underbrace{\text{RetrievalPrecision}(\mathcal{C}, q)}_{\text{signal relevance}} + \lambda_1 \underbrace{\text{ContextualCompleteness}(c_i)}_{\text{self-containedness}} + \lambda_2 \underbrace{\text{SynthesisUtility}(c_i)}_{\text{downstream usability}}

subject to:

ciτmaxi,i=1kciBcontext,Provenance(ci)i|c_i| \leq \tau_{\max} \quad \forall i, \qquad \sum_{i=1}^{k} |c_i| \leq B_{\text{context}}, \qquad \text{Provenance}(c_i) \neq \varnothing \quad \forall i

where τmax\tau_{\max} is the maximum chunk token size, BcontextB_{\text{context}} is the agent's context token budget, and provenance is mandatory for every chunk.

9.1.5 Architectural Position of Chunking in the Agentic Stack#

Chunking sits at the critical junction between ingestion and retrieval:

Document Ingestion → [Classification] → [Strategy Selection] → [Chunking Engine]
    → [Metadata Enrichment] → [Embedding] → [Indexing] → Retrieval Engine

The chunking engine receives a classified document and a strategy selector, produces enriched chunks, and emits them to the embedding and indexing pipeline. This is not a batch-once operation; adaptive chunking (§9.11) permits runtime re-chunking based on query characteristics.


9.2 Structural Chunking: Heading, Section, Paragraph, and Markup-Aware Splitting#

9.2.1 Definition and Scope#

Structural chunking exploits the explicit organizational markers present in a document — headings, subheadings, paragraphs, list items, HTML/XML tags, Markdown formatting, LaTeX section commands, or PDF layout elements — to define chunk boundaries that align with the author's intended information hierarchy.

Let a document DD be represented as an ordered tree T(D)=(V,E)\mathcal{T}(D) = (V, E) where each node vVv \in V corresponds to a structural element (section, subsection, paragraph, list item) and edges EE represent containment. Structural chunking defines chunk boundaries at a target depth dd of T(D)\mathcal{T}(D):

CHUNKstruct(D,d)={serialize(v)vV,  depth(v)=d}\text{CHUNK}_{\text{struct}}(D, d) = \{ \text{serialize}(v) \mid v \in V, \; \text{depth}(v) = d \}

If serialize(v)>τmax|\text{serialize}(v)| > \tau_{\max}, the node is recursively split at depth d+1d+1.

9.2.2 Markup Detection and Parsing#

Structural chunking requires a format-specific parser that recovers the document tree:

Source FormatStructural SignalsParser Strategy
Markdown#, ##, ###, ---, list markersRegex + CommonMark AST
HTML<h1><h6>, <section>, <article>, <div>DOM tree traversal
LaTeX\section, \subsection, \paragraphLaTeX AST parser
PDFFont size transitions, bold headings, indentationLayout analysis (heuristic or ML-based)
DOCX/ODTStyle-tagged heading levels, <w:pStyle>XML schema extraction
reStructuredTextUnderline-delimited headings, directivesRST parser

9.2.3 Pseudo-Algorithm: Structural Chunking#

ALGORITHM: StructuralChunk(D, format, τ_max, d_target)
───────────────────────────────────────────────────────
INPUT:
  D         — raw document content
  format    — detected document format (markdown, html, pdf, ...)
  τ_max     — maximum chunk token size
  d_target  — target structural depth for chunking
 
OUTPUT:
  chunks[]  — list of structurally-bounded chunks with metadata
 
PROCEDURE:
  1.  T ← ParseDocumentTree(D, format)
      // Recover the structural tree T = (V, E) from format-specific parser
 
  2.  candidates ← CollectNodesAtDepth(T, d_target)
      // Gather all nodes at target depth d_target
 
  3.  chunks ← []
 
  4.  FOR EACH node v IN candidates:
        text ← Serialize(v)           // Flatten node subtree to text
        tokens ← Tokenize(text)
 
        IF |tokens| ≤ τ_max:
            breadcrumb ← AncestorPath(v, T)
            // e.g., "Chapter 3 > Section 3.2 > Subsection 3.2.1"
 
            chunk ← {
                content:        text,
                token_count:    |tokens|,
                section_path:   breadcrumb,
                depth:          Depth(v),
                position_index: GlobalIndex(v, T),
                source_doc:     DocumentID(D),
                chunk_type:     "structural",
                parent_id:      ParentID(v, T),
                children_ids:   ChildrenIDs(v, T)
            }
            APPEND chunk TO chunks
 
        ELSE:
            // Node exceeds τ_max: recurse to finer depth
            sub_chunks ← StructuralChunk(
                Serialize(v), format, τ_max, d_target + 1
            )
            FOR EACH sc IN sub_chunks:
                sc.parent_id ← NodeID(v)
                APPEND sc TO chunks
 
  5.  // Handle leaf nodes with no structural children that still exceed τ_max
      FOR EACH chunk c IN chunks WHERE c.token_count > τ_max:
          splits ← ParagraphSplit(c.content, τ_max)
          REPLACE c IN chunks WITH splits
          // Paragraph-level fallback preserves sentence boundaries
 
  6.  RETURN chunks

9.2.4 Paragraph-Level Splitting as Fallback#

When structural depth is exhausted (leaf nodes exceed τmax\tau_{\max}), the fallback is paragraph-level splitting. A paragraph boundary is detected by:

IsParagraphBreak(i)={1if D[i]=nn or D[i]{<p>,<br/>}0otherwise\text{IsParagraphBreak}(i) = \begin{cases} 1 & \text{if } D[i] = \texttt{\\n\\n} \text{ or } D[i] \in \{\texttt{<p>}, \texttt{<br/>}\} \\ 0 & \text{otherwise} \end{cases}

This preserves sentence coherence while respecting token limits.

9.2.5 Trade-Off Analysis#

StrengthLimitation
Preserves author-intended organizationRequires well-structured source documents
Breadcrumb metadata enables hierarchical retrievalPDF and legacy formats have noisy structural signals
Chunk boundaries align with human reading unitsInformation density varies across sections; some chunks may be too sparse or too dense
Low computational costCannot detect topical shifts within a single section

9.2.6 When to Apply#

Structural chunking is the default first-pass strategy for any document with reliable markup. It should be combined with semantic chunking (§9.3) when sections contain multiple distinct topics, and with hierarchical chunking (§9.4) when parent-child retrieval is required.


9.3 Semantic Chunking: Topic Segmentation, Embedding Similarity Boundaries, Coherence Scoring#

9.3.1 Motivation#

Structural chunking fails when a document lacks reliable markup, or when a single structural section contains multiple distinct topics. Semantic chunking addresses this by detecting topical boundaries within text using distributional similarity measures.

9.3.2 Core Principle: Embedding Breakpoint Detection#

Given a document DD segmented into an ordered sequence of sentences {s1,s2,,sN}\{s_1, s_2, \ldots, s_N\}, each sentence is embedded into a vector space:

ei=Embed(si)Rd\mathbf{e}_i = \text{Embed}(s_i) \in \mathbb{R}^d

The inter-sentence similarity between consecutive sentences is:

sim(i,i+1)=eiei+1ei  ei+1\text{sim}(i, i+1) = \frac{\mathbf{e}_i \cdot \mathbf{e}_{i+1}}{\|\mathbf{e}_i\| \; \|\mathbf{e}_{i+1}\|}

A breakpoint is detected at position ii when the similarity drops below a threshold θ\theta or exhibits a statistically significant local minimum:

IsBreakpoint(i)={1if sim(i,i+1)<θ1if sim(i,i+1)<μwασw0otherwise\text{IsBreakpoint}(i) = \begin{cases} 1 & \text{if } \text{sim}(i, i+1) < \theta \\ 1 & \text{if } \text{sim}(i, i+1) < \mu_w - \alpha \cdot \sigma_w \\ 0 & \text{otherwise} \end{cases}

where μw\mu_w and σw\sigma_w are the mean and standard deviation of similarities within a sliding window of size ww, and α\alpha is a sensitivity parameter (typically α[1.0,2.0]\alpha \in [1.0, 2.0]).

9.3.3 Windowed Similarity with Smoothing#

To reduce noise from short sentences, compute a windowed embedding by averaging consecutive sentence embeddings:

eˉi(w)=1wj=ii+w1ej\bar{\mathbf{e}}_i^{(w)} = \frac{1}{w} \sum_{j=i}^{i+w-1} \mathbf{e}_j

The smoothed similarity curve is:

simw(i)=cos(eˉi(w),  eˉi+w(w))\text{sim}_w(i) = \cos\left(\bar{\mathbf{e}}_i^{(w)}, \; \bar{\mathbf{e}}_{i+w}^{(w)}\right)

Breakpoints are then detected as local minima of simw(i)\text{sim}_w(i) below the adaptive threshold.

9.3.4 Topic Coherence Scoring#

After chunking, each chunk cjc_j consisting of sentences {saj,,sbj}\{s_{a_j}, \ldots, s_{b_j}\} is scored for internal coherence:

Coherence(cj)=1(cj2)p<qcos(ep,eq)\text{Coherence}(c_j) = \frac{1}{\binom{|c_j|}{2}} \sum_{p < q} \cos(\mathbf{e}_p, \mathbf{e}_q)

A high-quality chunk exhibits Coherence(cj)γ\text{Coherence}(c_j) \geq \gamma for a domain-calibrated threshold γ\gamma.

The inter-chunk distinctiveness between adjacent chunks validates that breakpoints are meaningful:

Distinctiveness(cj,cj+1)=1cos(1cjicjei,  1cj+1icj+1ei)\text{Distinctiveness}(c_j, c_{j+1}) = 1 - \cos\left(\frac{1}{|c_j|}\sum_{i \in c_j} \mathbf{e}_i, \; \frac{1}{|c_{j+1}|}\sum_{i \in c_{j+1}} \mathbf{e}_i\right)

Effective semantic chunking maximizes both intra-chunk coherence and inter-chunk distinctiveness.

9.3.5 Formal Objective#

maxB  j=1B+1Coherence(cj)+βj=1BDistinctiveness(cj,cj+1)\max_{\mathcal{B}} \; \sum_{j=1}^{|\mathcal{B}|+1} \text{Coherence}(c_j) + \beta \sum_{j=1}^{|\mathcal{B}|} \text{Distinctiveness}(c_j, c_{j+1})

subject to:

cjτmaxj,cjτminj|c_j| \leq \tau_{\max} \quad \forall j, \qquad |c_j| \geq \tau_{\min} \quad \forall j

where B\mathcal{B} is the set of breakpoint positions, τmin\tau_{\min} prevents trivially small chunks, and β\beta balances coherence against distinctiveness.

9.3.6 Pseudo-Algorithm: Semantic Chunking via Embedding Breakpoints#

ALGORITHM: SemanticChunk(D, embed_fn, τ_max, τ_min, α, w)
────────────────────────────────────────────────────────────
INPUT:
  D           — document text
  embed_fn    — sentence embedding function: string → ℝ^d
  τ_max       — maximum chunk token size
  τ_min       — minimum chunk token size
  α           — breakpoint sensitivity (standard deviations below mean)
  w           — smoothing window size (number of sentences)
 
OUTPUT:
  chunks[]    — list of semantically coherent chunks with metadata
 
PROCEDURE:
  1.  sentences[] ← SentenceTokenize(D)
      N ← |sentences|
 
  2.  embeddings[] ← [embed_fn(s) FOR s IN sentences]
 
  3.  // Compute windowed similarity curve
      sim_curve[] ← []
      FOR i ← 0 TO N - 2w:
          left_avg  ← Mean(embeddings[i : i + w])
          right_avg ← Mean(embeddings[i + w : i + 2w])
          sim_curve[i] ← CosineSimilarity(left_avg, right_avg)
 
  4.  // Compute adaptive threshold
      μ ← Mean(sim_curve)
      σ ← StdDev(sim_curve)
      threshold ← μ - α · σ
 
  5.  // Detect breakpoints as local minima below threshold
      breakpoints ← []
      FOR i ← 1 TO |sim_curve| - 1:
          IF sim_curve[i] < threshold
             AND sim_curve[i] < sim_curve[i-1]
             AND sim_curve[i] < sim_curve[i+1]:
              candidate_pos ← i + w   // Center of the window gap
              APPEND candidate_pos TO breakpoints
 
  6.  // Form chunks from breakpoints
      boundaries ← [0] + breakpoints + [N]
      raw_chunks ← []
      FOR j ← 0 TO |boundaries| - 2:
          text ← JoinSentences(sentences[boundaries[j] : boundaries[j+1]])
          tokens ← Tokenize(text)
          APPEND {content: text, token_count: |tokens|} TO raw_chunks
 
  7.  // Enforce size constraints via merge and split
      chunks ← []
      buffer ← EMPTY
      FOR EACH rc IN raw_chunks:
          IF buffer.token_count + rc.token_count ≤ τ_max:
              buffer ← Merge(buffer, rc)
          ELSE:
              IF buffer.token_count ≥ τ_min:
                  APPEND buffer TO chunks
              ELSE:
                  // Merge small buffer into previous chunk if possible
                  IF chunks IS NOT EMPTY:
                      chunks[-1] ← Merge(chunks[-1], buffer)
                  ELSE:
                      APPEND buffer TO chunks
              buffer ← rc
      IF buffer IS NOT EMPTY:
          APPEND buffer TO chunks
 
  8.  // Split any remaining oversized chunks at sentence boundaries
      final_chunks ← []
      FOR EACH c IN chunks:
          IF c.token_count > τ_max:
              splits ← SentenceBoundarySplit(c.content, τ_max)
              EXTEND final_chunks WITH splits
          ELSE:
              APPEND c TO final_chunks
 
  9.  // Score coherence for each chunk
      FOR EACH c IN final_chunks:
          sent_embeds ← [embed_fn(s) FOR s IN SentenceTokenize(c.content)]
          c.coherence_score ← MeanPairwiseCosine(sent_embeds)
          c.chunk_type ← "semantic"
 
  10. RETURN final_chunks

9.3.7 Advanced Variant: TextTiling and BayesSeg#

Beyond cosine-breakpoint detection, two classical algorithms bear mention:

  • TextTiling (Hearst, 1997): Computes block similarity using term-frequency vectors across sliding windows, detecting topic boundaries at similarity valleys. It operates in O(Nw)O(N \cdot w) with no embedding model dependency.

  • BayesSeg (Eisenstein & Barzilay, 2008): Formulates topic segmentation as Bayesian inference over segment boundaries using a generative language model. It optimizes a posterior probability of segmentation given observed word distributions:

P(BD)P(DB)P(B)P(\mathcal{B} \mid D) \propto P(D \mid \mathcal{B}) \cdot P(\mathcal{B})

These may be used as baselines or fused with embedding-based approaches in an ensemble.

9.3.8 Trade-Off Analysis#

StrengthLimitation
Detects topical shifts independent of formattingRequires embedding model inference per sentence (O(N)O(N) calls)
Works on unstructured text, transcripts, raw proseSensitive to embedding model quality and domain alignment
Adaptive threshold accommodates varying document stylesSmoothing window ww must be tuned per corpus
Produces coherent, self-contained retrieval unitsCannot capture hierarchical structure (flat segmentation)

9.4 Hierarchical Chunking: Parent-Child Relationships, Summary-Detail Layering, Recursive Decomposition#

9.4.1 Motivation and Architecture#

Many retrieval tasks require chunks at multiple granularity levels simultaneously. A user query about a broad concept should retrieve a summary-level chunk; a follow-up query about a specific detail should retrieve the fine-grained child chunk. Hierarchical chunking constructs a chunk tree Tc\mathcal{T}_c where each node is a chunk and edges encode parent-child containment.

9.4.2 Formal Definition#

Let DD be a document. A hierarchical chunking of DD at LL levels produces a tree:

Tc=(C,Ec)\mathcal{T}_c = (C, E_c)

where:

  • C=C(1)C(2)C(L)C = C^{(1)} \cup C^{(2)} \cup \ldots \cup C^{(L)} is the set of chunks across levels
  • C(l)C^{(l)} is the set of chunks at level ll (level 1 = coarsest, level LL = finest)
  • Ec={(cp,cd)cpC(l),cdC(l+1),content(cd)content(cp)}E_c = \{(c_p, c_d) \mid c_p \in C^{(l)}, c_d \in C^{(l+1)}, \text{content}(c_d) \subseteq \text{content}(c_p)\}

Each parent chunk cp(l)c_p^{(l)} satisfies:

content(cp(l))=cdchildren(cp)content(cd)\text{content}(c_p^{(l)}) = \bigoplus_{c_d \in \text{children}(c_p)} \text{content}(c_d)

where \bigoplus denotes ordered concatenation, and an optional summary layer stores:

summary(cp(l))=Summarize(content(cp(l)))\text{summary}(c_p^{(l)}) = \text{Summarize}\left(\text{content}(c_p^{(l)})\right)

9.4.3 Retrieval with Hierarchical Chunks#

At query time, the retrieval engine can operate in multiple modes:

  1. Leaf retrieval: Retrieve fine-grained chunks from C(L)C^{(L)} for precise evidence extraction
  2. Parent expansion: Upon retrieving a leaf chunk cdc_d, expand to its parent cpc_p for surrounding context
  3. Summary-first routing: Retrieve from summary-level C(1)C^{(1)}, then drill into children for detail
  4. Multi-level fusion: Retrieve from all levels, deduplicate, and rank by relevance-at-granularity

The parent expansion strategy is formalized as:

Cexpanded(q)={parent(c)cTopK(C(L),q,k)}\mathcal{C}_{\text{expanded}}(q) = \left\{ \text{parent}(c) \mid c \in \text{TopK}\left(C^{(L)}, q, k\right) \right\}

This provides the model with surrounding context that a leaf chunk alone cannot supply.

9.4.4 Pseudo-Algorithm: Hierarchical Chunking with Summary-Detail Layering#

ALGORITHM: HierarchicalChunk(D, levels, τ_sizes[], summarize_fn)
──────────────────────────────────────────────────────────────────
INPUT:
  D             — document text
  levels        — number of hierarchical levels (e.g., 3)
  τ_sizes[]     — max chunk token size per level: [τ_1, τ_2, ..., τ_L]
                   where τ_1 > τ_2 > ... > τ_L
  summarize_fn  — function to generate summaries (LLM or extractive)
 
OUTPUT:
  chunk_tree    — tree of chunks with parent-child edges and summaries
 
PROCEDURE:
  1.  // Level 1: Coarsest segmentation
      level_1_chunks ← StructuralChunk(D, format, τ_sizes[1], d_target=1)
      // Or SemanticChunk if no structure available
 
  2.  chunk_tree ← InitializeTree()
      FOR EACH c1 IN level_1_chunks:
          node_1 ← CreateNode(level=1, content=c1.content, metadata=c1)
          AddToTree(chunk_tree, node_1, parent=ROOT)
 
  3.  // Levels 2 through L: Recursive refinement
      FOR l ← 2 TO levels:
          parent_nodes ← GetNodesAtLevel(chunk_tree, l - 1)
          FOR EACH parent IN parent_nodes:
              IF TokenCount(parent.content) > τ_sizes[l]:
                  children ← SemanticChunk(
                      parent.content, embed_fn, τ_sizes[l], τ_min=50,
                      α=1.5, w=3
                  )
                  FOR EACH child IN children:
                      node_child ← CreateNode(
                          level=l, content=child.content, metadata=child
                      )
                      AddToTree(chunk_tree, node_child, parent=parent)
              ELSE:
                  // Parent is already small enough; promote as leaf
                  MarkAsLeaf(parent)
 
  4.  // Generate summaries for non-leaf nodes (bottom-up)
      FOR l ← levels - 1 DOWNTO 1:
          nodes_at_l ← GetNodesAtLevel(chunk_tree, l)
          FOR EACH node IN nodes_at_l:
              children_text ← Concatenate([c.content FOR c IN Children(node)])
              node.summary ← summarize_fn(children_text)
              node.summary_tokens ← TokenCount(node.summary)
 
  5.  // Assign globally unique chunk IDs and parent pointers
      AssignIDs(chunk_tree)
      FOR EACH node IN AllNodes(chunk_tree):
          node.metadata.parent_id ← ParentID(node)
          node.metadata.children_ids ← [ChildID(c) FOR c IN Children(node)]
          node.metadata.level ← node.level
 
  6.  RETURN chunk_tree

9.4.5 Token Budget Considerations#

For a document of LdocL_{\text{doc}} tokens with LL hierarchical levels, the total storage cost is:

TotalTokens=l=1LcC(l)cLdocL+l=1L1C(l)sˉl\text{TotalTokens} = \sum_{l=1}^{L} \sum_{c \in C^{(l)}} |c| \approx L_{\text{doc}} \cdot L + \sum_{l=1}^{L-1} |C^{(l)}| \cdot \bar{s}_l

where sˉl\bar{s}_l is the average summary length at level ll. This is a storage multiplication factor of approximately L+ϵL + \epsilon (where ϵ\epsilon accounts for summaries), which is acceptable given that it enables multi-granularity retrieval. The embedding and indexing cost scales linearly with the total number of chunk nodes: C=l=1LC(l)|C| = \sum_{l=1}^{L} |C^{(l)}|.

9.4.6 Trade-Off Analysis#

StrengthLimitation
Enables multi-granularity retrieval with a single indexStorage and embedding cost scale linearly with level count
Parent expansion provides surrounding context automaticallySummarization introduces potential information loss
Summary layers enable high-recall coarse searchRequires careful calibration of τsizes\tau_{\text{sizes}} per level
Natural fit for documents with inherent hierarchyFlat documents (e.g., chat logs) gain limited benefit

9.5 Agentic Chunking: LLM-Guided Proposition Extraction, Claim Decomposition, and Fact Isolation#

9.5.1 Definition#

Agentic chunking uses a language model to decompose documents into atomic propositions — self-contained, independently verifiable statements that each express exactly one factual claim. Unlike structural or semantic chunking, which operate on surface features, agentic chunking operates on meaning, producing retrieval units optimized for factual precision and downstream synthesis.

9.5.2 Proposition as the Atomic Retrieval Unit#

A proposition pp satisfies three properties:

  1. Atomicity: pp expresses exactly one claim or fact
  2. Self-containedness: pp is interpretable without reference to surrounding text (all pronouns resolved, all abbreviations expanded)
  3. Verifiability: pp can be independently confirmed or refuted against an authoritative source

Example transformation:

Source: "Founded in 2004, the company grew to 500 employees by 2020 and was acquired by Acme Corp for $2B."

Propositions:

  • "The company was founded in 2004."
  • "The company had 500 employees by 2020."
  • "Acme Corp acquired the company."
  • "The acquisition price was $2 billion."

9.5.3 Formal Definition#

Given a document DD, agentic chunking produces a set of propositions:

P(D)=LLMextract(D)={p1,p2,,pm}\mathcal{P}(D) = \text{LLM}_{\text{extract}}(D) = \{p_1, p_2, \ldots, p_m\}

where each pip_i is a decontextualized atomic statement with provenance:

pi=(claimi,source_spani,doc_id,section_pathi,entitiesi)p_i = (\text{claim}_i, \text{source\_span}_i, \text{doc\_id}, \text{section\_path}_i, \text{entities}_i)

The proposition count mm typically satisfies mnm \gg n where nn is the structural chunk count, since a single paragraph may yield 5–15 propositions.

9.5.4 Pseudo-Algorithm: Agentic Proposition Extraction#

ALGORITHM: AgenticChunk(D, llm_fn, τ_max_input, batch_size)
─────────────────────────────────────────────────────────────
INPUT:
  D             — document text
  llm_fn        — language model function for proposition extraction
  τ_max_input   — max tokens per LLM extraction call
  batch_size    — number of paragraphs per batch
 
OUTPUT:
  propositions[] — list of atomic, self-contained propositions with provenance
 
PROCEDURE:
  1.  // Pre-segment document into manageable passages
      passages ← StructuralChunk(D, format, τ_max_input, d_target=MAX_DEPTH)
      // Alternatively: ParagraphSplit(D, τ_max_input)
 
  2.  propositions ← []
 
  3.  FOR EACH batch IN BatchIterator(passages, batch_size):
 
        prompt ← CompileExtractionPrompt(batch)
        // Prompt instructs:
        //   "Decompose the following text into atomic, self-contained
        //    propositions. Each proposition must:
        //    - Express exactly one factual claim
        //    - Resolve all pronouns and references to explicit entities
        //    - Be independently interpretable without surrounding text
        //    - Preserve numerical values, dates, and proper nouns exactly
        //    Return as a JSON array of objects with fields:
        //    {claim, source_sentence, entities[], confidence}"
 
        response ← llm_fn(prompt)
        extracted ← ParseJSON(response)
 
        FOR EACH prop IN extracted:
            prop.source_doc ← DocumentID(D)
            prop.source_passage_id ← PassageID(batch)
            prop.chunk_type ← "agentic_proposition"
            prop.token_count ← TokenCount(prop.claim)
 
            // Validate: reject empty, duplicate, or non-atomic propositions
            IF prop.claim IS NOT EMPTY
               AND NOT IsDuplicate(prop, propositions)
               AND TokenCount(prop.claim) ≥ 5:
                APPEND prop TO propositions
 
  4.  // Deduplication pass: semantic near-duplicate removal
      embeddings ← [Embed(p.claim) FOR p IN propositions]
      duplicate_pairs ← FindPairsAboveThreshold(embeddings, θ_dedup=0.95)
      FOR EACH (i, j) IN duplicate_pairs:
          // Keep the proposition with richer entity set
          IF |propositions[i].entities| ≥ |propositions[j].entities|:
              MarkForRemoval(propositions[j])
          ELSE:
              MarkForRemoval(propositions[i])
      propositions ← RemoveMarked(propositions)
 
  5.  // Optional: cluster propositions by topic for grouped retrieval
      clusters ← ClusterByEmbedding(propositions, method="HDBSCAN")
      FOR EACH cluster IN clusters:
          FOR EACH prop IN cluster:
              prop.topic_cluster_id ← cluster.id
 
  6.  RETURN propositions

9.5.5 Claim Decomposition for Complex Statements#

Complex claims (conjunctions, conditionals, causal chains) require further decomposition. A claim pp is non-atomic if:

ClauseCount(p)>1orcontains(p,{"and","but","because","if","which"})\text{ClauseCount}(p) > 1 \quad \text{or} \quad \text{contains}(p, \{\text{"and"}, \text{"but"}, \text{"because"}, \text{"if"}, \text{"which"}\})

The LLM is instructed to decompose:

pcomplex{p1,p2,,pk}where each pi has ClauseCount=1p_{\text{complex}} \rightarrow \{p_1, p_2, \ldots, p_k\} \quad \text{where each } p_i \text{ has ClauseCount} = 1

9.5.6 Cost Analysis#

Agentic chunking is the most expensive strategy in terms of inference cost. For a document of LL tokens processed in batches of bb tokens:

LLM Calls=Lb,Total Input TokensL+Lbprompt_template\text{LLM Calls} = \left\lceil \frac{L}{b} \right\rceil, \qquad \text{Total Input Tokens} \approx L + \left\lceil \frac{L}{b} \right\rceil \cdot |\text{prompt\_template}| Total Output Tokensmpˉwhere pˉ is mean proposition length\text{Total Output Tokens} \approx m \cdot \bar{p} \quad \text{where } \bar{p} \text{ is mean proposition length}

For cost optimization: (a) use a smaller, fine-tuned model for extraction; (b) batch aggressively; (c) cache extraction results and only re-extract on source change.

9.5.7 Trade-Off Analysis#

StrengthLimitation
Produces maximally precise retrieval unitsHigh inference cost (O(L)O(L) LLM calls)
Self-contained propositions eliminate context dependencyLLM extraction may hallucinate propositions not in source
Ideal for fact-checking, QA, and claim verification tasksNot suitable for code, tabular data, or structured content
Enables fine-grained attribution and provenanceOver-decomposition can fragment coherent arguments

9.6 Code Chunking: AST-Based, Function-Level, Class-Level, Dependency-Scope Chunking#

9.6.1 Motivation#

Source code possesses a formal grammar and a well-defined syntactic structure (the Abstract Syntax Tree, AST). Naïve text-based chunking of code produces fragments that split functions mid-body, separate signatures from implementations, and destroy import context. Code chunking must respect the syntactic and semantic boundaries defined by the programming language's grammar.

9.6.2 AST-Based Chunking#

The AST of a source file FF is a tree A(F)=(Va,Ea)\mathcal{A}(F) = (V_a, E_a) where nodes correspond to syntactic constructs: module, class, function, method, block, statement, expression. Chunking at a target AST depth produces semantically complete code units.

Let Nscope\mathcal{N}_{\text{scope}} denote the set of scope-defining node types:

Nscope={module,class,function,method,lambda,block}\mathcal{N}_{\text{scope}} = \{\texttt{module}, \texttt{class}, \texttt{function}, \texttt{method}, \texttt{lambda}, \texttt{block}\}

A code chunk is the serialization of a scope-defining node plus its required context:

cv=Imports(v)Signature(v)Docstring(v)Body(v)c_v = \text{Imports}(v) \oplus \text{Signature}(v) \oplus \text{Docstring}(v) \oplus \text{Body}(v)

where Imports(v)\text{Imports}(v) includes only the import statements referenced within vv's body.

9.6.3 Pseudo-Algorithm: AST-Based Code Chunking#

ALGORITHM: CodeChunk(F, language, τ_max, granularity)
─────────────────────────────────────────────────────
INPUT:
  F           — source file content
  language    — programming language (python, java, typescript, ...)
  τ_max       — maximum chunk token size
  granularity — target scope level: "function" | "class" | "module"
 
OUTPUT:
  chunks[]    — list of code chunks with AST metadata
 
PROCEDURE:
  1.  AST ← Parse(F, language)
      // Use tree-sitter, libcst, or language-specific parser
 
  2.  imports ← ExtractImports(AST)
      // Global import statements for the file
 
  3.  scope_nodes ← CollectNodes(AST, type ∈ N_scope, granularity)
      // If granularity = "function": collect all function/method nodes
      // If granularity = "class": collect all class nodes
      // If granularity = "module": the entire file is one chunk
 
  4.  chunks ← []
 
  5.  FOR EACH node IN scope_nodes:
        signature   ← ExtractSignature(node)      // e.g., "def foo(x: int) -> str:"
        docstring   ← ExtractDocstring(node)       // First string literal in body
        body        ← ExtractBody(node)            // Full body source
        decorators  ← ExtractDecorators(node)      // @decorator lines
        parent_class← GetEnclosingClass(node, AST) // None if top-level
        
        // Resolve used imports
        used_symbols ← ExtractReferencedSymbols(body)
        relevant_imports ← Filter(imports, symbol ∈ used_symbols)
 
        full_text ← Join([
            relevant_imports,
            decorators,
            signature,
            docstring,
            body
        ])
 
        tokens ← Tokenize(full_text)
 
        IF |tokens| ≤ τ_max:
            chunk ← {
                content:         full_text,
                token_count:     |tokens|,
                chunk_type:      "code_" + granularity,
                language:        language,
                scope_name:      QualifiedName(node, AST),
                signature:       signature,
                enclosing_class: parent_class,
                file_path:       FilePath(F),
                line_start:      StartLine(node),
                line_end:        EndLine(node),
                dependencies:    used_symbols,
                docstring:       docstring,
                complexity:      CyclomaticComplexity(node)
            }
            APPEND chunk TO chunks
 
        ELSE:
            // Function too large: split into sub-blocks
            IF granularity = "class":
                // Recurse at method level
                method_chunks ← CodeChunk(
                    Serialize(node), language, τ_max, "function"
                )
                FOR EACH mc IN method_chunks:
                    mc.enclosing_class ← QualifiedName(node, AST)
                    APPEND mc TO chunks
            ELSE:
                // Split large function at logical block boundaries
                blocks ← ExtractBlocks(node)
                // blocks: loops, conditionals, try-except, sequential blocks
                FOR EACH block IN blocks:
                    block_text ← Join([relevant_imports, signature + "...", Serialize(block)])
                    block_chunk ← CreateChunk(block_text, metadata={
                        scope_name: QualifiedName(node) + "::block_" + BlockIndex(block),
                        ...
                    })
                    APPEND block_chunk TO chunks
 
  6.  // Add file-level context chunk (always)
      file_summary ← {
          content:     Join([imports, ExtractClassSignatures(AST), ExtractFunctionSignatures(AST)]),
          chunk_type:  "code_file_summary",
          file_path:   FilePath(F),
          token_count: TokenCount(above)
      }
      APPEND file_summary TO chunks
 
  7.  RETURN chunks

9.6.4 Dependency-Scope Chunking#

For queries that require understanding the context of a function's callees or callers, dependency-scope chunking constructs chunks that include transitively referenced code:

cvdep=cvuCallees(v,depth=1)Signature(u)c_v^{\text{dep}} = c_v \oplus \bigoplus_{u \in \text{Callees}(v, \text{depth}=1)} \text{Signature}(u)

This provides the retrieval engine with enough context to answer questions about a function's behavior without retrieving the entire codebase.

9.6.5 Language-Specific Considerations#

LanguageParserScope NodesSpecial Handling
Pythontree-sitter-python, libcstfunction_definition, class_definitionDecorators, type hints, __init__ grouping
Javatree-sitter-java, JavaParsermethod_declaration, class_declarationInterface/abstract methods, annotations
TypeScripttree-sitter-typescriptfunction_declaration, class_declaration, arrow_functionJSX components, type declarations
Gotree-sitter-gofunction_declaration, method_declarationStruct methods, interface satisfaction
Rusttree-sitter-rustfunction_item, impl_itemTrait implementations, lifetime annotations

9.6.6 Trade-Off Analysis#

StrengthLimitation
Respects language grammar; never splits mid-functionRequires a working parser per language
Includes import context for self-containednessVery large functions or classes may still exceed τmax\tau_{\max}
AST metadata enables precise scope-based retrievalDynamic languages (Python, JS) may have incomplete ASTs for metaprogramming
Cyclomatic complexity metadata aids relevance rankingCross-file dependencies require call graph analysis

9.7 Tabular and Structured Data Chunking: Row-Group, Schema-Preserving, Pivot-Aware Strategies#

9.7.1 The Challenge#

Tabular data (CSV, Excel, SQL tables, JSON arrays) encodes information in a two-dimensional schema. Chunking rows without schema context produces meaningless text. Chunking entire tables may exceed τmax\tau_{\max}. The chunking strategy must preserve the schema-value binding so that each chunk is independently interpretable.

9.7.2 Schema-Preserving Row-Group Chunking#

Given a table TT with schema S=(col1,col2,,colk)S = (col_1, col_2, \ldots, col_k) and NN rows, partition into row groups of size gg:

cj=SRows[jg:(j+1)g],j{0,,N/g1}c_j = S \oplus \text{Rows}[j \cdot g : (j+1) \cdot g], \quad j \in \{0, \ldots, \lceil N/g \rceil - 1\}

Every chunk begins with the schema header SS, ensuring self-containedness. The group size gg is determined by:

g=τmaxStokensrˉg = \left\lfloor \frac{\tau_{\max} - |S|_{\text{tokens}}}{\bar{r}} \right\rfloor

where rˉ\bar{r} is the average row token count.

9.7.3 Pivot-Aware Chunking#

When a table has a key column (e.g., entity name, date) that groups logically related rows, chunking should respect key boundaries:

cj=S{rTr[key_col]=kj}c_j = S \oplus \{r \in T \mid r[\text{key\_col}] = k_j\}

This produces one chunk per unique key value (or per key-value group if the key has high cardinality), preserving the semantic unit of "all data about entity kjk_j."

9.7.4 Serialization Formats#

The way a table chunk is serialized affects embedding quality and retrieval precision:

FormatExampleProsCons
Markdown table| col1 | col2 | ... |Readable, familiar to LLMsVerbose for wide tables
Row-as-sentence"The revenue for Q3 2023 was $5.2M with margin 12%."High embedding qualityRequires NL template per schema
JSON records[{"col1": "v1", "col2": "v2"}]Structured, parseableToken-heavy due to key repetition
Key-value pairscol1: v1, col2: v2CompactAmbiguous without schema context

For optimal retrieval, row-as-sentence serialization with natural-language templates is preferred because it produces text that aligns with the embedding model's training distribution.

9.7.5 Pseudo-Algorithm: Schema-Preserving Tabular Chunking#

ALGORITHM: TabularChunk(T, schema, τ_max, key_col, serialize_mode)
────────────────────────────────────────────────────────────────────
INPUT:
  T               — table data (list of row dictionaries)
  schema          — column definitions with types and descriptions
  τ_max           — maximum chunk token size
  key_col         — optional grouping key column (may be NULL)
  serialize_mode  — "markdown" | "row_sentence" | "json"
 
OUTPUT:
  chunks[]        — list of schema-preserving table chunks
 
PROCEDURE:
  1.  schema_header ← SerializeSchema(schema, serialize_mode)
      schema_tokens ← TokenCount(schema_header)
 
  2.  IF key_col IS NOT NULL:
          // Pivot-aware grouping
          groups ← GroupBy(T, key_col)
      ELSE:
          // Sequential row grouping
          avg_row_tokens ← Mean([TokenCount(SerializeRow(r, schema, serialize_mode)) FOR r IN T])
          g ← Floor((τ_max - schema_tokens) / avg_row_tokens)
          g ← Max(g, 1)
          groups ← SequentialBatch(T, batch_size=g)
 
  3.  chunks ← []
      FOR EACH group IN groups:
          serialized_rows ← [SerializeRow(r, schema, serialize_mode) FOR r IN group]
          content ← schema_header + "\n" + Join(serialized_rows, "\n")
          tokens ← TokenCount(content)
 
          IF tokens ≤ τ_max:
              chunk ← {
                  content:       content,
                  token_count:   tokens,
                  chunk_type:    "tabular",
                  schema:        schema,
                  row_count:     |group|,
                  key_value:     group[0][key_col] IF key_col ELSE NULL,
                  row_range:     (FirstRowIndex(group), LastRowIndex(group)),
                  source_table:  TableID(T)
              }
              APPEND chunk TO chunks
          ELSE:
              // Group exceeds τ_max: split into sub-groups
              sub_groups ← SequentialBatch(group, batch_size=Floor(g/2))
              RECURSE on each sub_group
 
  4.  // Generate table summary chunk
      summary ← {
          content:     "Table: " + TableName(T) + ". Schema: " + schema_header
                       + ". Row count: " + |T|
                       + ". Key statistics: " + ComputeColumnStats(T, schema),
          chunk_type:  "tabular_summary",
          token_count: TokenCount(above),
          source_table: TableID(T)
      }
      APPEND summary TO chunks
 
  5.  RETURN chunks

9.7.6 Row-as-Sentence Template#

For the row-as-sentence serialization, a template is constructed from the schema:

Template(S)="The [entity_col] has [col2] of [val2], [col3] of [val3], ..."\text{Template}(S) = \text{"The [entity\_col] has [col2] of [val2], [col3] of [val3], ..."}

Example: For schema (company, revenue, quarter, margin):

SerializeRow(r)="The revenue of [company] in [quarter] was [revenue] with a margin of [margin]."\text{SerializeRow}(r) = \text{"The revenue of [company] in [quarter] was [revenue] with a margin of [margin]."}

This produces text that embeds well and retrieves accurately against natural-language queries.


9.8 Multi-Modal Chunking: Image-Caption Pairing, Video Segment Annotation, Audio Transcript Alignment#

9.8.1 Scope#

Multi-modal documents (PDFs with figures, web pages with images, video tutorials, podcast transcripts) require chunking strategies that preserve the alignment between modalities. A figure without its caption is uninterpretable; a transcript segment without its timestamp is unlocalizable.

9.8.2 Image-Caption Pairing#

For documents containing images (figures, diagrams, charts), each image is chunked as a paired unit:

cimg=(image_reference,caption,alt_text,surrounding_text,OCR_text)c_{\text{img}} = (\text{image\_reference}, \text{caption}, \text{alt\_text}, \text{surrounding\_text}, \text{OCR\_text})

where:

  • image_reference: URI or embedding of the image
  • caption: extracted figure caption (e.g., "Figure 3: Architecture diagram")
  • alt_text: accessibility description if available
  • surrounding_text: the kk sentences before and after the image in the document
  • OCR_text: text extracted from the image via OCR (for diagrams, charts, screenshots)

The text representation for embedding and retrieval is:

EmbeddableText(cimg)=captionalt_textOCR_textsurrounding_text\text{EmbeddableText}(c_{\text{img}}) = \text{caption} \oplus \text{alt\_text} \oplus \text{OCR\_text} \oplus \text{surrounding\_text}

Optionally, a vision-language model generates a synthetic description of the image:

description=VLM(image_bytes)\text{description} = \text{VLM}(\text{image\_bytes})

This description is appended to the embeddable text for richer retrieval.

9.8.3 Video Segment Chunking#

Video content is chunked by segmenting the timeline into semantically coherent segments:

cvideo,j=(transcript[tj:tj+1],  keyframes[tj:tj+1],  tj,  tj+1,  speaker)c_{\text{video}, j} = (\text{transcript}[t_j : t_{j+1}], \; \text{keyframes}[t_j : t_{j+1}], \; t_j, \; t_{j+1}, \; \text{speaker})

Segment boundaries {t1,t2,}\{t_1, t_2, \ldots\} are determined by:

  1. Silence detection: Pauses >δ> \delta seconds
  2. Speaker diarization: Speaker change points
  3. Transcript semantic segmentation: Apply semantic chunking (§9.3) to the transcript text
  4. Visual scene change detection: Keyframe difference exceeds threshold

9.8.4 Audio Transcript Alignment#

For audio-only content (podcasts, meetings, calls), the chunking pipeline is:

Audio → ASR (speech-to-text) → Diarization → Sentence segmentation
    → Semantic chunking on transcript → Timestamp alignment

Each chunk carries:

caudio,j=(transcript_segment,  tstart,  tend,  speaker_id,  confidence)c_{\text{audio}, j} = (\text{transcript\_segment}, \; t_{\text{start}}, \; t_{\text{end}}, \; \text{speaker\_id}, \; \text{confidence})

9.8.5 Pseudo-Algorithm: Multi-Modal Document Chunking#

ALGORITHM: MultiModalChunk(D_mm, modality_extractors, τ_max)
─────────────────────────────────────────────────────────────
INPUT:
  D_mm                  — multi-modal document (PDF, web page, video, ...)
  modality_extractors   — dict of modality → extraction function
  τ_max                 — maximum chunk token size
 
OUTPUT:
  chunks[]              — list of modality-annotated chunks
 
PROCEDURE:
  1.  // Extract modality streams
      text_blocks   ← modality_extractors["text"](D_mm)
      images        ← modality_extractors["image"](D_mm)
      tables        ← modality_extractors["table"](D_mm)
      audio_segs    ← modality_extractors["audio"](D_mm)   // May be empty
      video_segs    ← modality_extractors["video"](D_mm)   // May be empty
 
  2.  chunks ← []
 
  3.  // Text chunking (structural or semantic)
      text_chunks ← StructuralChunk(JoinBlocks(text_blocks), format, τ_max, d_target=2)
      FOR EACH tc IN text_chunks:
          tc.modality ← "text"
          APPEND tc TO chunks
 
  4.  // Image-caption pairing
      FOR EACH img IN images:
          caption         ← ExtractCaption(img, D_mm)
          alt_text        ← ExtractAltText(img, D_mm)
          ocr_text        ← OCR(img.bytes)
          surrounding     ← GetSurroundingText(img.position, text_blocks, k=3)
          vlm_description ← VLMDescribe(img.bytes)   // Optional
 
          embeddable ← Join([caption, alt_text, ocr_text, surrounding, vlm_description])
 
          chunk ← {
              content:         embeddable,
              token_count:     TokenCount(embeddable),
              chunk_type:      "image_caption_pair",
              modality:        "image",
              image_ref:       img.uri,
              caption:         caption,
              position_in_doc: img.position,
              source_doc:      DocumentID(D_mm)
          }
          APPEND chunk TO chunks
 
  5.  // Table chunking
      FOR EACH tbl IN tables:
          tbl_chunks ← TabularChunk(tbl.data, tbl.schema, τ_max, key_col=NULL, "markdown")
          FOR EACH tc IN tbl_chunks:
              tc.modality ← "table"
              APPEND tc TO chunks
 
  6.  // Audio/video segment chunking
      FOR EACH seg IN audio_segs + video_segs:
          transcript_chunks ← SemanticChunk(seg.transcript, embed_fn, τ_max, τ_min=30, α=1.5, w=2)
          FOR EACH tc IN transcript_chunks:
              tc.modality     ← seg.type   // "audio" or "video"
              tc.time_start   ← AlignTimestamp(tc, seg)
              tc.time_end     ← AlignTimestamp(tc, seg)
              tc.speaker      ← seg.speaker_id
              tc.keyframe_ref ← ExtractKeyframe(tc.time_start) IF seg.type = "video" ELSE NULL
              APPEND tc TO chunks
 
  7.  RETURN chunks

9.8.6 Cross-Modal Linking#

After chunking, cross-modal links should be established:

  • An image chunk links to the text chunk that references it ("As shown in Figure 3...")
  • A table chunk links to the text chunk that discusses its findings
  • A video segment chunk links to slides visible during that segment

These links are stored as metadata edges and enable the retrieval engine to expand across modalities when a query touches multiple information types.


9.9 Overlap, Stride, and Context Window Strategies for Boundary Coherence#

9.9.1 The Boundary Problem#

Every chunking strategy introduces artificial boundaries. Information that spans a boundary is split across two chunks, and if only one chunk is retrieved, the agent receives an incomplete evidence fragment. Overlap and stride strategies mitigate this by ensuring that boundary-adjacent content appears in multiple chunks.

9.9.2 Formal Definitions#

Given a document DD of LL tokens, a chunk size ss, and an overlap oo (where 0o<s0 \leq o < s), the stride is:

stride=so\text{stride} = s - o

The ii-th chunk is:

ci=D[istride:istride+s],i{0,1,,Lsstride}c_i = D[i \cdot \text{stride} : i \cdot \text{stride} + s], \quad i \in \left\{0, 1, \ldots, \left\lfloor \frac{L - s}{\text{stride}} \right\rfloor\right\}

The total number of chunks is:

n=Lsstride+1=Lsso+1n = \left\lfloor \frac{L - s}{\text{stride}} \right\rfloor + 1 = \left\lfloor \frac{L - s}{s - o} \right\rfloor + 1

The overlap ratio is:

ro=osr_o = \frac{o}{s}

The storage expansion factor (ratio of total stored tokens to document length) is:

Expansion=nsLsso=11ro\text{Expansion} = \frac{n \cdot s}{L} \approx \frac{s}{s - o} = \frac{1}{1 - r_o}

For ro=0.2r_o = 0.2 (20% overlap), expansion 1.25\approx 1.25; for ro=0.5r_o = 0.5, expansion =2.0= 2.0.

9.9.3 Optimal Overlap Selection#

The optimal overlap trades off between boundary coherence and storage/embedding cost. Define boundary information loss as:

Lboundary(o)=i=0n21[qQeval:answer(q)D[istride+so:istride+s+o]ci]\mathcal{L}_{\text{boundary}}(o) = \sum_{i=0}^{n-2} \mathbb{1}\left[\exists q \in Q_{\text{eval}} : \text{answer}(q) \in D[i \cdot \text{stride} + s - o : i \cdot \text{stride} + s + o] \setminus c_i \right]

This counts the number of evaluation queries whose answer falls in the boundary region and is not fully captured by a single chunk. The objective is:

mino  Lboundary(o)+λcost11o/s\min_o \; \mathcal{L}_{\text{boundary}}(o) + \lambda_{\text{cost}} \cdot \frac{1}{1 - o/s}

In practice, overlaps of 10%10\%20%20\% of chunk size are standard for prose documents, while code chunks typically use 0%0\% overlap (since AST boundaries are semantically precise).

9.9.4 Sentence-Aligned Overlap#

Rather than token-level overlap (which may split words), sentence-aligned overlap ensures that the overlap region begins and ends at sentence boundaries:

oaligned=j=1ksjwhere j=1ksjo and sj are trailing sentences of cio_{\text{aligned}} = \sum_{j=1}^{k} |s_j| \quad \text{where } \sum_{j=1}^{k} |s_j| \leq o \text{ and } s_j \text{ are trailing sentences of } c_i

This prevents partial-sentence artifacts in the overlap region.

9.9.5 Context Window Prefix Strategy#

An alternative to overlap is the context window prefix: instead of duplicating content, each chunk is prefixed with a compressed summary of its preceding context:

ciprefixed=Summary(D[0:istride])cic_i^{\text{prefixed}} = \text{Summary}(D[0 : i \cdot \text{stride}]) \oplus c_i

This provides positional context without full content duplication, at the cost of summary generation. The summary prefix is typically bounded to pp tokens:

ciprefixed=p+sτmax|c_i^{\text{prefixed}}| = p + s \leq \tau_{\max}

9.9.6 Decision Matrix#

StrategyBoundary CoherenceStorage CostCompute CostBest For
No overlap (o=0o = 0)Low1.0×1.0 \timesMinimalAST-bounded code, clean structural splits
Fixed overlap (10–20%)Medium1.11.11.25×1.25 \timesLowGeneral prose, documentation
High overlap (50%)High2.0×2.0 \timesModerateDense technical text with cross-reference dependencies
Sentence-aligned overlapMedium-High1.11.11.3×1.3 \timesLowNarrative text, legal documents
Context window prefixHigh1.0×1.0 \times + summary costHigh (LLM calls)Long-form analysis, reports

9.10 Chunk Metadata Enrichment: Section Title, Document Position, Entity Tags, Summary, Parent Pointer#

9.10.1 Principle#

A chunk without metadata is an anonymous text fragment. In production retrieval systems, metadata is the primary lever for non-semantic filtering, ranking, and provenance tracking. Every chunk must carry a rich, typed metadata envelope that enables the retrieval engine to filter, re-rank, and attribute.

9.10.2 Metadata Schema#

The canonical chunk metadata schema for an agentic retrieval system:

FieldTypeSourcePurpose
chunk_idUUIDGeneratedUnique identifier, idempotent re-chunking
source_doc_idUUIDIngestion pipelineProvenance: which document
source_doc_titlestringDocument metadataHuman-readable provenance
section_pathstring[]Structural parserBreadcrumb: ["Ch3", "Sec3.2", "Sub3.2.1"]
position_indexintChunking engineOrdinal position within document (0-indexed)
total_chunksintChunking engineTotal chunks from this document
chunk_typeenumStrategy selectorstructural, semantic, agentic, code, tabular, multimodal
token_countintTokenizerExact token count for budget planning
entity_tagsstring[]NER pipelineNamed entities mentioned: people, orgs, dates, products
summarystringLLM or extractive1–2 sentence summary of chunk content
parent_chunk_idUUIDHierarchical chunkerPointer to parent chunk (null if root)
children_chunk_idsUUID[]Hierarchical chunkerPointers to child chunks
languagestringDetectorNatural language or programming language
created_attimestampIngestion pipelineIngestion timestamp
source_versionstringVersion controlGit commit, document revision, API version
freshness_scorefloatDecay functionf(t)=eλ(tnowtcreated)f(t) = e^{-\lambda(t_{\text{now}} - t_{\text{created}})}
authority_scorefloatSource rankingConfidence in source reliability
modalityenumModality extractortext, image, table, audio, video
coherence_scorefloatQuality scorerIntra-chunk semantic coherence (§9.3.4)

9.10.3 Pseudo-Algorithm: Metadata Enrichment Pipeline#

ALGORITHM: EnrichChunkMetadata(chunks[], ner_fn, summarize_fn, embed_fn)
──────────────────────────────────────────────────────────────────────────
INPUT:
  chunks[]       — raw chunks from any chunking strategy
  ner_fn         — named entity recognition function
  summarize_fn   — summarization function (extractive or LLM)
  embed_fn       — embedding function
 
OUTPUT:
  enriched[]     — chunks with complete metadata
 
PROCEDURE:
  1.  FOR EACH chunk c IN chunks:
 
        // Entity extraction
        c.entity_tags ← ner_fn(c.content)
        // Returns: [{"text": "OpenAI", "type": "ORG"}, ...]
 
  2.  // Batch summarization for efficiency
      unsummarized ← [c FOR c IN chunks WHERE c.summary IS NULL]
      IF |unsummarized| > 0:
          summaries ← BatchSummarize(
              [c.content FOR c IN unsummarized],
              summarize_fn,
              max_summary_tokens=50
          )
          FOR i, c IN Enumerate(unsummarized):
              c.summary ← summaries[i]
 
  3.  // Coherence scoring
      FOR EACH chunk c IN chunks:
          sent_embeds ← [embed_fn(s) FOR s IN SentenceTokenize(c.content)]
          IF |sent_embeds| ≥ 2:
              c.coherence_score ← MeanPairwiseCosine(sent_embeds)
          ELSE:
              c.coherence_score ← 1.0   // Single-sentence chunk is trivially coherent
 
  4.  // Freshness scoring
      λ ← 0.01   // Decay rate (configurable per source type)
      FOR EACH chunk c IN chunks:
          age_days ← (Now() - c.created_at).days
          c.freshness_score ← Exp(-λ · age_days)
 
  5.  // Position normalization
      doc_groups ← GroupBy(chunks, key=source_doc_id)
      FOR EACH doc_id, group IN doc_groups:
          total ← |group|
          FOR i, c IN Enumerate(SortByPosition(group)):
              c.position_index ← i
              c.total_chunks ← total
              c.relative_position ← i / total   // 0.0 = start, 1.0 = end
 
  6.  // Generate chunk IDs (deterministic for idempotency)
      FOR EACH chunk c IN chunks:
          c.chunk_id ← DeterministicUUID(
              c.source_doc_id, c.position_index, c.chunk_type, c.source_version
          )
          // Same content + same version → same ID (enables dedup on re-ingestion)
 
  7.  RETURN chunks

9.10.4 Metadata-Driven Retrieval Filtering#

At query time, metadata enables pre-retrieval filtering that dramatically reduces the candidate set:

Cfiltered={cCfreshness(c)fminlanguage(c)=lqentity_tags(c)Eq}\mathcal{C}_{\text{filtered}} = \{c \in \mathcal{C} \mid \text{freshness}(c) \geq f_{\min} \land \text{language}(c) = l_q \land \text{entity\_tags}(c) \cap E_q \neq \varnothing\}

where fminf_{\min} is the minimum freshness threshold, lql_q is the query language, and EqE_q is the set of entities mentioned in the query. This filtering occurs before embedding similarity computation, reducing both latency and false positives.

9.10.5 Summary as a Retrieval Proxy#

Chunk summaries serve a dual purpose:

  1. Retrieval proxy: Embed and index the summary instead of (or in addition to) the full chunk. Summaries are denser and produce more discriminative embeddings.
  2. Context compression: When injecting chunks into the agent's context window, the summary can replace the full content for lower-priority chunks, conserving token budget.

The dual-embedding strategy indexes both:

vcfull=Embed(c.content),vcsummary=Embed(c.summary)\mathbf{v}_c^{\text{full}} = \text{Embed}(c.\text{content}), \qquad \mathbf{v}_c^{\text{summary}} = \text{Embed}(c.\text{summary})

Retrieval computes similarity against both and takes the maximum:

score(c,q)=max(cos(vcfull,vq),  cos(vcsummary,vq))\text{score}(c, q) = \max\left(\cos(\mathbf{v}_c^{\text{full}}, \mathbf{v}_q), \; \cos(\mathbf{v}_c^{\text{summary}}, \mathbf{v}_q)\right)

9.11 Adaptive Chunking: Runtime Chunk Size Adjustment Based on Query Complexity and Token Budget#

9.11.1 Motivation#

Static chunking assumes a fixed optimal chunk size at ingestion time. However, the optimal retrieval granularity depends on the query at runtime:

  • A broad question ("Summarize the company's Q3 performance") benefits from large, summary-level chunks
  • A precise question ("What was the EBITDA margin in Q3 2023?") benefits from small, fact-level chunks
  • A complex multi-hop question ("Compare Q3 margins across all subsidiaries") requires multiple medium-granularity chunks

Adaptive chunking adjusts the effective chunk granularity at query time without re-indexing.

9.11.2 Query Complexity Classification#

Define a query complexity classifier:

κ(q)=Classify(q){factoid,analytical,comparative,exploratory,multi-hop}\kappa(q) = \text{Classify}(q) \in \{\text{factoid}, \text{analytical}, \text{comparative}, \text{exploratory}, \text{multi-hop}\}

Each complexity class maps to a preferred granularity level:

Query ClassPreferred GranularityChunk Size RangeStrategy
FactoidFine (propositions, small chunks)50–150 tokensRetrieve from agentic/leaf-level index
AnalyticalMedium (paragraphs, sections)200–500 tokensRetrieve from structural/semantic index
ComparativeMedium + multi-entity200–500 tokensRetrieve per entity, align by schema
ExploratoryCoarse (summaries, full sections)500–1500 tokensRetrieve from summary/parent-level index
Multi-hopMixed (fine + medium)VariesIterative retrieval with decomposed sub-queries

9.11.3 Token Budget-Aware Chunk Selection#

Given a context budget BcontextB_{\text{context}}, the number of retrievable chunks kk is:

k=BcontextBsystemBqueryBreservesˉκk = \left\lfloor \frac{B_{\text{context}} - B_{\text{system}} - B_{\text{query}} - B_{\text{reserve}}}{\bar{s}_{\kappa}} \right\rfloor

where:

  • BsystemB_{\text{system}}: tokens consumed by system prompt, role policy, tool definitions
  • BqueryB_{\text{query}}: tokens in the user query
  • BreserveB_{\text{reserve}}: tokens reserved for model generation
  • sˉκ\bar{s}_{\kappa}: average chunk size for the selected granularity level κ\kappa

This ensures the retrieval engine never overfills the context window.

9.11.4 Runtime Chunk Merging and Splitting#

If the index stores chunks at a single granularity but the query demands a different one, runtime transformation is applied:

Runtime Merging (for coarser granularity):

cmerged=cici+1ci+m1c_{\text{merged}} = c_i \oplus c_{i+1} \oplus \ldots \oplus c_{i+m-1}

where ci,,ci+m1c_i, \ldots, c_{i+m-1} are consecutive chunks from the same document, and mm is chosen such that cmergedτtarget|c_{\text{merged}}| \leq \tau_{\text{target}}.

Runtime Splitting (for finer granularity):

{ci,1,ci,2,}=SentenceSplit(ci,τtarget)\{c_{i,1}, c_{i,2}, \ldots\} = \text{SentenceSplit}(c_i, \tau_{\text{target}})

Alternatively, if a hierarchical index (§9.4) is available, the retrieval engine simply navigates to the appropriate level.

9.11.5 Pseudo-Algorithm: Adaptive Retrieval with Dynamic Granularity#

ALGORITHM: AdaptiveRetrieve(q, index, B_context, B_system, B_reserve)
────────────────────────────────────────────────────────────────────────
INPUT:
  q           — user query
  index       — hierarchical chunk index with multiple granularity levels
  B_context   — total context window budget (tokens)
  B_system    — system prompt token consumption
  B_reserve   — tokens reserved for generation
 
OUTPUT:
  context[]   — ordered list of chunks for context injection
 
PROCEDURE:
  1.  // Classify query complexity
      κ ← ClassifyQueryComplexity(q)
      // κ ∈ {factoid, analytical, comparative, exploratory, multi_hop}
 
  2.  // Determine granularity level and chunk budget
      level ← GranularityMap[κ]
      avg_size ← index.AverageChunkSize(level)
      B_available ← B_context - B_system - TokenCount(q) - B_reserve
      k ← Floor(B_available / avg_size)
      k ← Clamp(k, min=3, max=20)   // Sensible bounds
 
  3.  // Retrieve at target granularity
      candidates ← index.Retrieve(q, level=level, top_k=k * 3)
      // Over-retrieve by 3× for re-ranking headroom
 
  4.  // Re-rank with cross-encoder or metadata-weighted scoring
      FOR EACH c IN candidates:
          c.relevance ← CrossEncoderScore(q, c.content)
          c.final_score ← (
              w_rel · c.relevance
            + w_fresh · c.freshness_score
            + w_auth · c.authority_score
            + w_coh · c.coherence_score
          )
 
  5.  candidates ← SortByFinalScore(candidates, descending=True)
 
  6.  // Greedy token-budget packing
      context ← []
      tokens_used ← 0
      FOR EACH c IN candidates:
          IF tokens_used + c.token_count ≤ B_available:
              APPEND c TO context
              tokens_used ← tokens_used + c.token_count
          ELSE:
              // Try summary version if available
              IF c.summary IS NOT NULL AND tokens_used + TokenCount(c.summary) ≤ B_available:
                  APPEND SummaryChunk(c) TO context
                  tokens_used ← tokens_used + TokenCount(c.summary)
              ELSE:
                  BREAK   // Budget exhausted
 
  7.  // For multi-hop queries: decompose and iterate
      IF κ = "multi_hop":
          sub_queries ← DecomposeQuery(q)
          FOR EACH sq IN sub_queries:
              sub_context ← AdaptiveRetrieve(sq, index, B_available / |sub_queries|, 0, 0)
              EXTEND context WITH sub_context
          context ← Deduplicate(context)
          context ← TruncateToFit(context, B_available)
 
  8.  // Order chunks by document position for coherent reading
      context ← SortByDocumentPosition(context)
 
  9.  RETURN context

9.11.6 Formal Objective: Budget-Optimal Chunk Selection#

The adaptive retrieval problem can be formalized as a constrained optimization:

maxSCcandidatescSscore(c,q)subject tocScBavailable\max_{\mathcal{S} \subseteq \mathcal{C}_{\text{candidates}}} \sum_{c \in \mathcal{S}} \text{score}(c, q) \quad \text{subject to} \quad \sum_{c \in \mathcal{S}} |c| \leq B_{\text{available}}

This is a variant of the 0-1 knapsack problem with item weight c|c| and value score(c,q)\text{score}(c, q). For practical purposes, the greedy approximation (sort by score-per-token ratio score(c,q)/c\text{score}(c, q) / |c| and pack greedily) yields near-optimal solutions:

ratio(c)=score(c,q)c\text{ratio}(c) = \frac{\text{score}(c, q)}{|c|}

Alternatively, for small candidate sets (Ccandidates50|\mathcal{C}_{\text{candidates}}| \leq 50), dynamic programming yields an exact solution in O(CBavailable)O(|\mathcal{C}| \cdot B_{\text{available}}).


9.12 Chunk Quality Metrics: Retrieval Precision Impact, Contextual Completeness, Synthesis Utility#

9.12.1 Motivation#

Chunking quality is not an abstract property — it manifests in measurable downstream effects on retrieval precision, context coherence, and agent output quality. A rigorous chunk quality framework defines metrics at three levels: chunk-level (intrinsic quality), retrieval-level (search effectiveness), and synthesis-level (downstream generation quality).

9.12.2 Chunk-Level Intrinsic Metrics#

9.12.2.1 Coherence Score#

As defined in §9.3.4:

Coherence(c)=1(Sc2)i<jcos(ei,ej)\text{Coherence}(c) = \frac{1}{\binom{|S_c|}{2}} \sum_{i < j} \cos(\mathbf{e}_i, \mathbf{e}_j)

where ScS_c is the set of sentences in chunk cc and ei\mathbf{e}_i are sentence embeddings. Range: [0,1][0, 1]; target 0.7\geq 0.7.

9.12.2.2 Self-Containedness Score#

Measures whether a chunk can be understood independently:

SelfContained(c)=1UnresolvedReferences(c)TotalReferences(c)\text{SelfContained}(c) = 1 - \frac{|\text{UnresolvedReferences}(c)|}{|\text{TotalReferences}(c)|}

where unresolved references include dangling pronouns ("it," "they," "the above"), undefined acronyms, and incomplete sentences. Range: [0,1][0, 1]; target 0.85\geq 0.85.

9.12.2.3 Information Density#

Density(c)=UniqueEntities(c)+UniqueFacts(c)ctokens\text{Density}(c) = \frac{|\text{UniqueEntities}(c)| + |\text{UniqueFacts}(c)|}{|c|_{\text{tokens}}}

Higher density indicates more information per token, meaning fewer tokens are wasted on filler or repetition.

9.12.2.4 Boundary Quality#

Assesses whether chunk boundaries respect semantic units:

BoundaryQuality(c)={1if c starts and ends at sentence boundaries0.5if only one boundary is sentence-aligned0if both boundaries are mid-sentence\text{BoundaryQuality}(c) = \begin{cases} 1 & \text{if } c \text{ starts and ends at sentence boundaries} \\ 0.5 & \text{if only one boundary is sentence-aligned} \\ 0 & \text{if both boundaries are mid-sentence} \end{cases}

9.12.3 Retrieval-Level Metrics#

9.12.3.1 Chunk Retrieval Precision at kk#

Given a set of evaluation queries QevalQ_{\text{eval}} with ground-truth relevant chunks RqR_q for each query qq:

Precision@k=1QevalqQevalTopK(q,k)Rqk\text{Precision@}k = \frac{1}{|Q_{\text{eval}}|} \sum_{q \in Q_{\text{eval}}} \frac{|\text{TopK}(q, k) \cap R_q|}{k}

9.12.3.2 Chunk Recall at kk#

Recall@k=1QevalqQevalTopK(q,k)RqRq\text{Recall@}k = \frac{1}{|Q_{\text{eval}}|} \sum_{q \in Q_{\text{eval}}} \frac{|\text{TopK}(q, k) \cap R_q|}{|R_q|}

9.12.3.3 Mean Reciprocal Rank (MRR)#

MRR=1QevalqQeval1rankq\text{MRR} = \frac{1}{|Q_{\text{eval}}|} \sum_{q \in Q_{\text{eval}}} \frac{1}{\text{rank}_q}

where rankq\text{rank}_q is the position of the first relevant chunk in the ranked retrieval list.

9.12.3.4 Normalized Discounted Cumulative Gain (NDCG)#

For graded relevance labels rel(c,q){0,1,2,3}\text{rel}(c, q) \in \{0, 1, 2, 3\}:

DCG@k=i=1k2rel(ci,q)1log2(i+1)\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}(c_i, q)} - 1}{\log_2(i + 1)} NDCG@k=DCG@kIDCG@k\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}

where IDCG@k\text{IDCG@}k is the ideal DCG (relevance labels sorted in descending order).

9.12.4 Synthesis-Level Metrics#

These measure the impact of chunk quality on the final agent output:

9.12.4.1 Faithfulness (Groundedness)#

Faithfulness(y,C)=Claims(y)Entailed(C)Claims(y)\text{Faithfulness}(y, \mathcal{C}) = \frac{|\text{Claims}(y) \cap \text{Entailed}(\mathcal{C})|}{|\text{Claims}(y)|}

The fraction of claims in the agent's output yy that are entailed by the retrieved context C\mathcal{C}. This is the primary hallucination metric, and chunking directly affects it: poorly chunked context that omits critical evidence forces the model to confabulate.

9.12.4.2 Answer Completeness#

Completeness(y,y)=Claims(y)Claims(y)Claims(y)\text{Completeness}(y, y^*) = \frac{|\text{Claims}(y) \cap \text{Claims}(y^*)|}{|\text{Claims}(y^*)|}

where yy^* is the ground-truth answer. Incomplete chunking produces incomplete answers.

9.12.4.3 Context Utilization#

Utilization(C,y)={cCclaimy attributed to c}C\text{Utilization}(\mathcal{C}, y) = \frac{|\{c \in \mathcal{C} \mid \exists \text{claim} \in y \text{ attributed to } c\}|}{|\mathcal{C}|}

Measures what fraction of retrieved chunks the model actually used. Low utilization indicates that chunking is producing low-relevance results.

9.12.5 Composite Chunk Quality Score#

A weighted composite metric for chunking strategy evaluation:

Qchunk=w1NDCG@k+w2Faithfulness+w3Completeness+w4Coherence+w5SelfContainedQ_{\text{chunk}} = w_1 \cdot \text{NDCG@}k + w_2 \cdot \text{Faithfulness} + w_3 \cdot \text{Completeness} + w_4 \cdot \overline{\text{Coherence}} + w_5 \cdot \overline{\text{SelfContained}}

subject to:

i=15wi=1,wi0\sum_{i=1}^{5} w_i = 1, \quad w_i \geq 0

where Coherence\overline{\text{Coherence}} and SelfContained\overline{\text{SelfContained}} are corpus-level averages. This composite score enables A/B evaluation of chunking strategies against a fixed evaluation set.

9.12.6 Pseudo-Algorithm: Chunk Quality Evaluation Pipeline#

ALGORITHM: EvaluateChunkQuality(chunks[], Q_eval, ground_truth, agent_fn, embed_fn)
───────────────────────────────────────────────────────────────────────────────────────
INPUT:
  chunks[]       — chunked corpus under evaluation
  Q_eval         — evaluation query set with ground-truth answers and relevant chunks
  ground_truth   — mapping: query → (answer, relevant_chunk_ids)
  agent_fn       — agent pipeline function: (query, context) → response
  embed_fn       — embedding function
 
OUTPUT:
  report         — quality report with all metrics
 
PROCEDURE:
  1.  // Index chunks
      index ← BuildIndex(chunks, embed_fn)
 
  2.  // Intrinsic metrics
      coherence_scores ← [Coherence(c, embed_fn) FOR c IN chunks]
      self_contained_scores ← [SelfContainednessScore(c) FOR c IN chunks]
      density_scores ← [InformationDensity(c) FOR c IN chunks]
      boundary_scores ← [BoundaryQuality(c) FOR c IN chunks]
 
  3.  // Retrieval metrics
      precision_at_k ← []
      recall_at_k ← []
      mrr_scores ← []
      ndcg_scores ← []
 
      FOR EACH (q, gt) IN Zip(Q_eval, ground_truth):
          retrieved ← index.Retrieve(q, top_k=10)
          retrieved_ids ← [c.chunk_id FOR c IN retrieved]
          relevant_ids ← gt.relevant_chunk_ids
 
          precision_at_k.APPEND(|Set(retrieved_ids[:k]) ∩ Set(relevant_ids)| / k)
          recall_at_k.APPEND(|Set(retrieved_ids[:k]) ∩ Set(relevant_ids)| / |relevant_ids|)
          mrr_scores.APPEND(1 / FirstRelevantRank(retrieved_ids, relevant_ids))
          ndcg_scores.APPEND(NDCG(retrieved_ids, relevant_ids, k))
 
  4.  // Synthesis metrics
      faithfulness_scores ← []
      completeness_scores ← []
      utilization_scores ← []
 
      FOR EACH (q, gt) IN Zip(Q_eval, ground_truth):
          retrieved ← index.Retrieve(q, top_k=5)
          response ← agent_fn(q, retrieved)
 
          claims_y ← ExtractClaims(response)
          claims_gt ← ExtractClaims(gt.answer)
 
          faithfulness_scores.APPEND(
              |EntailedClaims(claims_y, retrieved)| / Max(|claims_y|, 1)
          )
          completeness_scores.APPEND(
              |Set(claims_y) ∩ Set(claims_gt)| / Max(|claims_gt|, 1)
          )
          utilization_scores.APPEND(
              |UsedChunks(response, retrieved)| / Max(|retrieved|, 1)
          )
 
  5.  report ← {
          intrinsic: {
              mean_coherence:       Mean(coherence_scores),
              mean_self_contained:  Mean(self_contained_scores),
              mean_density:         Mean(density_scores),
              mean_boundary:        Mean(boundary_scores)
          },
          retrieval: {
              precision_at_k:       Mean(precision_at_k),
              recall_at_k:          Mean(recall_at_k),
              mrr:                  Mean(mrr_scores),
              ndcg_at_k:            Mean(ndcg_scores)
          },
          synthesis: {
              faithfulness:         Mean(faithfulness_scores),
              completeness:         Mean(completeness_scores),
              utilization:          Mean(utilization_scores)
          },
          composite: ComputeCompositeScore(above, weights)
      }
 
  6.  RETURN report

9.13 Chunk Storage and Indexing: Vector Stores, Inverted Indexes, Hybrid Index Structures#

9.13.1 Architectural Requirements#

Chunks, once produced and enriched, must be stored in a retrieval infrastructure that supports:

  1. Semantic search: approximate nearest neighbor (ANN) over dense embeddings
  2. Keyword search: exact match, BM25, inverted index queries
  3. Metadata filtering: pre-retrieval filtering on entity tags, document ID, date ranges, chunk type
  4. Hierarchical navigation: parent-child traversal for chunk tree indexes
  5. Freshness-aware retrieval: time-weighted scoring
  6. Scalability: sub-100ms retrieval latency at millions-to-billions of chunks
  7. Idempotent upsert: re-chunking the same document version produces no duplicates

9.13.2 Vector Store Architecture#

Dense embedding retrieval stores each chunk as a vector vcRd\mathbf{v}_c \in \mathbb{R}^d alongside its metadata. The retrieval operation for query qq is:

TopK(q,k)=argmax-kcC  sim(vq,vc)\text{TopK}(q, k) = \underset{c \in \mathcal{C}}{\text{argmax-}k} \; \text{sim}(\mathbf{v}_q, \mathbf{v}_c)

where sim\text{sim} is typically cosine similarity or inner product.

ANN Index Structures#

Index TypeBuild TimeQuery TimeMemoryRecallBest For
Flat (brute-force)O(n)O(n)O(nd)O(n \cdot d)O(nd)O(n \cdot d)100%Small corpora (<100K< 100K chunks)
IVF (Inverted File)O(nd)O(n \cdot d)O(nd)O(\sqrt{n} \cdot d)O(nd)O(n \cdot d)95–99%Medium corpora (100K100K10M10M)
HNSW (Hierarchical NSW)O(nlogn)O(n \cdot \log n)O(lognd)O(\log n \cdot d)O(ndM)O(n \cdot d \cdot M)97–99.9%Production systems requiring high recall
PQ (Product Quantization)O(nd)O(n \cdot d)O(nd/m)O(n \cdot d / m)O(nm)O(n \cdot m)90–95%Billion-scale with memory constraints
ScaNNO(nd)O(n \cdot d)O(n)O(\sqrt{n})O(nd)O(n \cdot d)98–99%Google-scale retrieval

For production agentic systems, HNSW is the default recommendation due to its logarithmic query time, high recall, and mature ecosystem support.

The HNSW parameters:

  • MM: maximum number of connections per node (typical: 16166464)
  • efConstruction\text{efConstruction}: search depth during index build (typical: 200200500500)
  • efSearch\text{efSearch}: search depth during query (typical: 100100300300; tune for recall-latency trade-off)

The recall-latency trade-off is governed by:

Recall@k1eγefSearch\text{Recall@}k \approx 1 - e^{-\gamma \cdot \text{efSearch}}

where γ\gamma depends on data dimensionality and distribution.

For exact-match and keyword-based retrieval, an inverted index maps terms to chunk IDs:

I:term{(c1,tf1),(c2,tf2),}\mathcal{I}: \text{term} \rightarrow \{(c_1, \text{tf}_1), (c_2, \text{tf}_2), \ldots\}

BM25 scoring for a query q={t1,t2,,tm}q = \{t_1, t_2, \ldots, t_m\} against chunk cc:

BM25(q,c)=i=1mIDF(ti)f(ti,c)(k1+1)f(ti,c)+k1(1b+bcavgdl)\text{BM25}(q, c) = \sum_{i=1}^{m} \text{IDF}(t_i) \cdot \frac{f(t_i, c) \cdot (k_1 + 1)}{f(t_i, c) + k_1 \cdot \left(1 - b + b \cdot \frac{|c|}{\text{avgdl}}\right)}

where:

  • f(ti,c)f(t_i, c): term frequency of tit_i in chunk cc
  • IDF(ti)=ln(Nn(ti)+0.5n(ti)+0.5+1)\text{IDF}(t_i) = \ln\left(\frac{N - n(t_i) + 0.5}{n(t_i) + 0.5} + 1\right)
  • NN: total number of chunks
  • n(ti)n(t_i): number of chunks containing term tit_i
  • k11.2k_1 \approx 1.2, b0.75b \approx 0.75: tuning parameters
  • avgdl\text{avgdl}: average chunk length in tokens

9.13.4 Hybrid Index: Dense + Sparse Fusion#

Production systems combine vector search and keyword search through reciprocal rank fusion (RRF) or linear score combination:

Reciprocal Rank Fusion (RRF)#

RRF(c)=rR1K+rankr(c)\text{RRF}(c) = \sum_{r \in \mathcal{R}} \frac{1}{K + \text{rank}_r(c)}

where R\mathcal{R} is the set of ranking systems (dense + sparse), rankr(c)\text{rank}_r(c) is the rank of chunk cc in system rr, and KK is a smoothing constant (typically K=60K = 60).

Linear Score Combination#

HybridScore(c,q)=αDenseScore(c,q)+(1α)SparseScore(c,q)\text{HybridScore}(c, q) = \alpha \cdot \text{DenseScore}(c, q) + (1 - \alpha) \cdot \text{SparseScore}(c, q)

where α[0,1]\alpha \in [0, 1] is tuned per domain. For most knowledge-base retrieval tasks, α[0.5,0.7]\alpha \in [0.5, 0.7] (slightly favoring semantic search) performs well.

9.13.5 Metadata Index Layer#

A separate metadata index enables pre-retrieval filtering. This is typically implemented as:

  • Structured metadata store: PostgreSQL, SQLite, or document store with indexed fields
  • Filter pushdown: Metadata filters are applied before ANN search, reducing the candidate set:
TopKfiltered(q,k,filters)=argmax-kcCfiltered  sim(vq,vc)\text{TopK}_{\text{filtered}}(q, k, \text{filters}) = \underset{c \in \mathcal{C}_{\text{filtered}}}{\text{argmax-}k} \; \text{sim}(\mathbf{v}_q, \mathbf{v}_c)

where Cfiltered={cCfilters(c.metadata)=true}\mathcal{C}_{\text{filtered}} = \{c \in \mathcal{C} \mid \text{filters}(c.\text{metadata}) = \text{true}\}.

Common filter predicates:

source_doc_id = "doc_xyz"
created_at >= "2024-01-01"
chunk_type IN ["structural", "semantic"]
entity_tags CONTAINS "OpenAI"
language = "python"
freshness_score >= 0.5

9.13.6 Hierarchical Index for Multi-Granularity Retrieval#

For hierarchical chunks (§9.4), the index supports level-aware retrieval:

Level 1 (summaries):     Index_L1 → HNSW over summary embeddings
Level 2 (sections):      Index_L2 → HNSW over section embeddings  
Level 3 (paragraphs):    Index_L3 → HNSW over paragraph embeddings
Level 4 (propositions):  Index_L4 → HNSW over proposition embeddings

The retrieval router selects the index level based on query complexity (§9.11.2) and then optionally expands to parent or child chunks via the chunk tree structure stored in the metadata layer.

9.13.7 Pseudo-Algorithm: Hybrid Index Build and Query#

ALGORITHM: BuildHybridIndex(chunks[], embed_fn, tokenizer)
────────────────────────────────────────────────────────────
INPUT:
  chunks[]    — enriched chunks with metadata
  embed_fn    — embedding function: string → ℝ^d
  tokenizer   — tokenizer for BM25 term extraction
 
OUTPUT:
  hybrid_index — composite index supporting dense, sparse, and metadata queries
 
PROCEDURE:
  1.  // Dense index construction
      dense_index ← InitHNSW(dim=d, M=32, efConstruction=400)
      FOR EACH chunk c IN chunks:
          v ← embed_fn(c.content)
          dense_index.Add(c.chunk_id, v)
 
  2.  // Sparse index construction (BM25)
      sparse_index ← InitInvertedIndex()
      FOR EACH chunk c IN chunks:
          terms ← tokenizer.Tokenize(c.content)
          sparse_index.AddDocument(c.chunk_id, terms, metadata={
              doc_length: |terms|
          })
      sparse_index.ComputeIDF()
 
  3.  // Metadata index construction
      metadata_store ← InitStructuredStore(schema=ChunkMetadataSchema)
      FOR EACH chunk c IN chunks:
          metadata_store.Upsert(c.chunk_id, c.metadata)
          // Upsert: idempotent — same chunk_id overwrites cleanly
 
  4.  // Summary embedding index (optional, for dual-embedding)
      summary_index ← InitHNSW(dim=d, M=16, efConstruction=200)
      FOR EACH chunk c IN chunks WHERE c.summary IS NOT NULL:
          v_s ← embed_fn(c.summary)
          summary_index.Add(c.chunk_id, v_s)
 
  5.  hybrid_index ← {
          dense: dense_index,
          sparse: sparse_index,
          metadata: metadata_store,
          summary: summary_index,
          config: {α: 0.6, K_rrf: 60}
      }
 
  6.  RETURN hybrid_index
 
 
ALGORITHM: HybridQuery(hybrid_index, q, k, filters, α)
───────────────────────────────────────────────────────
INPUT:
  hybrid_index  — composite index
  q             — query string
  k             — number of results to return
  filters       — metadata filter predicates
  α             — dense/sparse weight (override or use default)
 
OUTPUT:
  results[]     — ranked list of (chunk_id, score, chunk) tuples
 
PROCEDURE:
  1.  // Apply metadata filters
      candidate_ids ← hybrid_index.metadata.Query(filters)
      // Returns set of chunk_ids matching filter predicates
 
  2.  // Dense retrieval (restricted to candidates)
      v_q ← embed_fn(q)
      dense_results ← hybrid_index.dense.Search(
          v_q, top_k=k * 3, restrict_to=candidate_ids
      )
      // Returns: [(chunk_id, cosine_score), ...]
 
  3.  // Sparse retrieval (restricted to candidates)
      q_terms ← tokenizer.Tokenize(q)
      sparse_results ← hybrid_index.sparse.Search(
          q_terms, top_k=k * 3, restrict_to=candidate_ids
      )
      // Returns: [(chunk_id, bm25_score), ...]
 
  4.  // Score fusion
      IF fusion_mode = "RRF":
          // Reciprocal Rank Fusion
          scores ← DefaultDict(float)
          FOR rank, (cid, _) IN Enumerate(dense_results):
              scores[cid] += 1.0 / (K_rrf + rank + 1)
          FOR rank, (cid, _) IN Enumerate(sparse_results):
              scores[cid] += 1.0 / (K_rrf + rank + 1)
 
      ELSE IF fusion_mode = "linear":
          // Normalize scores to [0, 1]
          dense_norm ← MinMaxNormalize(dense_results)
          sparse_norm ← MinMaxNormalize(sparse_results)
          scores ← DefaultDict(float)
          FOR (cid, s) IN dense_norm:
              scores[cid] += α · s
          FOR (cid, s) IN sparse_norm:
              scores[cid] += (1 - α) · s
 
  5.  // Sort and return top-k
      ranked ← SortByScore(scores, descending=True)[:k]
 
  6.  // Fetch full chunk content and metadata
      results ← []
      FOR EACH (cid, score) IN ranked:
          chunk ← hybrid_index.metadata.GetChunk(cid)
          APPEND (cid, score, chunk) TO results
 
  7.  RETURN results

9.13.8 Storage Architecture Patterns#

Pattern 1: Integrated Vector Database#

A single system (e.g., Weaviate, Qdrant, Milvus, Pinecone) serves as both vector store and metadata store:

[Chunks] → [Embedding] → [Vector DB with metadata filtering]
                              ├── HNSW index (dense)
                              ├── BM25 index (sparse, if supported)
                              └── Metadata index (filterable fields)

Pros: Operational simplicity, atomic upserts, consistent filtering. Cons: May lack full-featured BM25 or complex metadata query support.

Pattern 2: Disaggregated Stores#

Separate systems for each concern:

[Chunks] → [Embedding] → [HNSW Vector Store (Qdrant, Faiss)]
                       → [Elasticsearch / OpenSearch (BM25 + metadata)]
                       → [PostgreSQL (chunk metadata, hierarchical pointers)]

Pros: Best-of-breed per concern, independent scaling. Cons: Operational complexity, consistency management, query fan-out latency.

Pattern 3: Lakehouse with Vector Index Overlay#

Chunks stored in a data lakehouse (Delta Lake, Iceberg) with vector index acceleration:

[Parquet files in object store]
    ├── Column: chunk_id, content, metadata (structured)
    ├── Column: embedding (vector)
    └── [Vector index sidecar: HNSW over embedding column]

Pros: Cost-effective for very large corpora, SQL analytics over chunks, version control. Cons: Higher query latency, complex infrastructure.

9.13.9 Idempotent Upsert and Versioning#

Re-ingestion of a document must not produce duplicate chunks. The idempotency contract:

ChunkID(c)=Hash(source_doc_idsource_versionposition_indexchunk_type)\text{ChunkID}(c) = \text{Hash}(\text{source\_doc\_id} \| \text{source\_version} \| \text{position\_index} \| \text{chunk\_type})

On re-ingestion:

  1. Compute new chunk IDs for the updated document
  2. Delete chunks whose source_doc_id matches but whose IDs are no longer in the new set
  3. Upsert new/changed chunks
  4. Index entries are atomically updated

This ensures the index is always consistent with the latest document version without manual deduplication.

9.13.10 Cache Hierarchy for Retrieval Artifacts#

To minimize retrieval latency under repeated or similar queries:

Cache LayerScopeTTLContent
Query embedding cachePer query hash1 hourPrecomputed vq\mathbf{v}_q
Result cachePer (query hash, filter hash)5 minutesTop-k chunk IDs and scores
Chunk content cachePer chunk ID24 hoursFull chunk content + metadata
Cross-encoder score cachePer (query, chunk ID)1 hourRe-ranking scores

Cache invalidation is triggered by chunk upserts: any chunk ID that changes invalidates all result caches containing it.

9.13.11 Observability and Operational Metrics#

MetricDescriptionAlert Threshold
retrieval_latency_p9999th percentile retrieval latency> 200ms
index_size_chunksTotal chunks in indexMonitor growth rate
embedding_throughputChunks embedded per second< 100/s triggers scaling
cache_hit_rateFraction of queries served from cache< 50% indicates cold cache
index_freshness_lagTime between document update and index update> 5 minutes
duplicate_chunk_rateFraction of chunks with duplicate content> 1% indicates idempotency failure
metadata_completenessFraction of chunks with all metadata fields populated< 95% triggers enrichment pipeline review

Chapter Summary#

Chunking is the foundational engineering discipline that determines the quality ceiling of every downstream retrieval and generation operation in an agentic system. This chapter has established:

  1. No universal strategy exists: Document class determines the optimal chunking approach. The system must classify documents and route to the appropriate strategy.

  2. Six primary strategies — structural, semantic, hierarchical, agentic, code-aware, and tabular — each formalized with mathematical definitions, pseudo-algorithms, and trade-off analyses.

  3. Multi-modal and adaptive extensions that handle heterogeneous media and runtime query-driven granularity adjustment.

  4. Overlap and boundary coherence techniques that mitigate information loss at chunk boundaries, with formal cost-expansion analysis.

  5. Metadata enrichment as a non-optional requirement: every chunk must carry provenance, position, entity tags, summary, coherence score, and hierarchical pointers.

  6. Quality metrics at three levels — intrinsic, retrieval, and synthesis — with a composite evaluation framework for A/B comparison of chunking strategies.

  7. Hybrid indexing infrastructure combining dense (HNSW), sparse (BM25), and metadata indexes with formal fusion algorithms, cache hierarchies, idempotent upsert contracts, and production observability.

The key architectural principle is: chunking is not preprocessing; it is a continuously evaluated, document-class-aware, query-adaptive retrieval precision lever that must be engineered with the same rigor as any other production subsystem in the agentic stack.