Agentic Notes Library

Chapter 9: Chunking Strategies — Document-Class-Specific Segmentation for Retrieval Precision

March 20, 2026 23 min read 5,016 words

Chapter 09MathRaw HTML

Preamble#

Chunking is the act of partitioning a source document into discrete retrieval units — the atomic segments that are embedded, indexed, retrieved, and injected into an agent's context window. In production agentic systems, chunking is not a preprocessing convenience; it is a retrieval precision lever that directly governs the signal-to-noise ratio of every context window the agent constructs. A poorly chunked corpus degrades retrieval recall, inflates token cost, introduces incoherent context boundaries, and ultimately amplifies hallucination risk. A well-engineered chunking layer, by contrast, produces retrieval units that are semantically self-contained, provenance-tagged, hierarchically navigable, and optimally sized for downstream synthesis.

This chapter develops chunking as a first-class engineering discipline within the agentic retrieval stack. We formalize each strategy — structural, semantic, hierarchical, agentic, code-aware, tabular, multi-modal, and adaptive — with mathematical definitions, pseudo-algorithms, quality metrics, and explicit trade-off analyses. Every strategy is evaluated against the requirements of retrieval precision, contextual completeness, synthesis utility, token efficiency, and production-grade scalability.

9.1 Chunking as a Retrieval Precision Lever: Why One Strategy Fails All Document Types#

9.1.1 The Fundamental Problem#

A retrieval-augmented agent constructs its context window $\mathcal{C}$ from a set of retrieved chunks $\{c_1, c_2, \ldots, c_k\}$ selected from an indexed corpus $\mathcal{D}$ . The quality of the agent's output $y$ is bounded by the quality of $\mathcal{C}$ :

P(y \mid q, \mathcal{C}) \leq P(y \mid q, \mathcal{D}^*)

where $q$ is the user query and $\mathcal{D}^*$ is the ideal complete evidence set. The gap between $P(y \mid q, \mathcal{C})$ and $P(y \mid q, \mathcal{D}^*)$ is the retrieval information loss, and chunking strategy is the primary determinant of that loss.

9.1.2 Why Uniform Chunking Fails#

A naïve fixed-size chunking strategy partitions a document $D$ of length $L$ tokens into $n = \lceil L / s \rceil$ chunks of size $s$ :

c_i = D[i \cdot s : (i+1) \cdot s], \quad i \in \{0, 1, \ldots, n-1\}

This approach introduces three classes of failure:

Failure Class	Description	Consequence
Boundary Incoherence	Splits occur mid-sentence, mid-paragraph, or mid-argument	Retrieved chunks are semantically incomplete, forcing the model to hallucinate missing context
Granularity Mismatch	A single chunk size cannot match the natural information density of heterogeneous documents	Code files, legal contracts, tabular data, and narrative prose have radically different information topologies
Metadata Destruction	Fixed splitting discards structural cues — headings, nesting, schema boundaries	Retrieval loses the ability to rank by document position, section relevance, or hierarchical containment

9.1.3 Document-Class Taxonomy#

Production corpora are heterogeneous. Each document class exhibits a distinct information topology that demands a specialized chunking strategy:

Document Class	Information Topology	Optimal Chunking Strategy
Technical documentation	Heading-section hierarchy	Structural
Research papers	Argument-claim-evidence chains	Semantic + Hierarchical
Legal contracts	Clause-subclause nesting	Structural + Hierarchical
Source code	AST-defined scopes	Code (AST-based)
Spreadsheets / CSV / databases	Row-column schema	Tabular
Conversational transcripts	Turn-topic segmentation	Semantic
Multi-modal artifacts	Image-text-audio alignment	Multi-modal
Knowledge base articles	Proposition-fact granularity	Agentic

9.1.4 The Chunking Design Objective#

Formally, the chunking function $\text{CHUNK}: \mathcal{D} \rightarrow \{c_1, c_2, \ldots, c_n\}$ must simultaneously optimize:

\max_{\text{CHUNK}} \; \underbrace{\text{RetrievalPrecision}(\mathcal{C}, q)}_{\text{signal relevance}} + \lambda_1 \underbrace{\text{ContextualCompleteness}(c_i)}_{\text{self-containedness}} + \lambda_2 \underbrace{\text{SynthesisUtility}(c_i)}_{\text{downstream usability}}

subject to:

|c_i| \leq \tau_{\max} \quad \forall i, \qquad \sum_{i=1}^{k} |c_i| \leq B_{\text{context}}, \qquad \text{Provenance}(c_i) \neq \varnothing \quad \forall i

where $\tau_{\max}$ is the maximum chunk token size, $B_{\text{context}}$ is the agent's context token budget, and provenance is mandatory for every chunk.

9.1.5 Architectural Position of Chunking in the Agentic Stack#

Chunking sits at the critical junction between ingestion and retrieval:

Document Ingestion → [Classification] → [Strategy Selection] → [Chunking Engine]
    → [Metadata Enrichment] → [Embedding] → [Indexing] → Retrieval Engine

The chunking engine receives a classified document and a strategy selector, produces enriched chunks, and emits them to the embedding and indexing pipeline. This is not a batch-once operation; adaptive chunking (§9.11) permits runtime re-chunking based on query characteristics.

9.2 Structural Chunking: Heading, Section, Paragraph, and Markup-Aware Splitting#

9.2.1 Definition and Scope#

Structural chunking exploits the explicit organizational markers present in a document — headings, subheadings, paragraphs, list items, HTML/XML tags, Markdown formatting, LaTeX section commands, or PDF layout elements — to define chunk boundaries that align with the author's intended information hierarchy.

Let a document $D$ be represented as an ordered tree $\mathcal{T}(D) = (V, E)$ where each node $v \in V$ corresponds to a structural element (section, subsection, paragraph, list item) and edges $E$ represent containment. Structural chunking defines chunk boundaries at a target depth $d$ of $\mathcal{T}(D)$ :

\text{CHUNK}_{\text{struct}}(D, d) = \{ \text{serialize}(v) \mid v \in V, \; \text{depth}(v) = d \}

If $|\text{serialize}(v)| > \tau_{\max}$ , the node is recursively split at depth $d+1$ .

9.2.2 Markup Detection and Parsing#

Structural chunking requires a format-specific parser that recovers the document tree:

Source Format	Structural Signals	Parser Strategy
Markdown	`#`, `##`, `###`, `---`, list markers	Regex + CommonMark AST
HTML	`<h1>`–`<h6>`, `<section>`, `<article>`, `<div>`	DOM tree traversal
LaTeX	`\section`, `\subsection`, `\paragraph`	LaTeX AST parser
PDF	Font size transitions, bold headings, indentation	Layout analysis (heuristic or ML-based)
DOCX/ODT	Style-tagged heading levels, `<w:pStyle>`	XML schema extraction
reStructuredText	Underline-delimited headings, directives	RST parser

9.2.3 Pseudo-Algorithm: Structural Chunking#

ALGORITHM: StructuralChunk(D, format, τ_max, d_target)
───────────────────────────────────────────────────────
INPUT:
  D         — raw document content
  format    — detected document format (markdown, html, pdf, ...)
  τ_max     — maximum chunk token size
  d_target  — target structural depth for chunking
 
OUTPUT:
  chunks[]  — list of structurally-bounded chunks with metadata
 
PROCEDURE:
  1.  T ← ParseDocumentTree(D, format)
      // Recover the structural tree T = (V, E) from format-specific parser
 
  2.  candidates ← CollectNodesAtDepth(T, d_target)
      // Gather all nodes at target depth d_target
 
  3.  chunks ← []
 
  4.  FOR EACH node v IN candidates:
        text ← Serialize(v)           // Flatten node subtree to text
        tokens ← Tokenize(text)
 
        IF |tokens| ≤ τ_max:
            breadcrumb ← AncestorPath(v, T)
            // e.g., "Chapter 3 > Section 3.2 > Subsection 3.2.1"
 
            chunk ← {
                content:        text,
                token_count:    |tokens|,
                section_path:   breadcrumb,
                depth:          Depth(v),
                position_index: GlobalIndex(v, T),
                source_doc:     DocumentID(D),
                chunk_type:     "structural",
                parent_id:      ParentID(v, T),
                children_ids:   ChildrenIDs(v, T)
            }
            APPEND chunk TO chunks
 
        ELSE:
            // Node exceeds τ_max: recurse to finer depth
            sub_chunks ← StructuralChunk(
                Serialize(v), format, τ_max, d_target + 1
            )
            FOR EACH sc IN sub_chunks:
                sc.parent_id ← NodeID(v)
                APPEND sc TO chunks
 
  5.  // Handle leaf nodes with no structural children that still exceed τ_max
      FOR EACH chunk c IN chunks WHERE c.token_count > τ_max:
          splits ← ParagraphSplit(c.content, τ_max)
          REPLACE c IN chunks WITH splits
          // Paragraph-level fallback preserves sentence boundaries
 
  6.  RETURN chunks

9.2.4 Paragraph-Level Splitting as Fallback#

When structural depth is exhausted (leaf nodes exceed $\tau_{\max}$ ), the fallback is paragraph-level splitting. A paragraph boundary is detected by:

\text{IsParagraphBreak}(i) = \begin{cases} 1 & \text{if } D[i] = \texttt{\\n\\n} \text{ or } D[i] \in \{\texttt{<p>}, \texttt{<br/>}\} \\ 0 & \text{otherwise} \end{cases}

This preserves sentence coherence while respecting token limits.

9.2.5 Trade-Off Analysis#

Strength	Limitation
Preserves author-intended organization	Requires well-structured source documents
Breadcrumb metadata enables hierarchical retrieval	PDF and legacy formats have noisy structural signals
Chunk boundaries align with human reading units	Information density varies across sections; some chunks may be too sparse or too dense
Low computational cost	Cannot detect topical shifts within a single section

9.2.6 When to Apply#

Structural chunking is the default first-pass strategy for any document with reliable markup. It should be combined with semantic chunking (§9.3) when sections contain multiple distinct topics, and with hierarchical chunking (§9.4) when parent-child retrieval is required.

9.3 Semantic Chunking: Topic Segmentation, Embedding Similarity Boundaries, Coherence Scoring#

9.3.1 Motivation#

Structural chunking fails when a document lacks reliable markup, or when a single structural section contains multiple distinct topics. Semantic chunking addresses this by detecting topical boundaries within text using distributional similarity measures.

9.3.2 Core Principle: Embedding Breakpoint Detection#

Given a document $D$ segmented into an ordered sequence of sentences $\{s_1, s_2, \ldots, s_N\}$ , each sentence is embedded into a vector space:

\mathbf{e}_i = \text{Embed}(s_i) \in \mathbb{R}^d

The inter-sentence similarity between consecutive sentences is:

\text{sim}(i, i+1) = \frac{\mathbf{e}_i \cdot \mathbf{e}_{i+1}}{\|\mathbf{e}_i\| \; \|\mathbf{e}_{i+1}\|}

A breakpoint is detected at position $i$ when the similarity drops below a threshold $\theta$ or exhibits a statistically significant local minimum:

\text{IsBreakpoint}(i) = \begin{cases} 1 & \text{if } \text{sim}(i, i+1) < \theta \\ 1 & \text{if } \text{sim}(i, i+1) < \mu_w - \alpha \cdot \sigma_w \\ 0 & \text{otherwise} \end{cases}

where $\mu_w$ and $\sigma_w$ are the mean and standard deviation of similarities within a sliding window of size $w$ , and $\alpha$ is a sensitivity parameter (typically $\alpha \in [1.0, 2.0]$ ).

9.3.3 Windowed Similarity with Smoothing#

To reduce noise from short sentences, compute a windowed embedding by averaging consecutive sentence embeddings:

\bar{\mathbf{e}}_i^{(w)} = \frac{1}{w} \sum_{j=i}^{i+w-1} \mathbf{e}_j

The smoothed similarity curve is:

\text{sim}_w(i) = \cos\left(\bar{\mathbf{e}}_i^{(w)}, \; \bar{\mathbf{e}}_{i+w}^{(w)}\right)

Breakpoints are then detected as local minima of $\text{sim}_w(i)$ below the adaptive threshold.

9.3.4 Topic Coherence Scoring#

After chunking, each chunk $c_j$ consisting of sentences $\{s_{a_j}, \ldots, s_{b_j}\}$ is scored for internal coherence:

\text{Coherence}(c_j) = \frac{1}{\binom{|c_j|}{2}} \sum_{p < q} \cos(\mathbf{e}_p, \mathbf{e}_q)

A high-quality chunk exhibits $\text{Coherence}(c_j) \geq \gamma$ for a domain-calibrated threshold $\gamma$ .

The inter-chunk distinctiveness between adjacent chunks validates that breakpoints are meaningful:

\text{Distinctiveness}(c_j, c_{j+1}) = 1 - \cos\left(\frac{1}{|c_j|}\sum_{i \in c_j} \mathbf{e}_i, \; \frac{1}{|c_{j+1}|}\sum_{i \in c_{j+1}} \mathbf{e}_i\right)

Effective semantic chunking maximizes both intra-chunk coherence and inter-chunk distinctiveness.

9.3.5 Formal Objective#

\max_{\mathcal{B}} \; \sum_{j=1}^{|\mathcal{B}|+1} \text{Coherence}(c_j) + \beta \sum_{j=1}^{|\mathcal{B}|} \text{Distinctiveness}(c_j, c_{j+1})

subject to:

|c_j| \leq \tau_{\max} \quad \forall j, \qquad |c_j| \geq \tau_{\min} \quad \forall j

where $\mathcal{B}$ is the set of breakpoint positions, $\tau_{\min}$ prevents trivially small chunks, and $\beta$ balances coherence against distinctiveness.

9.3.6 Pseudo-Algorithm: Semantic Chunking via Embedding Breakpoints#

ALGORITHM: SemanticChunk(D, embed_fn, τ_max, τ_min, α, w)
────────────────────────────────────────────────────────────
INPUT:
  D           — document text
  embed_fn    — sentence embedding function: string → ℝ^d
  τ_max       — maximum chunk token size
  τ_min       — minimum chunk token size
  α           — breakpoint sensitivity (standard deviations below mean)
  w           — smoothing window size (number of sentences)
 
OUTPUT:
  chunks[]    — list of semantically coherent chunks with metadata
 
PROCEDURE:
  1.  sentences[] ← SentenceTokenize(D)
      N ← |sentences|
 
  2.  embeddings[] ← [embed_fn(s) FOR s IN sentences]
 
  3.  // Compute windowed similarity curve
      sim_curve[] ← []
      FOR i ← 0 TO N - 2w:
          left_avg  ← Mean(embeddings[i : i + w])
          right_avg ← Mean(embeddings[i + w : i + 2w])
          sim_curve[i] ← CosineSimilarity(left_avg, right_avg)
 
  4.  // Compute adaptive threshold
      μ ← Mean(sim_curve)
      σ ← StdDev(sim_curve)
      threshold ← μ - α · σ
 
  5.  // Detect breakpoints as local minima below threshold
      breakpoints ← []
      FOR i ← 1 TO |sim_curve| - 1:
          IF sim_curve[i] < threshold
             AND sim_curve[i] < sim_curve[i-1]
             AND sim_curve[i] < sim_curve[i+1]:
              candidate_pos ← i + w   // Center of the window gap
              APPEND candidate_pos TO breakpoints
 
  6.  // Form chunks from breakpoints
      boundaries ← [0] + breakpoints + [N]
      raw_chunks ← []
      FOR j ← 0 TO |boundaries| - 2:
          text ← JoinSentences(sentences[boundaries[j] : boundaries[j+1]])
          tokens ← Tokenize(text)
          APPEND {content: text, token_count: |tokens|} TO raw_chunks
 
  7.  // Enforce size constraints via merge and split
      chunks ← []
      buffer ← EMPTY
      FOR EACH rc IN raw_chunks:
          IF buffer.token_count + rc.token_count ≤ τ_max:
              buffer ← Merge(buffer, rc)
          ELSE:
              IF buffer.token_count ≥ τ_min:
                  APPEND buffer TO chunks
              ELSE:
                  // Merge small buffer into previous chunk if possible
                  IF chunks IS NOT EMPTY:
                      chunks[-1] ← Merge(chunks[-1], buffer)
                  ELSE:
                      APPEND buffer TO chunks
              buffer ← rc
      IF buffer IS NOT EMPTY:
          APPEND buffer TO chunks
 
  8.  // Split any remaining oversized chunks at sentence boundaries
      final_chunks ← []
      FOR EACH c IN chunks:
          IF c.token_count > τ_max:
              splits ← SentenceBoundarySplit(c.content, τ_max)
              EXTEND final_chunks WITH splits
          ELSE:
              APPEND c TO final_chunks
 
  9.  // Score coherence for each chunk
      FOR EACH c IN final_chunks:
          sent_embeds ← [embed_fn(s) FOR s IN SentenceTokenize(c.content)]
          c.coherence_score ← MeanPairwiseCosine(sent_embeds)
          c.chunk_type ← "semantic"
 
  10. RETURN final_chunks

9.3.7 Advanced Variant: TextTiling and BayesSeg#

Beyond cosine-breakpoint detection, two classical algorithms bear mention:

TextTiling (Hearst, 1997): Computes block similarity using term-frequency vectors across sliding windows, detecting topic boundaries at similarity valleys. It operates in $O(N \cdot w)$ with no embedding model dependency.
BayesSeg (Eisenstein & Barzilay, 2008): Formulates topic segmentation as Bayesian inference over segment boundaries using a generative language model. It optimizes a posterior probability of segmentation given observed word distributions:

P(\mathcal{B} \mid D) \propto P(D \mid \mathcal{B}) \cdot P(\mathcal{B})

These may be used as baselines or fused with embedding-based approaches in an ensemble.

9.3.8 Trade-Off Analysis#

Strength	Limitation
Detects topical shifts independent of formatting	Requires embedding model inference per sentence ( $O(N)$ calls)
Works on unstructured text, transcripts, raw prose	Sensitive to embedding model quality and domain alignment
Adaptive threshold accommodates varying document styles	Smoothing window $w$ must be tuned per corpus
Produces coherent, self-contained retrieval units	Cannot capture hierarchical structure (flat segmentation)

9.4 Hierarchical Chunking: Parent-Child Relationships, Summary-Detail Layering, Recursive Decomposition#

9.4.1 Motivation and Architecture#

Many retrieval tasks require chunks at multiple granularity levels simultaneously. A user query about a broad concept should retrieve a summary-level chunk; a follow-up query about a specific detail should retrieve the fine-grained child chunk. Hierarchical chunking constructs a chunk tree $\mathcal{T}_c$ where each node is a chunk and edges encode parent-child containment.

9.4.2 Formal Definition#

Let $D$ be a document. A hierarchical chunking of $D$ at $L$ levels produces a tree:

\mathcal{T}_c = (C, E_c)

where:

$C = C^{(1)} \cup C^{(2)} \cup \ldots \cup C^{(L)}$ is the set of chunks across levels
$C^{(l)}$ is the set of chunks at level $l$ (level 1 = coarsest, level $L$ = finest)
$E_c = \{(c_p, c_d) \mid c_p \in C^{(l)}, c_d \in C^{(l+1)}, \text{content}(c_d) \subseteq \text{content}(c_p)\}$

Each parent chunk $c_p^{(l)}$ satisfies:

\text{content}(c_p^{(l)}) = \bigoplus_{c_d \in \text{children}(c_p)} \text{content}(c_d)

where $\bigoplus$ denotes ordered concatenation, and an optional summary layer stores:

\text{summary}(c_p^{(l)}) = \text{Summarize}\left(\text{content}(c_p^{(l)})\right)

9.4.3 Retrieval with Hierarchical Chunks#

At query time, the retrieval engine can operate in multiple modes:

Leaf retrieval: Retrieve fine-grained chunks from $C^{(L)}$ for precise evidence extraction
Parent expansion: Upon retrieving a leaf chunk $c_d$ , expand to its parent $c_p$ for surrounding context
Summary-first routing: Retrieve from summary-level $C^{(1)}$ , then drill into children for detail
Multi-level fusion: Retrieve from all levels, deduplicate, and rank by relevance-at-granularity

The parent expansion strategy is formalized as:

\mathcal{C}_{\text{expanded}}(q) = \left\{ \text{parent}(c) \mid c \in \text{TopK}\left(C^{(L)}, q, k\right) \right\}

This provides the model with surrounding context that a leaf chunk alone cannot supply.

9.4.4 Pseudo-Algorithm: Hierarchical Chunking with Summary-Detail Layering#

ALGORITHM: HierarchicalChunk(D, levels, τ_sizes[], summarize_fn)
──────────────────────────────────────────────────────────────────
INPUT:
  D             — document text
  levels        — number of hierarchical levels (e.g., 3)
  τ_sizes[]     — max chunk token size per level: [τ_1, τ_2, ..., τ_L]
                   where τ_1 > τ_2 > ... > τ_L
  summarize_fn  — function to generate summaries (LLM or extractive)
 
OUTPUT:
  chunk_tree    — tree of chunks with parent-child edges and summaries
 
PROCEDURE:
  1.  // Level 1: Coarsest segmentation
      level_1_chunks ← StructuralChunk(D, format, τ_sizes[1], d_target=1)
      // Or SemanticChunk if no structure available
 
  2.  chunk_tree ← InitializeTree()
      FOR EACH c1 IN level_1_chunks:
          node_1 ← CreateNode(level=1, content=c1.content, metadata=c1)
          AddToTree(chunk_tree, node_1, parent=ROOT)
 
  3.  // Levels 2 through L: Recursive refinement
      FOR l ← 2 TO levels:
          parent_nodes ← GetNodesAtLevel(chunk_tree, l - 1)
          FOR EACH parent IN parent_nodes:
              IF TokenCount(parent.content) > τ_sizes[l]:
                  children ← SemanticChunk(
                      parent.content, embed_fn, τ_sizes[l], τ_min=50,
                      α=1.5, w=3
                  )
                  FOR EACH child IN children:
                      node_child ← CreateNode(
                          level=l, content=child.content, metadata=child
                      )
                      AddToTree(chunk_tree, node_child, parent=parent)
              ELSE:
                  // Parent is already small enough; promote as leaf
                  MarkAsLeaf(parent)
 
  4.  // Generate summaries for non-leaf nodes (bottom-up)
      FOR l ← levels - 1 DOWNTO 1:
          nodes_at_l ← GetNodesAtLevel(chunk_tree, l)
          FOR EACH node IN nodes_at_l:
              children_text ← Concatenate([c.content FOR c IN Children(node)])
              node.summary ← summarize_fn(children_text)
              node.summary_tokens ← TokenCount(node.summary)
 
  5.  // Assign globally unique chunk IDs and parent pointers
      AssignIDs(chunk_tree)
      FOR EACH node IN AllNodes(chunk_tree):
          node.metadata.parent_id ← ParentID(node)
          node.metadata.children_ids ← [ChildID(c) FOR c IN Children(node)]
          node.metadata.level ← node.level
 
  6.  RETURN chunk_tree

9.4.5 Token Budget Considerations#

For a document of $L_{\text{doc}}$ tokens with $L$ hierarchical levels, the total storage cost is:

\text{TotalTokens} = \sum_{l=1}^{L} \sum_{c \in C^{(l)}} |c| \approx L_{\text{doc}} \cdot L + \sum_{l=1}^{L-1} |C^{(l)}| \cdot \bar{s}_l

where $\bar{s}_l$ is the average summary length at level $l$ . This is a storage multiplication factor of approximately $L + \epsilon$ (where $\epsilon$ accounts for summaries), which is acceptable given that it enables multi-granularity retrieval. The embedding and indexing cost scales linearly with the total number of chunk nodes: $|C| = \sum_{l=1}^{L} |C^{(l)}|$ .

9.4.6 Trade-Off Analysis#

Strength	Limitation
Enables multi-granularity retrieval with a single index	Storage and embedding cost scale linearly with level count
Parent expansion provides surrounding context automatically	Summarization introduces potential information loss
Summary layers enable high-recall coarse search	Requires careful calibration of $\tau_{\text{sizes}}$ per level
Natural fit for documents with inherent hierarchy	Flat documents (e.g., chat logs) gain limited benefit

9.5 Agentic Chunking: LLM-Guided Proposition Extraction, Claim Decomposition, and Fact Isolation#

9.5.1 Definition#

Agentic chunking uses a language model to decompose documents into atomic propositions — self-contained, independently verifiable statements that each express exactly one factual claim. Unlike structural or semantic chunking, which operate on surface features, agentic chunking operates on meaning, producing retrieval units optimized for factual precision and downstream synthesis.

9.5.2 Proposition as the Atomic Retrieval Unit#

A proposition $p$ satisfies three properties:

Atomicity: $p$ expresses exactly one claim or fact
Self-containedness: $p$ is interpretable without reference to surrounding text (all pronouns resolved, all abbreviations expanded)
Verifiability: $p$ can be independently confirmed or refuted against an authoritative source

Example transformation:

Source: "Founded in 2004, the company grew to 500 employees by 2020 and was acquired by Acme Corp for $2B."

Propositions:

"The company was founded in 2004."

"The company had 500 employees by 2020."

"Acme Corp acquired the company."

"The acquisition price was $2 billion."

9.5.3 Formal Definition#

Given a document $D$ , agentic chunking produces a set of propositions:

\mathcal{P}(D) = \text{LLM}_{\text{extract}}(D) = \{p_1, p_2, \ldots, p_m\}

where each $p_i$ is a decontextualized atomic statement with provenance:

p_i = (\text{claim}_i, \text{source\_span}_i, \text{doc\_id}, \text{section\_path}_i, \text{entities}_i)

The proposition count $m$ typically satisfies $m \gg n$ where $n$ is the structural chunk count, since a single paragraph may yield 5–15 propositions.

9.5.4 Pseudo-Algorithm: Agentic Proposition Extraction#

ALGORITHM: AgenticChunk(D, llm_fn, τ_max_input, batch_size)
─────────────────────────────────────────────────────────────
INPUT:
  D             — document text
  llm_fn        — language model function for proposition extraction
  τ_max_input   — max tokens per LLM extraction call
  batch_size    — number of paragraphs per batch
 
OUTPUT:
  propositions[] — list of atomic, self-contained propositions with provenance
 
PROCEDURE:
  1.  // Pre-segment document into manageable passages
      passages ← StructuralChunk(D, format, τ_max_input, d_target=MAX_DEPTH)
      // Alternatively: ParagraphSplit(D, τ_max_input)
 
  2.  propositions ← []
 
  3.  FOR EACH batch IN BatchIterator(passages, batch_size):
 
        prompt ← CompileExtractionPrompt(batch)
        // Prompt instructs:
        //   "Decompose the following text into atomic, self-contained
        //    propositions. Each proposition must:
        //    - Express exactly one factual claim
        //    - Resolve all pronouns and references to explicit entities
        //    - Be independently interpretable without surrounding text
        //    - Preserve numerical values, dates, and proper nouns exactly
        //    Return as a JSON array of objects with fields:
        //    {claim, source_sentence, entities[], confidence}"
 
        response ← llm_fn(prompt)
        extracted ← ParseJSON(response)
 
        FOR EACH prop IN extracted:
            prop.source_doc ← DocumentID(D)
            prop.source_passage_id ← PassageID(batch)
            prop.chunk_type ← "agentic_proposition"
            prop.token_count ← TokenCount(prop.claim)
 
            // Validate: reject empty, duplicate, or non-atomic propositions
            IF prop.claim IS NOT EMPTY
               AND NOT IsDuplicate(prop, propositions)
               AND TokenCount(prop.claim) ≥ 5:
                APPEND prop TO propositions
 
  4.  // Deduplication pass: semantic near-duplicate removal
      embeddings ← [Embed(p.claim) FOR p IN propositions]
      duplicate_pairs ← FindPairsAboveThreshold(embeddings, θ_dedup=0.95)
      FOR EACH (i, j) IN duplicate_pairs:
          // Keep the proposition with richer entity set
          IF |propositions[i].entities| ≥ |propositions[j].entities|:
              MarkForRemoval(propositions[j])
          ELSE:
              MarkForRemoval(propositions[i])
      propositions ← RemoveMarked(propositions)
 
  5.  // Optional: cluster propositions by topic for grouped retrieval
      clusters ← ClusterByEmbedding(propositions, method="HDBSCAN")
      FOR EACH cluster IN clusters:
          FOR EACH prop IN cluster:
              prop.topic_cluster_id ← cluster.id
 
  6.  RETURN propositions

9.5.5 Claim Decomposition for Complex Statements#

Complex claims (conjunctions, conditionals, causal chains) require further decomposition. A claim $p$ is non-atomic if:

\text{ClauseCount}(p) > 1 \quad \text{or} \quad \text{contains}(p, \{\text{"and"}, \text{"but"}, \text{"because"}, \text{"if"}, \text{"which"}\})

The LLM is instructed to decompose:

p_{\text{complex}} \rightarrow \{p_1, p_2, \ldots, p_k\} \quad \text{where each } p_i \text{ has ClauseCount} = 1

9.5.6 Cost Analysis#

Agentic chunking is the most expensive strategy in terms of inference cost. For a document of $L$ tokens processed in batches of $b$ tokens:

\text{LLM Calls} = \left\lceil \frac{L}{b} \right\rceil, \qquad \text{Total Input Tokens} \approx L + \left\lceil \frac{L}{b} \right\rceil \cdot |\text{prompt\_template}|

\text{Total Output Tokens} \approx m \cdot \bar{p} \quad \text{where } \bar{p} \text{ is mean proposition length}

For cost optimization: (a) use a smaller, fine-tuned model for extraction; (b) batch aggressively; (c) cache extraction results and only re-extract on source change.

9.5.7 Trade-Off Analysis#

Strength	Limitation
Produces maximally precise retrieval units	High inference cost ( $O(L)$ LLM calls)
Self-contained propositions eliminate context dependency	LLM extraction may hallucinate propositions not in source
Ideal for fact-checking, QA, and claim verification tasks	Not suitable for code, tabular data, or structured content
Enables fine-grained attribution and provenance	Over-decomposition can fragment coherent arguments

9.6 Code Chunking: AST-Based, Function-Level, Class-Level, Dependency-Scope Chunking#

9.6.1 Motivation#

Source code possesses a formal grammar and a well-defined syntactic structure (the Abstract Syntax Tree, AST). Naïve text-based chunking of code produces fragments that split functions mid-body, separate signatures from implementations, and destroy import context. Code chunking must respect the syntactic and semantic boundaries defined by the programming language's grammar.

9.6.2 AST-Based Chunking#

The AST of a source file $F$ is a tree $\mathcal{A}(F) = (V_a, E_a)$ where nodes correspond to syntactic constructs: module, class, function, method, block, statement, expression. Chunking at a target AST depth produces semantically complete code units.

Let $\mathcal{N}_{\text{scope}}$ denote the set of scope-defining node types:

\mathcal{N}_{\text{scope}} = \{\texttt{module}, \texttt{class}, \texttt{function}, \texttt{method}, \texttt{lambda}, \texttt{block}\}

A code chunk is the serialization of a scope-defining node plus its required context:

c_v = \text{Imports}(v) \oplus \text{Signature}(v) \oplus \text{Docstring}(v) \oplus \text{Body}(v)

where $\text{Imports}(v)$ includes only the import statements referenced within $v$ 's body.

9.6.3 Pseudo-Algorithm: AST-Based Code Chunking#

ALGORITHM: CodeChunk(F, language, τ_max, granularity)
─────────────────────────────────────────────────────
INPUT:
  F           — source file content
  language    — programming language (python, java, typescript, ...)
  τ_max       — maximum chunk token size
  granularity — target scope level: "function" | "class" | "module"
 
OUTPUT:
  chunks[]    — list of code chunks with AST metadata
 
PROCEDURE:
  1.  AST ← Parse(F, language)
      // Use tree-sitter, libcst, or language-specific parser
 
  2.  imports ← ExtractImports(AST)
      // Global import statements for the file
 
  3.  scope_nodes ← CollectNodes(AST, type ∈ N_scope, granularity)
      // If granularity = "function": collect all function/method nodes
      // If granularity = "class": collect all class nodes
      // If granularity = "module": the entire file is one chunk
 
  4.  chunks ← []
 
  5.  FOR EACH node IN scope_nodes:
        signature   ← ExtractSignature(node)      // e.g., "def foo(x: int) -> str:"
        docstring   ← ExtractDocstring(node)       // First string literal in body
        body        ← ExtractBody(node)            // Full body source
        decorators  ← ExtractDecorators(node)      // @decorator lines
        parent_class← GetEnclosingClass(node, AST) // None if top-level
        
        // Resolve used imports
        used_symbols ← ExtractReferencedSymbols(body)
        relevant_imports ← Filter(imports, symbol ∈ used_symbols)
 
        full_text ← Join([
            relevant_imports,
            decorators,
            signature,
            docstring,
            body
        ])
 
        tokens ← Tokenize(full_text)
 
        IF |tokens| ≤ τ_max:
            chunk ← {
                content:         full_text,
                token_count:     |tokens|,
                chunk_type:      "code_" + granularity,
                language:        language,
                scope_name:      QualifiedName(node, AST),
                signature:       signature,
                enclosing_class: parent_class,
                file_path:       FilePath(F),
                line_start:      StartLine(node),
                line_end:        EndLine(node),
                dependencies:    used_symbols,
                docstring:       docstring,
                complexity:      CyclomaticComplexity(node)
            }
            APPEND chunk TO chunks
 
        ELSE:
            // Function too large: split into sub-blocks
            IF granularity = "class":
                // Recurse at method level
                method_chunks ← CodeChunk(
                    Serialize(node), language, τ_max, "function"
                )
                FOR EACH mc IN method_chunks:
                    mc.enclosing_class ← QualifiedName(node, AST)
                    APPEND mc TO chunks
            ELSE:
                // Split large function at logical block boundaries
                blocks ← ExtractBlocks(node)
                // blocks: loops, conditionals, try-except, sequential blocks
                FOR EACH block IN blocks:
                    block_text ← Join([relevant_imports, signature + "...", Serialize(block)])
                    block_chunk ← CreateChunk(block_text, metadata={
                        scope_name: QualifiedName(node) + "::block_" + BlockIndex(block),
                        ...
                    })
                    APPEND block_chunk TO chunks
 
  6.  // Add file-level context chunk (always)
      file_summary ← {
          content:     Join([imports, ExtractClassSignatures(AST), ExtractFunctionSignatures(AST)]),
          chunk_type:  "code_file_summary",
          file_path:   FilePath(F),
          token_count: TokenCount(above)
      }
      APPEND file_summary TO chunks
 
  7.  RETURN chunks

9.6.4 Dependency-Scope Chunking#

For queries that require understanding the context of a function's callees or callers, dependency-scope chunking constructs chunks that include transitively referenced code:

c_v^{\text{dep}} = c_v \oplus \bigoplus_{u \in \text{Callees}(v, \text{depth}=1)} \text{Signature}(u)

This provides the retrieval engine with enough context to answer questions about a function's behavior without retrieving the entire codebase.

9.6.5 Language-Specific Considerations#

Language	Parser	Scope Nodes	Special Handling
Python	tree-sitter-python, libcst	`function_definition`, `class_definition`	Decorators, type hints, `__init__` grouping
Java	tree-sitter-java, JavaParser	`method_declaration`, `class_declaration`	Interface/abstract methods, annotations
TypeScript	tree-sitter-typescript	`function_declaration`, `class_declaration`, `arrow_function`	JSX components, type declarations
Go	tree-sitter-go	`function_declaration`, `method_declaration`	Struct methods, interface satisfaction
Rust	tree-sitter-rust	`function_item`, `impl_item`	Trait implementations, lifetime annotations

9.6.6 Trade-Off Analysis#

Strength	Limitation
Respects language grammar; never splits mid-function	Requires a working parser per language
Includes import context for self-containedness	Very large functions or classes may still exceed $\tau_{\max}$
AST metadata enables precise scope-based retrieval	Dynamic languages (Python, JS) may have incomplete ASTs for metaprogramming
Cyclomatic complexity metadata aids relevance ranking	Cross-file dependencies require call graph analysis

9.7 Tabular and Structured Data Chunking: Row-Group, Schema-Preserving, Pivot-Aware Strategies#

9.7.1 The Challenge#

Tabular data (CSV, Excel, SQL tables, JSON arrays) encodes information in a two-dimensional schema. Chunking rows without schema context produces meaningless text. Chunking entire tables may exceed $\tau_{\max}$ . The chunking strategy must preserve the schema-value binding so that each chunk is independently interpretable.

9.7.2 Schema-Preserving Row-Group Chunking#

Given a table $T$ with schema $S = (col_1, col_2, \ldots, col_k)$ and $N$ rows, partition into row groups of size $g$ :

c_j = S \oplus \text{Rows}[j \cdot g : (j+1) \cdot g], \quad j \in \{0, \ldots, \lceil N/g \rceil - 1\}

Every chunk begins with the schema header $S$ , ensuring self-containedness. The group size $g$ is determined by:

g = \left\lfloor \frac{\tau_{\max} - |S|_{\text{tokens}}}{\bar{r}} \right\rfloor

where $\bar{r}$ is the average row token count.

9.7.3 Pivot-Aware Chunking#

When a table has a key column (e.g., entity name, date) that groups logically related rows, chunking should respect key boundaries:

c_j = S \oplus \{r \in T \mid r[\text{key\_col}] = k_j\}

This produces one chunk per unique key value (or per key-value group if the key has high cardinality), preserving the semantic unit of "all data about entity $k_j$ ."

9.7.4 Serialization Formats#

The way a table chunk is serialized affects embedding quality and retrieval precision:

Format	Example	Pros	Cons
Markdown table	`\| col1 \| col2 \| ... \|`	Readable, familiar to LLMs	Verbose for wide tables
Row-as-sentence	"The revenue for Q3 2023 was $5.2M with margin 12%."	High embedding quality	Requires NL template per schema
JSON records	`[{"col1": "v1", "col2": "v2"}]`	Structured, parseable	Token-heavy due to key repetition
Key-value pairs	`col1: v1, col2: v2`	Compact	Ambiguous without schema context

For optimal retrieval, row-as-sentence serialization with natural-language templates is preferred because it produces text that aligns with the embedding model's training distribution.

9.7.5 Pseudo-Algorithm: Schema-Preserving Tabular Chunking#

ALGORITHM: TabularChunk(T, schema, τ_max, key_col, serialize_mode)
────────────────────────────────────────────────────────────────────
INPUT:
  T               — table data (list of row dictionaries)
  schema          — column definitions with types and descriptions
  τ_max           — maximum chunk token size
  key_col         — optional grouping key column (may be NULL)
  serialize_mode  — "markdown" | "row_sentence" | "json"
 
OUTPUT:
  chunks[]        — list of schema-preserving table chunks
 
PROCEDURE:
  1.  schema_header ← SerializeSchema(schema, serialize_mode)
      schema_tokens ← TokenCount(schema_header)
 
  2.  IF key_col IS NOT NULL:
          // Pivot-aware grouping
          groups ← GroupBy(T, key_col)
      ELSE:
          // Sequential row grouping
          avg_row_tokens ← Mean([TokenCount(SerializeRow(r, schema, serialize_mode)) FOR r IN T])
          g ← Floor((τ_max - schema_tokens) / avg_row_tokens)
          g ← Max(g, 1)
          groups ← SequentialBatch(T, batch_size=g)
 
  3.  chunks ← []
      FOR EACH group IN groups:
          serialized_rows ← [SerializeRow(r, schema, serialize_mode) FOR r IN group]
          content ← schema_header + "\n" + Join(serialized_rows, "\n")
          tokens ← TokenCount(content)
 
          IF tokens ≤ τ_max:
              chunk ← {
                  content:       content,
                  token_count:   tokens,
                  chunk_type:    "tabular",
                  schema:        schema,
                  row_count:     |group|,
                  key_value:     group[0][key_col] IF key_col ELSE NULL,
                  row_range:     (FirstRowIndex(group), LastRowIndex(group)),
                  source_table:  TableID(T)
              }
              APPEND chunk TO chunks
          ELSE:
              // Group exceeds τ_max: split into sub-groups
              sub_groups ← SequentialBatch(group, batch_size=Floor(g/2))
              RECURSE on each sub_group
 
  4.  // Generate table summary chunk
      summary ← {
          content:     "Table: " + TableName(T) + ". Schema: " + schema_header
                       + ". Row count: " + |T|
                       + ". Key statistics: " + ComputeColumnStats(T, schema),
          chunk_type:  "tabular_summary",
          token_count: TokenCount(above),
          source_table: TableID(T)
      }
      APPEND summary TO chunks
 
  5.  RETURN chunks

9.7.6 Row-as-Sentence Template#

For the row-as-sentence serialization, a template is constructed from the schema:

\text{Template}(S) = \text{"The [entity\_col] has [col2] of [val2], [col3] of [val3], ..."}

Example: For schema (company, revenue, quarter, margin):

\text{SerializeRow}(r) = \text{"The revenue of [company] in [quarter] was [revenue] with a margin of [margin]."}

This produces text that embeds well and retrieves accurately against natural-language queries.

9.8.1 Scope#

Multi-modal documents (PDFs with figures, web pages with images, video tutorials, podcast transcripts) require chunking strategies that preserve the alignment between modalities. A figure without its caption is uninterpretable; a transcript segment without its timestamp is unlocalizable.

9.8.2 Image-Caption Pairing#

For documents containing images (figures, diagrams, charts), each image is chunked as a paired unit:

c_{\text{img}} = (\text{image\_reference}, \text{caption}, \text{alt\_text}, \text{surrounding\_text}, \text{OCR\_text})

where:

image_reference: URI or embedding of the image
caption: extracted figure caption (e.g., "Figure 3: Architecture diagram")
alt_text: accessibility description if available
surrounding_text: the $k$ sentences before and after the image in the document
OCR_text: text extracted from the image via OCR (for diagrams, charts, screenshots)

The text representation for embedding and retrieval is:

\text{EmbeddableText}(c_{\text{img}}) = \text{caption} \oplus \text{alt\_text} \oplus \text{OCR\_text} \oplus \text{surrounding\_text}

Optionally, a vision-language model generates a synthetic description of the image:

\text{description} = \text{VLM}(\text{image\_bytes})

This description is appended to the embeddable text for richer retrieval.

9.8.3 Video Segment Chunking#

Video content is chunked by segmenting the timeline into semantically coherent segments:

c_{\text{video}, j} = (\text{transcript}[t_j : t_{j+1}], \; \text{keyframes}[t_j : t_{j+1}], \; t_j, \; t_{j+1}, \; \text{speaker})

Segment boundaries $\{t_1, t_2, \ldots\}$ are determined by:

Silence detection: Pauses $> \delta$ seconds
Speaker diarization: Speaker change points
Transcript semantic segmentation: Apply semantic chunking (§9.3) to the transcript text
Visual scene change detection: Keyframe difference exceeds threshold

9.8.4 Audio Transcript Alignment#

For audio-only content (podcasts, meetings, calls), the chunking pipeline is:

Audio → ASR (speech-to-text) → Diarization → Sentence segmentation
    → Semantic chunking on transcript → Timestamp alignment

Each chunk carries:

c_{\text{audio}, j} = (\text{transcript\_segment}, \; t_{\text{start}}, \; t_{\text{end}}, \; \text{speaker\_id}, \; \text{confidence})

ALGORITHM: MultiModalChunk(D_mm, modality_extractors, τ_max)
─────────────────────────────────────────────────────────────
INPUT:
  D_mm                  — multi-modal document (PDF, web page, video, ...)
  modality_extractors   — dict of modality → extraction function
  τ_max                 — maximum chunk token size
 
OUTPUT:
  chunks[]              — list of modality-annotated chunks
 
PROCEDURE:
  1.  // Extract modality streams
      text_blocks   ← modality_extractors["text"](D_mm)
      images        ← modality_extractors["image"](D_mm)
      tables        ← modality_extractors["table"](D_mm)
      audio_segs    ← modality_extractors["audio"](D_mm)   // May be empty
      video_segs    ← modality_extractors["video"](D_mm)   // May be empty
 
  2.  chunks ← []
 
  3.  // Text chunking (structural or semantic)
      text_chunks ← StructuralChunk(JoinBlocks(text_blocks), format, τ_max, d_target=2)
      FOR EACH tc IN text_chunks:
          tc.modality ← "text"
          APPEND tc TO chunks
 
  4.  // Image-caption pairing
      FOR EACH img IN images:
          caption         ← ExtractCaption(img, D_mm)
          alt_text        ← ExtractAltText(img, D_mm)
          ocr_text        ← OCR(img.bytes)
          surrounding     ← GetSurroundingText(img.position, text_blocks, k=3)
          vlm_description ← VLMDescribe(img.bytes)   // Optional
 
          embeddable ← Join([caption, alt_text, ocr_text, surrounding, vlm_description])
 
          chunk ← {
              content:         embeddable,
              token_count:     TokenCount(embeddable),
              chunk_type:      "image_caption_pair",
              modality:        "image",
              image_ref:       img.uri,
              caption:         caption,
              position_in_doc: img.position,
              source_doc:      DocumentID(D_mm)
          }
          APPEND chunk TO chunks
 
  5.  // Table chunking
      FOR EACH tbl IN tables:
          tbl_chunks ← TabularChunk(tbl.data, tbl.schema, τ_max, key_col=NULL, "markdown")
          FOR EACH tc IN tbl_chunks:
              tc.modality ← "table"
              APPEND tc TO chunks
 
  6.  // Audio/video segment chunking
      FOR EACH seg IN audio_segs + video_segs:
          transcript_chunks ← SemanticChunk(seg.transcript, embed_fn, τ_max, τ_min=30, α=1.5, w=2)
          FOR EACH tc IN transcript_chunks:
              tc.modality     ← seg.type   // "audio" or "video"
              tc.time_start   ← AlignTimestamp(tc, seg)
              tc.time_end     ← AlignTimestamp(tc, seg)
              tc.speaker      ← seg.speaker_id
              tc.keyframe_ref ← ExtractKeyframe(tc.time_start) IF seg.type = "video" ELSE NULL
              APPEND tc TO chunks
 
  7.  RETURN chunks

After chunking, cross-modal links should be established:

An image chunk links to the text chunk that references it ("As shown in Figure 3...")
A table chunk links to the text chunk that discusses its findings
A video segment chunk links to slides visible during that segment

These links are stored as metadata edges and enable the retrieval engine to expand across modalities when a query touches multiple information types.

9.9 Overlap, Stride, and Context Window Strategies for Boundary Coherence#

9.9.1 The Boundary Problem#

Every chunking strategy introduces artificial boundaries. Information that spans a boundary is split across two chunks, and if only one chunk is retrieved, the agent receives an incomplete evidence fragment. Overlap and stride strategies mitigate this by ensuring that boundary-adjacent content appears in multiple chunks.

9.9.2 Formal Definitions#

Given a document $D$ of $L$ tokens, a chunk size $s$ , and an overlap $o$ (where $0 \leq o < s$ ), the stride is:

\text{stride} = s - o

The $i$ -th chunk is:

c_i = D[i \cdot \text{stride} : i \cdot \text{stride} + s], \quad i \in \left\{0, 1, \ldots, \left\lfloor \frac{L - s}{\text{stride}} \right\rfloor\right\}

The total number of chunks is:

n = \left\lfloor \frac{L - s}{\text{stride}} \right\rfloor + 1 = \left\lfloor \frac{L - s}{s - o} \right\rfloor + 1

The overlap ratio is:

r_o = \frac{o}{s}

The storage expansion factor (ratio of total stored tokens to document length) is:

\text{Expansion} = \frac{n \cdot s}{L} \approx \frac{s}{s - o} = \frac{1}{1 - r_o}

For $r_o = 0.2$ (20% overlap), expansion $\approx 1.25$ ; for $r_o = 0.5$ , expansion $= 2.0$ .

9.9.3 Optimal Overlap Selection#

The optimal overlap trades off between boundary coherence and storage/embedding cost. Define boundary information loss as:

\mathcal{L}_{\text{boundary}}(o) = \sum_{i=0}^{n-2} \mathbb{1}\left[\exists q \in Q_{\text{eval}} : \text{answer}(q) \in D[i \cdot \text{stride} + s - o : i \cdot \text{stride} + s + o] \setminus c_i \right]

This counts the number of evaluation queries whose answer falls in the boundary region and is not fully captured by a single chunk. The objective is:

\min_o \; \mathcal{L}_{\text{boundary}}(o) + \lambda_{\text{cost}} \cdot \frac{1}{1 - o/s}

In practice, overlaps of $10\%$ – $20\%$ of chunk size are standard for prose documents, while code chunks typically use $0\%$ overlap (since AST boundaries are semantically precise).

9.9.4 Sentence-Aligned Overlap#

Rather than token-level overlap (which may split words), sentence-aligned overlap ensures that the overlap region begins and ends at sentence boundaries:

o_{\text{aligned}} = \sum_{j=1}^{k} |s_j| \quad \text{where } \sum_{j=1}^{k} |s_j| \leq o \text{ and } s_j \text{ are trailing sentences of } c_i

This prevents partial-sentence artifacts in the overlap region.

9.9.5 Context Window Prefix Strategy#

An alternative to overlap is the context window prefix: instead of duplicating content, each chunk is prefixed with a compressed summary of its preceding context:

c_i^{\text{prefixed}} = \text{Summary}(D[0 : i \cdot \text{stride}]) \oplus c_i

This provides positional context without full content duplication, at the cost of summary generation. The summary prefix is typically bounded to $p$ tokens:

|c_i^{\text{prefixed}}| = p + s \leq \tau_{\max}

9.9.6 Decision Matrix#

Strategy	Boundary Coherence	Storage Cost	Compute Cost	Best For
No overlap ( $o = 0$ )	Low	$1.0 \times$	Minimal	AST-bounded code, clean structural splits
Fixed overlap (10–20%)	Medium	$1.1$ – $1.25 \times$	Low	General prose, documentation
High overlap (50%)	High	$2.0 \times$	Moderate	Dense technical text with cross-reference dependencies
Sentence-aligned overlap	Medium-High	$1.1$ – $1.3 \times$	Low	Narrative text, legal documents
Context window prefix	High	$1.0 \times$ + summary cost	High (LLM calls)	Long-form analysis, reports

9.10 Chunk Metadata Enrichment: Section Title, Document Position, Entity Tags, Summary, Parent Pointer#

9.10.1 Principle#

A chunk without metadata is an anonymous text fragment. In production retrieval systems, metadata is the primary lever for non-semantic filtering, ranking, and provenance tracking. Every chunk must carry a rich, typed metadata envelope that enables the retrieval engine to filter, re-rank, and attribute.

9.10.2 Metadata Schema#

The canonical chunk metadata schema for an agentic retrieval system:

Field	Type	Source	Purpose
`chunk_id`	UUID	Generated	Unique identifier, idempotent re-chunking
`source_doc_id`	UUID	Ingestion pipeline	Provenance: which document
`source_doc_title`	string	Document metadata	Human-readable provenance
`section_path`	string[]	Structural parser	Breadcrumb: `["Ch3", "Sec3.2", "Sub3.2.1"]`
`position_index`	int	Chunking engine	Ordinal position within document (0-indexed)
`total_chunks`	int	Chunking engine	Total chunks from this document
`chunk_type`	enum	Strategy selector	`structural`, `semantic`, `agentic`, `code`, `tabular`, `multimodal`
`token_count`	int	Tokenizer	Exact token count for budget planning
`entity_tags`	string[]	NER pipeline	Named entities mentioned: people, orgs, dates, products
`summary`	string	LLM or extractive	1–2 sentence summary of chunk content
`parent_chunk_id`	UUID	Hierarchical chunker	Pointer to parent chunk (null if root)
`children_chunk_ids`	UUID[]	Hierarchical chunker	Pointers to child chunks
`language`	string	Detector	Natural language or programming language
`created_at`	timestamp	Ingestion pipeline	Ingestion timestamp
`source_version`	string	Version control	Git commit, document revision, API version
`freshness_score`	float	Decay function	$f(t) = e^{-\lambda(t_{\text{now}} - t_{\text{created}})}$
`authority_score`	float	Source ranking	Confidence in source reliability
`modality`	enum	Modality extractor	`text`, `image`, `table`, `audio`, `video`
`coherence_score`	float	Quality scorer	Intra-chunk semantic coherence (§9.3.4)

9.10.3 Pseudo-Algorithm: Metadata Enrichment Pipeline#

ALGORITHM: EnrichChunkMetadata(chunks[], ner_fn, summarize_fn, embed_fn)
──────────────────────────────────────────────────────────────────────────
INPUT:
  chunks[]       — raw chunks from any chunking strategy
  ner_fn         — named entity recognition function
  summarize_fn   — summarization function (extractive or LLM)
  embed_fn       — embedding function
 
OUTPUT:
  enriched[]     — chunks with complete metadata
 
PROCEDURE:
  1.  FOR EACH chunk c IN chunks:
 
        // Entity extraction
        c.entity_tags ← ner_fn(c.content)
        // Returns: [{"text": "OpenAI", "type": "ORG"}, ...]
 
  2.  // Batch summarization for efficiency
      unsummarized ← [c FOR c IN chunks WHERE c.summary IS NULL]
      IF |unsummarized| > 0:
          summaries ← BatchSummarize(
              [c.content FOR c IN unsummarized],
              summarize_fn,
              max_summary_tokens=50
          )
          FOR i, c IN Enumerate(unsummarized):
              c.summary ← summaries[i]
 
  3.  // Coherence scoring
      FOR EACH chunk c IN chunks:
          sent_embeds ← [embed_fn(s) FOR s IN SentenceTokenize(c.content)]
          IF |sent_embeds| ≥ 2:
              c.coherence_score ← MeanPairwiseCosine(sent_embeds)
          ELSE:
              c.coherence_score ← 1.0   // Single-sentence chunk is trivially coherent
 
  4.  // Freshness scoring
      λ ← 0.01   // Decay rate (configurable per source type)
      FOR EACH chunk c IN chunks:
          age_days ← (Now() - c.created_at).days
          c.freshness_score ← Exp(-λ · age_days)
 
  5.  // Position normalization
      doc_groups ← GroupBy(chunks, key=source_doc_id)
      FOR EACH doc_id, group IN doc_groups:
          total ← |group|
          FOR i, c IN Enumerate(SortByPosition(group)):
              c.position_index ← i
              c.total_chunks ← total
              c.relative_position ← i / total   // 0.0 = start, 1.0 = end
 
  6.  // Generate chunk IDs (deterministic for idempotency)
      FOR EACH chunk c IN chunks:
          c.chunk_id ← DeterministicUUID(
              c.source_doc_id, c.position_index, c.chunk_type, c.source_version
          )
          // Same content + same version → same ID (enables dedup on re-ingestion)
 
  7.  RETURN chunks

9.10.4 Metadata-Driven Retrieval Filtering#

At query time, metadata enables pre-retrieval filtering that dramatically reduces the candidate set:

\mathcal{C}_{\text{filtered}} = \{c \in \mathcal{C} \mid \text{freshness}(c) \geq f_{\min} \land \text{language}(c) = l_q \land \text{entity\_tags}(c) \cap E_q \neq \varnothing\}

where $f_{\min}$ is the minimum freshness threshold, $l_q$ is the query language, and $E_q$ is the set of entities mentioned in the query. This filtering occurs before embedding similarity computation, reducing both latency and false positives.

9.10.5 Summary as a Retrieval Proxy#

Chunk summaries serve a dual purpose:

Retrieval proxy: Embed and index the summary instead of (or in addition to) the full chunk. Summaries are denser and produce more discriminative embeddings.
Context compression: When injecting chunks into the agent's context window, the summary can replace the full content for lower-priority chunks, conserving token budget.

The dual-embedding strategy indexes both:

\mathbf{v}_c^{\text{full}} = \text{Embed}(c.\text{content}), \qquad \mathbf{v}_c^{\text{summary}} = \text{Embed}(c.\text{summary})

Retrieval computes similarity against both and takes the maximum:

\text{score}(c, q) = \max\left(\cos(\mathbf{v}_c^{\text{full}}, \mathbf{v}_q), \; \cos(\mathbf{v}_c^{\text{summary}}, \mathbf{v}_q)\right)

9.11 Adaptive Chunking: Runtime Chunk Size Adjustment Based on Query Complexity and Token Budget#

9.11.1 Motivation#

Static chunking assumes a fixed optimal chunk size at ingestion time. However, the optimal retrieval granularity depends on the query at runtime:

A broad question ("Summarize the company's Q3 performance") benefits from large, summary-level chunks
A precise question ("What was the EBITDA margin in Q3 2023?") benefits from small, fact-level chunks
A complex multi-hop question ("Compare Q3 margins across all subsidiaries") requires multiple medium-granularity chunks

Adaptive chunking adjusts the effective chunk granularity at query time without re-indexing.

9.11.2 Query Complexity Classification#

Define a query complexity classifier:

\kappa(q) = \text{Classify}(q) \in \{\text{factoid}, \text{analytical}, \text{comparative}, \text{exploratory}, \text{multi-hop}\}

Each complexity class maps to a preferred granularity level:

Query Class	Preferred Granularity	Chunk Size Range	Strategy
Factoid	Fine (propositions, small chunks)	50–150 tokens	Retrieve from agentic/leaf-level index
Analytical	Medium (paragraphs, sections)	200–500 tokens	Retrieve from structural/semantic index
Comparative	Medium + multi-entity	200–500 tokens	Retrieve per entity, align by schema
Exploratory	Coarse (summaries, full sections)	500–1500 tokens	Retrieve from summary/parent-level index
Multi-hop	Mixed (fine + medium)	Varies	Iterative retrieval with decomposed sub-queries

9.11.3 Token Budget-Aware Chunk Selection#

Given a context budget $B_{\text{context}}$ , the number of retrievable chunks $k$ is:

k = \left\lfloor \frac{B_{\text{context}} - B_{\text{system}} - B_{\text{query}} - B_{\text{reserve}}}{\bar{s}_{\kappa}} \right\rfloor

where:

$B_{\text{system}}$ : tokens consumed by system prompt, role policy, tool definitions
$B_{\text{query}}$ : tokens in the user query
$B_{\text{reserve}}$ : tokens reserved for model generation
$\bar{s}_{\kappa}$ : average chunk size for the selected granularity level $\kappa$

This ensures the retrieval engine never overfills the context window.

9.11.4 Runtime Chunk Merging and Splitting#

If the index stores chunks at a single granularity but the query demands a different one, runtime transformation is applied:

Runtime Merging (for coarser granularity):

c_{\text{merged}} = c_i \oplus c_{i+1} \oplus \ldots \oplus c_{i+m-1}

where $c_i, \ldots, c_{i+m-1}$ are consecutive chunks from the same document, and $m$ is chosen such that $|c_{\text{merged}}| \leq \tau_{\text{target}}$ .

Runtime Splitting (for finer granularity):

\{c_{i,1}, c_{i,2}, \ldots\} = \text{SentenceSplit}(c_i, \tau_{\text{target}})

Alternatively, if a hierarchical index (§9.4) is available, the retrieval engine simply navigates to the appropriate level.

9.11.5 Pseudo-Algorithm: Adaptive Retrieval with Dynamic Granularity#

ALGORITHM: AdaptiveRetrieve(q, index, B_context, B_system, B_reserve)
────────────────────────────────────────────────────────────────────────
INPUT:
  q           — user query
  index       — hierarchical chunk index with multiple granularity levels
  B_context   — total context window budget (tokens)
  B_system    — system prompt token consumption
  B_reserve   — tokens reserved for generation
 
OUTPUT:
  context[]   — ordered list of chunks for context injection
 
PROCEDURE:
  1.  // Classify query complexity
      κ ← ClassifyQueryComplexity(q)
      // κ ∈ {factoid, analytical, comparative, exploratory, multi_hop}
 
  2.  // Determine granularity level and chunk budget
      level ← GranularityMap[κ]
      avg_size ← index.AverageChunkSize(level)
      B_available ← B_context - B_system - TokenCount(q) - B_reserve
      k ← Floor(B_available / avg_size)
      k ← Clamp(k, min=3, max=20)   // Sensible bounds
 
  3.  // Retrieve at target granularity
      candidates ← index.Retrieve(q, level=level, top_k=k * 3)
      // Over-retrieve by 3× for re-ranking headroom
 
  4.  // Re-rank with cross-encoder or metadata-weighted scoring
      FOR EACH c IN candidates:
          c.relevance ← CrossEncoderScore(q, c.content)
          c.final_score ← (
              w_rel · c.relevance
            + w_fresh · c.freshness_score
            + w_auth · c.authority_score
            + w_coh · c.coherence_score
          )
 
  5.  candidates ← SortByFinalScore(candidates, descending=True)
 
  6.  // Greedy token-budget packing
      context ← []
      tokens_used ← 0
      FOR EACH c IN candidates:
          IF tokens_used + c.token_count ≤ B_available:
              APPEND c TO context
              tokens_used ← tokens_used + c.token_count
          ELSE:
              // Try summary version if available
              IF c.summary IS NOT NULL AND tokens_used + TokenCount(c.summary) ≤ B_available:
                  APPEND SummaryChunk(c) TO context
                  tokens_used ← tokens_used + TokenCount(c.summary)
              ELSE:
                  BREAK   // Budget exhausted
 
  7.  // For multi-hop queries: decompose and iterate
      IF κ = "multi_hop":
          sub_queries ← DecomposeQuery(q)
          FOR EACH sq IN sub_queries:
              sub_context ← AdaptiveRetrieve(sq, index, B_available / |sub_queries|, 0, 0)
              EXTEND context WITH sub_context
          context ← Deduplicate(context)
          context ← TruncateToFit(context, B_available)
 
  8.  // Order chunks by document position for coherent reading
      context ← SortByDocumentPosition(context)
 
  9.  RETURN context

9.11.6 Formal Objective: Budget-Optimal Chunk Selection#

The adaptive retrieval problem can be formalized as a constrained optimization:

\max_{\mathcal{S} \subseteq \mathcal{C}_{\text{candidates}}} \sum_{c \in \mathcal{S}} \text{score}(c, q) \quad \text{subject to} \quad \sum_{c \in \mathcal{S}} |c| \leq B_{\text{available}}

This is a variant of the 0-1 knapsack problem with item weight $|c|$ and value $\text{score}(c, q)$ . For practical purposes, the greedy approximation (sort by score-per-token ratio $\text{score}(c, q) / |c|$ and pack greedily) yields near-optimal solutions:

\text{ratio}(c) = \frac{\text{score}(c, q)}{|c|}

Alternatively, for small candidate sets ( $|\mathcal{C}_{\text{candidates}}| \leq 50$ ), dynamic programming yields an exact solution in $O(|\mathcal{C}| \cdot B_{\text{available}})$ .

9.12 Chunk Quality Metrics: Retrieval Precision Impact, Contextual Completeness, Synthesis Utility#

9.12.1 Motivation#

Chunking quality is not an abstract property — it manifests in measurable downstream effects on retrieval precision, context coherence, and agent output quality. A rigorous chunk quality framework defines metrics at three levels: chunk-level (intrinsic quality), retrieval-level (search effectiveness), and synthesis-level (downstream generation quality).

9.12.2 Chunk-Level Intrinsic Metrics#

9.12.2.1 Coherence Score#

As defined in §9.3.4:

\text{Coherence}(c) = \frac{1}{\binom{|S_c|}{2}} \sum_{i < j} \cos(\mathbf{e}_i, \mathbf{e}_j)

where $S_c$ is the set of sentences in chunk $c$ and $\mathbf{e}_i$ are sentence embeddings. Range: $[0, 1]$ ; target $\geq 0.7$ .

9.12.2.2 Self-Containedness Score#

Measures whether a chunk can be understood independently:

\text{SelfContained}(c) = 1 - \frac{|\text{UnresolvedReferences}(c)|}{|\text{TotalReferences}(c)|}

where unresolved references include dangling pronouns ("it," "they," "the above"), undefined acronyms, and incomplete sentences. Range: $[0, 1]$ ; target $\geq 0.85$ .

9.12.2.3 Information Density#

\text{Density}(c) = \frac{|\text{UniqueEntities}(c)| + |\text{UniqueFacts}(c)|}{|c|_{\text{tokens}}}

Higher density indicates more information per token, meaning fewer tokens are wasted on filler or repetition.

9.12.2.4 Boundary Quality#

Assesses whether chunk boundaries respect semantic units:

\text{BoundaryQuality}(c) = \begin{cases} 1 & \text{if } c \text{ starts and ends at sentence boundaries} \\ 0.5 & \text{if only one boundary is sentence-aligned} \\ 0 & \text{if both boundaries are mid-sentence} \end{cases}

9.12.3 Retrieval-Level Metrics#

9.12.3.1 Chunk Retrieval Precision at $k$ #

Given a set of evaluation queries $Q_{\text{eval}}$ with ground-truth relevant chunks $R_q$ for each query $q$ :

\text{Precision@}k = \frac{1}{|Q_{\text{eval}}|} \sum_{q \in Q_{\text{eval}}} \frac{|\text{TopK}(q, k) \cap R_q|}{k}

9.12.3.2 Chunk Recall at $k$ #

\text{Recall@}k = \frac{1}{|Q_{\text{eval}}|} \sum_{q \in Q_{\text{eval}}} \frac{|\text{TopK}(q, k) \cap R_q|}{|R_q|}

9.12.3.3 Mean Reciprocal Rank (MRR)#

\text{MRR} = \frac{1}{|Q_{\text{eval}}|} \sum_{q \in Q_{\text{eval}}} \frac{1}{\text{rank}_q}

where $\text{rank}_q$ is the position of the first relevant chunk in the ranked retrieval list.

9.12.3.4 Normalized Discounted Cumulative Gain (NDCG)#

For graded relevance labels $\text{rel}(c, q) \in \{0, 1, 2, 3\}$ :

\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}(c_i, q)} - 1}{\log_2(i + 1)}

\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}

where $\text{IDCG@}k$ is the ideal DCG (relevance labels sorted in descending order).

9.12.4 Synthesis-Level Metrics#

These measure the impact of chunk quality on the final agent output:

9.12.4.1 Faithfulness (Groundedness)#

\text{Faithfulness}(y, \mathcal{C}) = \frac{|\text{Claims}(y) \cap \text{Entailed}(\mathcal{C})|}{|\text{Claims}(y)|}

The fraction of claims in the agent's output $y$ that are entailed by the retrieved context $\mathcal{C}$ . This is the primary hallucination metric, and chunking directly affects it: poorly chunked context that omits critical evidence forces the model to confabulate.

9.12.4.2 Answer Completeness#

\text{Completeness}(y, y^*) = \frac{|\text{Claims}(y) \cap \text{Claims}(y^*)|}{|\text{Claims}(y^*)|}

where $y^*$ is the ground-truth answer. Incomplete chunking produces incomplete answers.

9.12.4.3 Context Utilization#

\text{Utilization}(\mathcal{C}, y) = \frac{|\{c \in \mathcal{C} \mid \exists \text{claim} \in y \text{ attributed to } c\}|}{|\mathcal{C}|}

Measures what fraction of retrieved chunks the model actually used. Low utilization indicates that chunking is producing low-relevance results.

9.12.5 Composite Chunk Quality Score#

A weighted composite metric for chunking strategy evaluation:

Q_{\text{chunk}} = w_1 \cdot \text{NDCG@}k + w_2 \cdot \text{Faithfulness} + w_3 \cdot \text{Completeness} + w_4 \cdot \overline{\text{Coherence}} + w_5 \cdot \overline{\text{SelfContained}}

subject to:

\sum_{i=1}^{5} w_i = 1, \quad w_i \geq 0

where $\overline{\text{Coherence}}$ and $\overline{\text{SelfContained}}$ are corpus-level averages. This composite score enables A/B evaluation of chunking strategies against a fixed evaluation set.

9.12.6 Pseudo-Algorithm: Chunk Quality Evaluation Pipeline#

ALGORITHM: EvaluateChunkQuality(chunks[], Q_eval, ground_truth, agent_fn, embed_fn)
───────────────────────────────────────────────────────────────────────────────────────
INPUT:
  chunks[]       — chunked corpus under evaluation
  Q_eval         — evaluation query set with ground-truth answers and relevant chunks
  ground_truth   — mapping: query → (answer, relevant_chunk_ids)
  agent_fn       — agent pipeline function: (query, context) → response
  embed_fn       — embedding function
 
OUTPUT:
  report         — quality report with all metrics
 
PROCEDURE:
  1.  // Index chunks
      index ← BuildIndex(chunks, embed_fn)
 
  2.  // Intrinsic metrics
      coherence_scores ← [Coherence(c, embed_fn) FOR c IN chunks]
      self_contained_scores ← [SelfContainednessScore(c) FOR c IN chunks]
      density_scores ← [InformationDensity(c) FOR c IN chunks]
      boundary_scores ← [BoundaryQuality(c) FOR c IN chunks]
 
  3.  // Retrieval metrics
      precision_at_k ← []
      recall_at_k ← []
      mrr_scores ← []
      ndcg_scores ← []
 
      FOR EACH (q, gt) IN Zip(Q_eval, ground_truth):
          retrieved ← index.Retrieve(q, top_k=10)
          retrieved_ids ← [c.chunk_id FOR c IN retrieved]
          relevant_ids ← gt.relevant_chunk_ids
 
          precision_at_k.APPEND(|Set(retrieved_ids[:k]) ∩ Set(relevant_ids)| / k)
          recall_at_k.APPEND(|Set(retrieved_ids[:k]) ∩ Set(relevant_ids)| / |relevant_ids|)
          mrr_scores.APPEND(1 / FirstRelevantRank(retrieved_ids, relevant_ids))
          ndcg_scores.APPEND(NDCG(retrieved_ids, relevant_ids, k))
 
  4.  // Synthesis metrics
      faithfulness_scores ← []
      completeness_scores ← []
      utilization_scores ← []
 
      FOR EACH (q, gt) IN Zip(Q_eval, ground_truth):
          retrieved ← index.Retrieve(q, top_k=5)
          response ← agent_fn(q, retrieved)
 
          claims_y ← ExtractClaims(response)
          claims_gt ← ExtractClaims(gt.answer)
 
          faithfulness_scores.APPEND(
              |EntailedClaims(claims_y, retrieved)| / Max(|claims_y|, 1)
          )
          completeness_scores.APPEND(
              |Set(claims_y) ∩ Set(claims_gt)| / Max(|claims_gt|, 1)
          )
          utilization_scores.APPEND(
              |UsedChunks(response, retrieved)| / Max(|retrieved|, 1)
          )
 
  5.  report ← {
          intrinsic: {
              mean_coherence:       Mean(coherence_scores),
              mean_self_contained:  Mean(self_contained_scores),
              mean_density:         Mean(density_scores),
              mean_boundary:        Mean(boundary_scores)
          },
          retrieval: {
              precision_at_k:       Mean(precision_at_k),
              recall_at_k:          Mean(recall_at_k),
              mrr:                  Mean(mrr_scores),
              ndcg_at_k:            Mean(ndcg_scores)
          },
          synthesis: {
              faithfulness:         Mean(faithfulness_scores),
              completeness:         Mean(completeness_scores),
              utilization:          Mean(utilization_scores)
          },
          composite: ComputeCompositeScore(above, weights)
      }
 
  6.  RETURN report

9.13 Chunk Storage and Indexing: Vector Stores, Inverted Indexes, Hybrid Index Structures#

9.13.1 Architectural Requirements#

Chunks, once produced and enriched, must be stored in a retrieval infrastructure that supports:

Semantic search: approximate nearest neighbor (ANN) over dense embeddings
Keyword search: exact match, BM25, inverted index queries
Metadata filtering: pre-retrieval filtering on entity tags, document ID, date ranges, chunk type
Hierarchical navigation: parent-child traversal for chunk tree indexes
Freshness-aware retrieval: time-weighted scoring
Scalability: sub-100ms retrieval latency at millions-to-billions of chunks
Idempotent upsert: re-chunking the same document version produces no duplicates

9.13.2 Vector Store Architecture#

Dense embedding retrieval stores each chunk as a vector $\mathbf{v}_c \in \mathbb{R}^d$ alongside its metadata. The retrieval operation for query $q$ is:

\text{TopK}(q, k) = \underset{c \in \mathcal{C}}{\text{argmax-}k} \; \text{sim}(\mathbf{v}_q, \mathbf{v}_c)

where $\text{sim}$ is typically cosine similarity or inner product.

ANN Index Structures#

Index Type	Build Time	Query Time	Memory	Recall	Best For
Flat (brute-force)	$O(n)$	$O(n \cdot d)$	$O(n \cdot d)$	100%	Small corpora ( $< 100K$ chunks)
IVF (Inverted File)	$O(n \cdot d)$	$O(\sqrt{n} \cdot d)$	$O(n \cdot d)$	95–99%	Medium corpora ( $100K$ – $10M$ )
HNSW (Hierarchical NSW)	$O(n \cdot \log n)$	$O(\log n \cdot d)$	$O(n \cdot d \cdot M)$	97–99.9%	Production systems requiring high recall
PQ (Product Quantization)	$O(n \cdot d)$	$O(n \cdot d / m)$	$O(n \cdot m)$	90–95%	Billion-scale with memory constraints
ScaNN	$O(n \cdot d)$	$O(\sqrt{n})$	$O(n \cdot d)$	98–99%	Google-scale retrieval

For production agentic systems, HNSW is the default recommendation due to its logarithmic query time, high recall, and mature ecosystem support.

The HNSW parameters:

$M$ : maximum number of connections per node (typical: $16$ – $64$ )
$\text{efConstruction}$ : search depth during index build (typical: $200$ – $500$ )
$\text{efSearch}$ : search depth during query (typical: $100$ – $300$ ; tune for recall-latency trade-off)

The recall-latency trade-off is governed by:

\text{Recall@}k \approx 1 - e^{-\gamma \cdot \text{efSearch}}

where $\gamma$ depends on data dimensionality and distribution.

9.13.3 Inverted Index for Keyword Search#

For exact-match and keyword-based retrieval, an inverted index maps terms to chunk IDs:

\mathcal{I}: \text{term} \rightarrow \{(c_1, \text{tf}_1), (c_2, \text{tf}_2), \ldots\}

BM25 scoring for a query $q = \{t_1, t_2, \ldots, t_m\}$ against chunk $c$ :

\text{BM25}(q, c) = \sum_{i=1}^{m} \text{IDF}(t_i) \cdot \frac{f(t_i, c) \cdot (k_1 + 1)}{f(t_i, c) + k_1 \cdot \left(1 - b + b \cdot \frac{|c|}{\text{avgdl}}\right)}

where:

$f(t_i, c)$ : term frequency of $t_i$ in chunk $c$
$\text{IDF}(t_i) = \ln\left(\frac{N - n(t_i) + 0.5}{n(t_i) + 0.5} + 1\right)$
$N$ : total number of chunks
$n(t_i)$ : number of chunks containing term $t_i$
$k_1 \approx 1.2$ , $b \approx 0.75$ : tuning parameters
$\text{avgdl}$ : average chunk length in tokens

9.13.4 Hybrid Index: Dense + Sparse Fusion#

Production systems combine vector search and keyword search through reciprocal rank fusion (RRF) or linear score combination:

Reciprocal Rank Fusion (RRF)#

\text{RRF}(c) = \sum_{r \in \mathcal{R}} \frac{1}{K + \text{rank}_r(c)}

where $\mathcal{R}$ is the set of ranking systems (dense + sparse), $\text{rank}_r(c)$ is the rank of chunk $c$ in system $r$ , and $K$ is a smoothing constant (typically $K = 60$ ).

Linear Score Combination#

\text{HybridScore}(c, q) = \alpha \cdot \text{DenseScore}(c, q) + (1 - \alpha) \cdot \text{SparseScore}(c, q)

where $\alpha \in [0, 1]$ is tuned per domain. For most knowledge-base retrieval tasks, $\alpha \in [0.5, 0.7]$ (slightly favoring semantic search) performs well.

9.13.5 Metadata Index Layer#

A separate metadata index enables pre-retrieval filtering. This is typically implemented as:

Structured metadata store: PostgreSQL, SQLite, or document store with indexed fields
Filter pushdown: Metadata filters are applied before ANN search, reducing the candidate set:

\text{TopK}_{\text{filtered}}(q, k, \text{filters}) = \underset{c \in \mathcal{C}_{\text{filtered}}}{\text{argmax-}k} \; \text{sim}(\mathbf{v}_q, \mathbf{v}_c)

where $\mathcal{C}_{\text{filtered}} = \{c \in \mathcal{C} \mid \text{filters}(c.\text{metadata}) = \text{true}\}$ .

Common filter predicates:

source_doc_id = "doc_xyz"
created_at >= "2024-01-01"
chunk_type IN ["structural", "semantic"]
entity_tags CONTAINS "OpenAI"
language = "python"
freshness_score >= 0.5

9.13.6 Hierarchical Index for Multi-Granularity Retrieval#

For hierarchical chunks (§9.4), the index supports level-aware retrieval:

Level 1 (summaries):     Index_L1 → HNSW over summary embeddings
Level 2 (sections):      Index_L2 → HNSW over section embeddings  
Level 3 (paragraphs):    Index_L3 → HNSW over paragraph embeddings
Level 4 (propositions):  Index_L4 → HNSW over proposition embeddings

The retrieval router selects the index level based on query complexity (§9.11.2) and then optionally expands to parent or child chunks via the chunk tree structure stored in the metadata layer.

9.13.7 Pseudo-Algorithm: Hybrid Index Build and Query#

ALGORITHM: BuildHybridIndex(chunks[], embed_fn, tokenizer)
────────────────────────────────────────────────────────────
INPUT:
  chunks[]    — enriched chunks with metadata
  embed_fn    — embedding function: string → ℝ^d
  tokenizer   — tokenizer for BM25 term extraction
 
OUTPUT:
  hybrid_index — composite index supporting dense, sparse, and metadata queries
 
PROCEDURE:
  1.  // Dense index construction
      dense_index ← InitHNSW(dim=d, M=32, efConstruction=400)
      FOR EACH chunk c IN chunks:
          v ← embed_fn(c.content)
          dense_index.Add(c.chunk_id, v)
 
  2.  // Sparse index construction (BM25)
      sparse_index ← InitInvertedIndex()
      FOR EACH chunk c IN chunks:
          terms ← tokenizer.Tokenize(c.content)
          sparse_index.AddDocument(c.chunk_id, terms, metadata={
              doc_length: |terms|
          })
      sparse_index.ComputeIDF()
 
  3.  // Metadata index construction
      metadata_store ← InitStructuredStore(schema=ChunkMetadataSchema)
      FOR EACH chunk c IN chunks:
          metadata_store.Upsert(c.chunk_id, c.metadata)
          // Upsert: idempotent — same chunk_id overwrites cleanly
 
  4.  // Summary embedding index (optional, for dual-embedding)
      summary_index ← InitHNSW(dim=d, M=16, efConstruction=200)
      FOR EACH chunk c IN chunks WHERE c.summary IS NOT NULL:
          v_s ← embed_fn(c.summary)
          summary_index.Add(c.chunk_id, v_s)
 
  5.  hybrid_index ← {
          dense: dense_index,
          sparse: sparse_index,
          metadata: metadata_store,
          summary: summary_index,
          config: {α: 0.6, K_rrf: 60}
      }
 
  6.  RETURN hybrid_index
 
 
ALGORITHM: HybridQuery(hybrid_index, q, k, filters, α)
───────────────────────────────────────────────────────
INPUT:
  hybrid_index  — composite index
  q             — query string
  k             — number of results to return
  filters       — metadata filter predicates
  α             — dense/sparse weight (override or use default)
 
OUTPUT:
  results[]     — ranked list of (chunk_id, score, chunk) tuples
 
PROCEDURE:
  1.  // Apply metadata filters
      candidate_ids ← hybrid_index.metadata.Query(filters)
      // Returns set of chunk_ids matching filter predicates
 
  2.  // Dense retrieval (restricted to candidates)
      v_q ← embed_fn(q)
      dense_results ← hybrid_index.dense.Search(
          v_q, top_k=k * 3, restrict_to=candidate_ids
      )
      // Returns: [(chunk_id, cosine_score), ...]
 
  3.  // Sparse retrieval (restricted to candidates)
      q_terms ← tokenizer.Tokenize(q)
      sparse_results ← hybrid_index.sparse.Search(
          q_terms, top_k=k * 3, restrict_to=candidate_ids
      )
      // Returns: [(chunk_id, bm25_score), ...]
 
  4.  // Score fusion
      IF fusion_mode = "RRF":
          // Reciprocal Rank Fusion
          scores ← DefaultDict(float)
          FOR rank, (cid, _) IN Enumerate(dense_results):
              scores[cid] += 1.0 / (K_rrf + rank + 1)
          FOR rank, (cid, _) IN Enumerate(sparse_results):
              scores[cid] += 1.0 / (K_rrf + rank + 1)
 
      ELSE IF fusion_mode = "linear":
          // Normalize scores to [0, 1]
          dense_norm ← MinMaxNormalize(dense_results)
          sparse_norm ← MinMaxNormalize(sparse_results)
          scores ← DefaultDict(float)
          FOR (cid, s) IN dense_norm:
              scores[cid] += α · s
          FOR (cid, s) IN sparse_norm:
              scores[cid] += (1 - α) · s
 
  5.  // Sort and return top-k
      ranked ← SortByScore(scores, descending=True)[:k]
 
  6.  // Fetch full chunk content and metadata
      results ← []
      FOR EACH (cid, score) IN ranked:
          chunk ← hybrid_index.metadata.GetChunk(cid)
          APPEND (cid, score, chunk) TO results
 
  7.  RETURN results

9.13.8 Storage Architecture Patterns#

Pattern 1: Integrated Vector Database#

A single system (e.g., Weaviate, Qdrant, Milvus, Pinecone) serves as both vector store and metadata store:

[Chunks] → [Embedding] → [Vector DB with metadata filtering]
                              ├── HNSW index (dense)
                              ├── BM25 index (sparse, if supported)
                              └── Metadata index (filterable fields)

Pros: Operational simplicity, atomic upserts, consistent filtering. Cons: May lack full-featured BM25 or complex metadata query support.

Pattern 2: Disaggregated Stores#

Separate systems for each concern:

[Chunks] → [Embedding] → [HNSW Vector Store (Qdrant, Faiss)]
                       → [Elasticsearch / OpenSearch (BM25 + metadata)]
                       → [PostgreSQL (chunk metadata, hierarchical pointers)]

Pros: Best-of-breed per concern, independent scaling. Cons: Operational complexity, consistency management, query fan-out latency.

Pattern 3: Lakehouse with Vector Index Overlay#

Chunks stored in a data lakehouse (Delta Lake, Iceberg) with vector index acceleration:

[Parquet files in object store]
    ├── Column: chunk_id, content, metadata (structured)
    ├── Column: embedding (vector)
    └── [Vector index sidecar: HNSW over embedding column]

Pros: Cost-effective for very large corpora, SQL analytics over chunks, version control. Cons: Higher query latency, complex infrastructure.

9.13.9 Idempotent Upsert and Versioning#

Re-ingestion of a document must not produce duplicate chunks. The idempotency contract:

\text{ChunkID}(c) = \text{Hash}(\text{source\_doc\_id} \| \text{source\_version} \| \text{position\_index} \| \text{chunk\_type})

On re-ingestion:

Compute new chunk IDs for the updated document
Delete chunks whose source_doc_id matches but whose IDs are no longer in the new set
Upsert new/changed chunks
Index entries are atomically updated

This ensures the index is always consistent with the latest document version without manual deduplication.

9.13.10 Cache Hierarchy for Retrieval Artifacts#

To minimize retrieval latency under repeated or similar queries:

Cache Layer	Scope	TTL	Content
Query embedding cache	Per query hash	1 hour	Precomputed $\mathbf{v}_q$
Result cache	Per (query hash, filter hash)	5 minutes	Top-k chunk IDs and scores
Chunk content cache	Per chunk ID	24 hours	Full chunk content + metadata
Cross-encoder score cache	Per (query, chunk ID)	1 hour	Re-ranking scores

Cache invalidation is triggered by chunk upserts: any chunk ID that changes invalidates all result caches containing it.

9.13.11 Observability and Operational Metrics#

Metric	Description	Alert Threshold
`retrieval_latency_p99`	99th percentile retrieval latency	> 200ms
`index_size_chunks`	Total chunks in index	Monitor growth rate
`embedding_throughput`	Chunks embedded per second	< 100/s triggers scaling
`cache_hit_rate`	Fraction of queries served from cache	< 50% indicates cold cache
`index_freshness_lag`	Time between document update and index update	> 5 minutes
`duplicate_chunk_rate`	Fraction of chunks with duplicate content	> 1% indicates idempotency failure
`metadata_completeness`	Fraction of chunks with all metadata fields populated	< 95% triggers enrichment pipeline review

Chapter Summary#

Chunking is the foundational engineering discipline that determines the quality ceiling of every downstream retrieval and generation operation in an agentic system. This chapter has established:

No universal strategy exists: Document class determines the optimal chunking approach. The system must classify documents and route to the appropriate strategy.
Six primary strategies — structural, semantic, hierarchical, agentic, code-aware, and tabular — each formalized with mathematical definitions, pseudo-algorithms, and trade-off analyses.
Multi-modal and adaptive extensions that handle heterogeneous media and runtime query-driven granularity adjustment.
Overlap and boundary coherence techniques that mitigate information loss at chunk boundaries, with formal cost-expansion analysis.
Metadata enrichment as a non-optional requirement: every chunk must carry provenance, position, entity tags, summary, coherence score, and hierarchical pointers.
Quality metrics at three levels — intrinsic, retrieval, and synthesis — with a composite evaluation framework for A/B comparison of chunking strategies.
Hybrid indexing infrastructure combining dense (HNSW), sparse (BM25), and metadata indexes with formal fusion algorithms, cache hierarchies, idempotent upsert contracts, and production observability.

The key architectural principle is: chunking is not preprocessing; it is a continuously evaluated, document-class-aware, query-adaptive retrieval precision lever that must be engineered with the same rigor as any other production subsystem in the agentic stack.

Chapter 9: Chunking Strategies — Document-Class-Specific Segmentation for Retrieval Precision

Preamble#

9.1 Chunking as a Retrieval Precision Lever: Why One Strategy Fails All Document Types#

9.1.1 The Fundamental Problem#

9.1.2 Why Uniform Chunking Fails#

9.1.3 Document-Class Taxonomy#

9.1.4 The Chunking Design Objective#

9.1.5 Architectural Position of Chunking in the Agentic Stack#

9.2 Structural Chunking: Heading, Section, Paragraph, and Markup-Aware Splitting#

9.2.1 Definition and Scope#

9.2.2 Markup Detection and Parsing#

9.2.3 Pseudo-Algorithm: Structural Chunking#

9.2.4 Paragraph-Level Splitting as Fallback#

9.2.5 Trade-Off Analysis#

9.2.6 When to Apply#

9.3 Semantic Chunking: Topic Segmentation, Embedding Similarity Boundaries, Coherence Scoring#

9.3.1 Motivation#

9.3.2 Core Principle: Embedding Breakpoint Detection#

9.3.3 Windowed Similarity with Smoothing#

9.3.4 Topic Coherence Scoring#

9.3.5 Formal Objective#

9.3.6 Pseudo-Algorithm: Semantic Chunking via Embedding Breakpoints#

9.3.7 Advanced Variant: TextTiling and BayesSeg#

9.3.8 Trade-Off Analysis#

9.4 Hierarchical Chunking: Parent-Child Relationships, Summary-Detail Layering, Recursive Decomposition#

9.4.1 Motivation and Architecture#

9.4.2 Formal Definition#

9.4.3 Retrieval with Hierarchical Chunks#

9.4.4 Pseudo-Algorithm: Hierarchical Chunking with Summary-Detail Layering#

9.4.5 Token Budget Considerations#

9.4.6 Trade-Off Analysis#

9.5 Agentic Chunking: LLM-Guided Proposition Extraction, Claim Decomposition, and Fact Isolation#

9.5.1 Definition#

9.5.2 Proposition as the Atomic Retrieval Unit#

9.5.3 Formal Definition#

9.5.4 Pseudo-Algorithm: Agentic Proposition Extraction#

9.5.5 Claim Decomposition for Complex Statements#

9.5.6 Cost Analysis#

9.5.7 Trade-Off Analysis#

9.6 Code Chunking: AST-Based, Function-Level, Class-Level, Dependency-Scope Chunking#

9.6.1 Motivation#

9.6.2 AST-Based Chunking#

9.6.3 Pseudo-Algorithm: AST-Based Code Chunking#

9.6.4 Dependency-Scope Chunking#

9.6.5 Language-Specific Considerations#

9.6.6 Trade-Off Analysis#

9.7 Tabular and Structured Data Chunking: Row-Group, Schema-Preserving, Pivot-Aware Strategies#

9.7.1 The Challenge#

9.7.2 Schema-Preserving Row-Group Chunking#

9.7.3 Pivot-Aware Chunking#

9.7.4 Serialization Formats#

9.7.5 Pseudo-Algorithm: Schema-Preserving Tabular Chunking#

9.7.6 Row-as-Sentence Template#

9.8 Multi-Modal Chunking: Image-Caption Pairing, Video Segment Annotation, Audio Transcript Alignment#

9.8.1 Scope#

9.8.2 Image-Caption Pairing#

9.8.3 Video Segment Chunking#

9.8.4 Audio Transcript Alignment#

9.8.5 Pseudo-Algorithm: Multi-Modal Document Chunking#

9.8.6 Cross-Modal Linking#

9.9 Overlap, Stride, and Context Window Strategies for Boundary Coherence#

9.9.1 The Boundary Problem#

9.9.2 Formal Definitions#

9.9.3 Optimal Overlap Selection#

9.9.4 Sentence-Aligned Overlap#

9.9.5 Context Window Prefix Strategy#

9.9.6 Decision Matrix#

9.10 Chunk Metadata Enrichment: Section Title, Document Position, Entity Tags, Summary, Parent Pointer#

9.10.1 Principle#

9.10.2 Metadata Schema#

9.10.3 Pseudo-Algorithm: Metadata Enrichment Pipeline#

9.10.4 Metadata-Driven Retrieval Filtering#

9.10.5 Summary as a Retrieval Proxy#

9.11 Adaptive Chunking: Runtime Chunk Size Adjustment Based on Query Complexity and Token Budget#

9.11.1 Motivation#

9.11.2 Query Complexity Classification#

9.11.3 Token Budget-Aware Chunk Selection#

9.11.4 Runtime Chunk Merging and Splitting#

9.11.5 Pseudo-Algorithm: Adaptive Retrieval with Dynamic Granularity#

9.11.6 Formal Objective: Budget-Optimal Chunk Selection#

9.12.3.1 Chunk Retrieval Precision at $k$ #

9.12.3.2 Chunk Recall at $k$ #