Agentic Notes Library

Chapter 7: Query Understanding — Cognitive Decomposition, Intent Resolution, and Semantic Enrichment

March 20, 2026 22 min read 4,735 words

Chapter 07Math

Preamble#

Query understanding constitutes the single most consequential stage in any agentic retrieval pipeline. Every downstream operation—retrieval precision, tool selection, memory access, response synthesis, and verification—is bounded by the fidelity with which the system interprets the user's communicative intent, decomposes it into actionable sub-problems, enriches it with contextual and ontological knowledge, and routes constituent sub-queries to the correct execution tier. A system that retrieves against raw query strings treats language as a lookup key. A system that understands queries treats language as a compressed program requiring decompilation before execution.

This chapter formalizes query understanding as a multi-stage cognitive pipeline: a typed, inspectable, and measurable transformation from an ambiguous natural-language surface form into a structured, provenance-tagged, schema-routed execution plan. We develop the mathematical foundations, taxonomic models, decomposition strategies, enrichment protocols, and quality metrics that elevate query processing from heuristic pattern matching to a principled engineering discipline.

7.1 Query Understanding as a Cognitive Pipeline, Not String Matching#

7.1.1 The Fundamental Inadequacy of String-Level Processing#

Traditional information retrieval systems operate on a reductive assumption: the query string $q$ is a sufficient representation of the user's information need $\mathcal{I}$ . This assumption fails categorically in agentic contexts for three structural reasons:

Compression Loss. Natural language is a lossy compression of intent. The query "How does the new policy affect our deployment?" encodes domain context (which policy?), temporal reference (new relative to what?), scope (which deployment?), and an implicit causal reasoning requirement—none of which are recoverable from lexical or even embedding-level similarity alone.
Contextual Dependency. In multi-turn interactions, queries carry anaphoric references, elliptical constructions, and presuppositions that require resolution against conversational history, user state, and session memory.
Heterogeneous Execution Requirements. A single surface query may require simultaneous retrieval from structured databases, unstructured document stores, code repositories, live APIs, and episodic memory—each with distinct schemas, latency profiles, and authority levels.

7.1.2 The Pipeline Abstraction#

We define query understanding as a typed transformation pipeline $\Pi$ that maps a raw input tuple to an executable query plan:

\Pi: (q_{\text{raw}}, \mathcal{H}, \mathcal{M}, \mathcal{S}) \longrightarrow \mathcal{Q}_{\text{plan}}

where:

$q_{\text{raw}} \in \Sigma^*$ is the raw query string (or multi-modal input).
$\mathcal{H} = \{(q_i, r_i, t_i)\}_{i=1}^{n}$ is the conversational history (query-response-timestamp triples).
$\mathcal{M}$ is the accessible memory state (working, session, episodic, semantic, procedural layers).
$\mathcal{S}$ is the system state (available tools, active schemas, user profile, authorization scope).
$\mathcal{Q}_{\text{plan}}$ is a structured query execution plan containing resolved intents, decomposed sub-queries, enrichment annotations, routing directives, and verification predicates.

7.1.3 Pipeline Stages#

The pipeline comprises the following ordered stages, each producing a typed intermediate representation:

Stage	Input	Output	Section
Intent Classification	$q_{\text{raw}}, \mathcal{H}$	Intent vector $\vec{I}$ with confidence	§7.2
Pragmatic Analysis	$q_{\text{raw}}, \vec{I}, \mathcal{H}$	Pragmatic frame $\mathcal{F}_p$	§7.3
Cognitive Load Estimation	$\mathcal{F}_p, \vec{I}$	Complexity score $\kappa$ , reasoning mode $\rho$	§7.4
Query Rewriting & Expansion	$q_{\text{raw}}, \mathcal{F}_p, \mathcal{M}$	Rewritten query set $\{q'_j\}$	§7.5
Decomposition	$\{q'_j\}, \kappa, \rho$	Sub-query DAG $G_{sq}$	§7.6
Schema-Aware Routing	$G_{sq}, \mathcal{S}$	Routed execution plan $\mathcal{Q}_{\text{plan}}$	§7.7
Clarification Detection	All prior outputs	Clarification request or proceed signal	§7.9

7.1.4 Formal Invariant#

The pipeline must satisfy the semantic preservation invariant: the union of information needs addressed by the decomposed, routed sub-queries must be a superset of the original intent:

\mathcal{I}(q_{\text{raw}}) \subseteq \bigcup_{s \in G_{sq}} \mathcal{I}(s)

Any pipeline stage that reduces coverage below this bound constitutes a comprehension fault and must trigger either clarification (§7.9) or repair (via the bounded agent loop's critique-repair cycle).

7.1.5 Design Principles#

Typed Intermediates. Every stage produces a structured, schema-validated output—never free-text annotations.
Provenance Tracking. Every enrichment, expansion, or rewrite carries a provenance tag identifying its source (ontology, memory, model inference, user history).
Token Budget Awareness. The pipeline operates under a global token budget $B_{\text{tokens}}$ ; enrichment is bounded by retrieval utility, not unbounded expansion.
Fail-Safe Semantics. If any stage fails or times out, the pipeline degrades gracefully to the best available intermediate representation rather than propagating raw $q_{\text{raw}}$ .

7.2 Intent Classification: Taxonomic, Hierarchical, and Open-Domain Intent Models#

7.2.1 Intent as a Structured Object#

Intent classification transforms the surface query into a structured intent representation. We define an intent as a typed record:

I = (\tau, \phi, c, \vec{p}, \delta)

where:

$\tau \in \mathcal{T}$ is the intent type drawn from a taxonomy $\mathcal{T}$ .
$\phi \in [0, 1]$ is the classification confidence.
$c \in \{\text{informational}, \text{navigational}, \text{transactional}, \text{analytical}, \text{generative}, \text{verification}, \text{meta-cognitive}\}$ is the intent class.
$\vec{p}$ is a vector of extracted parameters (entities, constraints, temporal references, scope qualifiers).
$\delta \in \{\text{explicit}, \text{implicit}, \text{inferred}\}$ is the derivation mode indicating how the intent was determined.

7.2.2 Taxonomic Intent Models#

A flat taxonomy defines $\mathcal{T}$ as a finite set of mutually exclusive intent labels. This is sufficient for narrow-domain systems but fails when intents are compositional.

Classification via softmax over taxonomy:

P(\tau_k \mid q) = \frac{\exp(W_k \cdot f_{\text{enc}}(q) + b_k)}{\sum_{j=1}^{|\mathcal{T}|} \exp(W_j \cdot f_{\text{enc}}(q) + b_j)}

where $f_{\text{enc}}: \Sigma^* \to \mathbb{R}^d$ is an encoder producing a $d$ -dimensional query embedding.

Limitation: Flat taxonomies cannot represent composite intents such as "compare X and Y, then recommend based on our deployment constraints," which simultaneously invokes comparison, recommendation, and constraint-satisfaction intents.

7.2.3 Hierarchical Intent Models#

A hierarchical taxonomy organizes intents as a tree or DAG $\mathcal{T}_H = (V, E)$ where parent nodes represent coarse intent categories and leaf nodes represent fine-grained intents.

Hierarchical classification proceeds top-down:

P(\tau_{\text{leaf}} \mid q) = \prod_{l=0}^{L-1} P(\tau^{(l+1)} \mid \tau^{(l)}, q)

where $\tau^{(0)}$ is the root and $\tau^{(L)} = \tau_{\text{leaf}}$ .

Advantages:

Enables coarse-to-fine resolution under latency constraints (early exit at intermediate levels).
Permits structural sharing of parameters across related intents.
Supports graceful degradation: if fine-grained classification is uncertain, the system operates at the parent level.

7.2.4 Open-Domain Intent Models#

In agentic systems interacting with arbitrary domains, a closed taxonomy is insufficient. Open-domain intent models must handle previously unseen intent types.

Approach 1: Intent as Natural-Language Description. Rather than classifying into a fixed set, the system generates a structured natural-language intent description:

I_{\text{open}} = \text{Generate}(q, \mathcal{H}, \mathcal{M}) \to (\text{action}, \text{object}, \text{constraints}, \text{expected\_output\_type})

Approach 2: Intent Embedding in Continuous Space. Map intents to a continuous embedding space $\mathbb{R}^{d_I}$ where similarity corresponds to functional equivalence:

\vec{v}_I = g_{\text{intent}}(q, \mathcal{H})

New intents are recognized by their distance from known intent cluster centroids. If $\min_k \| \vec{v}_I - \mu_k \| > \theta_{\text{novelty}}$ , the intent is flagged as novel, triggering clarification or dynamic taxonomy extension.

7.2.5 Multi-Intent Detection#

Agentic queries frequently encode multiple simultaneous intents. We model multi-intent detection as a multi-label classification problem:

\hat{Y} = \sigma(W \cdot f_{\text{enc}}(q) + b) \in [0,1]^{|\mathcal{T}|}

where $\sigma$ is the element-wise sigmoid function. An intent $\tau_k$ is active if $\hat{Y}_k > \theta_k$ , where $\theta_k$ is a per-intent calibrated threshold.

Multi-intent queries require decomposition (§7.6) to ensure each intent is served by an appropriate retrieval pathway.

7.2.6 Pseudo-Algorithm: Hierarchical Multi-Intent Classification#

ALGORITHM 7.1: HierarchicalMultiIntentClassify(q, H, T_H)
───────────────────────────────────────────────────────────
Input:
  q       — raw query string
  H       — conversational history
  T_H     — hierarchical intent taxonomy (V, E, depth L)
 
Output:
  I_set   — set of (intent, confidence, derivation_mode) triples
 
1.  q_ctx ← ContextualizeQuery(q, H)                     // resolve anaphora, ellipsis
2.  v_q   ← Encode(q_ctx)                                // dense embedding ∈ ℝ^d
3.  I_set ← ∅
4.  frontier ← {root(T_H)}
 
5.  WHILE frontier ≠ ∅ DO
6.      node ← Pop(frontier)
7.      children ← Children(node, T_H)
8.      IF children = ∅ THEN                              // leaf node
9.          score ← IntentScore(v_q, node)
10.         IF score > θ_leaf THEN
11.             derivation ← DetermineDerivation(q, node)
12.             I_set ← I_set ∪ {(node.intent, score, derivation)}
13.         END IF
14.     ELSE
15.         FOR EACH child IN children DO
16.             score ← IntentScore(v_q, child)
17.             IF score > θ_branch THEN
18.                 frontier ← frontier ∪ {child}
19.             END IF
20.         END FOR
21.     END IF
22. END WHILE
 
23. IF |I_set| = 0 THEN                                   // open-domain fallback
24.     I_open ← GenerateOpenIntent(q_ctx, H)
25.     I_set ← {(I_open, confidence(I_open), "inferred")}
26. END IF
 
27. I_set ← DeduplicateAndMerge(I_set)
28. RETURN I_set

7.2.7 Intent Confidence Calibration#

Raw model scores are typically poorly calibrated. We apply temperature scaling post-hoc:

P_{\text{cal}}(\tau_k \mid q) = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}

where $T$ is a scalar temperature learned on a held-out calibration set by minimizing negative log-likelihood. A well-calibrated intent classifier satisfies:

\left| P_{\text{cal}}(\tau_k \mid q) - \frac{|\{q : \hat{\tau}(q) = \tau_k \wedge \tau^*(q) = \tau_k\}|}{|\{q : \hat{\tau}(q) = \tau_k\}|} \right| < \epsilon_{\text{cal}} \quad \forall k

This property is critical for downstream routing: if the system cannot reliably distinguish between intents, it must request clarification rather than committing to a low-confidence route.

7.3 Psycholinguistic Analysis: Pragmatic Inference, Gricean Maxims, Presupposition Resolution#

7.3.1 Beyond Semantics: The Pragmatic Layer#

Semantic analysis extracts what the words mean. Pragmatic analysis extracts what the speaker means by using those words in this context. This distinction is not philosophical decoration—it is the difference between retrieving a definition of "scalability" and understanding that the user is asking how to resolve a performance bottleneck they are currently experiencing.

We formalize the pragmatic frame as:

\mathcal{F}_p = (\mathcal{A}_{\text{speech}}, \mathcal{P}_{\text{presup}}, \mathcal{G}_{\text{implicature}}, \mathcal{E}_{\text{expectation}})

where:

$\mathcal{A}_{\text{speech}}$ is the identified speech act (request, assertion, question, directive, commissive).
$\mathcal{P}_{\text{presup}}$ is the set of presuppositions the query takes for granted.
$\mathcal{G}_{\text{implicature}}$ is the set of conversational implicatures derivable via pragmatic reasoning.
$\mathcal{E}_{\text{expectation}}$ is the inferred response expectation (format, depth, evidence requirements).

7.3.2 Gricean Maxim Analysis#

H.P. Grice's cooperative principle posits that speakers implicitly follow four maxims. When a query appears to violate a maxim, the system should infer an implicature:

Maxim	Definition	Violation Signal	Implicature Example
Quantity	Be as informative as required, no more	Over-specification or under-specification	Under-specified query → user assumes shared context
Quality	Do not say what you believe to be false	Hedging, uncertainty markers	"I think the API changed?" → request for confirmation
Relation	Be relevant	Apparent topic shift	Topic shift → implicit connection the system must discover
Manner	Be clear, orderly	Vague, ambiguous phrasing	Deliberate vagueness → user is exploring, not retrieving

Formal implicature extraction:

Given a query $q$ and maxim violation detector $V_m$ :

\mathcal{G}_{\text{implicature}} = \{g \mid \exists m \in \{Qt, Ql, Rl, Mn\} : V_m(q, \mathcal{H}) = \text{true} \wedge g = \text{Implicate}(q, m, \mathcal{H})\}

7.3.3 Presupposition Resolution#

Presuppositions are propositions that the query takes as given. Unresolved presuppositions cause retrieval against false premises.

Example: "Why did the migration fail?" presupposes (a) a migration occurred, (b) it failed.

Presupposition extraction and validation:

\mathcal{P}_{\text{presup}} = \text{Extract}_{\text{presup}}(q, \mathcal{H})

Each $p \in \mathcal{P}_{\text{presup}}$ must be verified against available evidence:

\text{Status}(p) = \begin{cases} \text{confirmed} & \text{if } \exists e \in \mathcal{E} : e \models p \\ \text{contradicted} & \text{if } \exists e \in \mathcal{E} : e \models \neg p \\ \text{unverified} & \text{otherwise} \end{cases}

Contradicted presuppositions must trigger either:

A corrective clarification to the user ("The migration actually succeeded; are you asking about the subsequent deployment failure?").
A presupposition repair in the query plan (replace failed presupposition with corrected premise before retrieval).

7.3.4 Speech Act Classification#

We classify speech acts using an adaptation of Searle's taxonomy:

P(\mathcal{A}_{\text{speech}} \mid q, \mathcal{H}) = \text{Classify}_{\text{act}}(f_{\text{enc}}(q, \mathcal{H}))

The speech act determines the functional shape of the expected response:

Speech Act	System Response Shape
Assertive question	Evidence-backed factual answer
Directive	Action execution + confirmation
Commissive	Commitment tracking + follow-up scheduling
Expressive	Acknowledgment + contextual assistance
Declarative	State change + verification

7.3.5 Pseudo-Algorithm: Pragmatic Frame Construction#

ALGORITHM 7.2: ConstructPragmaticFrame(q, I_set, H, M)
───────────────────────────────────────────────────────
Input:
  q       — raw query string
  I_set   — classified intents from Algorithm 7.1
  H       — conversational history
  M       — memory state (session + episodic)
 
Output:
  F_p     — pragmatic frame (speech_act, presuppositions, implicatures, expectations)
 
1.  // Speech act classification
2.  a_speech ← ClassifySpeechAct(q, H)
 
3.  // Presupposition extraction and validation
4.  P_raw ← ExtractPresuppositions(q, H)
5.  P_validated ← ∅
6.  FOR EACH p IN P_raw DO
7.      evidence ← RetrieveEvidence(p, M, timeout=50ms)
8.      status ← Verify(p, evidence)
9.      P_validated ← P_validated ∪ {(p, status, evidence.provenance)}
10. END FOR
 
11. // Gricean maxim violation detection
12. G_imp ← ∅
13. FOR EACH maxim IN {Quantity, Quality, Relation, Manner} DO
14.     IF DetectViolation(q, maxim, H) THEN
15.         g ← DeriveImplicature(q, maxim, H, I_set)
16.         G_imp ← G_imp ∪ {(g, maxim, confidence(g))}
17.     END IF
18. END FOR
 
19. // Response expectation inference
20. E_exp ← InferExpectation(a_speech, I_set, H, UserProfile(M))
21.     // E_exp includes: format, depth, evidence_required, action_required
 
22. // Check for contradicted presuppositions
23. contradictions ← {(p, s, e) ∈ P_validated : s = "contradicted"}
24. IF |contradictions| > 0 THEN
25.     F_p.requires_clarification ← TRUE
26.     F_p.contradiction_details ← contradictions
27. END IF
 
28. F_p ← (a_speech, P_validated, G_imp, E_exp)
29. RETURN F_p

7.3.6 Pragmatic Inference Under Uncertainty#

When pragmatic inference is uncertain (e.g., the system cannot determine whether "Can you show me the logs?" is a capability question or a directive), we maintain a pragmatic distribution:

P(\mathcal{A}_{\text{speech}} = a_i \mid q, \mathcal{H}) \quad \forall a_i \in \mathcal{A}

If the entropy of this distribution exceeds a threshold:

H(\mathcal{A}_{\text{speech}} \mid q, \mathcal{H}) = -\sum_i P(a_i) \log P(a_i) > \theta_{\text{pragmatic}}

the system must either (a) hedge its response to cover the top- $k$ interpretations or (b) trigger active clarification (§7.9).

7.4 Cognitive Load Modeling: Estimating Task Complexity, Ambiguity, and Required Reasoning Depth#

7.4.1 Motivation#

Not all queries require the same computational investment. A simple factual lookup should not trigger multi-step decomposition and parallel retrieval fan-out. Conversely, a complex analytical query should not be answered with a single retrieval pass. Cognitive load modeling estimates the computational and reasoning resources required to adequately serve a query, enabling adaptive pipeline configuration.

7.4.2 Complexity Dimensions#

We model cognitive load as a multi-dimensional score:

\kappa = (\kappa_{\text{lex}}, \kappa_{\text{sem}}, \kappa_{\text{struct}}, \kappa_{\text{reason}}, \kappa_{\text{scope}})

Dimension	Measures	Range
$\kappa_{\text{lex}}$	Lexical complexity (vocabulary rarity, technical density)	$[0, 1]$
$\kappa_{\text{sem}}$	Semantic ambiguity (polysemy, vagueness)	$[0, 1]$
$\kappa_{\text{struct}}$	Structural complexity (syntactic depth, embedded clauses)	$[0, 1]$
$\kappa_{\text{reason}}$	Reasoning depth (number of inference steps required)	$[0, 1]$
$\kappa_{\text{scope}}$	Information scope (number of distinct knowledge domains)	$[0, 1]$

7.4.3 Aggregate Complexity Score#

The aggregate cognitive load is computed as a weighted combination:

\kappa_{\text{agg}} = \sum_{d \in \{\text{lex, sem, struct, reason, scope}\}} w_d \cdot \kappa_d

subject to $\sum_d w_d = 1$ and $w_d \geq 0$ .

Weights are calibrated empirically against human difficulty ratings on a held-out query set.

7.4.4 Reasoning Mode Selection#

Based on the cognitive load profile, the system selects an appropriate reasoning mode $\rho$ :

\rho = \begin{cases} \text{direct-retrieval} & \text{if } \kappa_{\text{agg}} < \theta_1 \\ \text{single-step-reasoning} & \text{if } \theta_1 \leq \kappa_{\text{agg}} < \theta_2 \\ \text{multi-step-decomposition} & \text{if } \theta_2 \leq \kappa_{\text{agg}} < \theta_3 \\ \text{deliberative-analysis} & \text{if } \kappa_{\text{agg}} \geq \theta_3 \end{cases}

Each mode configures the downstream pipeline differently:

Mode	Decomposition	Retrieval	Verification	Token Budget
Direct-retrieval	None	Single-pass	Confidence check	Minimal
Single-step-reasoning	None	Multi-source	Evidence match	Moderate
Multi-step-decomposition	DAG (§7.6)	Per-sub-query	Per-sub-result	Standard
Deliberative-analysis	Full DAG + critique	Iterative	Multi-round verification	Extended

7.4.5 Ambiguity Quantification#

Semantic ambiguity $\kappa_{\text{sem}}$ is estimated via the entropy of the embedding neighborhood:

\kappa_{\text{sem}}(q) = \frac{H(\{s_i\}_{i=1}^{k})}{H_{\max}} = \frac{-\sum_{i=1}^{k} \tilde{s}_i \log \tilde{s}_i}{\log k}

where $\{s_i\}$ are the cosine similarities between $q$ 's embedding and its $k$ -nearest neighbors in the retrieval index, and $\tilde{s}_i = s_i / \sum_j s_j$ are the normalized similarities. High entropy indicates the query is equidistant from many semantically distinct documents—a signal of ambiguity.

7.4.6 Reasoning Depth Estimation#

We estimate $\kappa_{\text{reason}}$ by analyzing the logical structure of the query:

\kappa_{\text{reason}}(q) = \frac{1}{D_{\max}} \cdot \left( n_{\text{entities}} \cdot \alpha + n_{\text{relations}} \cdot \beta + n_{\text{conditionals}} \cdot \gamma + n_{\text{comparisons}} \cdot \delta \right)

where $n_{\text{entities}}$ , $n_{\text{relations}}$ , $n_{\text{conditionals}}$ , $n_{\text{comparisons}}$ are counts of distinct entities, required relational inferences, conditional branches, and comparative evaluations extracted from the query, and $\alpha, \beta, \gamma, \delta$ are empirically calibrated coefficients.

7.4.7 Pseudo-Algorithm: Cognitive Load Estimation#

ALGORITHM 7.3: EstimateCognitiveLoad(q, F_p, I_set, index)
───────────────────────────────────────────────────────────
Input:
  q       — contextualized query
  F_p     — pragmatic frame
  I_set   — classified intents
  index   — retrieval index for ambiguity estimation
 
Output:
  κ       — (κ_lex, κ_sem, κ_struct, κ_reason, κ_scope)
  κ_agg   — aggregate score
  ρ       — selected reasoning mode
 
1.  // Lexical complexity
2.  tokens ← Tokenize(q)
3.  κ_lex ← Mean({Rarity(t) : t ∈ tokens}) + TechnicalDensity(tokens)
4.  κ_lex ← Clamp(κ_lex, 0, 1)
 
5.  // Semantic ambiguity via embedding neighborhood entropy
6.  v_q ← Encode(q)
7.  neighbors ← KNN(index, v_q, k=20)
8.  sims ← CosineSimilarities(v_q, neighbors)
9.  sims_norm ← Normalize(sims)
10. κ_sem ← -Sum(sims_norm * Log(sims_norm)) / Log(k)
 
11. // Structural complexity via parse depth
12. tree ← SyntacticParse(q)
13. κ_struct ← Depth(tree) / D_max_struct + EmbeddedClauseCount(tree) / C_max
 
14. // Reasoning depth
15. entities ← ExtractEntities(q, F_p)
16. relations ← ExtractRelations(q, F_p)
17. conditionals ← CountConditionals(q)
18. comparisons ← CountComparisons(q)
19. κ_reason ← (|entities|·α + |relations|·β + conditionals·γ + comparisons·δ) / D_max
 
20. // Information scope
21. domains ← IdentifyDomains(q, I_set)
22. κ_scope ← |domains| / max_domains
 
23. // Aggregate
24. κ_agg ← w_lex·κ_lex + w_sem·κ_sem + w_struct·κ_struct + w_reason·κ_reason + w_scope·κ_scope
 
25. // Mode selection
26. IF κ_agg < θ₁ THEN ρ ← "direct-retrieval"
27. ELIF κ_agg < θ₂ THEN ρ ← "single-step-reasoning"
28. ELIF κ_agg < θ₃ THEN ρ ← "multi-step-decomposition"
29. ELSE ρ ← "deliberative-analysis"
 
30. RETURN (κ_lex, κ_sem, κ_struct, κ_reason, κ_scope), κ_agg, ρ

7.5 Query Rewriting and Expansion#

Query rewriting transforms the raw query into one or more semantically enriched variants that improve retrieval recall and precision. This is not cosmetic reformulation—it is a critical recall amplification mechanism that bridges the vocabulary and conceptual gap between user language and corpus language.

7.5.1 Hypothetical Document Embedding (HyDE) Generation#

7.5.1.1 Conceptual Foundation#

HyDE inverts the retrieval problem: instead of matching the query against documents, the system generates a hypothetical document that would ideally answer the query, then retrieves real documents similar to this hypothetical answer.

Formal definition:

d_{\text{hyp}} = \text{LLM}_{\text{generate}}(q, \text{prompt}_{\text{HyDE}})

\vec{v}_{\text{hyp}} = f_{\text{enc}}(d_{\text{hyp}})

\text{Retrieve}(q) \triangleq \text{TopK}\left(\left\{ \frac{\vec{v}_{\text{hyp}} \cdot \vec{v}_{d_i}}{\|\vec{v}_{\text{hyp}}\| \|\vec{v}_{d_i}\|} \right\}_{i=1}^{|\mathcal{D}|}\right)

7.5.1.2 Why HyDE Works#

The embedding of a well-formed answer paragraph occupies a different (often more precise) region of embedding space than a short question. HyDE exploits this observation:

\text{sim}(f_{\text{enc}}(d_{\text{hyp}}), f_{\text{enc}}(d_{\text{real}})) > \text{sim}(f_{\text{enc}}(q), f_{\text{enc}}(d_{\text{real}}))

under the assumption that the hypothetical answer shares more lexical and structural features with real answers than the original question does.

7.5.1.3 Multi-HyDE for Ambiguous Queries#

For queries with high semantic ambiguity ( $\kappa_{\text{sem}} > \theta_{\text{ambig}}$ ), generate multiple hypothetical documents to cover distinct interpretations:

\{d_{\text{hyp}}^{(1)}, d_{\text{hyp}}^{(2)}, \ldots, d_{\text{hyp}}^{(m)}\} = \text{LLM}_{\text{generate}}^{m}(q, T_{\text{high}})

where $T_{\text{high}}$ is an elevated temperature parameter encouraging diversity. The final retrieval set is the union of results from all hypothetical embeddings:

\mathcal{R}_{\text{HyDE}} = \bigcup_{j=1}^{m} \text{TopK}_j\left(\text{sim}(\vec{v}_{\text{hyp}}^{(j)}, \cdot)\right)

7.5.1.4 HyDE Quality Control#

A hypothetical document may hallucinate facts. This is acceptable because HyDE uses the document only as an embedding probe, not as an answer. However, a wildly off-topic hypothesis degrades retrieval. We apply a relevance gate:

\text{Accept}(d_{\text{hyp}}) \iff \text{sim}(f_{\text{enc}}(q), f_{\text{enc}}(d_{\text{hyp}})) > \theta_{\text{HyDE\_min}}

7.5.1.5 Pseudo-Algorithm: HyDE Generation#

ALGORITHM 7.4: HyDEGenerate(q, F_p, κ_sem)
────────────────────────────────────────────
Input:
  q      — contextualized query
  F_p    — pragmatic frame
  κ_sem  — semantic ambiguity score
 
Output:
  V_hyde — set of hypothetical document embeddings with provenance
 
1.  num_hypotheses ← IF κ_sem > θ_ambig THEN m_max ELSE 1
2.  temperature ← IF num_hypotheses > 1 THEN T_diverse ELSE T_standard
3.  V_hyde ← ∅
 
4.  FOR j = 1 TO num_hypotheses DO
5.      prompt ← CompileHyDEPrompt(q, F_p, j)
6.          // "Write a detailed passage that answers the following question: {q}"
7.      d_hyp_j ← LLM.Generate(prompt, temperature=temperature, max_tokens=256)
8.      v_hyp_j ← Encode(d_hyp_j)
9.      
10.     // Relevance gate
11.     IF CosineSim(Encode(q), v_hyp_j) > θ_HyDE_min THEN
12.         V_hyde ← V_hyde ∪ {(v_hyp_j, provenance="HyDE", source_query=q)}
13.     ELSE
14.         Log("HyDE hypothesis rejected: low relevance", j)
15.     END IF
16. END FOR
 
17. IF |V_hyde| = 0 THEN
18.     V_hyde ← {(Encode(q), provenance="fallback_raw", source_query=q)}
19. END IF
 
20. RETURN V_hyde

7.5.2 Synonym Expansion, Ontological Enrichment, and Domain Terminology Mapping#

7.5.2.1 Synonym Expansion#

Synonym expansion augments the query with lexical variants to improve recall against documents using different terminology for the same concept.

Controlled expansion via ontology lookup:

\text{Expand}_{\text{syn}}(t) = \{t' \mid (t, t') \in \mathcal{O}_{\text{syn}} \wedge \text{domain}(t') \cap \text{domain}(q) \neq \emptyset\}

where $\mathcal{O}_{\text{syn}}$ is a domain-scoped synonym ontology. The domain constraint prevents incorrect expansions (e.g., "Java" → "coffee" in a software engineering context).

7.5.2.2 Ontological Enrichment#

Beyond synonyms, ontological enrichment adds hypernyms (generalization), hyponyms (specialization), meronyms (part-of), and related concepts from a domain knowledge graph $\mathcal{KG}$ :

\text{Enrich}_{\text{onto}}(q) = \bigcup_{e \in \text{Entities}(q)} \left\{ (e', r, w) \mid (e, r, e') \in \mathcal{KG} \wedge w = \text{RelWeight}(r) \right\}

where $r$ is the relation type and $w$ is a weight reflecting the utility of the relation for retrieval enrichment.

Enrichment budget: To prevent query explosion, limit the total enrichment to $B_{\text{enrich}}$ additional terms:

|\text{Enrich}_{\text{onto}}(q)| \leq B_{\text{enrich}}

selecting the top- $B_{\text{enrich}}$ enrichments by weight.

7.5.2.3 Domain Terminology Mapping#

In enterprise contexts, users and documents may use different terminologies for identical concepts. A terminology mapping layer maintains bidirectional mappings:

\mathcal{M}_{\text{term}}: \text{UserTerm} \leftrightarrow \text{CorpusTerm}

These mappings are populated from:

Organizational glossaries.
Automatically mined term co-occurrence patterns.
Human-curated correction memories (§7.11, Theory of Mind).

7.5.2.4 Weighted Query Formulation#

After expansion and enrichment, the rewritten query is a weighted bag of terms:

q' = \{(t_i, w_i)\}_{i=1}^{|q'|}

where original terms receive weight $w_{\text{orig}} = 1.0$ , synonyms receive $w_{\text{syn}} \in [0.5, 0.8]$ , and ontological enrichments receive $w_{\text{onto}} \in [0.3, 0.6]$ , calibrated to prevent enrichment terms from dominating retrieval scoring.

7.5.3 Ellipsis Resolution and Anaphora Tracking in Multi-Turn Queries#

7.5.3.1 The Multi-Turn Problem#

In multi-turn interactions, queries are frequently incomplete:

Anaphora: "What about its latency?" — "its" refers to a system mentioned three turns ago.
Ellipsis: "And for production?" — omits the entire predicate; the full query is "What is the deployment configuration for production?"
Deictic reference: "Show me that error" — "that" refers to an error displayed in the UI or mentioned in a prior response.

7.5.3.2 Formal Resolution#

We define resolution as a function that produces a self-contained query from a context-dependent one:

q_{\text{resolved}} = \text{Resolve}(q_{\text{raw}}, \mathcal{H}, \mathcal{M}_{\text{session}})

Resolution via co-reference chain construction:

Let $\mathcal{C} = \{c_1, c_2, \ldots, c_n\}$ be the set of co-reference chains extracted from $\mathcal{H} \cup \{q_{\text{raw}}\}$ . For each unresolved reference $r$ in $q_{\text{raw}}$ :

\text{Antecedent}(r) = \arg\max_{e \in \mathcal{C}(r)} \left[ \lambda_{\text{rec}} \cdot \text{Recency}(e) + \lambda_{\text{sal}} \cdot \text{Salience}(e) + \lambda_{\text{sem}} \cdot \text{SemanticFit}(e, q_{\text{raw}}) \right]

where $\mathcal{C}(r)$ is the set of candidate antecedents for reference $r$ , and the scoring function balances recency (more recent mentions preferred), salience (topically central entities preferred), and semantic fit (the antecedent must be semantically compatible with the query's predicate structure).

7.5.3.3 Pseudo-Algorithm: Multi-Turn Query Resolution#

ALGORITHM 7.5: ResolveMultiTurnQuery(q_raw, H, M_session)
──────────────────────────────────────────────────────────
Input:
  q_raw      — current turn query (potentially incomplete)
  H          — conversational history [(q₁,r₁,t₁), ..., (qₙ,rₙ,tₙ)]
  M_session  — session memory (entities, topics, focal objects)
 
Output:
  q_resolved — fully self-contained query string
  references — list of resolved (reference, antecedent, confidence) triples
 
1.  // Detect unresolved references
2.  refs ← DetectAnaphoraAndEllipsis(q_raw)
3.  IF refs = ∅ AND NOT IsElliptical(q_raw) THEN
4.      RETURN q_raw, []
5.  END IF
 
6.  // Build entity salience model from history
7.  entity_scores ← {}
8.  FOR i = |H| DOWNTO max(1, |H| - window_size) DO
9.      entities_i ← ExtractEntities(H[i].query ⊕ H[i].response)
10.     FOR EACH e IN entities_i DO
11.         recency ← DecayFunction(|H| - i)
12.         salience ← MentionCount(e, H) * TopicalCentrality(e, H)
13.         entity_scores[e] ← λ_rec · recency + λ_sal · salience
14.     END FOR
15. END FOR
 
16. // Resolve each reference
17. references ← []
18. q_resolved ← q_raw
19. FOR EACH ref IN refs DO
20.     candidates ← FilterByType(entity_scores, ref.expected_type)
21.     FOR EACH (e, score) IN candidates DO
22.         score ← score + λ_sem · SemanticFit(e, q_raw, ref.position)
23.     END FOR
24.     best ← ArgMax(candidates, by=score)
25.     confidence ← Softmax(candidates.scores)[best]
26.     IF confidence > θ_resolve THEN
27.         q_resolved ← Substitute(q_resolved, ref, best.entity)
28.         references ← references ∪ {(ref, best.entity, confidence)}
29.     ELSE
30.         // Cannot resolve confidently; mark for clarification
31.         q_resolved ← MarkAmbiguous(q_resolved, ref)
32.         references ← references ∪ {(ref, NULL, confidence)}
33.     END IF
34. END FOR
 
35. // Handle ellipsis: reconstruct omitted predicate
36. IF IsElliptical(q_raw) THEN
37.     template ← FindMostRecentPredicateTemplate(H)
38.     q_resolved ← MergeEllipticalQuery(q_resolved, template)
39. END IF
 
40. RETURN q_resolved, references

7.6 Query Decomposition Strategies#

7.6.1 Decomposition as Structural Transformation#

Complex queries encode multiple information needs with varying dependency structures. Decomposition transforms a monolithic query into a sub-query graph $G_{sq} = (V_{sq}, E_{sq})$ where:

Each vertex $v \in V_{sq}$ is a self-contained sub-query with its own intent, scope, and routing target.
Each edge $(v_i, v_j) \in E_{sq}$ represents a data dependency (the result of $v_i$ is required as input to $v_j$ ).

The topology of $G_{sq}$ determines the decomposition strategy.

7.6.1 Parallel-Decomposition: Independent Sub-Queries for Fan-Out Retrieval#

7.6.1.1 Definition#

Parallel decomposition applies when the original query comprises multiple independent information needs:

G_{sq}^{\parallel} = (V_{sq}, \emptyset) \quad \text{(no edges; all sub-queries are independent)}

Example: "What are the current CPU metrics for the staging cluster, and what does the latest RFC say about the new auth protocol?"

This decomposes into two independent sub-queries:

Retrieve CPU metrics → routed to metrics API.
Retrieve RFC content → routed to document store.

7.6.1.2 Execution Model#

All sub-queries execute concurrently with independent deadlines:

T_{\text{total}}^{\parallel} = \max_{v \in V_{sq}} T_{\text{exec}}(v) + T_{\text{merge}}

This is optimal when sub-queries target different sources with independent latency profiles.

7.6.1.3 Independence Verification#

Before committing to parallel execution, verify independence:

\text{Independent}(v_i, v_j) \iff \text{Entities}(v_i) \cap \text{Entities}(v_j) = \emptyset \wedge \text{Predicates}(v_i) \cap \text{Predicates}(v_j) = \emptyset

If entities or predicates overlap, the sub-queries may have hidden dependencies requiring sequential or conditional treatment.

7.6.2 Sequential-Decomposition: Dependency-Ordered Sub-Query Chains#

7.6.2.1 Definition#

Sequential decomposition applies when sub-queries form a chain of dependencies:

G_{sq}^{\text{seq}} = (V_{sq}, E_{sq}) \quad \text{where } E_{sq} \text{ forms a total order on } V_{sq}

Example: "Find the service that had the highest error rate last week, then retrieve its deployment history, and identify which config change caused the regression."

This decomposes into:

$v_1$ : Identify highest-error-rate service → metrics query.
$v_2$ : Retrieve deployment history for $\text{result}(v_1)$ → ops database.
$v_3$ : Identify causal config change from $\text{result}(v_2)$ → analytical reasoning.

7.6.2.2 Execution Model#

T_{\text{total}}^{\text{seq}} = \sum_{v \in V_{sq}} T_{\text{exec}}(v) + (|V_{sq}| - 1) \cdot T_{\text{handoff}}

Each sub-query receives the result of its predecessor as context:

\text{Context}(v_{k+1}) = \text{Context}_{\text{base}} \cup \{\text{result}(v_i) : i \leq k\}

7.6.2.3 Failure Propagation#

In sequential chains, failure at step $k$ prevents all downstream steps. The system must implement:

Partial result reporting: Return results from $v_1, \ldots, v_{k-1}$ with an explicit failure annotation at $v_k$ .
Retry with relaxed constraints: Attempt $v_k$ with broader retrieval parameters.
User escalation: If $v_k$ fails after retry, request human guidance before proceeding.

7.6.3 Conditional-Decomposition: Branch-on-Evidence Sub-Query Trees#

7.6.3.1 Definition#

Conditional decomposition applies when the next sub-query depends on the content of a prior result:

G_{sq}^{\text{cond}} = (V_{sq}, E_{sq}, \Phi)

where $\Phi: E_{sq} \to \text{Predicates}$ assigns a Boolean predicate to each edge. An edge $(v_i, v_j)$ is traversed only if $\Phi(v_i, v_j)(\text{result}(v_i)) = \text{true}$ .

Example: "Check if the service is using the legacy auth module. If so, find the migration guide. If not, verify it's using the new auth SDK and check for known issues."

This decomposes into a tree:

v₁: Check auth module type
├── [legacy=true]  → v₂: Find migration guide
└── [legacy=false] → v₃: Verify new auth SDK
                     └── v₄: Check known issues

7.6.3.2 Execution Model#

Conditional decomposition requires an evaluation step between sub-queries:

\text{next}(v_i) = \{v_j \mid (v_i, v_j) \in E_{sq} \wedge \Phi(v_i, v_j)(\text{result}(v_i)) = \text{true}\}

The total latency is:

T_{\text{total}}^{\text{cond}} = \sum_{v \in \text{path}(\text{root}, \text{leaf})} \left[T_{\text{exec}}(v) + T_{\text{eval}}(v)\right]

where $T_{\text{eval}}$ is the time to evaluate the branching predicate.

7.6.3.3 Speculative Execution#

To reduce latency, the system may speculatively execute both branches of a conditional node in parallel and discard the unused branch's results:

T_{\text{speculative}} = T_{\text{exec}}(v_i) + \max(T_{\text{exec}}(v_j), T_{\text{exec}}(v_k)) + T_{\text{eval}}

This trades compute cost for latency reduction. The decision to speculate is governed by:

\text{Speculate}(v_i) \iff T_{\text{exec}}(v_j) + T_{\text{exec}}(v_k) \leq B_{\text{compute}} \wedge P(\Phi = \text{true}) \in [\epsilon, 1 - \epsilon]

The second condition ensures speculation is worthwhile only when the branch outcome is genuinely uncertain.

7.6.4 Unified Decomposition Framework#

ALGORITHM 7.6: DecomposeQuery(q, I_set, F_p, κ, ρ, S)
──────────────────────────────────────────────────────
Input:
  q       — resolved, enriched query
  I_set   — classified intents
  F_p     — pragmatic frame
  κ       — cognitive load vector
  ρ       — reasoning mode
  S       — system state (available tools, schemas)
 
Output:
  G_sq    — sub-query DAG (vertices, edges, predicates, routing hints)
 
1.  IF ρ = "direct-retrieval" THEN
2.      // No decomposition needed
3.      G_sq ← SingleNode(q, I_set[0], routing=DefaultRoute(q, S))
4.      RETURN G_sq
5.  END IF
 
6.  // Extract decomposition candidates
7.  intents ← I_set
8.  entities ← ExtractEntities(q)
9.  relations ← ExtractRelations(q, entities)
10. conditions ← ExtractConditionals(q)
 
11. // Determine decomposition topology
12. IF |intents| > 1 AND AllIndependent(intents, entities) THEN
13.     strategy ← "parallel"
14. ELIF |conditions| > 0 THEN
15.     strategy ← "conditional"
16. ELIF HasChainedDependencies(relations) THEN
17.     strategy ← "sequential"
18. ELSE
19.     strategy ← "parallel"  // default: independent sub-queries
20. END IF
 
21. // Generate sub-queries
22. SWITCH strategy:
23.     CASE "parallel":
24.         sub_queries ← GenerateParallelSubQueries(q, intents, entities)
25.         G_sq ← DAG(nodes=sub_queries, edges=∅)
26.
27.     CASE "sequential":
28.         chain ← OrderByDependency(intents, relations)
29.         sub_queries ← GenerateSequentialSubQueries(q, chain)
30.         edges ← {(sq_i, sq_{i+1}) : i = 1..|sub_queries|-1}
31.         G_sq ← DAG(nodes=sub_queries, edges=edges)
32.
33.     CASE "conditional":
34.         tree ← BuildConditionalTree(q, intents, conditions, entities)
35.         sub_queries ← tree.nodes
36.         edges ← tree.edges  // includes predicate functions
37.         G_sq ← DAG(nodes=sub_queries, edges=edges, predicates=tree.predicates)
 
38. // Annotate each sub-query with routing hints
39. FOR EACH sq IN G_sq.nodes DO
40.     sq.routing ← InferRoutingHints(sq, S)
41.     sq.deadline ← AssignDeadline(sq, ρ, strategy)
42.     sq.verification ← SelectVerificationStrategy(sq, κ)
43. END FOR
 
44. // Validate: semantic preservation invariant
45. coverage ← ComputeCoverage(G_sq, q, I_set)
46. IF coverage < θ_coverage THEN
47.     missing ← IdentifyMissingIntents(G_sq, I_set)
48.     G_sq ← AddSubQueries(G_sq, missing)
49. END IF
 
50. RETURN G_sq

7.6.5 Decomposition Depth Bounding#

Unbounded decomposition leads to exponential sub-query proliferation. We enforce:

|V_{sq}| \leq D_{\max} \quad \text{and} \quad \text{depth}(G_{sq}) \leq L_{\max}

where $D_{\max}$ is the maximum number of sub-queries (typically 8–12 for production systems) and $L_{\max}$ is the maximum sequential depth (typically 4–5). These bounds are informed by the token budget:

\sum_{v \in V_{sq}} \text{TokenCost}(v) \leq B_{\text{tokens}} - B_{\text{synthesis}}

where $B_{\text{synthesis}}$ is the token budget reserved for final answer synthesis.

7.7 Schema-Aware Query Routing: Matching Sub-Queries to Source Type, Latency Tier, and Authority Level#

7.7.1 The Routing Problem#

After decomposition, each sub-query $sq_i$ must be directed to the most appropriate data source. This is a multi-objective assignment problem that must optimize for:

Schema compatibility: Can the source answer this type of question?
Authority level: How trustworthy is this source for this domain?
Latency tier: Can the source respond within the sub-query's deadline?
Cost: What is the computational/monetary cost of querying this source?
Freshness: Does the source have sufficiently recent data?

7.7.2 Source Registry#

The system maintains a typed source registry $\mathcal{R}_S$ :

\mathcal{R}_S = \{(s_j, \sigma_j, \alpha_j, \lambda_j, c_j, \phi_j)\}_{j=1}^{|\mathcal{R}_S|}

Symbol	Meaning	Type
$s_j$	Source identifier	String
$\sigma_j$	Schema descriptor (query types supported, entity domains)	Structured
$\alpha_j$	Authority score	$[0, 1]$
$\lambda_j$	Latency profile (P50, P90, P99)	Distribution
$c_j$	Cost per query	$\mathbb{R}^+$
$\phi_j$	Freshness (data staleness bound)	Duration

7.7.3 Routing Score Function#

For each sub-query $sq_i$ and candidate source $s_j$ , compute a routing score:

R(sq_i, s_j) = w_\sigma \cdot \text{SchemaMatch}(\sigma_j, sq_i) + w_\alpha \cdot \alpha_j + w_\lambda \cdot \text{LatencyFit}(\lambda_j, \text{deadline}(sq_i)) + w_c \cdot (1 - \hat{c}_j) + w_\phi \cdot \text{FreshnessFit}(\phi_j, sq_i)

where:

\text{SchemaMatch}(\sigma_j, sq_i) = \begin{cases} 1.0 & \text{if } \sigma_j \text{ natively supports } sq_i\text{'s type} \\ 0.5 & \text{if } \sigma_j \text{ partially supports (with transformation)} \\ 0.0 & \text{if incompatible} \end{cases}

\text{LatencyFit}(\lambda_j, d) = \begin{cases} 1.0 & \text{if } P_{99}(\lambda_j) \leq d \\ P_{99}(\lambda_j) / d & \text{if } P_{50}(\lambda_j) \leq d < P_{99}(\lambda_j) \\ 0.0 & \text{if } P_{50}(\lambda_j) > d \end{cases}

7.7.4 Optimal Routing Assignment#

The routing problem is formulated as a constrained assignment:

\max_{\{a_{ij}\}} \sum_{i} \sum_{j} a_{ij} \cdot R(sq_i, s_j)

subject to:

\sum_{j} a_{ij} \geq 1 \quad \forall i \quad \text{(every sub-query gets at least one source)}

\sum_{i} a_{ij} \leq C_j \quad \forall j \quad \text{(source concurrency limits)}

\sum_{i,j} a_{ij} \cdot c_j \leq B_{\text{cost}} \quad \text{(total cost budget)}

a_{ij} \in \{0, 1\}

For small $|V_{sq}|$ and $|\mathcal{R}_S|$ (typical in practice: $\leq 12$ sub-queries, $\leq 20$ sources), this is solvable by greedy assignment or linear relaxation within microsecond latency budgets.

7.7.5 Multi-Source Redundancy#

For high-stakes sub-queries (where retrieval failure is costly), the system may route to multiple sources and reconcile results:

\text{RedundancyFactor}(sq_i) = \begin{cases} 1 & \text{if } \text{criticality}(sq_i) < \theta_{\text{crit}} \\ 2 & \text{if } \theta_{\text{crit}} \leq \text{criticality}(sq_i) < \theta_{\text{high}} \\ 3 & \text{if } \text{criticality}(sq_i) \geq \theta_{\text{high}} \end{cases}

7.7.6 Pseudo-Algorithm: Schema-Aware Routing#

ALGORITHM 7.7: RouteSubQueries(G_sq, R_S, B_cost)
──────────────────────────────────────────────────
Input:
  G_sq   — sub-query DAG from decomposition
  R_S    — source registry
  B_cost — total cost budget
 
Output:
  routing — map from sub-query → list of (source, priority, deadline)
 
1.  routing ← {}
2.  cost_remaining ← B_cost
 
3.  FOR EACH sq IN TopologicalSort(G_sq) DO
4.      candidates ← []
5.      FOR EACH (s, σ, α, λ, c, φ) IN R_S DO
6.          IF SchemaMatch(σ, sq) > 0 THEN
7.              score ← ComputeRoutingScore(sq, s, σ, α, λ, c, φ)
8.              candidates ← candidates ∪ {(s, score, c)}
9.          END IF
10.     END FOR
11.
12.     // Sort by score descending
13.     candidates ← SortDescending(candidates, by=score)
14.
15.     // Select top-k based on redundancy factor
16.     k ← RedundancyFactor(sq)
17.     selected ← []
18.     FOR i = 1 TO min(k, |candidates|) DO
19.         IF cost_remaining ≥ candidates[i].cost THEN
20.             selected ← selected ∪ {candidates[i]}
21.             cost_remaining ← cost_remaining - candidates[i].cost
22.         END IF
23.     END FOR
24.
25.     IF |selected| = 0 THEN
26.         // No affordable source; flag for degraded execution
27.         selected ← [{source=FALLBACK, priority=LOW, deadline=sq.deadline}]
28.         Log("WARNING: Sub-query routed to fallback", sq)
29.     END IF
30.
31.     routing[sq] ← selected
32. END FOR
 
33. RETURN routing

Modern agentic systems receive inputs spanning multiple modalities. A multi-modal query is represented as:

q_{\text{mm}} = \{(m_i, \text{type}_i, \text{content}_i)\}_{i=1}^{n_{\text{parts}}}

where $\text{type}_i \in \{\text{text}, \text{image}, \text{code}, \text{table}, \text{audio}, \text{video}, \text{structured\_data}\}$ .

7.8.2 Modality-Specific Encoding#

Each modality requires a specialized encoder to produce a unified semantic representation:

\vec{v}_i = f_{\text{enc}}^{(\text{type}_i)}(\text{content}_i) \in \mathbb{R}^d

Text: Transformer-based language model encoder.
Image: Vision transformer (ViT) or CLIP visual encoder.
Code: Code-specialized encoder (e.g., CodeBERT, StarCoder embeddings) with AST-aware tokenization.
Table: Structural encoder preserving row-column relationships; schema-aware embedding.

The multi-modal query embedding is computed via cross-modal attention:

\vec{v}_{\text{fused}} = \text{CrossAttention}\left(\vec{v}_{\text{text}}, \{\vec{v}_i\}_{i: \text{type}_i \neq \text{text}}\right)

where text serves as the query modality and non-text inputs serve as context modalities. This preserves the primacy of the textual query intent while grounding it in visual, structural, or code evidence.

7.8.4 Modality-Specific Intent Extraction#

Different modalities contribute different types of information to intent resolution:

Modality	Contribution to Intent
Text	Explicit intent statement, constraints, questions
Image	Visual evidence (error screenshots, architecture diagrams, UI states)
Code	Implementation context, error locations, API signatures
Table	Data context, value ranges, anomaly patterns

The system must extract modality-specific signals and reconcile them:

I_{\text{mm}} = \text{Reconcile}\left(\{I_{\text{type}_i}\}_{i=1}^{n_{\text{parts}}}\right)

Conflict resolution: When modalities suggest conflicting intents (e.g., text says "this works fine" but the screenshot shows an error), the system flags the conflict and prioritizes the higher-evidence-weight modality or requests clarification.

ALGORITHM 7.8: MultiModalQueryUnderstand(q_mm, H, M, S)
───────────────────────────────────────────────────────
Input:
  q_mm  — multi-modal query: list of (modality, type, content)
  H     — conversational history
  M     — memory state
  S     — system state
 
Output:
  q_unified  — unified query representation
  I_mm       — multi-modal intent set
 
1.  embeddings ← {}
2.  modality_intents ← {}
 
3.  FOR EACH (m, type, content) IN q_mm DO
4.      // Modality-specific encoding
5.      v_i ← Encode(content, encoder=SelectEncoder(type))
6.      embeddings[type] ← embeddings[type] ∪ {v_i}
7.
8.      // Modality-specific intent extraction
9.      SWITCH type:
10.         CASE "text":
11.             intents_i ← TextIntentClassify(content, H)
12.         CASE "image":
13.             description ← ImageCaption(content)
14.             entities_visual ← VisualEntityExtract(content)
15.             intents_i ← InferIntentFromVisual(description, entities_visual)
16.         CASE "code":
17.             ast ← ParseAST(content)
18.             errors ← DetectCodeIssues(content, ast)
19.             intents_i ← InferIntentFromCode(content, ast, errors)
20.         CASE "table":
21.             schema ← InferTableSchema(content)
22.             anomalies ← DetectAnomalies(content, schema)
23.             intents_i ← InferIntentFromTable(schema, anomalies)
24.     modality_intents[type] ← intents_i
25. END FOR
 
26. // Cross-modal fusion
27. IF "text" IN embeddings THEN
28.     v_fused ← CrossAttention(embeddings["text"], AllOtherEmbeddings(embeddings))
29. ELSE
30.     v_fused ← MeanPool(AllEmbeddings(embeddings))
31. END IF
 
32. // Reconcile intents across modalities
33. I_mm ← ReconcileIntents(modality_intents)
34. conflicts ← DetectConflicts(modality_intents)
35. IF |conflicts| > 0 THEN
36.     I_mm ← AnnotateConflicts(I_mm, conflicts)
37. END IF
 
38. q_unified ← (v_fused, I_mm, embeddings, modality_intents)
39. RETURN q_unified, I_mm

7.9.1 When to Ask, Not Answer#

A system that always attempts to answer is not robust—it is reckless. Clarification detection determines when the system's confidence in its interpretation is insufficient to justify execution, and triggers a structured interaction to resolve the ambiguity.

7.9.2 Clarification Triggers#

We define a clarification trigger function based on multiple signals:

\text{Clarify}(q) = \bigvee \left\{ \begin{aligned} & H(I_{\text{set}}) > \theta_{\text{intent\_entropy}} \\ & \exists p \in \mathcal{P}_{\text{presup}} : \text{Status}(p) = \text{contradicted} \\ & \kappa_{\text{sem}} > \theta_{\text{ambiguity}} \\ & \exists r \in \text{refs}(q) : r.\text{confidence} < \theta_{\text{resolve}} \\ & |I_{\text{set}}| > K_{\text{max\_intents}} \\ & \text{Coverage}(G_{sq}, q) < \theta_{\text{coverage}} \\ & \text{NoViableRoute}(sq_i) \text{ for any } sq_i \in G_{sq} \end{aligned} \right\}

Each trigger corresponds to a specific pipeline failure mode:

Trigger	Pipeline Stage	Failure Mode
High intent entropy	§7.2	Cannot determine what the user wants
Contradicted presupposition	§7.3	Query premises are false
High semantic ambiguity	§7.4	Query maps to too many interpretations
Unresolved reference	§7.5.3	Cannot determine referent in multi-turn
Too many intents	§7.2	Query is too complex to serve atomically
Low coverage	§7.6	Decomposition lost information
No viable route	§7.7	No source can serve a sub-query

7.9.3 Clarification Quality: Minimal, Discriminative, Actionable#

A good clarification request satisfies three properties:

Minimal: Asks the fewest questions necessary to resolve the ambiguity.
Discriminative: Each possible answer leads to a distinct execution path.
Actionable: The expected answers map directly to pipeline parameters.

Formal minimality constraint:

|\mathcal{C}| = \arg\min_{k} \left\{ k : H(I_{\text{set}} \mid \text{answers to } k \text{ questions}) < \theta_{\text{resolved}} \right\}

This is the minimum number of clarifying questions such that the residual intent entropy drops below the resolution threshold.

7.9.4 Clarification Generation Strategy#

ALGORITHM 7.9: GenerateClarification(q, I_set, F_p, G_sq, triggers)
───────────────────────────────────────────────────────────────────
Input:
  q         — processed query
  I_set     — classified intents
  F_p       — pragmatic frame
  G_sq      — sub-query DAG
  triggers  — set of active clarification triggers
 
Output:
  clarification — structured clarification request, or NULL if none needed
 
1.  IF |triggers| = 0 THEN RETURN NULL
 
2.  questions ← []
 
3.  // Prioritize triggers by impact
4.  sorted_triggers ← SortByImpact(triggers)
 
5.  FOR EACH trigger IN sorted_triggers DO
6.      SWITCH trigger.type:
7.          CASE "high_intent_entropy":
8.              // Generate discriminative question
9.              top_intents ← TopK(I_set, k=3, by=confidence)
10.             q_clar ← FormatDisambiguation(top_intents)
11.             // e.g., "Are you asking about X, Y, or Z?"
12.             questions ← questions ∪ {q_clar}
13.
14.         CASE "contradicted_presupposition":
15.             (p, status, evidence) ← trigger.details
16.             q_clar ← FormatPresuppositionCorrection(p, evidence)
17.             // e.g., "The migration actually succeeded. Did you mean..."
18.             questions ← questions ∪ {q_clar}
19.
20.         CASE "unresolved_reference":
21.             (ref, candidates) ← trigger.details
22.             q_clar ← FormatReferenceDisambiguation(ref, candidates)
23.             questions ← questions ∪ {q_clar}
24.
25.         CASE "no_viable_route":
26.             sq ← trigger.sub_query
27.             q_clar ← FormatScopeNarrowing(sq)
28.             // e.g., "I don't have access to X. Can you provide..."
29.             questions ← questions ∪ {q_clar}
30.
31.     // Enforce minimality: stop if remaining entropy is low
32.     estimated_residual ← EstimateResidualEntropy(I_set, questions)
33.     IF estimated_residual < θ_resolved THEN BREAK
34. END FOR
 
35. clarification ← {
36.     questions: questions,
37.     context: SummarizeUnderstanding(q, I_set, F_p),
38.     options: GenerateOptions(questions),  // structured choices where possible
39.     fallback_action: DescribeDefaultBehavior(I_set, G_sq)
40. }
 
41. RETURN clarification

7.9.5 Graceful Degradation When Clarification is Not Possible#

In asynchronous or batch contexts where interactive clarification is unavailable, the system must degrade gracefully:

Conservative interpretation: Select the highest-confidence intent and explicitly state the assumption.
Multi-interpretation hedging: Execute the top- $k$ interpretations and present results organized by interpretation.
Partial execution: Execute only the sub-queries with confident routing, and annotate gaps.

7.10 Cognitive Reasoning Integration: Deductive, Inductive, Abductive, and Analogical Inference Modes#

7.10.1 Reasoning as a First-Class Pipeline Component#

Query understanding is not purely a classification and retrieval problem. Many queries require reasoning over the query itself before retrieval can begin. We formalize four reasoning modes that the pipeline may invoke during query understanding.

7.10.2 Deductive Reasoning#

Definition: From general premises and the specific query, derive necessary conclusions.

\frac{P_1, P_2, \ldots, P_n \quad \text{(premises from memory, policy, context)}}{\therefore C \quad \text{(necessary conclusion about what the user needs)}}

Application in query understanding: If organizational policy states "All production deployments require security review" and the user asks "Deploy service X to production," the system deductively infers that a security review check is a prerequisite, even though the user did not mention it.

Formal representation:

\text{Deduction}(q, \mathcal{K}_{\text{rules}}) = \{c \mid \exists R \in \mathcal{K}_{\text{rules}} : \text{Premises}(R) \subseteq \text{Facts}(q, \mathcal{M}) \wedge c = \text{Conclusion}(R)\}

7.10.3 Inductive Reasoning#

Definition: From observed patterns, infer general principles that inform query interpretation.

\text{Obs}_1, \text{Obs}_2, \ldots, \text{Obs}_n \Rightarrow_{\text{likely}} H_{\text{general}}

Application: If the user has asked about service latency five times in the past week, each time followed by a deployment change, the system inductively infers that the current latency question is likely a pre-deployment check.

\text{Induction}(q, \mathcal{H}_{\text{episodic}}) = \arg\max_{h} P(h \mid \text{patterns}(\mathcal{H}_{\text{episodic}}, q))

7.10.4 Abductive Reasoning#

Definition: From observed effects, infer the most likely explanatory cause.

\text{Effect}(q) \xrightarrow{\text{best explanation}} \text{Cause}(q)

Application: The user says "The dashboard is showing weird numbers." Abductive reasoning generates hypotheses about what might be wrong (data pipeline failure, metric misconfiguration, timezone mismatch) and structures retrieval to investigate each hypothesis.

\text{Abduction}(q) = \arg\max_{h \in \mathcal{H}_{\text{hypotheses}}} P(\text{Obs}(q) \mid h) \cdot P(h)

This is structurally equivalent to maximum a posteriori (MAP) inference over a hypothesis space.

7.10.5 Analogical Reasoning#

Definition: Map the current query to a structurally similar past query and transfer the solution strategy.

\text{Analogy}(q_{\text{new}}) = \text{Transfer}(\text{Strategy}(q_{\text{past}}), \text{Mapping}(q_{\text{past}} \to q_{\text{new}}))

Application: If the system previously solved "Why is service A slow?" by checking database connection pools, and now receives "Why is service B slow?", analogical reasoning suggests starting with the same diagnostic pathway.

Structural mapping formalism:

\text{AnalogicalMatch}(q, \mathcal{M}_{\text{episodic}}) = \arg\max_{q' \in \mathcal{M}_{\text{episodic}}} \left[ \alpha \cdot \text{StructSim}(q, q') + \beta \cdot \text{OutcomeSim}(q, q') \right]

where $\text{StructSim}$ measures structural (relational) similarity and $\text{OutcomeSim}$ measures similarity of expected outcomes.

7.10.6 Reasoning Mode Selection#

The cognitive load estimator (§7.4) and the intent classifier (§7.2) jointly determine which reasoning modes to activate:

\text{ActiveModes}(q) = \left\{ m \mid P(m \mid q, \kappa, I_{\text{set}}) > \theta_{\text{mode}} \right\}

where $m \in \{\text{deductive}, \text{inductive}, \text{abductive}, \text{analogical}\}$ .

Query Pattern	Primary Mode	Secondary Mode
Rule-governed task	Deductive	—
Recurring diagnostic	Inductive	Analogical
Unexplained phenomenon	Abductive	Inductive
Novel problem resembling past case	Analogical	Abductive
Complex multi-constraint	Deductive	Abductive

7.10.7 Pseudo-Algorithm: Reasoning Integration#

ALGORITHM 7.10: IntegrateReasoningModes(q, I_set, F_p, κ, M)
─────────────────────────────────────────────────────────────
Input:
  q      — resolved query
  I_set  — intent set
  F_p    — pragmatic frame
  κ      — cognitive load
  M      — full memory state (episodic, semantic, procedural)
 
Output:
  q_augmented — query augmented with reasoning-derived context
  hypotheses  — ranked list of reasoning-generated hypotheses
 
1.  active_modes ← SelectReasoningModes(q, κ, I_set)
2.  reasoning_outputs ← []
 
3.  IF "deductive" IN active_modes THEN
4.      rules ← RetrieveApplicableRules(q, M.procedural)
5.      deductions ← ApplyRules(q, rules, M.semantic)
6.      reasoning_outputs ← reasoning_outputs ∪ {("deductive", deductions)}
7.  END IF
 
8.  IF "inductive" IN active_modes THEN
9.      patterns ← FindPatterns(q, M.episodic, window=30_days)
10.     generalizations ← Generalize(patterns)
11.     reasoning_outputs ← reasoning_outputs ∪ {("inductive", generalizations)}
12. END IF
 
13. IF "abductive" IN active_modes THEN
14.     observations ← ExtractObservations(q, F_p)
15.     hypotheses_abd ← GenerateHypotheses(observations, M.semantic, k=5)
16.     ranked ← RankByPosterior(hypotheses_abd, observations)
17.     reasoning_outputs ← reasoning_outputs ∪ {("abductive", ranked)}
18. END IF
 
19. IF "analogical" IN active_modes THEN
20.     similar_cases ← FindAnalogousCases(q, M.episodic, threshold=θ_analogy)
21.     strategies ← TransferStrategies(similar_cases, q)
22.     reasoning_outputs ← reasoning_outputs ∪ {("analogical", strategies)}
23. END IF
 
24. // Merge reasoning outputs into augmented query
25. q_augmented ← AugmentQuery(q, reasoning_outputs)
26. hypotheses ← ConsolidateHypotheses(reasoning_outputs)
 
27. RETURN q_augmented, hypotheses

7.11 Theory of Mind Modeling: Inferring User Knowledge State, Expertise Level, and Unstated Goals#

7.11.1 The Necessity of User Modeling#

Two users asking the identical query may need fundamentally different responses. A junior developer asking "What is a connection pool?" needs an explanation. A senior engineer asking the same question in the context of a production incident needs the connection pool configuration for the specific service that is failing. Theory of Mind (ToM) modeling enables the system to infer what the user knows, what they need, and what they have left unstated.

7.11.2 User State Model#

We define the user state as a structured representation:

\mathcal{U} = (\xi, \mathcal{K}_u, \mathcal{G}_{\text{stated}}, \mathcal{G}_{\text{unstated}}, \mathcal{P}_{\text{pref}}, \mathcal{C}_{\text{ctx}})

Component	Definition	Source
$\xi \in [0, 1]^{\\|\mathcal{D}\\|}$	Expertise vector across domains $\mathcal{D}$	Profile, interaction history
$\mathcal{K}_u$	Estimated knowledge set (concepts the user knows)	Prior interactions, role
$\mathcal{G}_{\text{stated}}$	Explicitly stated goals	Current query
$\mathcal{G}_{\text{unstated}}$	Inferred unstated goals	Pragmatic analysis, patterns
$\mathcal{P}_{\text{pref}}$	Response preferences (format, depth, verbosity)	History, explicit settings
$\mathcal{C}_{\text{ctx}}$	Current operational context (task, deadline, stress)	Session signals

7.11.3 Expertise Estimation#

We estimate expertise per domain via a Bayesian update:

P(\xi_d \mid \mathcal{H}) \propto P(\mathcal{H} \mid \xi_d) \cdot P(\xi_d)

The likelihood $P(\mathcal{H} \mid \xi_d)$ is modeled by observing:

Vocabulary sophistication: Use of domain-specific terminology increases $\xi_d$ .
Query specificity: Specific, well-scoped queries indicate higher expertise.
Follow-up patterns: Users who ask progressive, building questions are more expert than those who repeat basics.
Tool usage: Expert users invoke advanced tools; novices use basic ones.

Bayesian update rule (simplified):

\xi_d^{(t+1)} = \xi_d^{(t)} + \eta \cdot \left[\text{ExpertiseSignal}(q_t, d) - \xi_d^{(t)}\right]

where $\eta$ is a learning rate and $\text{ExpertiseSignal}$ extracts a per-domain expertise indicator from the current query.

7.11.4 Unstated Goal Inference#

Unstated goals are inferred from the gap between what the user asked and what a user with their expertise and context would likely need:

\mathcal{G}_{\text{unstated}} = \text{GoalSet}_{\text{typical}}(\xi, \mathcal{C}_{\text{ctx}}, I_{\text{set}}) \setminus \mathcal{G}_{\text{stated}}

Example: A user with high DevOps expertise asking about deployment configuration in a production context unstately needs rollback procedures, even if they did not ask for them.

7.11.5 Response Calibration#

The user state model calibrates the response along multiple axes:

\text{ResponseConfig}(\mathcal{U}) = \begin{cases} \text{depth} = f_{\text{depth}}(\xi, \kappa_{\text{reason}}) \\ \text{jargon\_level} = g_{\text{jargon}}(\xi) \\ \text{evidence\_density} = h_{\text{evidence}}(\xi, \mathcal{P}_{\text{pref}}) \\ \text{proactive\_additions} = \mathcal{G}_{\text{unstated}} \end{cases}

7.11.6 Knowledge Gap Detection#

The system identifies concepts the query depends on that the user may not know:

\text{KnowledgeGaps}(q, \mathcal{K}_u) = \text{Prerequisites}(q) \setminus \mathcal{K}_u

If $|\text{KnowledgeGaps}| > 0$ , the system may:

For novice users: Proactively explain prerequisite concepts.
For expert users: Skip prerequisites and provide direct answers.
For uncertain expertise: Provide layered responses (summary + details).

7.11.7 Pseudo-Algorithm: Theory of Mind Construction#

ALGORITHM 7.11: BuildUserModel(q, H, M, UserProfile)
────────────────────────────────────────────────────
Input:
  q           — current query
  H           — conversation history
  M           — memory (session, episodic)
  UserProfile — stored user profile (if available)
 
Output:
  U — user state model
 
1.  // Initialize from profile or defaults
2.  IF UserProfile EXISTS THEN
3.      ξ ← UserProfile.expertise_vector
4.      P_pref ← UserProfile.preferences
5.  ELSE
6.      ξ ← UniformPrior(|D|, value=0.5)
7.      P_pref ← DefaultPreferences()
8.  END IF
 
9.  // Update expertise from current session
10. FOR EACH (q_i, r_i) IN H DO
11.     FOR EACH domain d IN IdentifyDomains(q_i) DO
12.         signal ← ComputeExpertiseSignal(q_i, d)
13.         ξ[d] ← ξ[d] + η · (signal - ξ[d])
14.     END FOR
15. END FOR
 
16. // Current query expertise signal
17. FOR EACH domain d IN IdentifyDomains(q) DO
18.     signal ← ComputeExpertiseSignal(q, d)
19.     ξ[d] ← ξ[d] + η · (signal - ξ[d])
20. END FOR
 
21. // Estimate knowledge set
22. K_u ← EstimateKnowledgeSet(ξ, H, M.episodic)
 
23. // Extract stated goals
24. G_stated ← ExtractExplicitGoals(q, F_p)
 
25. // Infer unstated goals
26. G_typical ← TypicalGoalSet(ξ, C_ctx, I_set)
27. G_unstated ← G_typical \ G_stated
 
28. // Detect operational context
29. C_ctx ← InferContext(q, H, M.session)
30.     // Context includes: task_type, deadline_pressure, incident_mode
 
31. // Detect knowledge gaps
32. prerequisites ← ComputePrerequisites(q, I_set)
33. gaps ← prerequisites \ K_u
 
34. U ← (ξ, K_u, G_stated, G_unstated, P_pref, C_ctx, gaps)
35. RETURN U

7.12 Query Understanding Quality Metrics: Precision of Decomposition, Routing Accuracy, Enrichment Lift#

7.12.1 Measurement Imperative#

A query understanding pipeline that cannot be measured cannot be improved. We define a comprehensive metrics framework covering every pipeline stage, enabling CI/CD-integrated evaluation, regression detection, and continuous optimization.

7.12.2 Intent Classification Metrics#

Precision, Recall, F1 per intent class:

\text{Precision}(\tau_k) = \frac{|\{q : \hat{\tau}(q) = \tau_k \wedge \tau^*(q) = \tau_k\}|}{|\{q : \hat{\tau}(q) = \tau_k\}|}

\text{Recall}(\tau_k) = \frac{|\{q : \hat{\tau}(q) = \tau_k \wedge \tau^*(q) = \tau_k\}|}{|\{q : \tau^*(q) = \tau_k\}|}

F_1(\tau_k) = \frac{2 \cdot \text{Precision}(\tau_k) \cdot \text{Recall}(\tau_k)}{\text{Precision}(\tau_k) + \text{Recall}(\tau_k)}

Multi-intent accuracy (subset accuracy):

\text{SubsetAcc} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}\left[\hat{Y}_i = Y_i^*\right]

where $\hat{Y}_i$ and $Y_i^*$ are the predicted and ground-truth intent sets for query $i$ . This is a strict metric: partial matches score zero.

Hamming loss (relaxed multi-label metric):

\text{HammingLoss} = \frac{1}{N \cdot |\mathcal{T}|} \sum_{i=1}^{N} \sum_{k=1}^{|\mathcal{T}|} \mathbb{1}\left[\hat{Y}_{ik} \neq Y_{ik}^*\right]

7.12.3 Decomposition Quality Metrics#

Semantic Preservation Score (SPS):

\text{SPS}(q, G_{sq}) = \frac{|\mathcal{I}(q) \cap \bigcup_{s \in G_{sq}} \mathcal{I}(s)|}{|\mathcal{I}(q)|}

A perfect decomposition achieves $\text{SPS} = 1.0$ . Any value below 1.0 indicates information loss.

Decomposition Granularity Index (DGI):

\text{DGI}(G_{sq}) = \frac{|V_{sq}|}{\text{OptimalDecompositionSize}(q)}

$\text{DGI} < 1.0$ indicates under-decomposition (sub-queries too coarse). $\text{DGI} > 1.0$ indicates over-decomposition (unnecessary fragmentation). Optimal: $\text{DGI} \approx 1.0$ .

Dependency Correctness:

\text{DepCorr}(G_{sq}) = \frac{|\text{CorrectEdges}(G_{sq})|}{|\text{TrueEdges}(G_{sq}^*)|}

measuring whether the identified dependencies between sub-queries match ground-truth dependency structures.

7.12.4 Routing Accuracy Metrics#

Source Match Rate (SMR):

\text{SMR} = \frac{1}{|V_{sq}|} \sum_{v \in V_{sq}} \mathbb{1}\left[\text{RoutedSource}(v) \in \text{OptimalSources}(v)\right]

Latency Compliance Rate (LCR):

\text{LCR} = \frac{|\{v : T_{\text{actual}}(v) \leq \text{deadline}(v)\}|}{|V_{sq}|}

Cost Efficiency:

\text{CostEff} = \frac{\text{RetrievalQuality}(G_{sq})}{\sum_{v \in V_{sq}} \text{Cost}(\text{Route}(v))}

measuring quality-per-unit-cost of the routing decisions.

7.12.5 Enrichment Lift Metrics#

Retrieval Recall Lift from Enrichment:

\text{Lift}_{\text{recall}} = \frac{\text{Recall}(q_{\text{enriched}})}{\text{Recall}(q_{\text{raw}})} - 1

Positive values indicate that enrichment improved retrieval coverage.

Precision Preservation:

\text{PrecPres} = \frac{\text{Precision}(q_{\text{enriched}})}{\text{Precision}(q_{\text{raw}})}

Enrichment must improve recall without substantially degrading precision. Target: $\text{PrecPres} \geq 0.95$ .

HyDE Effectiveness:

\text{HyDE}_{\text{eff}} = \frac{\text{NDCG}(q_{\text{HyDE}})}{\text{NDCG}(q_{\text{raw}})}

measuring the normalized discounted cumulative gain improvement from HyDE-based retrieval versus raw query retrieval.

7.12.6 End-to-End Query Understanding Quality#

Query Understanding Score (QUS):

We define an aggregate metric that captures the end-to-end quality of the pipeline:

\text{QUS}(q) = \omega_I \cdot F_1^{\text{intent}} + \omega_D \cdot \text{SPS} + \omega_R \cdot \text{SMR} + \omega_E \cdot \text{Lift}_{\text{recall}} + \omega_C \cdot (1 - \text{ClarificationRate}_{\text{unnecessary}})

where:

$\omega_I + \omega_D + \omega_R + \omega_E + \omega_C = 1$
$\text{ClarificationRate}_{\text{unnecessary}}$ is the fraction of clarifications that were not actually needed (false positive clarifications).

7.12.7 Operational Metrics#

Beyond quality, production systems must track operational health:

Metric	Definition	Target
Pipeline Latency P50/P95/P99	End-to-end query understanding time	P95 < 200ms
Token Consumption	Tokens used by rewriting + HyDE + reasoning	< 15% of total budget
Clarification Rate	Fraction of queries requiring clarification	< 10%
Fallback Rate	Fraction of queries degrading to raw retrieval	< 5%
Decomposition Rate	Fraction of queries decomposed into $>1$ sub-query	Monitored (not targeted)

7.12.8 Continuous Evaluation Pipeline#

ALGORITHM 7.12: QueryUnderstandingEvalPipeline(eval_set, pipeline)
─────────────────────────────────────────────────────────────────
Input:
  eval_set — list of (query, ground_truth) pairs
             ground_truth = {intents, decomposition, routing, enrichment_targets}
  pipeline — the query understanding pipeline under evaluation
 
Output:
  report — comprehensive quality report with per-stage metrics
 
1.  metrics ← InitializeMetricsAccumulator()
 
2.  FOR EACH (q, gt) IN eval_set DO
3.      // Run pipeline
4.      start ← Now()
5.      result ← pipeline.Process(q)
6.      latency ← Now() - start
7.
8.      // Intent metrics
9.      metrics.intent_predictions ← Append(result.I_set)
10.     metrics.intent_ground_truth ← Append(gt.intents)
11.
12.     // Decomposition metrics
13.     sps ← ComputeSPS(q, result.G_sq, gt.decomposition)
14.     dgi ← ComputeDGI(result.G_sq, gt.decomposition)
15.     dep_corr ← ComputeDepCorrectness(result.G_sq, gt.decomposition)
16.     metrics.decomposition ← Append(sps, dgi, dep_corr)
17.
18.     // Routing metrics
19.     smr ← ComputeSMR(result.routing, gt.routing)
20.     metrics.routing ← Append(smr)
21.
22.     // Enrichment lift (requires retrieval execution)
23.     recall_raw ← MeasureRecall(q, gt.enrichment_targets)
24.     recall_enriched ← MeasureRecall(result.q_enriched, gt.enrichment_targets)
25.     lift ← (recall_enriched / recall_raw) - 1
26.     metrics.enrichment_lift ← Append(lift)
27.
28.     // Operational
29.     metrics.latencies ← Append(latency)
30.     metrics.token_usage ← Append(result.token_count)
31.     metrics.clarification_triggered ← Append(result.clarification ≠ NULL)
32. END FOR
 
33. // Compute aggregate metrics
34. report ← {
35.     intent_f1:           ComputeF1(metrics.intent_predictions, metrics.intent_ground_truth),
36.     intent_subset_acc:   ComputeSubsetAccuracy(...),
37.     avg_sps:             Mean(metrics.decomposition.sps),
38.     avg_dgi:             Mean(metrics.decomposition.dgi),
39.     avg_smr:             Mean(metrics.routing),
40.     avg_enrichment_lift: Mean(metrics.enrichment_lift),
41.     latency_p50:         Percentile(metrics.latencies, 50),
42.     latency_p95:         Percentile(metrics.latencies, 95),
43.     latency_p99:         Percentile(metrics.latencies, 99),
44.     clarification_rate:  Mean(metrics.clarification_triggered),
45.     qus:                 ComputeQUS(report)
46. }
 
47. // Regression detection
48. previous ← LoadPreviousReport()
49. IF previous ≠ NULL THEN
50.     regressions ← DetectRegressions(report, previous, thresholds)
51.     IF |regressions| > 0 THEN
52.         report.regressions ← regressions
53.         report.status ← "REGRESSION_DETECTED"
54.     END IF
55. END IF
 
56. PersistReport(report)
57. RETURN report

7.12.9 Evaluation-Driven Improvement Loop#

The metrics framework is not passive. It drives an improvement loop:

Failed traces (low SPS, incorrect routing, unnecessary clarifications) are captured and normalized into regression test cases.
Pattern analysis identifies systematic failure modes (e.g., "sequential decomposition consistently misses the third dependency").
Policy updates are derived from failure patterns and integrated into the pipeline's procedural memory.
A/B testing of pipeline variants (e.g., new HyDE prompts, updated ontologies) is evaluated against the benchmark set.
Quality gates in CI/CD enforce that no pipeline change may degrade QUS below the established baseline.

The closed loop from evaluation to improvement is what transforms a query understanding pipeline from a static component into an evolving, self-improving system.

Chapter Summary#

This chapter formalized query understanding as a typed cognitive pipeline that transforms raw, ambiguous, context-dependent user inputs into structured, provenance-tagged, schema-routed execution plans. The key architectural contributions are:

Contribution	Section	Core Idea
Pipeline formalization	§7.1	Query understanding as $\Pi: (q, \mathcal{H}, \mathcal{M}, \mathcal{S}) \to \mathcal{Q}_{\text{plan}}$
Hierarchical multi-intent detection	§7.2	Open-domain, calibrated, multi-label intent classification
Pragmatic frame construction	§7.3	Gricean analysis, presupposition validation, speech act classification
Cognitive load estimation	§7.4	Multi-dimensional complexity scoring for adaptive pipeline configuration
HyDE and enrichment	§7.5	Hypothesis-driven retrieval, ontological expansion, multi-turn resolution
Decomposition DAGs	§7.6	Parallel, sequential, and conditional sub-query graphs
Schema-aware routing	§7.7	Multi-objective source assignment under latency and cost constraints
Multi-modal understanding	§7.8	Cross-modal fusion and modality-specific intent extraction
Clarification protocols	§7.9	Minimal, discriminative, actionable clarification with graceful degradation
Cognitive reasoning	§7.10	Deductive, inductive, abductive, and analogical inference integration
Theory of mind	§7.11	User expertise estimation, unstated goal inference, knowledge gap detection
Quality metrics	§7.12	SPS, DGI, SMR, enrichment lift, QUS, and CI-integrated evaluation

The unifying principle is that query understanding is the highest-leverage intervention point in any agentic system. Every downstream operation—retrieval, tool use, reasoning, synthesis, verification—is bounded by the quality of query understanding. A system that invests in this stage proportionally outperforms one that invests the same compute budget elsewhere in the pipeline. The formal metrics framework (§7.12) ensures this investment is measurable, reproducible, and continuously improving.

Chapter 7: Query Understanding — Cognitive Decomposition, Intent Resolution, and Semantic Enrichment

Preamble#

7.1 Query Understanding as a Cognitive Pipeline, Not String Matching#

7.1.1 The Fundamental Inadequacy of String-Level Processing#

7.1.2 The Pipeline Abstraction#

7.1.3 Pipeline Stages#

7.1.4 Formal Invariant#

7.1.5 Design Principles#

7.2 Intent Classification: Taxonomic, Hierarchical, and Open-Domain Intent Models#

7.2.1 Intent as a Structured Object#

7.2.2 Taxonomic Intent Models#

7.2.3 Hierarchical Intent Models#

7.2.4 Open-Domain Intent Models#

7.2.5 Multi-Intent Detection#

7.2.6 Pseudo-Algorithm: Hierarchical Multi-Intent Classification#

7.2.7 Intent Confidence Calibration#

7.3 Psycholinguistic Analysis: Pragmatic Inference, Gricean Maxims, Presupposition Resolution#

7.3.1 Beyond Semantics: The Pragmatic Layer#

7.3.2 Gricean Maxim Analysis#

7.3.3 Presupposition Resolution#

7.3.4 Speech Act Classification#

7.3.5 Pseudo-Algorithm: Pragmatic Frame Construction#

7.3.6 Pragmatic Inference Under Uncertainty#

7.4 Cognitive Load Modeling: Estimating Task Complexity, Ambiguity, and Required Reasoning Depth#

7.4.1 Motivation#

7.4.2 Complexity Dimensions#

7.4.3 Aggregate Complexity Score#

7.4.4 Reasoning Mode Selection#

7.4.5 Ambiguity Quantification#

7.4.6 Reasoning Depth Estimation#

7.4.7 Pseudo-Algorithm: Cognitive Load Estimation#

7.5 Query Rewriting and Expansion#

7.5.1 Hypothetical Document Embedding (HyDE) Generation#

7.5.1.1 Conceptual Foundation#

7.5.1.2 Why HyDE Works#

7.5.1.3 Multi-HyDE for Ambiguous Queries#

7.5.1.4 HyDE Quality Control#

7.5.1.5 Pseudo-Algorithm: HyDE Generation#

7.5.2 Synonym Expansion, Ontological Enrichment, and Domain Terminology Mapping#

7.5.2.1 Synonym Expansion#

7.5.2.2 Ontological Enrichment#

7.5.2.3 Domain Terminology Mapping#

7.5.2.4 Weighted Query Formulation#

7.5.3 Ellipsis Resolution and Anaphora Tracking in Multi-Turn Queries#

7.5.3.1 The Multi-Turn Problem#

7.5.3.2 Formal Resolution#

7.5.3.3 Pseudo-Algorithm: Multi-Turn Query Resolution#

7.6 Query Decomposition Strategies#

7.6.1 Decomposition as Structural Transformation#

7.6.1 Parallel-Decomposition: Independent Sub-Queries for Fan-Out Retrieval#

7.6.1.1 Definition#

7.6.1.2 Execution Model#

7.6.1.3 Independence Verification#

7.6.2 Sequential-Decomposition: Dependency-Ordered Sub-Query Chains#

7.6.2.1 Definition#

7.6.2.2 Execution Model#

7.6.2.3 Failure Propagation#

7.6.3 Conditional-Decomposition: Branch-on-Evidence Sub-Query Trees#

7.6.3.1 Definition#

7.6.3.2 Execution Model#

7.6.3.3 Speculative Execution#

7.6.4 Unified Decomposition Framework#

7.6.5 Decomposition Depth Bounding#

7.7 Schema-Aware Query Routing: Matching Sub-Queries to Source Type, Latency Tier, and Authority Level#

7.7.1 The Routing Problem#

7.7.2 Source Registry#

7.7.3 Routing Score Function#

7.7.4 Optimal Routing Assignment#

7.7.5 Multi-Source Redundancy#

7.7.6 Pseudo-Algorithm: Schema-Aware Routing#

7.8 Multi-Modal Query Understanding: Interpreting Mixed Text, Image, Code, and Data Table Inputs#

7.8.1 Multi-Modal Query Representation#

7.8.2 Modality-Specific Encoding#

7.8.3 Cross-Modal Fusion#

7.8.4 Modality-Specific Intent Extraction#

7.8.5 Pseudo-Algorithm: Multi-Modal Query Understanding#

7.9 Clarification Detection and Active Query Refinement Protocols#

7.9.1 When to Ask, Not Answer#

7.9.2 Clarification Triggers#

7.9.3 Clarification Quality: Minimal, Discriminative, Actionable#