Skip to content

feat: opt-in query rewriting for multi-turn RAG conversations#5188

Closed
Alminc91 wants to merge 5 commits intoMintplex-Labs:masterfrom
Alminc91:feat/query-rewriting-for-contextual-rag
Closed

feat: opt-in query rewriting for multi-turn RAG conversations#5188
Alminc91 wants to merge 5 commits intoMintplex-Labs:masterfrom
Alminc91:feat/query-rewriting-for-contextual-rag

Conversation

@Alminc91
Copy link

Problem

In multi-turn conversations, follow-up queries like "Yes, the B1 course please" or "Tell me more about that" fail at the RAG retrieval stage. Vector search on the literal text "Yes, B1 please" returns 0 relevant results. Reranking cannot fix this — it reorders results after retrieval, and reordering 0 relevant results still yields 0 relevant results.

This is the single biggest quality gap in multi-turn RAG, and the reason every major RAG framework has adopted query rewriting as a pre-retrieval step.

Solution

A 75-line module (queryRewriter.js) that rewrites ambiguous follow-up queries into standalone search queries before vector search. Integrates via a 2-line import+call in each chat handler.

Opt-in per workspace. Disabled by default. Enable via a toggle in Chat Settings → "Query Rewriting". When disabled, zero code paths change.

How it works

  1. Gate: Only runs when chat history exists, query is ≤12 words, and the workspace has query rewriting enabled
  2. Rewrite: Sends a short prompt (~550 tokens) with the last 2 turn-pairs to the workspace LLM
  3. Use: Rewritten query goes to performSimilaritySearch. Original message is preserved for chat completion and history
  4. Fallback: On any error, the original query is used. Zero risk of degrading existing behavior

Benchmark Results

Tested on Mistral Small 24B (FP8), a small locally-hosted model, across 250 queries (6 prompt variants):

Prompt Variant Follow-ups (25) Self-contained (25) Hallucinations Overall
Final version 25/25 (100%) 21/25 (84%) 0 92%
UNCHANGED signal 17/25 (68%) 25/25 (100%) 0 84%
Reference extraction 23/25 (92%) 18/25 (72%) 0 82%
LangChain prompt ~20/25 (80%) 3/25 (12%) 5+ ~46%

The 4 self-contained "errors" in the final version are harmless paraphrases (same search results). Zero hallucinations, zero meta-text leakage across all 50 test queries.

Latency: ~260ms self-contained, ~300ms rewrites. Tested on a small on-device model.

Safety Features

Feature Behavior
Opt-in per workspace Disabled by default. Enable in Chat Settings
Env override ENABLE_QUERY_REWRITING=true sets default for all workspaces
Word count threshold Queries above 12 words skip rewriting (configurable via QUERY_REWRITE_WORD_THRESHOLD)
Verbatim detection If the LLM returns the query unchanged, original is used
Error fallback Any exception returns the original query — zero disruption
History gate First message always skips (no history to reference)

Worst case: Original query used unchanged. Cannot produce worse results than current behavior.

Changes

File Change
server/utils/helpers/chat/queryRewriter.js New — rewrite logic + prompt (~75 lines)
server/utils/chats/embed.js Import + call before vector search (+2 lines)
server/utils/chats/stream.js Import + call before vector search (+2 lines)
server/utils/chats/apiChatHandler.js Import + call before vector search (+2 lines, 2 handlers)
server/utils/chats/openaiCompatible.js Import + call before vector search (+2 lines, 2 handlers)
server/models/workspace.js Add queryRewriteMode to writable fields + validation
server/prisma/schema.prisma Add queryRewriteMode column (nullable, default "off")
frontend/.../ChatSettings/QueryRewriteMode/index.jsx New — UI toggle component (~45 lines)
frontend/.../ChatSettings/index.jsx Import + render toggle

No new dependencies. No breaking changes.

Industry Precedent

This is not experimental — it is the standard approach for multi-turn RAG:

  • LangChaincreate_history_aware_retriever: LLM-based query contextualization
  • Open WebUI — Enabled by default; generates search queries from history
  • Vercel AI SDKgenerateObject for query transformation before RAG
  • Amazon Bedrock Knowledge Bases — Built-in query reformulation
  • Google Vertex AI RAG — Context-aware query rewriting

Related Issues


🤖 Generated with Claude Code

Alminc91 and others added 5 commits March 8, 2026 22:28
…ersations

Before vector search, rewrite short follow-up queries using chat history
so the search query captures the full conversational intent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… validation

Instead of always rewriting (like LangChain/LlamaIndex), the LLM now responds
with UNCHANGED for self-contained queries — reducing latency by ~40% (1 token
vs reproducing the full query).

Output validation ensures the function only ever returns the original query or
a valid rewrite, never meta-text like "no rewrite needed":

- Layer 1: Explicit UNCHANGED signal from LLM (fast path)
- Layer 2: Verbatim copy detection (fallback for less capable models)
- Layer 3: Content word overlap check — a valid rewrite must share topic
  words with the conversation context. Meta-responses do not. This works
  across all languages and models without keyword lists.

Tested with Mistral Small 24B:
- UNCHANGED queries: ~125-240ms (was ~240-435ms without signal)
- Rewritten queries: ~275-290ms
- All test cases pass: follow-ups rewritten, self-contained queries unchanged

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Layer 3 word-level matching fails for languages without spaces between
words (Chinese, Japanese, Korean, Thai). These scripts use meaningful
individual characters, so we fall back to character-level overlap
checking for any non-ASCII character (charCode > 127).

This makes the output validation truly language-agnostic:
- Space-separated languages (Latin, Arabic, Cyrillic, Hebrew): word-level
- Non-space languages (CJK, Thai, Lao, Myanmar): character-level fallback
- Meta-responses in any language: still correctly rejected

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace UNCHANGED signal + 3-layer validation with simpler "return
EXACTLY as written" prompt and verbatim check. Tested across 6 prompt
variants on Mistral Small 24B FP8: 92% accuracy, 0 hallucinations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add per-workspace queryRewriteMode setting (default: off)
- Add UI toggle in Chat Settings for easy enable/disable
- Replace UNCHANGED signal with simpler "return EXACTLY as written"
  prompt based on 250-query benchmark (92% accuracy, 0 hallucinations)
- Env var ENABLE_QUERY_REWRITING=true sets default for all workspaces

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Alminc91 Alminc91 closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant