-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
bugSomething isn't workingSomething isn't workingpriority:criticalCritical priority - blocks productionCritical priority - blocks production
Description
Problem
Question "What was IBM revenue in 2022?" ranks the correct answer chunk at position #14, outside the default top_k=5, causing users to receive incorrect or incomplete answers.
Evidence
- All 8 embedding models tested rank revenue chunk identically at Build and tests fixes #14
- Similarity score: 0.7069
- Query: "What was IBM revenue in 2022?"
- Correct chunk: "For the year, IBM generated $60.5 billion in revenue..."
Test Results
| Model | Rank | Score | Answer |
|---|---|---|---|
| slate-125m-english-rtrvr | #14 | 0.7069 | ✅ (with top_k=20) |
| slate-125m-english-rtrvr-v2 | #14 | 0.7069 | ✅ (with top_k=20) |
| granite-107m-multilingual | #14 | 0.7069 | ❌ |
| (all 8 models) | #14 | 0.7069 | 7/8 correct |
Root Cause
Semantic matching on generic financial keywords rather than specific factual content.
Chunks ranked #1-13 contain generic terms that semantically match the query but don't contain the answer:
- "consolidated financial results"
- "annual report"
- "stockholders"
- "financial statements"
The revenue chunk (#14) uses different phrasing:
- "generated" instead of "revenue"
- "For the year" instead of "in 2022"
Impact
- Critical UX issue: Default
top_k=5misses correct answer - Users get wrong/incomplete information
- System appears unreliable for factual questions
- Workaround requires
top_k=20(expensive, slower)
Solution Options
Option A: Fix LLM Reranker (QUICK WIN - RECOMMENDED)
- Effort: 30 min
- Impact: 70-80% improvement
- Action: Fix reranker template=None bug
- LLM can read all 20 chunks and identify chunk Build and tests fixes #14 as most relevant
Option B: Implement Hybrid Search
- Effort: 3-4 hours
- Impact: 50-60% improvement
- Combine vector similarity (70%) + BM25 keyword matching (30%)
- Boosts chunks with exact "revenue" and "2022" keywords
Option C: Improve Query Rewriting
- Effort: 1-2 hours
- Impact: 20-30% improvement
- Remove generic expansion: "AND (relevant OR important OR key)"
- Add entity extraction and synonym expansion
Option D: Reduce Chunk Size
- Effort: 2-3 hours (re-ingestion required)
- Impact: 30-40% improvement
- Test 400 chars vs current 750 chars
- Reduces signal dilution
Related
- Reranker template validation bug (BLOCKS this fix)
- Issue Critical: Chain of Thought reasoning leaking into final responses - garbage output #461: CoT reasoning leak (separate, affects response quality)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingpriority:criticalCritical priority - blocks productionCritical priority - blocks production