Most RAG systems don't fail because of the model. They fail because of the pipeline.
After building and benchmarking production RAG APIs across 6 frameworks, 14 LLMs and 70 adversarial tests, here's what actually breaks in production — and what works.
Curated list · CC0 · PR-friendly · every entry links to a reproducible artifact
RAG is supposed to reduce hallucinations.
In reality, most pipelines:
- Retrieve context...
- Then blindly trust the model
Result:
- Missing key facts
- Conflicting sources ignored
- Confident but wrong answers
The model is not the problem. The system is.
Top-K similarity doesn't mean the right chunks are selected. A chunk about "warranty terms" may score lower than a general overview — but it's the one that matters.
Numbers, dates, policy limits, entity names — these often rank low in vector similarity because they're short and specific. But they're exactly what the user asked about.
Source says "coverage period: 60 days." Model outputs "coverage period: 30 days." The right chunk was in context. The model just... changed it. This is generation drift, not a retrieval problem.
Document A says "unlimited storage." Document B says "50GB limit." Most pipelines pick whichever scores higher. No reconciliation. No flag.
A paragraph split mid-sentence loses context. A table split across chunks becomes garbage. Hierarchical chunking (small for retrieval, large for context) helps — flat chunking doesn't.
Reranking improves the average case but can kill edge cases. We saw a header-penalty reranker drop retrieval by 2 points and increase variance 5x. Calibrated ≠ tweakable.
"Retrieve more, then filter" sounds smart. In practice, parent chunk expansion reduced our benchmark by 3 points. More recall ≠ more accuracy.
The pipeline generates an answer and returns it. Nobody checks if the answer terms actually appear in the sources. This is where most hallucinations survive.
"Only answer from the provided context" is an instruction, not a guarantee. LLMs follow instructions probabilistically. Prompts reduce hallucinations — they don't eliminate them.
The system always answers — even when it shouldn't. A well-designed pipeline should say "insufficient evidence" instead of guessing. The refusal IS the feature.
Hybrid search (BM25 + vector) catches what either alone misses. BM25 finds exact terms. Vectors find semantics. Together they cover more ground. But never blindly trust the result.
Don't rely on ranking alone. Detect the entities and attributes the query asks about. Force-include chunks that match — regardless of their similarity score. This is constraint-based, not score-based.
Pull numbers, dates, percentages, and named values from retrieved chunks. Inject them as a structured "must include" section in the prompt. The model can't forget what's explicitly listed.
Check the answer against the sources:
- Do answer terms appear in the source text?
- Do numbers match?
- Are there negation conflicts ("never" vs "12 months")?
- Do citations trace back to real sources?
If the answer isn't grounded → reject it. Return "insufficient evidence."
A correct "I don't know" is infinitely better than a confident wrong answer. Design for refusal. The 17% accuracy gap in our benchmarks = the system correctly not guessing.
| Metric | Result |
|---|---|
| Hallucination rate | 0% across 61 evaluation tasks |
| Accuracy | 83% (remaining 17% = correct refusals) |
| LLMs tested | 14 models, 3 runs each |
| Avg latency | ~1.2s |
Key insight:
You don't need perfect accuracy to eliminate hallucination. You need verification.
Another surprise: the cheapest model (Qwen 3.5 Flash, $0.065/M tokens) performed the same as GPT-4.1 on this pipeline. The pipeline matters more than the model.
| Attempt | Result | Lesson |
|---|---|---|
| Multi-step retrieval | Retrieval dropped 3 points | More recall ≠ more accuracy |
| Header penalties in reranker | Retrieval dropped 2 points, variance 5x | Don't touch calibrated scoring |
| Parent chunk expansion | Degraded benchmark | Added noise, not context |
| Switching to GPT-4.1 | Same cross-doc score as Qwen | Pipeline > model |
RAG is a balanced system. Small changes can break it.
| Tool | What it does | Link |
|---|---|---|
| LettuceDetect | Token-level hallucination detection | GitHub |
| LongTracer | Claim verification against sources | GitHub |
| MiniCheck | Fact-checking, GPT-4 level at 400x lower cost | GitHub |
| RAGAS | Evaluation framework | GitHub |
| DeepEval | LLM evaluation with hallucination metrics | GitHub |
| Paper | Key finding |
|---|---|
| Lost in the Middle | LLMs ignore context in the middle of the prompt |
| MiniCheck | 770M model matches GPT-4 for fact-checking |
| LettuceDetect | 79% F1 on RAGTruth with encoder-only model |
| FACTS Grounding | Even Gemini fails ~16% of factual claims |
| Stanford Legal RAG Study | Specialized legal RAG tools hallucinate 17-33% |
| Benchmark | What it measures | Link |
|---|---|---|
| RAGTruth | Hallucination detection accuracy | Paper |
| FACTS Grounding | LLM factual accuracy | DeepMind |
| TruthfulQA | Language model truthfulness | GitHub |
| HaluEval | Hallucination evaluation | GitHub |
We built this entire pipeline into an API:
wauldo.com — Upload docs, ask questions, get verified answers.
- Native PDF/DOCX upload with quality scoring
- Built-in fact-check endpoint (3 modes)
- Citation verification
- OpenAI-compatible
- Free tier on RapidAPI
Full technical breakdown: How We Achieved 0% Hallucination Rate in Our RAG API
PRs welcome:
- Add failure modes you've encountered
- Share real-world cases with specifics
- Improve techniques with benchmarks
- Add tools or papers
Every entry must link to a reproducible artifact (paper, repo, dataset, or benchmark). See contribution guidelines.
- wauldo.com — platform
- wauldo.com/leaderboard — live RAG framework bench (6 frameworks, daily refresh)
- wauldo.com/guard — the verification layer referenced throughout this list
- github.com/wauldo/wauldo-leaderboard — reproducible adversarial bench runner
- github.com/wauldo/ragrs — standalone Rust RAG CLI with
--verifyflag
CC0 1.0 Universal — see LICENSE.
Built by the Wauldo team. Every PR that ships a reproducible failure mode gets merged on sight.
