Bug Summary
When evaluating a financial‑doc RAG pipeline, ContextualPrecisionMetric treats near‑duplicate overlapping chunks as distinct low‑quality retrievals. This under‑reports retrieval quality for pipelines that use overlap to preserve context around table/section boundaries — a standard practice for dense financial documents.
The core issue: increasing chunk overlap from 0% to 10–20% (which improves answer quality) makes the eval score worse — not because retrieval degraded, but because the metric doesn’t consolidate overlapping chunks that cover the same relevant span.
Redundancy ≠ Irrelevance.
Pipeline Setup
- Domain: Public‑company filings (10‑K, 10‑Q, earnings transcripts)
- Chunking: Fixed‑size ~512–1024 tokens, 10–20% overlap (to avoid cutting tables/footnotes at boundaries), via LangExtract
- Embeddings: Ollama local embeddings (
nomic-embed-text)
- Vector store: Supabase + pgvector (top‑k retrieval, no dedup, k=5–10)
- Eval LLM: Ollama (local —
llama3 / deepseek-r1)
- Orchestration: n8n → Supabase → dbt (analytics layer)
- Evaluation: DeepEval RAG metrics comparing
input, retrieval_context, actual_output, expected_output
Reproduction
Query: “What was total revenue for FY 2023, and how did it change YoY?”
Retrieved chunks (k=5):
- Chunk 1: Revenue table from Consolidated Statements of Operations
- Chunk 2: Overlapping — repeats revenue rows + a few lines above
- Chunk 3: Overlapping — same rows + footnotes below
- Chunks 4–5: MD&A narrative explaining YoY revenue change
Result: Answer is factually correct, fully grounded, uses only the correct filing. But context_precision reports ~0.3–0.5 instead of near 1.0.
Why:
- Each overlapping chunk that doesn’t add unique coverage gets scored as a separate “miss”
- Multiple windows over the same relevant span are penalized as if they were off‑topic retrievals
from deepeval import evaluate
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase
from deepeval.models import DeepEvalBaseLLM
from langchain_openai import ChatOpenAI
class OllamaLocal(DeepEvalBaseLLM):
def __init__(self, model_name="llama3", base_url="http://localhost:11434/v1/"):
self.model = ChatOpenAI(
model_name=model_name,
openai_api_key="ollama",
base_url=base_url,
temperature=0,
)
def generate(self, prompt):
return self.model.invoke(prompt).content
def get_model_name(self):
return "ollama-local"
async def a_generate(self, prompt):
return self.generate(prompt)
local_model = OllamaLocal()
test_case = LLMTestCase(
input="What was total revenue for FY 2023, and how did it change year-over-year?",
actual_output="Total revenue was $12,483.2M, up 13.4% year-over-year.",
expected_output="In FY 2023, total revenue was $12,483.2M, up 13.4% compared to prior year.",
retrieval_context=[
"Chunk 1: Net revenues: $12,483.2M (FY2023), $11,009.7M (FY2022), $9,204.1M (FY2021). Product revenue: $7,841.0M. Service revenue: $4,642.2M.",
"Chunk 2: Total net revenues $12,483.2M, $11,009.7M, $9,204.1M. Cost of revenues: $5,214.3M. Gross profit: $7,268.9M. [Footnote 3: Revenue recognized under ASC 606]",
"Chunk 3: [Footnote 3] Revenue is recognized when performance obligations are satisfied. Variable consideration is estimated using the expected value method.",
"Chunk 4: Revenue increased 13.4% year-over-year, driven by a 15.2% increase in product revenue and 10.8% growth in services. North America contributed 61% of total revenue.",
"Chunk 5: Management attributes YoY growth to enterprise expansion and new product launches in Q3 FY2023. FX headwinds reduced reported revenue by approximately $340M.",
],
)
metric = ContextualPrecisionMetric(threshold=0.7, model=local_model)
result = evaluate([test_case], [metric])
# Score: ~0.3–0.5 despite all chunks being from the correct section```
***
## Expected Behavior
When all retrieved chunks come from the correct document section and the answer is fully grounded, overlapping chunks should not drag precision down the same way genuinely irrelevant chunks do.
## Possible Approaches
1. **Chunk similarity consolidation** — optionally cluster near‑duplicate chunks (by text similarity or shared span coverage) and treat them as a single retrieval unit for precision calculation.
2. **Configurable redundancy tolerance** — a parameter like `deduplicate_context=True` or `max_redundant_hits=N` so users can tell the metric not to penalize overlapping windows from the same section.
3. **Doc‑aware precision** — if all top‑k chunks are from the same document + section containing the answer span, weight precision accordingly rather than strict per‑chunk scoring.
If the current per‑chunk behavior is intentional, guidance in the docs on recommended chunking strategies for DeepEval’s metrics (especially for dense domains like financial filings) would help a lot.
Happy to provide a full minimal repro with synthetic 10‑K text and controlled overlap levels.
***
## Environment
- **deepeval:** 3.9.5
- **Python:** 3.14.3
- **OS:** macOS
- **Vector store:** Supabase + pgvector
- **Eval LLM:** Ollama (llama3 / deepseek‑r1 via `http://localhost:11434/v1/`)
***
Bug Summary
When evaluating a financial‑doc RAG pipeline,
ContextualPrecisionMetrictreats near‑duplicate overlapping chunks as distinct low‑quality retrievals. This under‑reports retrieval quality for pipelines that use overlap to preserve context around table/section boundaries — a standard practice for dense financial documents.The core issue: increasing chunk overlap from 0% to 10–20% (which improves answer quality) makes the eval score worse — not because retrieval degraded, but because the metric doesn’t consolidate overlapping chunks that cover the same relevant span.
Redundancy ≠ Irrelevance.
Pipeline Setup
nomic-embed-text)llama3/deepseek-r1)input,retrieval_context,actual_output,expected_outputReproduction
Query: “What was total revenue for FY 2023, and how did it change YoY?”
Retrieved chunks (k=5):
Result: Answer is factually correct, fully grounded, uses only the correct filing. But
context_precisionreports ~0.3–0.5 instead of near 1.0.Why: