Skip to content

Contextual Precision over-penalizes overlapping chunks in financial-document RAG #2594

@Ruthwik-Data

Description

@Ruthwik-Data

Bug Summary

When evaluating a financial‑doc RAG pipeline, ContextualPrecisionMetric treats near‑duplicate overlapping chunks as distinct low‑quality retrievals. This under‑reports retrieval quality for pipelines that use overlap to preserve context around table/section boundaries — a standard practice for dense financial documents.

The core issue: increasing chunk overlap from 0% to 10–20% (which improves answer quality) makes the eval score worse — not because retrieval degraded, but because the metric doesn’t consolidate overlapping chunks that cover the same relevant span.

Redundancy ≠ Irrelevance.

Pipeline Setup

  • Domain: Public‑company filings (10‑K, 10‑Q, earnings transcripts)
  • Chunking: Fixed‑size ~512–1024 tokens, 10–20% overlap (to avoid cutting tables/footnotes at boundaries), via LangExtract
  • Embeddings: Ollama local embeddings (nomic-embed-text)
  • Vector store: Supabase + pgvector (top‑k retrieval, no dedup, k=5–10)
  • Eval LLM: Ollama (local — llama3 / deepseek-r1)
  • Orchestration: n8n → Supabase → dbt (analytics layer)
  • Evaluation: DeepEval RAG metrics comparing input, retrieval_context, actual_output, expected_output

Reproduction

Query: “What was total revenue for FY 2023, and how did it change YoY?”

Retrieved chunks (k=5):

  • Chunk 1: Revenue table from Consolidated Statements of Operations
  • Chunk 2: Overlapping — repeats revenue rows + a few lines above
  • Chunk 3: Overlapping — same rows + footnotes below
  • Chunks 4–5: MD&A narrative explaining YoY revenue change

Result: Answer is factually correct, fully grounded, uses only the correct filing. But context_precision reports ~0.3–0.5 instead of near 1.0.

Why:

  • Each overlapping chunk that doesn’t add unique coverage gets scored as a separate “miss”
  • Multiple windows over the same relevant span are penalized as if they were off‑topic retrievals
from deepeval import evaluate
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase
from deepeval.models import DeepEvalBaseLLM
from langchain_openai import ChatOpenAI

class OllamaLocal(DeepEvalBaseLLM):
    def __init__(self, model_name="llama3", base_url="http://localhost:11434/v1/"):
        self.model = ChatOpenAI(
            model_name=model_name,
            openai_api_key="ollama",
            base_url=base_url,
            temperature=0,
        )

    def generate(self, prompt):
        return self.model.invoke(prompt).content

    def get_model_name(self):
        return "ollama-local"

    async def a_generate(self, prompt):
        return self.generate(prompt)

local_model = OllamaLocal()

test_case = LLMTestCase(
    input="What was total revenue for FY 2023, and how did it change year-over-year?",
    actual_output="Total revenue was $12,483.2M, up 13.4% year-over-year.",
    expected_output="In FY 2023, total revenue was $12,483.2M, up 13.4% compared to prior year.",
    retrieval_context=[
        "Chunk 1: Net revenues: $12,483.2M (FY2023), $11,009.7M (FY2022), $9,204.1M (FY2021). Product revenue: $7,841.0M. Service revenue: $4,642.2M.",
        "Chunk 2: Total net revenues $12,483.2M, $11,009.7M, $9,204.1M. Cost of revenues: $5,214.3M. Gross profit: $7,268.9M. [Footnote 3: Revenue recognized under ASC 606]",
        "Chunk 3: [Footnote 3] Revenue is recognized when performance obligations are satisfied. Variable consideration is estimated using the expected value method.",
        "Chunk 4: Revenue increased 13.4% year-over-year, driven by a 15.2% increase in product revenue and 10.8% growth in services. North America contributed 61% of total revenue.",
        "Chunk 5: Management attributes YoY growth to enterprise expansion and new product launches in Q3 FY2023. FX headwinds reduced reported revenue by approximately $340M.",
    ],
)

metric = ContextualPrecisionMetric(threshold=0.7, model=local_model)
result = evaluate([test_case], [metric])
# Score: ~0.3–0.5 despite all chunks being from the correct section```

***

## Expected Behavior  
When all retrieved chunks come from the correct document section and the answer is fully grounded, overlapping chunks should not drag precision down the same way genuinely irrelevant chunks do.  


## Possible Approaches  
1. **Chunk similarity consolidation**optionally cluster nearduplicate chunks (by text similarity or shared span coverage) and treat them as a single retrieval unit for precision calculation.  
2. **Configurable redundancy tolerance**a parameter like `deduplicate_context=True` or `max_redundant_hits=N` so users can tell the metric not to penalize overlapping windows from the same section.  
3. **Docaware precision**if all topk chunks are from the same document + section containing the answer span, weight precision accordingly rather than strict perchunk scoring.  

If the current perchunk behavior is intentional, guidance in the docs on recommended chunking strategies for DeepEvals metrics (especially for dense domains like financial filings) would help a lot.  

Happy to provide a full minimal repro with synthetic 10K text and controlled overlap levels.  

***

## Environment  
- **deepeval:** 3.9.5  
- **Python:** 3.14.3  
- **OS:** macOS  
- **Vector store:** Supabase + pgvector  
- **Eval LLM:** Ollama (llama3 / deepseekr1 via `http://localhost:11434/v1/`)  
***

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions