Skip to content

[Docs] Add WFGY 16-mode RAG failure checklist as a BEIR-aligned debugging resource #210

@onestardao

Description

@onestardao

Hi BEIR team,

first of all – thank you for releasing BEIR.
I’m one of those people whose RAG pipelines only became debuggable because BEIR-style retrieval evaluation exists.

I’m the author of WFGY and the 16-mode RAG Failure ProblemMap:

Recently, parts of this work were picked up by:

  • Harvard MIMS Lab – ToolUniverse (LLM tools benchmark; WFGY listed in the robustness / RAG debugging section)
  • Univ. of Innsbruck – Rankify (academic RAG toolkit; merged WFGY as a reference for RAG / re-ranking troubleshooting)
  • QCRI LLM Lab – Multimodal RAG Survey (lists WFGY ProblemMap as a semantic failure-mode taxonomy)

Why I’m reaching out

Most BEIR users I see in the wild are now building RAG pipelines on top of BEIR-style retrievers.
Where things break in practice is often not:

“Is my nDCG@10 on dataset X high enough?”

but rather:

  • No.1 hallucination & chunk drift – retrieved context is slightly off, model confidently fabricates
  • No.2 interpretation collapse – retrieval is fine, but the question is semantically mis-parsed
  • No.5 semantic ≠ embedding – cosine match ≠ true meaning, especially for long / symbolic queries
  • No.8 debugging is a black box – no clear mapping from failure to “what to inspect next”

The 16-mode ProblemMap is basically a RAG failure map built from real production incidents.
A lot of people now pair BEIR-style scores with this checklist to debug where the pipeline is actually leaking.

Proposal (very small, docs-only)

Would you be open to a small docs addition in BEIR, something like:

RAG debugging & failure modes
For teams building RAG systems on top of BEIR retrievers, you may find the WFGY 16-mode RAG ProblemMap useful as a practical failure-mode checklist (hallucination & chunk drift, semantic ≠ embedding, retrieval traceability, multi-agent chaos, etc.).
See: https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

Optionally I could also contribute a short mapping table in the docs, e.g.:

  • which BEIR datasets tend to surface No.1 vs No.5 vs No.8,
  • and a 1-page “how to combine BEIR metrics + ProblemMap checklist when debugging a RAG stack”.

This would stay strictly on the documentation side, so there is no impact on core code or benchmarks.

Why this might be useful for BEIR users

From what I see in the community, there is a big gap between:

  • benchmarking retrieval quality, and
  • operationalizing those results into an actual RAG failure budget.

Putting a small, research-backed failure taxonomy next to BEIR in the docs could help teams:

  • reason about which modes of failure BEIR covers well,
  • and where they still need extra monitoring or tests.

If this sounds aligned with your roadmap, I’m happy to open a very small PR with the wording and table, and adjust it to your feedback.

If not, feel free to close – I mainly wanted to share the ProblemMap in case it is useful for BEIR users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions