-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Hi BEIR team,
first of all – thank you for releasing BEIR.
I’m one of those people whose RAG pipelines only became debuggable because BEIR-style retrieval evaluation exists.
I’m the author of WFGY and the 16-mode RAG Failure ProblemMap:
- WFGY main repo: https://github.com/onestardao/WFGY
- 16-mode ProblemMap (RAG failure checklist): https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
Recently, parts of this work were picked up by:
- Harvard MIMS Lab – ToolUniverse (LLM tools benchmark; WFGY listed in the robustness / RAG debugging section)
- Univ. of Innsbruck – Rankify (academic RAG toolkit; merged WFGY as a reference for RAG / re-ranking troubleshooting)
- QCRI LLM Lab – Multimodal RAG Survey (lists WFGY ProblemMap as a semantic failure-mode taxonomy)
Why I’m reaching out
Most BEIR users I see in the wild are now building RAG pipelines on top of BEIR-style retrievers.
Where things break in practice is often not:
“Is my nDCG@10 on dataset X high enough?”
but rather:
- No.1 hallucination & chunk drift – retrieved context is slightly off, model confidently fabricates
- No.2 interpretation collapse – retrieval is fine, but the question is semantically mis-parsed
- No.5 semantic ≠ embedding – cosine match ≠ true meaning, especially for long / symbolic queries
- No.8 debugging is a black box – no clear mapping from failure to “what to inspect next”
The 16-mode ProblemMap is basically a RAG failure map built from real production incidents.
A lot of people now pair BEIR-style scores with this checklist to debug where the pipeline is actually leaking.
Proposal (very small, docs-only)
Would you be open to a small docs addition in BEIR, something like:
RAG debugging & failure modes
For teams building RAG systems on top of BEIR retrievers, you may find the WFGY 16-mode RAG ProblemMap useful as a practical failure-mode checklist (hallucination & chunk drift, semantic ≠ embedding, retrieval traceability, multi-agent chaos, etc.).
See: https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
Optionally I could also contribute a short mapping table in the docs, e.g.:
- which BEIR datasets tend to surface No.1 vs No.5 vs No.8,
- and a 1-page “how to combine BEIR metrics + ProblemMap checklist when debugging a RAG stack”.
This would stay strictly on the documentation side, so there is no impact on core code or benchmarks.
Why this might be useful for BEIR users
From what I see in the community, there is a big gap between:
- benchmarking retrieval quality, and
- operationalizing those results into an actual RAG failure budget.
Putting a small, research-backed failure taxonomy next to BEIR in the docs could help teams:
- reason about which modes of failure BEIR covers well,
- and where they still need extra monitoring or tests.
If this sounds aligned with your roadmap, I’m happy to open a very small PR with the wording and table, and adjust it to your feedback.
If not, feel free to close – I mainly wanted to share the ProblemMap in case it is useful for BEIR users.