[Docs] Add WFGY 16-mode RAG failure checklist as a BEIR-aligned debugging resource

Hi BEIR team,

first of all – thank you for releasing BEIR.  
I’m one of those people whose RAG pipelines *only* became debuggable because BEIR-style retrieval evaluation exists.

I’m the author of **WFGY** and the **16-mode RAG Failure ProblemMap**:

- WFGY main repo: https://github.com/onestardao/WFGY  
- 16-mode ProblemMap (RAG failure checklist): https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md  

Recently, parts of this work were picked up by:
- **Harvard MIMS Lab – ToolUniverse** (LLM tools benchmark; WFGY listed in the robustness / RAG debugging section)  
- **Univ. of Innsbruck – Rankify** (academic RAG toolkit; merged WFGY as a reference for RAG / re-ranking troubleshooting)  
- **QCRI LLM Lab – Multimodal RAG Survey** (lists WFGY ProblemMap as a semantic failure-mode taxonomy)

### Why I’m reaching out

Most BEIR users I see in the wild are now building RAG pipelines *on top* of BEIR-style retrievers.  
Where things break in practice is often *not*:

> “Is my nDCG@10 on dataset X high enough?”

but rather:

- No.1 **hallucination & chunk drift** – retrieved context is slightly off, model confidently fabricates  
- No.2 **interpretation collapse** – retrieval is fine, but the question is semantically mis-parsed  
- No.5 **semantic ≠ embedding** – cosine match ≠ true meaning, especially for long / symbolic queries  
- No.8 **debugging is a black box** – no clear mapping from failure to “what to inspect next”

The 16-mode ProblemMap is basically a **RAG failure map** built from real production incidents.  
A lot of people now pair BEIR-style scores with this checklist to debug where the pipeline is actually leaking.

### Proposal (very small, docs-only)

Would you be open to a small docs addition in BEIR, something like:

> **RAG debugging & failure modes**  
> For teams building RAG systems on top of BEIR retrievers, you may find the WFGY 16-mode RAG ProblemMap useful as a practical failure-mode checklist (hallucination & chunk drift, semantic ≠ embedding, retrieval traceability, multi-agent chaos, etc.).  
> See: https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

Optionally I could also contribute a short **mapping table** in the docs, e.g.:

- which BEIR datasets tend to surface No.1 vs No.5 vs No.8,  
- and a 1-page “how to combine BEIR metrics + ProblemMap checklist when debugging a RAG stack”.

This would stay strictly on the documentation side, so there is **no impact on core code or benchmarks**.

### Why this might be useful for BEIR users

From what I see in the community, there is a big gap between:

- *benchmarking* retrieval quality, and  
- *operationalizing* those results into an actual RAG failure budget.

Putting a small, research-backed failure taxonomy next to BEIR in the docs could help teams:

- reason about *which* modes of failure BEIR covers well,  
- and where they still need extra monitoring or tests.

If this sounds aligned with your roadmap, I’m happy to open a very small PR with the wording and table, and adjust it to your feedback.

If not, feel free to close – I mainly wanted to share the ProblemMap in case it is useful for BEIR users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Add WFGY 16-mode RAG failure checklist as a BEIR-aligned debugging resource #210

Why I’m reaching out

Proposal (very small, docs-only)

Why this might be useful for BEIR users

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Docs] Add WFGY 16-mode RAG failure checklist as a BEIR-aligned debugging resource #210

Description

Why I’m reaching out

Proposal (very small, docs-only)

Why this might be useful for BEIR users

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions