Skip to content

docs: add RAG / LLM failure-mode checklist tutorial#6931

Open
onestardao wants to merge 1 commit intoflyteorg:masterfrom
onestardao:master
Open

docs: add RAG / LLM failure-mode checklist tutorial#6931
onestardao wants to merge 1 commit intoflyteorg:masterfrom
onestardao:master

Conversation

@onestardao
Copy link

Tracking issue

Related to #6930

Why are the changes needed?

RAG / LLM workloads on Flyte are becoming more common, but when a pipeline misbehaves it is still hard for users to reason about where the failure comes from (ingestion, embeddings, retriever, LLM, evaluation, etc.).

This PR adds a pointer to a concrete, MIT-licensed failure-mode checklist that:

  • organizes 16 common failure modes across the main stages of a RAG / LLM pipeline
  • is already referenced by external projects (e.g. ToolUniverse from Harvard MIMS Lab, QCRI LLM Lab’s Multimodal RAG Survey, and Rankify from the University of Innsbruck) as a debugging / evaluation aid
  • can serve as a lightweight starting point for Flyte users who are building similar workflows and need a structured way to debug them

What changes were proposed in this pull request?

  • Add a “RAG / LLM pipeline failure mode checklist” entry to the Tutorials section of the README.md.
  • The link points to the WFGY ProblemMap page, which documents a 16-problem checklist for RAG / LLM pipelines (covering data ingestion and chunking, embeddings and vector stores, retrievers and ranking, LLM routing / tools, and evaluation / guardrails).

This is intentionally a minimal docs-only change.
As a follow-up, I am interested in contributing one or two Flyte 2.0 examples that use this checklist to structure RAG / LLM workflows, based on the suggestions in #6930.

How was this patch tested?

  • Rendered the README using GitHub’s preview to verify:
    • the new tutorial entry renders consistently with existing links
    • the link target is correct and reachable
  • No code or configuration changes were made, so no unit or integration tests were added.

Labels

  • added: new documentation link / tutorial entry

Setup process

Not applicable – docs-only change.

Screenshots

Not applicable – README text-only change.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

None.

Docs link

Not applicable – change is in the main README.md only.

Signed-off-by: PSBigBig × MiniPS <psbigbig@onestardao.com>
@codecov
Copy link

codecov bot commented Feb 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.95%. Comparing base (b5f898c) to head (14e98fb).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6931      +/-   ##
==========================================
- Coverage   56.96%   56.95%   -0.01%     
==========================================
  Files         929      929              
  Lines       58156    58156              
==========================================
- Hits        33130    33125       -5     
- Misses      21984    21989       +5     
  Partials     3042     3042              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.10% <ø> (-0.04%) ⬇️
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.02% <ø> (ø)
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.14% <ø> (ø)
unittests-flytepropeller 53.64% <ø> (ø)
unittests-flytestdlib 63.29% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@onestardao
Copy link
Author

This PR is docs-only (just adding a link to the README). Codecov reports a -0.01% project coverage change coming from indirect changes in flyteadmin (workflow_execution_event_writer.go), but all modified lines in this PR are fully covered.

If needed, I’m happy to rebase so the checks can be re-run, but this looks like a coverage noise issue rather than a change introduced here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant