feat: integrate structured extraction and multimodal role-based pipeline by MrGidea · Pull Request #2756 · HKUDS/LightRAG

MrGidea · 2026-03-08T08:17:51Z

Description

This pull request supersedes the previous JSON structured extraction PR and combines that work with the newer multimodal pipeline and role-based model routing changes, while explicitly excluding the entity disambiguation experiment.

On the extraction side, it replaces delimiter-based entity extraction with JSON structured output to improve robustness and compatibility across providers and smaller models. On top of that, it integrates the RAG-Anything parse -> analyze -> process flow into LightRAG's document pipeline, adds role-specific model routing for extract, keyword, query, and VLM, improves DOCX/interchange ingestion and multimodal sidecar writeback, and hardens relation VDB operations with timeout and logging improvements.

Related Issues

Supersedes closed PR feat: Entity extraction uses JSON structured output instead of delimiter-based text #2684
No separate issue is referenced for this combined update

Changes Made

Preserve and extend the JSON structured extraction changes from the previous PR:
- replace delimiter-based entity extraction with JSON structured output
- support native JSON mode for OpenAI-compatible APIs (response_format: json_object), Ollama (format="json"), and Gemini (response_mime_type="application/json")
- add provider fallback logic when native JSON response formatting is unsupported
- keep backward compatibility through ENTITY_EXTRACTION_USE_JSON
- auto-detect JSON vs delimiter format during cache rebuild
- skip relationships with empty descriptions to avoid merge errors
Add multimodal document pipeline integration:
- integrate the RAG-Anything parse -> analyze -> process flow into the LightRAG document pipeline
- add structured interchange ingestion for DOCX and related parsing helpers
- write multimodal LightRAG document artifacts and sidecars for drawings, tables, and equations
- preserve heading context and correct table dimensions in generated sidecar data
- normalize multimodal analysis outputs such as boolean grounded
Add role-based model routing:
- add per-role LLM/VLM configuration for extract, keyword, query, and VLM
- route role-specific calls through server configuration and query/generation paths
- expose role configuration in health reporting
Improve relation merge robustness:
- add defensive timeout handling for relation VDB upserts
- add finer-grained logging around relation/entity upsert stages
- improve observability for edge-processing waits and pending tasks
Add/extend tests:
- extend extraction tests for JSON structured extraction behavior
- add regression tests for multimodal sidecar writeback and timeout behavior
Add reviewer context for companion parser-side work:
- add docs/RAGAnythingParserAlignment.md to document the related parser alignment changes already made on the RAG-Anything side
- clarify how heading preservation, table normalization, and parser output consistency relate to the LightRAG-side changes in this PR

Checklist

Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Additional Notes

The earlier JSON structured extraction PR was closed in favor of this combined PR.
docs/RAGAnythingParserAlignment.md is included as reviewer context for companion parser-side changes made in RAG-Anything; it is documentation only and does not add external repository code into this PR.
The LightRAG-side changes in this PR can run independently, but the best end-to-end multimodal structure quality depends on aligned parser output from the RAG-Anything side.
Local verification run:
- pytest tests/test_extract_entities.py tests/test_pipeline_release_closure.py -q
- result: 16 passed

…ter-based text - Add EntityExtractionResult Pydantic model for structured JSON output - Add JSON-mode prompt templates for entity/relationship extraction - Add _process_json_extraction_result() JSON parser in extraction pipeline - Add entity_extraction_use_json config option, default True - Add extraction_max_tokens config to prevent output truncation - OpenAI: use response_format json_object with auto-fallback retry - Ollama/Gemini: use native JSON mode for entity extraction - Other providers: pop entity_extraction kwarg for compatibility - Cache rebuild auto-detects JSON vs delimiter format - Skip relationships with empty descriptions to prevent merge errors

… ruff formatting

…ndles truncation)

Bring the RAG-Anything parsing/analyze flow into LightRAG's document pipeline and let extract, keyword, query, and VLM roles run with independent model settings. This keeps structured extraction, DOCX interchange ingestion, and relation merge hardening upstreamable without including the entity disambiguation experiment. Made-with: Cursor

Document the companion parser-side changes made in RAG-Anything so reviewers can understand how heading and table normalization align with this LightRAG multimodal pipeline PR without including external repository code here. Made-with: Cursor

Add safe example settings for the new structured extraction, parser integration, staged pipeline, and role-specific LLM/VLM routing options while keeping env.example ready for cp env.example .env deployment without exposing private models or secrets. Made-with: Cursor

Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly. Made-with: Cursor

danielaskdd · 2026-03-09T01:22:20Z

Please resolve all conflicts in env.example to ensure the PR is ready for review. The repository now features an interactive setup wizard to streamline .env file generation, and env.example has been updated to serve as its configuration template. Please adhere to the following requirements:

All configurable environment variables must be included in this file (either active or commented out).
Lines starting with # # denote repeated placeholders; these must remain as-is and will not be substituted with actual values by the setup wizard.

For more information of the interactive setup tool, pls refer to : https://github.com/HKUDS/LightRAG/blob/main/docs/InteractiveSetup.md

MrGidea added 7 commits March 8, 2026 15:59

fix: resolve CI linting - add extraction_max_tokens definition, apply…

c115342

… ruff formatting

refactor: remove extraction_max_tokens (not essential, json_repair ha…

a6da346

…ndles truncation)

fix: apply lint cleanup for multimodal PR

7c940af

Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly. Made-with: Cursor

danielaskdd added tracked Issue is tracked by project enhancement New feature or request labels Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate structured extraction and multimodal role-based pipeline#2756

feat: integrate structured extraction and multimodal role-based pipeline#2756
MrGidea wants to merge 7 commits intoHKUDS:mainfrom
MrGidea:feat/upstream-combined-no-disambiguation

MrGidea commented Mar 8, 2026 •

edited

Loading

Uh oh!

danielaskdd commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MrGidea commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes Made

Checklist

Additional Notes

Uh oh!

danielaskdd commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGidea commented Mar 8, 2026 •

edited

Loading