feat: integrate structured extraction and multimodal role-based pipeline#2756
Open
MrGidea wants to merge 7 commits intoHKUDS:mainfrom
Open
feat: integrate structured extraction and multimodal role-based pipeline#2756MrGidea wants to merge 7 commits intoHKUDS:mainfrom
MrGidea wants to merge 7 commits intoHKUDS:mainfrom
Conversation
…ter-based text - Add EntityExtractionResult Pydantic model for structured JSON output - Add JSON-mode prompt templates for entity/relationship extraction - Add _process_json_extraction_result() JSON parser in extraction pipeline - Add entity_extraction_use_json config option, default True - Add extraction_max_tokens config to prevent output truncation - OpenAI: use response_format json_object with auto-fallback retry - Ollama/Gemini: use native JSON mode for entity extraction - Other providers: pop entity_extraction kwarg for compatibility - Cache rebuild auto-detects JSON vs delimiter format - Skip relationships with empty descriptions to prevent merge errors
…ndles truncation)
Bring the RAG-Anything parsing/analyze flow into LightRAG's document pipeline and let extract, keyword, query, and VLM roles run with independent model settings. This keeps structured extraction, DOCX interchange ingestion, and relation merge hardening upstreamable without including the entity disambiguation experiment. Made-with: Cursor
Document the companion parser-side changes made in RAG-Anything so reviewers can understand how heading and table normalization align with this LightRAG multimodal pipeline PR without including external repository code here. Made-with: Cursor
Add safe example settings for the new structured extraction, parser integration, staged pipeline, and role-specific LLM/VLM routing options while keeping env.example ready for cp env.example .env deployment without exposing private models or secrets. Made-with: Cursor
Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly. Made-with: Cursor
Collaborator
|
Please resolve all conflicts in env.example to ensure the PR is ready for review. The repository now features an interactive setup wizard to streamline .env file generation, and env.example has been updated to serve as its configuration template. Please adhere to the following requirements:
For more information of the interactive setup tool, pls refer to : https://github.com/HKUDS/LightRAG/blob/main/docs/InteractiveSetup.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request supersedes the previous JSON structured extraction PR and combines that work with the newer multimodal pipeline and role-based model routing changes, while explicitly excluding the entity disambiguation experiment.
On the extraction side, it replaces delimiter-based entity extraction with JSON structured output to improve robustness and compatibility across providers and smaller models. On top of that, it integrates the RAG-Anything parse -> analyze -> process flow into LightRAG's document pipeline, adds role-specific model routing for
extract,keyword,query, andVLM, improves DOCX/interchange ingestion and multimodal sidecar writeback, and hardens relation VDB operations with timeout and logging improvements.Related Issues
Changes Made
Preserve and extend the JSON structured extraction changes from the previous PR:
response_format: json_object), Ollama (format="json"), and Gemini (response_mime_type="application/json")ENTITY_EXTRACTION_USE_JSONAdd multimodal document pipeline integration:
parse -> analyze -> processflow into the LightRAG document pipelinegroundedAdd role-based model routing:
extract,keyword,query, andVLMImprove relation merge robustness:
Add/extend tests:
Add reviewer context for companion parser-side work:
docs/RAGAnythingParserAlignment.mdto document the related parser alignment changes already made on theRAG-AnythingsideChecklist
Additional Notes
docs/RAGAnythingParserAlignment.mdis included as reviewer context for companion parser-side changes made inRAG-Anything; it is documentation only and does not add external repository code into this PR.RAG-Anythingside.pytest tests/test_extract_entities.py tests/test_pipeline_release_closure.py -q16 passed