Skip to content

feat: integrate structured extraction and multimodal role-based pipeline#2756

Open
MrGidea wants to merge 7 commits intoHKUDS:mainfrom
MrGidea:feat/upstream-combined-no-disambiguation
Open

feat: integrate structured extraction and multimodal role-based pipeline#2756
MrGidea wants to merge 7 commits intoHKUDS:mainfrom
MrGidea:feat/upstream-combined-no-disambiguation

Conversation

@MrGidea
Copy link
Contributor

@MrGidea MrGidea commented Mar 8, 2026

Description

This pull request supersedes the previous JSON structured extraction PR and combines that work with the newer multimodal pipeline and role-based model routing changes, while explicitly excluding the entity disambiguation experiment.

On the extraction side, it replaces delimiter-based entity extraction with JSON structured output to improve robustness and compatibility across providers and smaller models. On top of that, it integrates the RAG-Anything parse -> analyze -> process flow into LightRAG's document pipeline, adds role-specific model routing for extract, keyword, query, and VLM, improves DOCX/interchange ingestion and multimodal sidecar writeback, and hardens relation VDB operations with timeout and logging improvements.

Related Issues

Changes Made

  • Preserve and extend the JSON structured extraction changes from the previous PR:

    • replace delimiter-based entity extraction with JSON structured output
    • support native JSON mode for OpenAI-compatible APIs (response_format: json_object), Ollama (format="json"), and Gemini (response_mime_type="application/json")
    • add provider fallback logic when native JSON response formatting is unsupported
    • keep backward compatibility through ENTITY_EXTRACTION_USE_JSON
    • auto-detect JSON vs delimiter format during cache rebuild
    • skip relationships with empty descriptions to avoid merge errors
  • Add multimodal document pipeline integration:

    • integrate the RAG-Anything parse -> analyze -> process flow into the LightRAG document pipeline
    • add structured interchange ingestion for DOCX and related parsing helpers
    • write multimodal LightRAG document artifacts and sidecars for drawings, tables, and equations
    • preserve heading context and correct table dimensions in generated sidecar data
    • normalize multimodal analysis outputs such as boolean grounded
  • Add role-based model routing:

    • add per-role LLM/VLM configuration for extract, keyword, query, and VLM
    • route role-specific calls through server configuration and query/generation paths
    • expose role configuration in health reporting
  • Improve relation merge robustness:

    • add defensive timeout handling for relation VDB upserts
    • add finer-grained logging around relation/entity upsert stages
    • improve observability for edge-processing waits and pending tasks
  • Add/extend tests:

    • extend extraction tests for JSON structured extraction behavior
    • add regression tests for multimodal sidecar writeback and timeout behavior
  • Add reviewer context for companion parser-side work:

    • add docs/RAGAnythingParserAlignment.md to document the related parser alignment changes already made on the RAG-Anything side
    • clarify how heading preservation, table normalization, and parser output consistency relate to the LightRAG-side changes in this PR

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

  • The earlier JSON structured extraction PR was closed in favor of this combined PR.
  • docs/RAGAnythingParserAlignment.md is included as reviewer context for companion parser-side changes made in RAG-Anything; it is documentation only and does not add external repository code into this PR.
  • The LightRAG-side changes in this PR can run independently, but the best end-to-end multimodal structure quality depends on aligned parser output from the RAG-Anything side.
  • Local verification run:
    • pytest tests/test_extract_entities.py tests/test_pipeline_release_closure.py -q
    • result: 16 passed

MrGidea added 7 commits March 8, 2026 15:59
…ter-based text

- Add EntityExtractionResult Pydantic model for structured JSON output
- Add JSON-mode prompt templates for entity/relationship extraction
- Add _process_json_extraction_result() JSON parser in extraction pipeline
- Add entity_extraction_use_json config option, default True
- Add extraction_max_tokens config to prevent output truncation
- OpenAI: use response_format json_object with auto-fallback retry
- Ollama/Gemini: use native JSON mode for entity extraction
- Other providers: pop entity_extraction kwarg for compatibility
- Cache rebuild auto-detects JSON vs delimiter format
- Skip relationships with empty descriptions to prevent merge errors
Bring the RAG-Anything parsing/analyze flow into LightRAG's document pipeline and let extract, keyword, query, and VLM roles run with independent model settings. This keeps structured extraction, DOCX interchange ingestion, and relation merge hardening upstreamable without including the entity disambiguation experiment.

Made-with: Cursor
Document the companion parser-side changes made in RAG-Anything so reviewers can understand how heading and table normalization align with this LightRAG multimodal pipeline PR without including external repository code here.

Made-with: Cursor
Add safe example settings for the new structured extraction, parser integration, staged pipeline, and role-specific LLM/VLM routing options while keeping env.example ready for cp env.example .env deployment without exposing private models or secrets.

Made-with: Cursor
Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly.

Made-with: Cursor
@danielaskdd
Copy link
Collaborator

Please resolve all conflicts in env.example to ensure the PR is ready for review. The repository now features an interactive setup wizard to streamline .env file generation, and env.example has been updated to serve as its configuration template. Please adhere to the following requirements:

  • All configurable environment variables must be included in this file (either active or commented out).
  • Lines starting with # # denote repeated placeholders; these must remain as-is and will not be substituted with actual values by the setup wizard.

For more information of the interactive setup tool, pls refer to : https://github.com/HKUDS/LightRAG/blob/main/docs/InteractiveSetup.md

@danielaskdd danielaskdd added tracked Issue is tracked by project enhancement New feature or request labels Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request tracked Issue is tracked by project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants