Fix: /file2document/convert blocks event loop on large folders causing 504 timeout#13784
Conversation
When the same model name exists for multiple types (common with
OpenAI-API-Compatible providers), calling get_api_key without a
model_type filter could return the wrong tenant_llm_id, causing the
tenant_{key} columns to reference an incorrect model type.
This is the same class of bug fixed in PR infiniflow#13569 for
get_model_config_by_type_and_name, now applied consistently to
ensure_tenant_model_id_for_params in tenant_utils.py.
Fixes infiniflow#13775
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Added `inline-block max-w-[120px] truncate align-middle` classes to prevent long usernames from wrapping to multiple lines in the UI. Fixes infiniflow#13748 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t 504 timeout The convert endpoint executed all file lookups, document removals, and insertions synchronously in the request cycle. For large folders this caused 504 Gateway Timeout errors. Fix: validate inputs and expand folder file IDs upfront, then dispatch the blocking DB work to a thread pool via get_running_loop().run_in_executor so the HTTP response is returned immediately without waiting for completion. Fixes infiniflow#13781 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes /file2document/convert request timeouts by moving the heavy synchronous DB work off the async request path so the event loop isn’t blocked during large folder conversions.
Changes:
- Dispatches convert’s delete/insert DB loop to a thread via
asyncio.get_running_loop().run_in_executor(...)and returns immediately. - Adds tenant-model lookup disambiguation by passing a specific
LLMTypetoTenantLLMService.get_api_key(...). - Truncates
SharedBadgeUI content to prevent overflow.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
api/apps/file2document_app.py |
Makes /convert schedule conversion work in an executor instead of doing it inline. |
api/utils/tenant_utils.py |
Adds key→LLMType mapping and passes model_type into tenant model lookup. |
web/src/components/shared-badge.tsx |
Adds truncation/width styling to prevent badge overflow. |
| for kb_id in kb_ids: | ||
| e, kb = KnowledgebaseService.get_by_id(kb_id) | ||
| if not e: | ||
| continue | ||
| e, file = FileService.get_by_id(id) | ||
| if not e: | ||
| continue |
There was a problem hiding this comment.
_convert_files calls FileService.get_by_id(id) inside the for kb_id in kb_ids loop, so the same file is re-fetched once per KB. Fetch the file once per file_id (outside the KB loop) to avoid redundant DB calls, which can be significant for large folders and many datasets.
| for kb_id in kb_ids: | |
| e, kb = KnowledgebaseService.get_by_id(kb_id) | |
| if not e: | |
| continue | |
| e, file = FileService.get_by_id(id) | |
| if not e: | |
| continue | |
| # Fetch the file once per file_id to avoid redundant DB calls in the kb_ids loop. | |
| e, file = FileService.get_by_id(id) | |
| if not e: | |
| continue | |
| for kb_id in kb_ids: | |
| e, kb = KnowledgebaseService.get_by_id(kb_id) | |
| if not e: | |
| continue |
| return get_data_error_result( | ||
| message="Database error (Document removal)!") | ||
| File2DocumentService.delete_by_file_id(id) | ||
|
|
There was a problem hiding this comment.
The new async behavior returns success after validating only file_ids. Previously this endpoint validated kb_ids exist (and would fail fast with "Can't find this dataset!"). With the current code, invalid kb_ids will still return 200/code==0 while the background worker silently skips inserts. Please validate all kb_ids up front before scheduling the executor job.
| # Validate all knowledge bases exist before scheduling background work | |
| for kb_id in kb_ids: | |
| e, kb = KnowledgebaseService.get_by_id(kb_id) | |
| if not e or kb is None: | |
| return get_data_error_result(message="Can't find this dataset!") |
api/apps/file2document_app.py
Outdated
| asyncio.get_running_loop().run_in_executor( | ||
| None, _convert_files, all_file_ids, kb_ids, user_id | ||
| ) |
There was a problem hiding this comment.
run_in_executor(...) is fire-and-forget here: the returned Future isn’t stored or awaited, so exceptions raised in _convert_files can become "Future exception was never retrieved" warnings and failures are effectively invisible to callers. Add a done-callback (or store the Future in a task registry) to log/handle exceptions, and consider limiting concurrency with a dedicated executor if this endpoint can be called frequently.
| asyncio.get_running_loop().run_in_executor( | ||
| None, _convert_files, all_file_ids, kb_ids, user_id | ||
| ) | ||
| return get_json_result(data=True) |
There was a problem hiding this comment.
This endpoint’s contract changed substantially (now returns data=True immediately and no longer surfaces per-document/per-dataset errors). There are existing unit tests that assert the old synchronous behavior and response payload (e.g., test/testcases/test_web_api/test_file_app/test_file2document_routes_unit.py::test_convert_branch_matrix_unit), which will fail and should be updated to reflect the new async semantics and response shape.
| from common.constants import LLMType | ||
| from api.db.services.tenant_llm_service import TenantLLMService | ||
|
|
||
| _KEY_TO_MODEL_TYPE = { | ||
| "llm_id": LLMType.CHAT, | ||
| "embd_id": LLMType.EMBEDDING, | ||
| "asr_id": LLMType.SPEECH2TEXT, | ||
| "img2txt_id": LLMType.IMAGE2TEXT, | ||
| "rerank_id": LLMType.RERANK, | ||
| "tts_id": LLMType.TTS, | ||
| } |
There was a problem hiding this comment.
This file introduces model-type mapping logic for tenant model IDs, but the PR title/description focus on /file2document/convert timeout behavior. If this change is intentional, please mention it in the PR description (or split it into a separate PR) to keep scope and review risk clear.
| tenant_id = DocumentService.get_tenant_id(doc_id) | ||
| if tenant_id: | ||
| DocumentService.remove_document(doc, tenant_id) | ||
| File2DocumentService.delete_by_file_id(id) | ||
|
|
There was a problem hiding this comment.
In _convert_files, DocumentService.get_tenant_id(doc_id) can return None (e.g., when the document’s knowledgebase is not VALID). The current code skips remove_document in that case but still deletes the File2Document rows, which can leave orphan Document rows and incorrect KB counters/chunk-store state. Consider handling the None tenant case explicitly (e.g., log and still delete the document + update KB counts, or abort without deleting mappings).
- Move FileService.get_by_id() outside kb loop to avoid redundant DB calls - Validate kb_ids upfront before scheduling background work - Log warning when tenant_id is None instead of silently skipping - Add done-callback to log exceptions from fire-and-forget executor future
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #13784 +/- ##
=======================================
Coverage 96.72% 96.72%
=======================================
Files 10 10
Lines 702 702
Branches 112 112
=======================================
Hits 679 679
Misses 5 5
Partials 18 18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…nature - Use files_set.get() with falsy check to catch both missing and invalid files - Update test_convert_branch_matrix_unit to reflect new async behavior: file and kb validation still synchronous, background errors no longer surfaced - Add model_type=None to get_api_key mock to match real signature
|
CI fails: |
Problem
The /file2document/convert endpoint ran all file lookups, document deletions, and insertions synchronously inside the
request cycle. Linking a large folder (~1.7GB with many files) caused 504 Gateway Timeout because the blocking DB loop
held the HTTP connection open for too long.
Fix
immediately
Fixes #13781