Skip to content

Fix: /file2document/convert blocks event loop on large folders causing 504 timeout#13784

Merged
yingfeng merged 8 commits intoinfiniflow:mainfrom
SyedShahmeerAli12:fix/file2document-convert-504-timeout
Mar 26, 2026
Merged

Fix: /file2document/convert blocks event loop on large folders causing 504 timeout#13784
yingfeng merged 8 commits intoinfiniflow:mainfrom
SyedShahmeerAli12:fix/file2document-convert-504-timeout

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

Problem

The /file2document/convert endpoint ran all file lookups, document deletions, and insertions synchronously inside the
request cycle. Linking a large folder (~1.7GB with many files) caused 504 Gateway Timeout because the blocking DB loop
held the HTTP connection open for too long.

Fix

  • Extracted the heavy DB work into a plain sync function _convert_files
  • Inputs are validated and folder file IDs expanded upfront (fast path)
  • The blocking work is dispatched to a thread pool via get_running_loop().run_in_executor() and the endpoint returns 200
    immediately
  • Frontend only checks data.code === 0 so the response change (file2documents list → True) has no impact

Fixes #13781

SyedShahmeerAli12 and others added 3 commits March 25, 2026 12:45
When the same model name exists for multiple types (common with
OpenAI-API-Compatible providers), calling get_api_key without a
model_type filter could return the wrong tenant_llm_id, causing the
tenant_{key} columns to reference an incorrect model type.

This is the same class of bug fixed in PR infiniflow#13569 for
get_model_config_by_type_and_name, now applied consistently to
ensure_tenant_model_id_for_params in tenant_utils.py.

Fixes infiniflow#13775

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Added `inline-block max-w-[120px] truncate align-middle` classes to
prevent long usernames from wrapping to multiple lines in the UI.

Fixes infiniflow#13748

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t 504 timeout

The convert endpoint executed all file lookups, document removals, and
insertions synchronously in the request cycle. For large folders this caused
504 Gateway Timeout errors.

Fix: validate inputs and expand folder file IDs upfront, then dispatch the
blocking DB work to a thread pool via get_running_loop().run_in_executor so
the HTTP response is returned immediately without waiting for completion.

Fixes infiniflow#13781

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. labels Mar 25, 2026
@yingfeng yingfeng added the ci Continue Integration label Mar 25, 2026
@yingfeng yingfeng marked this pull request as draft March 25, 2026 15:49
@yingfeng yingfeng marked this pull request as ready for review March 25, 2026 15:49
@yingfeng yingfeng requested a review from Copilot March 25, 2026 15:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes /file2document/convert request timeouts by moving the heavy synchronous DB work off the async request path so the event loop isn’t blocked during large folder conversions.

Changes:

  • Dispatches convert’s delete/insert DB loop to a thread via asyncio.get_running_loop().run_in_executor(...) and returns immediately.
  • Adds tenant-model lookup disambiguation by passing a specific LLMType to TenantLLMService.get_api_key(...).
  • Truncates SharedBadge UI content to prevent overflow.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
api/apps/file2document_app.py Makes /convert schedule conversion work in an executor instead of doing it inline.
api/utils/tenant_utils.py Adds key→LLMType mapping and passes model_type into tenant model lookup.
web/src/components/shared-badge.tsx Adds truncation/width styling to prevent badge overflow.

Comment on lines +46 to +52
for kb_id in kb_ids:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
continue
e, file = FileService.get_by_id(id)
if not e:
continue
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_convert_files calls FileService.get_by_id(id) inside the for kb_id in kb_ids loop, so the same file is re-fetched once per KB. Fetch the file once per file_id (outside the KB loop) to avoid redundant DB calls, which can be significant for large folders and many datasets.

Suggested change
for kb_id in kb_ids:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
continue
e, file = FileService.get_by_id(id)
if not e:
continue
# Fetch the file once per file_id to avoid redundant DB calls in the kb_ids loop.
e, file = FileService.get_by_id(id)
if not e:
continue
for kb_id in kb_ids:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
continue

Copilot uses AI. Check for mistakes.
return get_data_error_result(
message="Database error (Document removal)!")
File2DocumentService.delete_by_file_id(id)

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new async behavior returns success after validating only file_ids. Previously this endpoint validated kb_ids exist (and would fail fast with "Can't find this dataset!"). With the current code, invalid kb_ids will still return 200/code==0 while the background worker silently skips inserts. Please validate all kb_ids up front before scheduling the executor job.

Suggested change
# Validate all knowledge bases exist before scheduling background work
for kb_id in kb_ids:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e or kb is None:
return get_data_error_result(message="Can't find this dataset!")

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +105
asyncio.get_running_loop().run_in_executor(
None, _convert_files, all_file_ids, kb_ids, user_id
)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_in_executor(...) is fire-and-forget here: the returned Future isn’t stored or awaited, so exceptions raised in _convert_files can become "Future exception was never retrieved" warnings and failures are effectively invisible to callers. Add a done-callback (or store the Future in a task registry) to log/handle exceptions, and consider limiting concurrency with a dedicated executor if this endpoint can be called frequently.

Copilot uses AI. Check for mistakes.
asyncio.get_running_loop().run_in_executor(
None, _convert_files, all_file_ids, kb_ids, user_id
)
return get_json_result(data=True)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint’s contract changed substantially (now returns data=True immediately and no longer surfaces per-document/per-dataset errors). There are existing unit tests that assert the old synchronous behavior and response payload (e.g., test/testcases/test_web_api/test_file_app/test_file2document_routes_unit.py::test_convert_branch_matrix_unit), which will fail and should be updated to reflect the new async semantics and response shape.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +16 to +26
from common.constants import LLMType
from api.db.services.tenant_llm_service import TenantLLMService

_KEY_TO_MODEL_TYPE = {
"llm_id": LLMType.CHAT,
"embd_id": LLMType.EMBEDDING,
"asr_id": LLMType.SPEECH2TEXT,
"img2txt_id": LLMType.IMAGE2TEXT,
"rerank_id": LLMType.RERANK,
"tts_id": LLMType.TTS,
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file introduces model-type mapping logic for tenant model IDs, but the PR title/description focus on /file2document/convert timeout behavior. If this change is intentional, please mention it in the PR description (or split it into a separate PR) to keep scope and review risk clear.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +45
tenant_id = DocumentService.get_tenant_id(doc_id)
if tenant_id:
DocumentService.remove_document(doc, tenant_id)
File2DocumentService.delete_by_file_id(id)

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _convert_files, DocumentService.get_tenant_id(doc_id) can return None (e.g., when the document’s knowledgebase is not VALID). The current code skips remove_document in that case but still deletes the File2Document rows, which can leave orphan Document rows and incorrect KB counters/chunk-store state. Consider handling the None tenant case explicitly (e.g., log and still delete the document + update KB counts, or abort without deleting mappings).

Copilot uses AI. Check for mistakes.
- Move FileService.get_by_id() outside kb loop to avoid redundant DB calls
- Validate kb_ids upfront before scheduling background work
- Log warning when tenant_id is None instead of silently skipping
- Add done-callback to log exceptions from fire-and-forget executor future
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.72%. Comparing base (e705ac6) to head (16b76aa).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #13784   +/-   ##
=======================================
  Coverage   96.72%   96.72%           
=======================================
  Files          10       10           
  Lines         702      702           
  Branches      112      112           
=======================================
  Hits          679      679           
  Misses          5        5           
  Partials       18       18           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…nature

- Use files_set.get() with falsy check to catch both missing and invalid files
- Update test_convert_branch_matrix_unit to reflect new async behavior:
  file and kb validation still synchronous, background errors no longer surfaced
- Add model_type=None to get_api_key mock to match real signature
@yingfeng
Copy link
Copy Markdown
Member

CI fails:

=================================== FAILURES ===================================
__________ test_tenant_info_and_set_tenant_info_exception_matrix_unit __________
test/testcases/test_web_api/test_user_app/test_user_app_unit.py:1104: in test_tenant_info_and_set_tenant_info_exception_matrix_unit
    assert "tenant update boom" in res["message"], res
E   AssertionError: {'code': 100, 'message': "TypeError('_load_dialog_module.<locals>._TenantLLMService.get_api_key() takes 2 positional arguments but 3 were given')"}
E   assert 'tenant update boom' in "TypeError('_load_dialog_module.<locals>._TenantLLMService.get_api_key() takes 2 positional arguments but 3 were given')"
=========================== short test summary info ============================
FAILED test/testcases/test_web_api/test_user_app/test_user_app_unit.py::test_tenant_info_and_set_tenant_info_exception_matrix_unit - AssertionError: {'code': 100, 'message': "TypeError('_load_dialog_module.<locals>._TenantLLMService.get_api_key() takes 2 positional arguments but 3 were given')"}
assert 'tenant update boom' in "TypeError('_load_dialog_module.<locals>._TenantLLMService.get_api_key() takes 2 positional arguments but 3 were given')"
==== 1 failed, 718 passed, 26 skipped, 203 deselected in 188.27s (0:03:08) =====
Error: Process completed with exit code 1.

@yingfeng yingfeng merged commit ff92b55 into infiniflow:main Mar 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

3 participants