-
Notifications
You must be signed in to change notification settings - Fork 660
[Optimization] Qwen2.5-VL support multi-batch prefill
#5269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds multi-batch prefill support for Qwen2.5-VL models to improve throughput and efficiency. The implementation introduces conditional logic controlled by environment variables FD_ENABLE_MAX_PREFILL and FD_ENABLE_E2W_TENSOR_CONVERT.
- Enables multi-batch prefill scheduling for multimodal requests when
FD_ENABLE_MAX_PREFILLis set - Adds tensor conversion in the zmq→scheduler pipeline when
FD_ENABLE_E2W_TENSOR_CONVERTis enabled - Updates vision feature extraction to handle batched inputs for Qwen models
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
fastdeploy/worker/gpu_model_runner.py |
Modified _apply_mm_inputs to conditionally batch vision inputs and updated extract_vision_features_qwen to handle batched image tensors |
fastdeploy/entrypoints/engine_client.py |
Added conditional tensor conversion in _send_task before sending tasks via ZMQ |
fastdeploy/engine/sched/resource_manager_v1.py |
Updated scheduling logic to allow multiple multimodal prefill requests when max prefill is enabled |
Co-authored-by: Copilot <[email protected]>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5269 +/- ##
==========================================
Coverage ? 59.06%
==========================================
Files ? 324
Lines ? 40065
Branches ? 6056
==========================================
Hits ? 23666
Misses ? 14529
Partials ? 1870
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
none
Modifications
zmq->local/global schedulerwhenFD_ENABLE_E2W_TENSOR_CONVERTis enabled.FD_ENABLE_MAX_PREFILLis enabled.extract_vision_features_qwento handle batched image tensorsUsage or Command
Accuracy Tests
none
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.