-
Notifications
You must be signed in to change notification settings - Fork 205
[Feature] Add patch to accelerate SGLang weight loading #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…zy/antcode/optimize-sglang-load
…zy/antcode/optimize-sglang-load
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an effective optimization for SGLang weight loading by applying a custom patch. The changes are well-structured, adding new configuration options in cli_args.py
and the patching logic in launcher.py
. The performance improvement from ~60s to ~30s for weight updates is significant.
My review focuses on the integration of the patch. I've identified a critical issue in the patching logic for editable installations that could cause it to fail, along with a couple of medium-severity suggestions to improve robustness and logging. Once these points are addressed, this PR will be a great addition to accelerate model loading.
What does this PR do?
This PR adds options to apply patch to SGLang and accelerate its weight loading.
Option 1
enable_multithread_load
: This is a native SGLang weight-loading acceleration which has not been applied when updating weights from disk in the original SGLang code. The patch in this PR fixes this issue. This option is available for all models.Option 2
enable_fast_load
: This is an option to enable an optimized, customized weight loading implementation in SGLang introduced by the patch in this PR. It is faster thanenable_multithread_load
, but is only available for Qwen3 and Qwen3MoE models.Why we need this PR?
Disk weight loading is simpler and more flexible than NCCL weight loading. It has great advantages in supporting complex scenarios in the future, such as RL with elastic inference servers or heterogeneous hardware.
Example Usage
Add options in yaml or command line:
sglang.enable_multithread_load=true
orsglang.enable_fast_load=true
.Performance and Correctness
On Qwen3-30B-A3B, allocation mode
sglang:d4p1t4+megatron:(attn:d1p4t2c2|ffn:d1p4t2e2)
, this PR accelerates weight updating from ~60s to ~30s while maintaining correctness.Performance in other conditions is pending to be tested.
Update
The patch is upgraded and tested on SGLang v0.5.2, and the performance matches previous results on v0.4.9.post2.