Skip to content

Conversation

nuzant
Copy link
Collaborator

@nuzant nuzant commented Sep 11, 2025

What does this PR do?

This PR adds options to apply patch to SGLang and accelerate its weight loading.

Option 1 enable_multithread_load: This is a native SGLang weight-loading acceleration which has not been applied when updating weights from disk in the original SGLang code. The patch in this PR fixes this issue. This option is available for all models.

Option 2 enable_fast_load: This is an option to enable an optimized, customized weight loading implementation in SGLang introduced by the patch in this PR. It is faster than enable_multithread_load, but is only available for Qwen3 and Qwen3MoE models.

Why we need this PR?

Disk weight loading is simpler and more flexible than NCCL weight loading. It has great advantages in supporting complex scenarios in the future, such as RL with elastic inference servers or heterogeneous hardware.

Example Usage

Add options in yaml or command line: sglang.enable_multithread_load=true or sglang.enable_fast_load=true.

Performance and Correctness

On Qwen3-30B-A3B, allocation mode sglang:d4p1t4+megatron:(attn:d1p4t2c2|ffn:d1p4t2e2), this PR accelerates weight updating from ~60s to ~30s while maintaining correctness.
Performance in other conditions is pending to be tested.

image

Update

The patch is upgraded and tested on SGLang v0.5.2, and the performance matches previous results on v0.4.9.post2.

@garrett4wade
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an effective optimization for SGLang weight loading by applying a custom patch. The changes are well-structured, adding new configuration options in cli_args.py and the patching logic in launcher.py. The performance improvement from ~60s to ~30s for weight updates is significant.

My review focuses on the integration of the patch. I've identified a critical issue in the patching logic for editable installations that could cause it to fail, along with a couple of medium-severity suggestions to improve robustness and logging. Once these points are addressed, this PR will be a great addition to accelerate model loading.

@garrett4wade garrett4wade merged commit 0ff615d into main Oct 13, 2025
1 of 4 checks passed
@garrett4wade garrett4wade deleted the mzy/optimize-sglang-load branch October 13, 2025 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants