-
Notifications
You must be signed in to change notification settings - Fork 42
vLLM Sleep mode blog #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vLLM Sleep mode blog #106
Conversation
|
@youkaichao @hmellor PTAL |
Signed-off-by: PinSiang <[email protected]>
Signed-off-by: PinSiang <[email protected]>
Signed-off-by: PinSiang <[email protected]>
Signed-off-by: PinSiang <[email protected]>
5d82a95 to
f75d39d
Compare
733c93d to
a54e3aa
Compare
a54e3aa to
f75d39d
Compare
Signed-off-by: PinSiang <[email protected]>
|
|
||
| ```bash | ||
| # Terminal 1: Start Phi-3-vision | ||
| export VLLM_SERVER_DEV_MODE=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is dev mode necessary? If yes, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Dev mode flag is needed as the sleep mode API are only exposed under development environment. They are not expose in inference production stack as users can break the deployment by resetting the weights and cache. The sleep mode endpoint is to be used in a closed secure environment like training or backend applications.
| <div style="margin: 2rem 0;"> | ||
| <script src="https://cdn.plot.ly/plotly-2.32.0.min.js"></script> | ||
| <div id="plotly-sleep-mode" style="width: 100%; height: 250px;"></div> | ||
| <div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;"> | ||
| <strong>Model A:</strong> Qwen3-235B-A22B-Instruct-2507-FP8 (TP=4) | <strong>Model B:</strong> Qwen3-Coder-30B-A3B-Instruct (TP=1)<br> | ||
| GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code><br> | ||
|
|
||
| </div> | ||
| <script src="/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js"></script> | ||
| </div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this plot, and the others like it, could you explain what we are comparing here?
It looks like we're comparing the time taken to do the following:
- Prompt A
- Prompt B
- Prompt A
- Prompt B
- Prompt A
- Prompt B
It'd be nice to make this really clear to readers
hmellor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice blog (the interactive graphs are great!).
Just a couple of comments about minor things that it might be nice to address before publishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change the filename to 2025-10-26-sleep-mode.md , the resulting url will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad that you really like this image 😆
youkaichao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the interactive plots look great, so amazing!
left 2 nit comments.
- Rename blog post file to shorter name (sleep-mode.md) - Clarify security warning about dev mode requirement - Improve plot description to explain A→B→A→B switching pattern - Update sleepmode.png image Signed-off-by: PinSiang <[email protected]>

vLLM Sleep Mode feature blog.
Coverage
Benchmarks: 0.6B-235B params, A4000 and A100 GPUs, TP=1-4
Ablation studies: warmup impact, FP8 quantization
Interactive Plotly charts with full methodology
Solves the multi-model serving problem: models too large to fit simultaneously on GPU, but traditional reloading too slow for production. Sleep Mode makes multi-model switching fast.