Skip to content

Conversation

@tanpinsiang
Copy link
Contributor

vLLM Sleep Mode feature blog.
Coverage
Benchmarks: 0.6B-235B params, A4000 and A100 GPUs, TP=1-4
Ablation studies: warmup impact, FP8 quantization
Interactive Plotly charts with full methodology

Solves the multi-model serving problem: models too large to fit simultaneously on GPU, but traditional reloading too slow for production. Sleep Mode makes multi-model switching fast.

@tjtanaa
Copy link
Contributor

tjtanaa commented Oct 27, 2025

@youkaichao @hmellor PTAL

Signed-off-by: PinSiang <[email protected]>
Signed-off-by: PinSiang <[email protected]>
Signed-off-by: PinSiang <[email protected]>
Signed-off-by: PinSiang <[email protected]>

```bash
# Terminal 1: Start Phi-3-vision
export VLLM_SERVER_DEV_MODE=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dev mode necessary? If yes, why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Dev mode flag is needed as the sleep mode API are only exposed under development environment. They are not expose in inference production stack as users can break the deployment by resetting the weights and cache. The sleep mode endpoint is to be used in a closed secure environment like training or backend applications.

Comment on lines +109 to +118
<div style="margin: 2rem 0;">
<script src="https://cdn.plot.ly/plotly-2.32.0.min.js"></script>
<div id="plotly-sleep-mode" style="width: 100%; height: 250px;"></div>
<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
<strong>Model A:</strong> Qwen3-235B-A22B-Instruct-2507-FP8 (TP=4) | <strong>Model B:</strong> Qwen3-Coder-30B-A3B-Instruct (TP=1)<br>
GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code><br>

</div>
<script src="/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js"></script>
</div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this plot, and the others like it, could you explain what we are comparing here?

It looks like we're comparing the time taken to do the following:

  • Prompt A
  • Prompt B
  • Prompt A
  • Prompt B
  • Prompt A
  • Prompt B

It'd be nice to make this really clear to readers

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice blog (the interactive graphs are great!).

Just a couple of comments about minor things that it might be nice to address before publishing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change the filename to 2025-10-26-sleep-mode.md , the resulting url will be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this previous figure looks better, can we use a similar image?

Image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad that you really like this image 😆

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the interactive plots look great, so amazing!

left 2 nit comments.

- Rename blog post file to shorter name (sleep-mode.md)
- Clarify security warning about dev mode requirement
- Improve plot description to explain A→B→A→B switching pattern
- Update sleepmode.png image

Signed-off-by: PinSiang <[email protected]>
@youkaichao youkaichao merged commit 20acc21 into vllm-project:main Oct 28, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants