Skip to content

Commit a7adbc6

Browse files
iAmir97Amir Balwelgemini-code-assist[bot]
authored
[Doc] Sleep mode documentation (vllm-project#28357)
Signed-off-by: Amir Balwel <[email protected]> Signed-off-by: iAmir97 <[email protected]> Co-authored-by: Amir Balwel <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent e605e8e commit a7adbc6

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed

docs/features/sleep_mode.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ Key benefits:
1313
!!! note
1414
This feature is only supported on CUDA platform.
1515

16+
!!! note
17+
For more information, see this [Blog Post](https://blog.vllm.ai/2025/10/26/sleep-mode.html).
18+
1619
## Sleep levels
1720

1821
Level 1 sleep will offload the model weights and discard the KV cache. The content of KV cache is forgotten. Level 1 sleep is good for sleeping and waking up the engine to run the same model again. The model weights are backed up in CPU memory. Please make sure there's enough CPU memory to store the model weights. Level 2 sleep will discard both the model weights and the KV cache (while the model's buffers are kept in CPU, like rope scaling tensors). The content of both the model weights and KV cache is forgotten. Level 2 sleep is good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed, e.g. RLHF weight update.
@@ -31,13 +34,29 @@ llm = LLM("Qwen/Qwen3-0.6B", enable_sleep_mode=True)
3134
#### Python API
3235

3336
```python
37+
# Sleep level 1
3438
# Put the engine to sleep (level=1: offload weights to CPU RAM, discard KV cache)
3539
llm.sleep(level=1)
3640

3741
# Wake up the engine (restore weights)
3842
llm.wake_up()
3943
```
4044

45+
```python
46+
# Sleep level 2
47+
# Put the engine to sleep (level=2: discard both weights and KV cache)
48+
llm.sleep(level=2)
49+
50+
# Reallocate weights memory only
51+
llm.wake_up(tags=["weights"])
52+
53+
# Load weights in-place
54+
llm.collective_rpc("reload_weights")
55+
56+
# Reallocate KV cache
57+
llm.wake_up(tags=["kv_cache"])
58+
```
59+
4160
#### RLHF weight updates
4261

4362
During RLHF training, vLLM allows you to selectively wake up only the model weights or the KV cache using the tags argument in wake_up(). This fine-grained control is especially useful when updating model weights: by waking up just the weights (e.g., llm.wake_up(tags=["weights"])), you avoid allocating memory for the KV cache until after the weight update is complete. This approach helps prevent GPU out-of-memory (OOM) errors, particularly with large models, by minimizing peak memory usage during weight synchronization and update operations.
@@ -69,10 +88,30 @@ VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-0.6B \
6988
--port 8000
7089
```
7190

91+
Below is an example of how to sleep and wake up a model in level 1.
92+
93+
```bash
94+
curl -X POST 'http://localhost:8000/sleep?level=1'
95+
curl -X POST 'http://localhost:8000/wake_up'
96+
```
97+
98+
And this is an example of how to sleep and wake up a model in level 2.
99+
100+
```bash
101+
curl -X POST 'http://localhost:8000/sleep?level=2'
102+
# Reallocate weights memory only
103+
curl -X POST 'http://localhost:8000/wake_up?tags=weights'
104+
# Load weights in-place
105+
curl -X POST 'http://localhost:8000/collective_rpc' -H 'Content-Type: application/json' -d '{"method":"reload_weights"}'
106+
# Reallocate KV cache
107+
curl -X POST 'http://localhost:8000/wake_up?tags=kv_cache'
108+
```
109+
72110
#### HTTP endpoints
73111

74112
- `POST /sleep?level=1` — Put the model to sleep (`level=1`).
75113
- `POST /wake_up` — Wake up the model. Supports optional `tags` query parameters for partial wake-up (e.g., `?tags=weights`).
114+
- `POST /collective_rpc` — Perform a collective remote procedure call (RPC).
76115
- `GET /is_sleeping` — Check if the model is sleeping.
77116

78117
!!! note

0 commit comments

Comments
 (0)