Skip to content

Commit a746f82

Browse files
authored
[DOC] Qwen3 PD disaggregation user guide (#2751)
### What this PR does / why we need it? The PR is for the document of the prefiller&decoder disaggregation deloyment guide. The scenario of the guide is: - Use 3 nodes totally and 2 NPUs on each node - Qwen3-30B-A3B - 1P2D - Expert Parallel The deployment can be used to verify PD Disggregation / Expert Parallel features with a slightly less resources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@e599e2c --------- Signed-off-by: paulyu12 <[email protected]>
1 parent b2f77d3 commit a746f82

File tree

4 files changed

+296
-30
lines changed

4 files changed

+296
-30
lines changed

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,5 @@ multi_npu_quantization
1515
single_node_300i
1616
multi_node
1717
multi_node_kimi
18+
multi_node_pd_disaggregation
1819
:::
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Prefill-Decode Disaggregation Verification (Qwen)
2+
3+
## Getting Start
4+
5+
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
6+
7+
Take the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the ip of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
8+
9+
## Verify Multi-Node Communication Environment
10+
11+
### Physical Layer Requirements
12+
13+
- The physical machines must be located on the same WLAN, with network connectivity.
14+
- All NPUs must be interconnected. Intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA.
15+
16+
### Verification Process
17+
18+
1. Single Node Verification:
19+
20+
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
21+
22+
```bash
23+
# Check the remote switch ports
24+
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
25+
# Get the link status of the Ethernet ports (UP or DOWN)
26+
for i in {0..7}; do hccn_tool -i $i -link -g ; done
27+
# Check the network health status
28+
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
29+
# View the network detected IP configuration
30+
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
31+
# View gateway configuration
32+
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
33+
# View NPU network configuration
34+
cat /etc/hccn.conf
35+
```
36+
37+
2. Get NPU IP Addresses
38+
39+
```bash
40+
for i in {0..7}; do hccn_tool -i $i -ip -g;done
41+
```
42+
43+
3. Cross-Node PING Test
44+
45+
```bash
46+
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
47+
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
48+
```
49+
50+
## Generate Ranktable
51+
52+
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
53+
54+
```shell
55+
cd vllm-ascend/examples/disaggregate_prefill_v1/
56+
bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip> <decoder_node1_local_ip> <decoder_node2_local_ip> \
57+
--npus-per-node <npu_clips> --network-card-name <nic_name> --prefill-device-cnt <prefiller_npu_clips> --decode-device-cnt <decode_npu_clips> \
58+
[--local-device-ids <id_1>,<id_2>,<id_3>...]
59+
```
60+
61+
Assume that we use device 0,1 on the prefiller server node and device 6,7 on both of the decoder server nodes. Take the following commands as an example. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
62+
63+
```shell
64+
# On the prefiller node
65+
cd vllm-ascend/examples/disaggregate_prefill_v1/
66+
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
67+
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 0,1
68+
69+
# On the decoder 1
70+
cd vllm-ascend/examples/disaggregate_prefill_v1/
71+
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
72+
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
73+
74+
# On the decoder 2
75+
cd vllm-ascend/examples/disaggregate_prefill_v1/
76+
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
77+
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
78+
```
79+
80+
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
81+
82+
|Parameter | meaning |
83+
| --- | --- |
84+
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
85+
| --npus-per-node | Each node's npu clips |
86+
| --network-card-name | The physical machines' NIC |
87+
|--prefill-device-cnt | Npu clips used for prefill |
88+
|--decode-device-cnt |Npu clips used for decode |
89+
|--local-device-ids |Optional. No need if using all devices on the local node. |
90+
91+
## Prefiller / Decoder Deployment
92+
93+
We can run the following scripts to launch a server on the prefiller/decoder node respectively.
94+
95+
:::::{tab-set}
96+
97+
::::{tab-item} Prefiller node
98+
99+
```shell
100+
export HCCL_IF_IP=192.0.0.1 # node ip
101+
export GLOO_SOCKET_IFNAME="eth0" # network card name
102+
export TP_SOCKET_IFNAME="eth0"
103+
export HCCL_SOCKET_IFNAME="eth0"
104+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
105+
export OMP_PROC_BIND=false
106+
export OMP_NUM_THREADS=10
107+
export VLLM_USE_V1=1
108+
109+
vllm serve /model/Qwen3-30B-A3B \
110+
--host 0.0.0.0 \
111+
--port 13700 \
112+
--tensor-parallel-size 2 \
113+
--no-enable-prefix-caching \
114+
--seed 1024 \
115+
--served-model-name qwen3-moe \
116+
--max-model-len 6144 \
117+
--max-num-batched-tokens 6144 \
118+
--trust-remote-code \
119+
--gpu-memory-utilization 0.9 \
120+
--enable-expert-parallel \
121+
--kv-transfer-config \
122+
'{"kv_connector": "LLMDataDistCMgrConnector",
123+
"kv_buffer_device": "npu",
124+
"kv_role": "kv_producer",
125+
"kv_parallel_size": 1,
126+
"kv_port": "20001",
127+
"engine_id": "0",
128+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
129+
}' \
130+
--additional-config \
131+
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}' \
132+
--enforce-eager
133+
```
134+
135+
::::
136+
137+
::::{tab-item} Decoder node 1
138+
139+
```shell
140+
export HCCL_IF_IP=192.0.0.2 # node ip
141+
export GLOO_SOCKET_IFNAME="eth0" # network card name
142+
export TP_SOCKET_IFNAME="eth0"
143+
export HCCL_SOCKET_IFNAME="eth0"
144+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
145+
export OMP_PROC_BIND=false
146+
export OMP_NUM_THREADS=10
147+
export VLLM_USE_V1=1
148+
149+
vllm serve /model/Qwen3-30B-A3B \
150+
--host 0.0.0.0 \
151+
--port 13700 \
152+
--no-enable-prefix-caching \
153+
--tensor-parallel-size 2 \
154+
--seed 1024 \
155+
--served-model-name qwen3-moe \
156+
--max-model-len 6144 \
157+
--max-num-batched-tokens 6144 \
158+
--trust-remote-code \
159+
--gpu-memory-utilization 0.9 \
160+
--enable-expert-parallel \
161+
--kv-transfer-config \
162+
'{"kv_connector": "LLMDataDistCMgrConnector",
163+
"kv_buffer_device": "npu",
164+
"kv_role": "kv_consumer",
165+
"kv_parallel_size": 1,
166+
"kv_port": "20001",
167+
"engine_id": "0",
168+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
169+
}' \
170+
--additional-config \
171+
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
172+
```
173+
174+
::::
175+
176+
::::{tab-item} Decoder node 2
177+
178+
```shell
179+
export HCCL_IF_IP=192.0.0.3 # node ip
180+
export GLOO_SOCKET_IFNAME="eth0" # network card name
181+
export TP_SOCKET_IFNAME="eth0"
182+
export HCCL_SOCKET_IFNAME="eth0"
183+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
184+
export OMP_PROC_BIND=false
185+
export OMP_NUM_THREADS=10
186+
export VLLM_USE_V1=1
187+
188+
vllm serve /model/Qwen3-30B-A3B \
189+
--host 0.0.0.0 \
190+
--port 13700 \
191+
--no-enable-prefix-caching \
192+
--tensor-parallel-size 2 \
193+
--seed 1024 \
194+
--served-model-name qwen3-moe \
195+
--max-model-len 6144 \
196+
--max-num-batched-tokens 6144 \
197+
--trust-remote-code \
198+
--gpu-memory-utilization 0.9 \
199+
--enable-expert-parallel \
200+
--kv-transfer-config \
201+
'{"kv_connector": "LLMDataDistCMgrConnector",
202+
"kv_buffer_device": "npu",
203+
"kv_role": "kv_consumer",
204+
"kv_parallel_size": 1,
205+
"kv_port": "20001",
206+
"engine_id": "0",
207+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
208+
}' \
209+
--additional-config \
210+
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
211+
```
212+
213+
::::
214+
215+
:::::
216+
217+
## Example proxy for Deployment
218+
219+
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
220+
221+
```shell
222+
python load_balance_proxy_server_example.py \
223+
--host 192.0.0.1 \
224+
--port 8080 \
225+
--prefiller-hosts 192.0.0.1 \
226+
--prefiller-port 13700 \
227+
--decoder-hosts 192.0.0.2 192.0.0.3 \
228+
--decoder-ports 13700 13700
229+
```
230+
231+
## Verification
232+
233+
Check service health using the proxy server endpoint.
234+
235+
```shell
236+
curl http://192.0.0.1:8080/v1/completions \
237+
-H "Content-Type: application/json" \
238+
-d '{
239+
"model": "qwen3-moe",
240+
"prompt": "Who are you?",
241+
"max_tokens": 100,
242+
"temperature": 0
243+
}'
244+
```

examples/disaggregated_prefill_v1/gen_ranktable.py

Lines changed: 41 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@
1717
type=int,
1818
required=True,
1919
help="number of decode devices")
20+
parser.add_argument("--local-device-ids",
21+
type=str,
22+
required=False,
23+
help="local device ids")
2024
args = parser.parse_args()
2125
local_host = args.local_host
2226
prefill_device_cnt = args.prefill_device_cnt
@@ -54,39 +58,47 @@ def get_cmd_stdout(cmd):
5458
"\n")[0].split(":")[1].strip()
5559
chips_per_card = int(chips_per_card)
5660

61+
if args.local_device_ids:
62+
local_device_ids = args.local_device_ids.split(',')
63+
else:
64+
local_device_ids = []
65+
for card_id in range(num_cards):
66+
for chip_id in range(chips_per_card):
67+
device_id = card_id * chips_per_card + chip_id
68+
local_device_ids.append(device_id)
69+
5770
# generate local device list for local rank 0, and gather it to all ranks
5871
local_device_list: list[dict[str, str]] = list()
5972
if local_rank == "0":
6073
super_pod_id = "0"
61-
for card_id in range(num_cards):
62-
for chip_id in range(chips_per_card):
63-
device_id = card_id * chips_per_card + chip_id
64-
if soc_info == AscendSocVersion.A3:
65-
device_ip = get_cmd_stdout(
66-
f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr"
67-
).split(":")[1].strip()
68-
super_device_id = get_cmd_stdout(
69-
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID"
70-
).split(":")[1].strip()
71-
super_pod_id = get_cmd_stdout(
72-
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\""
73-
).split(":")[1].strip()
74-
else:
75-
device_ip = get_cmd_stdout(
76-
f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr"
77-
).split(":")[1].strip()
78-
79-
device_info = {
80-
"server_id": local_host,
81-
"device_id": str(device_id),
82-
"device_ip": str(device_ip),
83-
}
84-
if soc_info == AscendSocVersion.A3:
85-
device_info.update({
86-
"super_pod_id": str(super_pod_id),
87-
"super_device_id": str(super_device_id)
88-
})
89-
local_device_list.append(device_info)
74+
for idx in range(len(local_device_ids)):
75+
device_id = local_device_ids[idx]
76+
if soc_info == AscendSocVersion.A3:
77+
device_ip = get_cmd_stdout(
78+
f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr"
79+
).split(":")[1].strip()
80+
super_device_id = get_cmd_stdout(
81+
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID"
82+
).split(":")[1].strip()
83+
super_pod_id = get_cmd_stdout(
84+
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\""
85+
).split(":")[1].strip()
86+
else:
87+
device_ip = get_cmd_stdout(
88+
f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr"
89+
).split(":")[1].strip()
90+
91+
device_info = {
92+
"server_id": local_host,
93+
"device_id": str(device_id),
94+
"device_ip": str(device_ip),
95+
}
96+
if soc_info == AscendSocVersion.A3:
97+
device_info.update({
98+
"super_pod_id": str(super_pod_id),
99+
"super_device_id": str(super_device_id)
100+
})
101+
local_device_list.append(device_info)
90102

91103
dist.init_process_group(backend=dist.Backend.GLOO)
92104
global_device_list = [None] * dist.get_world_size()

examples/disaggregated_prefill_v1/gen_ranktable.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,11 @@ while [[ $# -gt 0 ]]; do
3333
DECODE_DEVICE_CNT="$1"
3434
shift
3535
;;
36+
--local-device-ids)
37+
shift
38+
LOCAL_DEVICE_IDS="$1"
39+
shift
40+
;;
3641
esac
3742
done
3843
LOCAL_HOSTS=($(hostname -I))
@@ -68,12 +73,16 @@ echo "NNODES": $NNODES
6873
echo "NODE_RANK": $NODE_RANK
6974
echo "==============="
7075

76+
if [ -n "$LOCAL_DEVICE_IDS" ]; then
77+
OPTIONAL_SECTION=" --local-device-ids $LOCAL_DEVICE_IDS"
78+
fi
79+
7180
if [[ -n "${GEN_RANKTABLE}" || ! -e ${PWD}/ranktable.json ]]; then
7281
GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME torchrun \
7382
--nproc_per_node 1 \
7483
--nnodes ${NNODES} \
7584
--node_rank ${NODE_RANK} \
7685
--master_addr ${MASTER_ADDR} \
7786
--master_port ${MASTER_PORT} \
78-
gen_ranktable.py --local-host $LOCAL_HOST --prefill-device-cnt $PREFILL_DEVICE_CNT --decode-device-cnt $DECODE_DEVICE_CNT
87+
gen_ranktable.py --local-host $LOCAL_HOST --prefill-device-cnt $PREFILL_DEVICE_CNT --decode-device-cnt $DECODE_DEVICE_CNT $OPTIONAL_SECTION
7988
fi

0 commit comments

Comments
 (0)