Skip to content

Commit 356b7c1

Browse files
authored
Merge branch 'main' into feature_allreduce
2 parents b18b5e2 + df0ec55 commit 356b7c1

28 files changed

+2833
-145
lines changed

.github/workflows/vllm_ascend_test_pd.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,7 @@ jobs:
4242
strategy:
4343
matrix:
4444
vllm_verison: [
45-
# revert me when V1 disaggregation prefill is merged in main
46-
# main,
45+
main,
4746
v0.9.1
4847
]
4948
name: vLLM Ascend prefilling decoding disaggregation test
@@ -107,6 +106,6 @@ jobs:
107106
pip install -r requirements-dev.txt
108107
pip install -v -e .
109108
110-
- name: Run vllm-project/vllm-ascend PD Disaggregation test
109+
- name: Run vllm-project/vllm-ascend PD Disaggregation edge test
111110
run: |
112-
pytest -sv tests/e2e/pd_disaggreate/test_pd_e2e.py
111+
bash tests/e2e/pd_disaggreate/run_edge_case_test.sh
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# Disaggregated Prefill-Decode Deployment Guide
2+
3+
## Overview
4+
This demo document provides instructions for running a disaggregated vLLM-ascend service with separate prefill and decode stages across 4 nodes, uses 16 Ascend NPUs for two prefill nodes (P1/P2) and 16 Ascend NPUS for two decode nodes (D1/D2).
5+
6+
## Prerequisites
7+
- Ascend NPU environment with vLLM 0.9.1 installed
8+
- Network interfaces configured for distributed communication (eg: eth0)
9+
- Model weights located at `/data01/deepseek_r1_w8a8_zhw`
10+
11+
## Rank table generation
12+
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode:
13+
14+
Run the following command on every node to generate the rank table:
15+
```shell
16+
cd vllm-ascend/examples/disaggregate_prefill_v1/
17+
bash gen_ranktable.sh --ips 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 \
18+
--npus-per-node 8 --network-card-name enp189s0f0 --prefill-device-cnt 16 --decode-device-cnt 16
19+
```
20+
Rank table will generated at `/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json`
21+
22+
## Start disaggregated vLLM-ascend service
23+
Execution Sequence
24+
- 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36
25+
- Start Prefill on Node 1 (P1)
26+
- Start Prefill on Node 2 (P2)
27+
- Start Decode on Node 1 (D1)
28+
- Start Decode on Node 2 (D2)
29+
- Start proxy server on Node1
30+
31+
* Run prefill server P1 on first node
32+
```shell
33+
export HCCL_IF_IP=172.19.32.175 # node ip
34+
export GLOO_SOCKET_IFNAME="eth0" # network card name
35+
export TP_SOCKET_IFNAME="eth0"
36+
export HCCL_SOCKET_IFNAME="eth0"
37+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
38+
export OMP_PROC_BIND=false
39+
export OMP_NUM_THREADS=100
40+
export VLLM_USE_V1=1
41+
vllm serve /data01/deepseek_r1_w8a8_zhw \
42+
--host 0.0.0.0 \
43+
--port 20002 \
44+
--data-parallel-size 2 \
45+
--data-parallel-size-local 1 \
46+
--api-server-count 2 \
47+
--data-parallel-address 172.19.32.175 \
48+
--data-parallel-rpc-port 13356 \
49+
--tensor-parallel-size 8 \
50+
--no-enable-prefix-caching \
51+
--seed 1024 \
52+
--served-model-name deepseek \
53+
--max-model-len 6144 \
54+
--max-num-batched-tokens 6144 \
55+
--trust-remote-code \
56+
--enforce-eager \
57+
--gpu-memory-utilization 0.9 \
58+
--kv-transfer-config \
59+
'{"kv_connector": "LLMDataDistCMgrConnector",
60+
"kv_buffer_device": "npu",
61+
"kv_role": "kv_producer",
62+
"kv_parallel_size": 1,
63+
"kv_port": "20001",
64+
"engine_id": "0",
65+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
66+
}' \
67+
--additional-config \
68+
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
69+
```
70+
71+
* Run prefill server P2 on second node
72+
```shell
73+
export HCCL_IF_IP=172.19.241.49
74+
export GLOO_SOCKET_IFNAME="eth0"
75+
export TP_SOCKET_IFNAME="eth0"
76+
export HCCL_SOCKET_IFNAME="eth0"
77+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
78+
export OMP_PROC_BIND=false
79+
export OMP_NUM_THREADS=100
80+
export VLLM_USE_V1=1
81+
vllm serve /data01/deepseek_r1_w8a8_zhw \
82+
--host 0.0.0.0 \
83+
--port 20002 \
84+
--headless \
85+
--data-parallel-size 2 \
86+
--data-parallel-start-rank 1 \
87+
--data-parallel-size-local 1 \
88+
--data-parallel-address 172.19.32.175 \
89+
--data-parallel-rpc-port 13356 \
90+
--tensor-parallel-size 8 \
91+
--no-enable-prefix-caching \
92+
--seed 1024 \
93+
--served-model-name deepseek \
94+
--max-model-len 6144 \
95+
--max-num-batched-tokens 6144 \
96+
--trust-remote-code \
97+
--enforce-eager \
98+
--gpu-memory-utilization 0.9 \
99+
--kv-transfer-config \
100+
'{"kv_connector": "LLMDataDistCMgrConnector",
101+
"kv_buffer_device": "npu",
102+
"kv_role": "kv_producer",
103+
"kv_parallel_size": 1,
104+
"kv_port": "20001",
105+
"engine_id": "0",
106+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
107+
}' \
108+
--additional-config \
109+
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
110+
```
111+
112+
* Run decode server d1 on third node
113+
```shell
114+
export HCCL_IF_IP=172.19.123.51
115+
export GLOO_SOCKET_IFNAME="eth0"
116+
export TP_SOCKET_IFNAME="eth0"
117+
export HCCL_SOCKET_IFNAME="eth0"
118+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
119+
export OMP_PROC_BIND=false
120+
export OMP_NUM_THREADS=100
121+
export VLLM_USE_V1=1
122+
vllm serve /data01/deepseek_r1_w8a8_zhw \
123+
--host 0.0.0.0 \
124+
--port 20002 \
125+
--data-parallel-size 2 \
126+
--data-parallel-size-local 1 \
127+
--api-server-count 2 \
128+
--data-parallel-address 172.19.123.51 \
129+
--data-parallel-rpc-port 13356 \
130+
--tensor-parallel-size 8 \
131+
--no-enable-prefix-caching \
132+
--seed 1024 \
133+
--served-model-name deepseek \
134+
--max-model-len 6144 \
135+
--max-num-batched-tokens 6144 \
136+
--trust-remote-code \
137+
--enforce-eager \
138+
--gpu-memory-utilization 0.9 \
139+
--kv-transfer-config \
140+
'{"kv_connector": "LLMDataDistCMgrConnector",
141+
"kv_buffer_device": "npu",
142+
"kv_role": "kv_consumer",
143+
"kv_parallel_size": 1,
144+
"kv_port": "20001",
145+
"engine_id": "0",
146+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
147+
}' \
148+
--additional-config \
149+
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
150+
```
151+
152+
* Run decode server d2 on last node
153+
```shell
154+
export HCCL_IF_IP=172.19.190.36
155+
export GLOO_SOCKET_IFNAME="eth0"
156+
export TP_SOCKET_IFNAME="eth0"
157+
export HCCL_SOCKET_IFNAME="eth0"
158+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
159+
export OMP_PROC_BIND=false
160+
export OMP_NUM_THREADS=100
161+
export VLLM_USE_V1=1
162+
vllm serve /data01/deepseek_r1_w8a8_zhw \
163+
--host 0.0.0.0 \
164+
--port 20002 \
165+
--headless \
166+
--data-parallel-size 2 \
167+
--data-parallel-start-rank 1 \
168+
--data-parallel-size-local 1 \
169+
--data-parallel-address 172.19.123.51 \
170+
--data-parallel-rpc-port 13356 \
171+
--tensor-parallel-size 8 \
172+
--no-enable-prefix-caching \
173+
--seed 1024 \
174+
--served-model-name deepseek \
175+
--max-model-len 6144 \
176+
--max-num-batched-tokens 6144 \
177+
--trust-remote-code \
178+
--enforce-eager \
179+
--gpu-memory-utilization 0.9 \
180+
--kv-transfer-config \
181+
'{"kv_connector": "LLMDataDistCMgrConnector",
182+
"kv_buffer_device": "npu",
183+
"kv_role": "kv_consumer",
184+
"kv_parallel_size": 1,
185+
"kv_port": "20001",
186+
"engine_id": "0",
187+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
188+
}' \
189+
--additional-config \
190+
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
191+
```
192+
193+
* Run proxy server on the first node
194+
```shell
195+
cd /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1
196+
python toy_proxy_server.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002
197+
```
198+
199+
* Verification
200+
Check service health using the proxy server endpoint:
201+
```shell
202+
curl http://localhost:1025/v1/completions \
203+
-H "Content-Type: application/json" \
204+
-d '{
205+
"model": "deepseek",
206+
"prompt": "Who are you?",
207+
"max_tokens": 100,
208+
"temperature": 0
209+
}'
210+
```
211+
212+
* Performance
213+
Test performance with vllm benchmark
214+
```shell
215+
cd /vllm-workspace/vllm/benchmarks
216+
python3 benchmark_serving.py \
217+
--backend vllm \
218+
--dataset-name random \
219+
--random-input-len 4096 \
220+
--random-output-len 1536 \
221+
--num-prompts 256 \
222+
--ignore-eos \
223+
--model deepseek \
224+
--tokenizer /data01/deepseek_r1_w8a8_zhw \
225+
--host localhost \
226+
--port 8000 \
227+
--endpoint /v1/completions \
228+
--max-concurrency 4 \
229+
--request-rate 4
230+
```
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
import argparse
2+
import json
3+
import os
4+
5+
import torch.distributed as dist
6+
7+
from vllm_ascend.soc_info import NPUSocInfo
8+
9+
parser = argparse.ArgumentParser(
10+
description="Arguments of rank table generator", )
11+
parser.add_argument("--local-host", type=str, required=True, help="local ip")
12+
parser.add_argument("--prefill-device-cnt",
13+
type=int,
14+
required=True,
15+
help="number of prefill devices")
16+
parser.add_argument("--decode-device-cnt",
17+
type=int,
18+
required=True,
19+
help="number of decode devices")
20+
args = parser.parse_args()
21+
local_host = args.local_host
22+
prefill_device_cnt = args.prefill_device_cnt
23+
decode_device_cnt = args.decode_device_cnt
24+
25+
print("enter py")
26+
27+
hccn_tool_path = os.environ.get("HCCN_TOOL_PATH",
28+
"/usr/local/Ascend/driver/tools/hccn_tool")
29+
master_addr = os.environ.get("MASTER_ADDR")
30+
master_port = os.environ.get("MASTER_PORT")
31+
rank = os.environ.get("RANK")
32+
local_rank = os.environ.get("LOCAL_RANK")
33+
# This variable is set by torchrun,
34+
# and is different from WORLD_SIZE in gen_rank_table.sh.
35+
world_size = os.environ.get("WORLD_SIZE")
36+
soc_info = NPUSocInfo()
37+
38+
39+
def get_cmd_stdout(cmd):
40+
import subprocess
41+
return subprocess.run(cmd, capture_output=True,
42+
shell=True).stdout.decode("utf-8").strip()
43+
44+
45+
print(f"local_host: {local_host}")
46+
print("gen ranktable.json")
47+
48+
num_cards = get_cmd_stdout("npu-smi info -l | grep \"Total Count\"").split(
49+
":")[1].strip()
50+
num_cards = int(num_cards)
51+
chips_per_card = get_cmd_stdout("npu-smi info -l | grep \"Chip Count\"").split(
52+
"\n")[0].split(":")[1].strip()
53+
chips_per_card = int(chips_per_card)
54+
55+
# generate local device list for local rank 0, and gather it to all ranks
56+
local_device_list: list[dict[str, str]] = list()
57+
if local_rank == "0":
58+
super_pod_id = "0"
59+
for card_id in range(num_cards):
60+
for chip_id in range(chips_per_card):
61+
device_id = card_id * chips_per_card + chip_id
62+
if soc_info.is_a3:
63+
device_ip = get_cmd_stdout(
64+
f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr"
65+
).split(":")[1].strip()
66+
super_device_id = get_cmd_stdout(
67+
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID"
68+
).split(":")[1].strip()
69+
super_pod_id = get_cmd_stdout(
70+
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\""
71+
).split(":")[1].strip()
72+
else:
73+
device_ip = get_cmd_stdout(
74+
f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr"
75+
).split(":")[1].strip()
76+
77+
device_info = {
78+
"server_id": local_host,
79+
"device_id": str(device_id),
80+
"device_ip": str(device_ip),
81+
}
82+
if soc_info.is_a3:
83+
device_info.update({
84+
"super_pod_id": str(super_pod_id),
85+
"super_device_id": str(super_device_id)
86+
})
87+
local_device_list.append(device_info)
88+
89+
dist.init_process_group(backend=dist.Backend.GLOO)
90+
global_device_list = [None] * dist.get_world_size()
91+
dist.all_gather_object(global_device_list, local_device_list)
92+
global_device_list = [
93+
device_info for device_list in global_device_list
94+
for device_info in device_list # type: ignore[attr-defined]
95+
]
96+
cnt = 1
97+
for device_info in global_device_list: # type: ignore[assignment]
98+
device_info["cluster_id"] = str(cnt)
99+
cnt += 1
100+
assert (prefill_device_cnt + decode_device_cnt) <= len(global_device_list), \
101+
"prefill_device_cnt + decode_device_cnt must be less than or equal to number of all devices in cluster"
102+
ranktable = {
103+
"version":
104+
"1.2",
105+
"server_count":
106+
str(world_size),
107+
"prefill_device_list":
108+
global_device_list[:prefill_device_cnt],
109+
"decode_device_list":
110+
global_device_list[prefill_device_cnt:prefill_device_cnt +
111+
decode_device_cnt],
112+
"status":
113+
"completed"
114+
}
115+
116+
if local_rank == '0':
117+
with open("ranktable.json", "w") as f:
118+
json.dump(ranktable, f, indent=4)
119+
120+
print("gen ranktable.json done")

0 commit comments

Comments
 (0)