[MoE][Dist] Fix Qwen MoE accuracy bug in DP senario #1856

MengqingCao · 2025-07-17T10:35:45Z

What this PR does / why we need it?

Fix Qwen MoE accuracy bug in DP senario.

Now the implentment of FusedMoE in vLLM use All2AllManager to manager different all2all algorithm branch. And the default branch use Multicast in dispatch phase and all_reduce in combine phase, which are not implented in vLLM-Ascend. This leading to invoking into a default implentment in base_communicator, with empty dispatch and combine operations, thus causing the accuracy issue on it.

This pr is a temporary workaround, refacting all2all in vLLM-Ascend could be a better way.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@8dfb45c

codecov · 2025-07-17T11:01:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.18%. Comparing base (ef99fe1) to head (755bd0f).
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1856      +/-   ##
==========================================
- Coverage   54.22%   54.18%   -0.05%     
==========================================
  Files          75       74       -1     
  Lines        9244     9235       -9     
==========================================
- Hits         5013     5004       -9     
  Misses       4231     4231

Flag	Coverage Δ
unittests	`54.18% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vllm_ascend/distributed/communicator.py

Signed-off-by: MengqingCao <[email protected]>

Potabk · 2025-07-18T02:25:20Z

What is confusing is that this patch indeed can solve the accuracy problem of online scenes, but it destroys the functional availability of offline scenes.

Potabk · 2025-07-18T02:30:10Z

online:

run dp2 on single node

#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="enp67s0f5"
local_ip="192.168.0.183"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024

vllm serve /root/.cache/Qwen3-30B-A3B \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name qwen \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":false}}'

result:
server:

INFO:     127.0.0.1:38768 - "POST /v1/completions HTTP/1.1" 200 OK

client

curl http://127.0.0.1:8004/v1/completions \mpletions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen",
        "prompt": "The future of AI is",
        "max_tokens": 50,
        "temperature": 0
    }'
{"id":"cmpl-5ac0743caa7f4c67aca6582781d07769","object":"text_completion","created":1752805401,"model":"qwen","choices":[{"index":0,"text":" not just about the technology itself, but about how it is used to solve real-world problems. As AI continues to evolve, it will become more integrated into our daily lives, from healthcare and education to transportation and entertainment. The key to unlocking the full","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":55,"completion_tokens":50,"prompt_tokens_details":null},"kv_transfer_params":null}

offline mode:

run offline_data_parallel_script.py

python examples/offline_data_parallel.py \
                --model="/root/.cache/Qwen3-30B-A3B" \
                --dp-size=2 \
                --tp-size=2 \
                --enable-expert-parallel

result:
functional failed

(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 596, in run_engine_core
(EngineCore_0 pid=3232)     raise e
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 585, in run_engine_core
(EngineCore_0 pid=3232)     engine_core.run_busy_loop()
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 944, in run_busy_loop
(EngineCore_0 pid=3232)     executed = self._process_engine_step()
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in _process_engine_step
(EngineCore_0 pid=3232)     outputs, model_executed = self.step_fn()
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 241, in step
(EngineCore_0 pid=3232)     model_output = self.execute_model(scheduler_output)
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 227, in execute_model
(EngineCore_0 pid=3232)     raise err
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 218, in execute_model
(EngineCore_0 pid=3232)     return self.model_executor.execute_model(scheduler_output)
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 172, in execute_model
(EngineCore_0 pid=3232)     (output, ) = self.collective_rpc(
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 247, in collective_rpc
(EngineCore_0 pid=3232)     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_0 pid=3232) TimeoutError: RPC call to execute_model timed out.

MengqingCao force-pushed the dp_moe branch from e4626e4 to 61d0b15 Compare July 17, 2025 10:39

wangxiyuan reviewed Jul 18, 2025

View reviewed changes

vllm_ascend/distributed/communicator.py Outdated Show resolved Hide resolved

[MoE][Dist] Fix Qwen MoE accuracy bug in DP scenario

755bd0f

Signed-off-by: MengqingCao <[email protected]>

MengqingCao force-pushed the dp_moe branch from 61d0b15 to 755bd0f Compare July 18, 2025 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE][Dist] Fix Qwen MoE accuracy bug in DP senario #1856

[MoE][Dist] Fix Qwen MoE accuracy bug in DP senario #1856

MengqingCao commented Jul 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Jul 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Potabk commented Jul 18, 2025

Uh oh!

Potabk commented Jul 18, 2025

Uh oh!

Uh oh!

[MoE][Dist] Fix Qwen MoE accuracy bug in DP senario #1856

Are you sure you want to change the base?

[MoE][Dist] Fix Qwen MoE accuracy bug in DP senario #1856

Conversation

MengqingCao commented Jul 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Potabk commented Jul 18, 2025

Uh oh!

Potabk commented Jul 18, 2025

online:

offline mode:

Uh oh!

Uh oh!

MengqingCao commented Jul 17, 2025 •

edited by github-actions bot

Loading

codecov bot commented Jul 17, 2025 •

edited

Loading