Skip to content

[MoE][Dist] Fix Qwen MoE accuracy bug in DP senario #1856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MengqingCao
Copy link
Collaborator

@MengqingCao MengqingCao commented Jul 17, 2025

What this PR does / why we need it?

Fix Qwen MoE accuracy bug in DP senario.

Now the implentment of FusedMoE in vLLM use All2AllManager to manager different all2all algorithm branch. And the default branch use Multicast in dispatch phase and all_reduce in combine phase, which are not implented in vLLM-Ascend. This leading to invoking into a default implentment in base_communicator, with empty dispatch and combine operations, thus causing the accuracy issue on it.

This pr is a temporary workaround, refacting all2all in vLLM-Ascend could be a better way.

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link

codecov bot commented Jul 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.18%. Comparing base (ef99fe1) to head (755bd0f).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1856      +/-   ##
==========================================
- Coverage   54.22%   54.18%   -0.05%     
==========================================
  Files          75       74       -1     
  Lines        9244     9235       -9     
==========================================
- Hits         5013     5004       -9     
  Misses       4231     4231              
Flag Coverage Δ
unittests 54.18% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Potabk
Copy link
Contributor

Potabk commented Jul 18, 2025

What is confusing is that this patch indeed can solve the accuracy problem of online scenes, but it destroys the functional availability of offline scenes.

@Potabk
Copy link
Contributor

Potabk commented Jul 18, 2025

online:

run dp2 on single node

#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="enp67s0f5"
local_ip="192.168.0.183"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024

vllm serve /root/.cache/Qwen3-30B-A3B \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name qwen \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":false}}'

result:
server:

INFO:     127.0.0.1:38768 - "POST /v1/completions HTTP/1.1" 200 OK

client

curl http://127.0.0.1:8004/v1/completions \mpletions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen",
        "prompt": "The future of AI is",
        "max_tokens": 50,
        "temperature": 0
    }'
{"id":"cmpl-5ac0743caa7f4c67aca6582781d07769","object":"text_completion","created":1752805401,"model":"qwen","choices":[{"index":0,"text":" not just about the technology itself, but about how it is used to solve real-world problems. As AI continues to evolve, it will become more integrated into our daily lives, from healthcare and education to transportation and entertainment. The key to unlocking the full","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":55,"completion_tokens":50,"prompt_tokens_details":null},"kv_transfer_params":null}

offline mode:

run offline_data_parallel_script.py

python examples/offline_data_parallel.py \
                --model="/root/.cache/Qwen3-30B-A3B" \
                --dp-size=2 \
                --tp-size=2 \
                --enable-expert-parallel

result:
functional failed

(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 596, in run_engine_core
(EngineCore_0 pid=3232)     raise e
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 585, in run_engine_core
(EngineCore_0 pid=3232)     engine_core.run_busy_loop()
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 944, in run_busy_loop
(EngineCore_0 pid=3232)     executed = self._process_engine_step()
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in _process_engine_step
(EngineCore_0 pid=3232)     outputs, model_executed = self.step_fn()
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 241, in step
(EngineCore_0 pid=3232)     model_output = self.execute_model(scheduler_output)
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 227, in execute_model
(EngineCore_0 pid=3232)     raise err
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 218, in execute_model
(EngineCore_0 pid=3232)     return self.model_executor.execute_model(scheduler_output)
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 172, in execute_model
(EngineCore_0 pid=3232)     (output, ) = self.collective_rpc(
(EngineCore_0 pid=3232)   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 247, in collective_rpc
(EngineCore_0 pid=3232)     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_0 pid=3232) TimeoutError: RPC call to execute_model timed out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants