-
Notifications
You must be signed in to change notification settings - Fork 266
[MoE][Dist] Fix Qwen MoE accuracy bug in DP senario #1856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1856 +/- ##
==========================================
- Coverage 54.22% 54.18% -0.05%
==========================================
Files 75 74 -1
Lines 9244 9235 -9
==========================================
- Hits 5013 5004 -9
Misses 4231 4231
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: MengqingCao <[email protected]>
What is confusing is that this patch indeed can solve the accuracy problem of online scenes, but it destroys the functional availability of offline scenes. |
online:run dp2 on single node #!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="enp67s0f5"
local_ip="192.168.0.183"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
vllm serve /root/.cache/Qwen3-30B-A3B \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name qwen \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":false}}' result: INFO: 127.0.0.1:38768 - "POST /v1/completions HTTP/1.1" 200 OK client curl http://127.0.0.1:8004/v1/completions \mpletions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
{"id":"cmpl-5ac0743caa7f4c67aca6582781d07769","object":"text_completion","created":1752805401,"model":"qwen","choices":[{"index":0,"text":" not just about the technology itself, but about how it is used to solve real-world problems. As AI continues to evolve, it will become more integrated into our daily lives, from healthcare and education to transportation and entertainment. The key to unlocking the full","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":55,"completion_tokens":50,"prompt_tokens_details":null},"kv_transfer_params":null} offline mode:run offline_data_parallel_script.py python examples/offline_data_parallel.py \
--model="/root/.cache/Qwen3-30B-A3B" \
--dp-size=2 \
--tp-size=2 \
--enable-expert-parallel result: (EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 596, in run_engine_core
(EngineCore_0 pid=3232) raise e
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 585, in run_engine_core
(EngineCore_0 pid=3232) engine_core.run_busy_loop()
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 944, in run_busy_loop
(EngineCore_0 pid=3232) executed = self._process_engine_step()
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in _process_engine_step
(EngineCore_0 pid=3232) outputs, model_executed = self.step_fn()
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 241, in step
(EngineCore_0 pid=3232) model_output = self.execute_model(scheduler_output)
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 227, in execute_model
(EngineCore_0 pid=3232) raise err
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 218, in execute_model
(EngineCore_0 pid=3232) return self.model_executor.execute_model(scheduler_output)
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 172, in execute_model
(EngineCore_0 pid=3232) (output, ) = self.collective_rpc(
(EngineCore_0 pid=3232) File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 247, in collective_rpc
(EngineCore_0 pid=3232) raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_0 pid=3232) TimeoutError: RPC call to execute_model timed out. |
What this PR does / why we need it?
Fix Qwen MoE accuracy bug in DP senario.
Now the implentment of
FusedMoE
in vLLM useAll2AllManager
to manager different all2all algorithm branch. And the default branch useMulticast
indispatch
phase andall_reduce
incombine
phase, which are not implented in vLLM-Ascend. This leading to invoking into a default implentment inbase_communicator
, with emptydispatch
andcombine
operations, thus causing the accuracy issue on it.This pr is a temporary workaround, refacting all2all in vLLM-Ascend could be a better way.
Does this PR introduce any user-facing change?
How was this patch tested?