Skip to content

[BugFix]fixed rm_router_logits_allgather_ep bug #1817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ttanzhiqiang
Copy link
Contributor

@ttanzhiqiang ttanzhiqiang commented Jul 15, 2025

There are four situations in the previous logic

  1. If Prefill/decode uses AllGather or NaiveMulticast at the same time, this logic is normal, and this solution is used for optimization
  2. If Prefill/decode uses All2All/MC2 at the same time, this logic is also normal, and this solution is not used for optimization
  3. Prefill uses AllGatherEP (using VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP switch) and Decode uses MC2, which will affect the results. There is a bug in this place
  4. In the PD separation scenario, the strategies used by P and D are separate, so there is no impact.

rm_router_logits optimization scheme, AllGather/NaiveMulticast/All2All/MC2 are all used

  1. If Prefill/decode use AllGather or NaiveMulticast scheme at the same time, this logic is normal, and this scheme is used for optimization
  2. If Prefill/decode use All2All/MC2 scheme at the same time, this logic is also normal, and this scheme is used for optimization
  3. Prefill uses AllGatherEP scheme (use VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP switch), Decode uses MC2 scheme, and this scheme is used for optimization
  4. In the PD separation scenario, the strategies used by P and D are separate, and this scheme is used for optimization.

How was this patch tested?

Test method for case 1

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

Test method for case 2

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

Test method for case 3

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

@ttanzhiqiang ttanzhiqiang changed the title fixed rm_router_logits_allgather_ep bug [BugFix]fixed rm_router_logits_allgather_ep bug Jul 15, 2025
Copy link

codecov bot commented Jul 15, 2025

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 53.49%. Comparing base (f96100f) to head (d2d9ee4).

Files with missing lines Patch % Lines
vllm_ascend/ops/fused_moe.py 0.00% 7 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1817      +/-   ##
==========================================
- Coverage   53.51%   53.49%   -0.02%     
==========================================
  Files          77       77              
  Lines        9435     9438       +3     
==========================================
  Hits         5049     5049              
- Misses       4386     4389       +3     
Flag Coverage Δ
unittests 53.49% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant