-
Notifications
You must be signed in to change notification settings - Fork 444
[BugFix]fixed rm_router_logits_allgather_ep bug #1817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[BugFix]fixed rm_router_logits_allgather_ep bug #1817
Conversation
Signed-off-by: ttanzhiqiang <[email protected]>
Signed-off-by: ttanzhiqiang <[email protected]>
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1817 +/- ##
==========================================
- Coverage 53.51% 53.49% -0.02%
==========================================
Files 77 77
Lines 9435 9438 +3
==========================================
Hits 5049 5049
- Misses 4386 4389 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
please rebase to fix the merge conflict if this PR is still needed. |
hidden_states = chunk_hidden_states[tp_rank] | ||
router_logits = chunk_router_logits[tp_rank] | ||
if not self.rm_router_logits: | ||
if num_tokens < tp_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This f statement's condition check can be directly merged with the padding check for num_token above.
There are four situations in the previous logic
rm_router_logits optimization scheme, AllGather/NaiveMulticast/All2All/MC2 are all used
How was this patch tested?
Test method for case 1
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown
Test method for case 2
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown
Test method for case 3
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown