[Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default #27216

Isotr0py · 2025-10-20T17:22:57Z

Purpose

Discussion: https://vllm-dev.slack.com/archives/C07QCGVDNUF/p1760976569264999

[Bugfix] Fix platform-specific routing in CustomOp implementations #24444 also disabled mrope triton kernel on CUDA/ROCM by default, this PR re-enable it for CUDA/ROCM platform by default without affecting OOT platform.

cc @tjtanaa @ProExpertProg

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <[email protected]>

gemini-code-assist

Code Review

This pull request re-enables the mrope Triton kernel for CUDA/ROCm platforms. The logic is mostly correct, but I've found a critical issue where the availability of Triton is not checked before enabling the kernel, which could lead to runtime crashes. I've provided a comment with a suggested fix that also improves the code's readability.

gemini-code-assist · 2025-10-20T17:25:36Z

vllm/model_executor/layers/rotary_embedding/mrope.py

+        enabled = super().enabled()
+        compilation_config = get_cached_compilation_config()
+        custom_ops = compilation_config.custom_ops
+        disabled = hasattr(cls, "name") and f"-{cls.name}" in custom_ops
+        use_triton = current_platform.is_cuda_alike()
+        return (use_triton or enabled) and not disabled


This implementation has two issues:

Missing Triton check (Critical Bug): The mrope Triton kernel is enabled on CUDA/ROCm platforms, but there's no check to ensure Triton is actually available. If Triton is not installed or not configured correctly, this will lead to a runtime crash when forward_cuda is called.

Complex logic (Maintainability): The boolean logic (use_triton or enabled) and not disabled is a bit convoluted and hard to reason about.

I've provided a suggestion that fixes the bug and refactors the logic to be more explicit and readable, separating the logic for CUDA-alike platforms from others.

from vllm.triton_utils import HAS_TRITON if not HAS_TRITON: return False # On CUDA/ROCm, the Triton kernel is enabled by default unless # explicitly disabled. if current_platform.is_cuda_alike(): compilation_config = get_cached_compilation_config() custom_ops = compilation_config.custom_ops disabled = hasattr(cls, "name") and f"-{cls.name}" in custom_ops return not disabled # On other platforms, fall back to the default behavior. return super().enabled()

ProExpertProg

The custom op enablement mechanism is complex as it is, please let's not add complexity here. Couldyou instead add logic to VllmConfig.__post_init__ that enables mrope by default on CUDA-alike platforms? You can add the CustomOp.register decorator to mrope or you can conditionally enable rope if the model uses mrope.

Signed-off-by: Isotr0py <[email protected]>

Isotr0py · 2025-10-21T03:47:49Z

Benchmark results on RTX 3090

vllm serve Qwen/Qwen3-VL-4B-Instruct/ --limit-mm-per-prompt.video 0

vllm bench serve  --backend openai-chat   --endpoint /v1/chat/completions --model Qwen/Qwen3-VL-4B-Instruct/   --endpoint /v1/chat/completions   --dataset-name hf   --dataset-path "lmarena-ai/VisionArena-Chat"   --hf-split train   --num-prompts 200 --max-concurrency 64

Main branch

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  32.80     
Total input tokens:                      15317     
Total generated tokens:                  24209     
Request throughput (req/s):              6.10      
Output token throughput (tok/s):         738.01    
Peak output token throughput (tok/s):    2441.00   
Peak concurrent requests:                86.00     
Total Token throughput (tok/s):          1204.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          2187.02   
Median TTFT (ms):                        1817.40   
P99 TTFT (ms):                           6392.69   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.44     
Median TPOT (ms):                        67.19     
P99 TPOT (ms):                           372.61    
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.52     
Median ITL (ms):                         27.50     
P99 ITL (ms):                            424.95    
==================================================

PR

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  33.34     
Total input tokens:                      15317     
Total generated tokens:                  24140     
Request throughput (req/s):              6.00      
Output token throughput (tok/s):         723.96    
Peak output token throughput (tok/s):    2559.00   
Peak concurrent requests:                79.00     
Total Token throughput (tok/s):          1183.32   
---------------Time to First Token----------------
Mean TTFT (ms):                          1984.13   
Median TTFT (ms):                        1241.46   
P99 TTFT (ms):                           6767.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          81.76     
Median TPOT (ms):                        69.51     
P99 TPOT (ms):                           373.03    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.64     
Median ITL (ms):                         27.44     
P99 ITL (ms):                            423.14    
==================================================

Hmmm, seems using the triton MRoPE kernel can get a lower TTFT. Perhaps @tjtanaa can help benchmark on ROCM platform as well?

tjtanaa · 2025-10-21T09:54:50Z

@Isotr0py , we observed the same trend on MI300X.

Server command

VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
vllm serve Qwen/Qwen3-VL-4B-Instruct \
--tensor-parallel-size 1 \
--limit-mm-per-prompt.video 0 \
--port 8090 \
> logs/server.log 2>&1

Bench

vllm bench serve  \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model Qwen/Qwen3-VL-4B-Instruct \
--endpoint /v1/chat/completions  \
--dataset-name hf \
--dataset-path "lmarena-ai/VisionArena-Chat" \
--hf-split train \
--num-prompts 2000 \
--port 8090 \
--max-concurrency 64 \
> logs/before.log 2>&1

Before PR

============ Serving Benchmark Result ============
Successful requests:                     2000     
Failed requests:                         0        
Maximum request concurrency:             64       
Benchmark duration (s):                  255.60   
Total input tokens:                      190515   
Total generated tokens:                  244872   
Request throughput (req/s):              7.82     
Output token throughput (tok/s):         958.04   
Peak output token throughput (tok/s):    641.00   
Peak concurrent requests:                84.00    
Total Token throughput (tok/s):          1703.41  
---------------Time to First Token----------------
Mean TTFT (ms):                          6377.35  
Median TTFT (ms):                        6553.74  
P99 TTFT (ms):                           7343.13  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.63    
Median TPOT (ms):                        12.90    
P99 TPOT (ms):                           26.11    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.18   
Median ITL (ms):                         118.18   
P99 ITL (ms):                            284.10   
==================================================

After PR

============ Serving Benchmark Result ============
Successful requests:                     2000     
Failed requests:                         0        
Maximum request concurrency:             64       
Benchmark duration (s):                  257.02   
Total input tokens:                      190515   
Total generated tokens:                  244993   
Request throughput (req/s):              7.78     
Output token throughput (tok/s):         953.19   
Peak output token throughput (tok/s):    1403.00  
Peak concurrent requests:                110.00   
Total Token throughput (tok/s):          1694.43  
---------------Time to First Token----------------
Mean TTFT (ms):                          6360.02  
Median TTFT (ms):                        6531.22  
P99 TTFT (ms):                           8692.30  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.26    
Median TPOT (ms):                        12.92    
P99 TPOT (ms):                           40.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           118.71   
Median ITL (ms):                         22.33    
P99 ITL (ms):                            289.05   
==================================================

tjtanaa · 2025-10-21T14:53:33Z

Do we still want to re-enable mrope triton for CUDA/ROCM platform by default?

enable mrope triton kernel

2c363df

Signed-off-by: Isotr0py <[email protected]>

mergify bot added the rocm Related to AMD ROCm label Oct 20, 2025

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

ProExpertProg requested changes Oct 20, 2025

View reviewed changes

use vllm config

b4fc9a8

Signed-off-by: Isotr0py <[email protected]>

Isotr0py requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 21, 2025 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default #27216

[Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default #27216

Isotr0py commented Oct 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 20, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Isotr0py commented Oct 21, 2025

Uh oh!

tjtanaa commented Oct 21, 2025

Uh oh!

tjtanaa commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default #27216

Are you sure you want to change the base?

[Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default #27216

Conversation

Isotr0py commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Isotr0py commented Oct 21, 2025

Benchmark results on RTX 3090

Main branch

PR

Uh oh!

tjtanaa commented Oct 21, 2025

Uh oh!

tjtanaa commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Isotr0py commented Oct 20, 2025 •

edited by github-actions bot

Loading