Skip to content

Conversation

Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Oct 20, 2025

Purpose

Discussion: https://vllm-dev.slack.com/archives/C07QCGVDNUF/p1760976569264999

cc @tjtanaa @ProExpertProg

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the rocm Related to AMD ROCm label Oct 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request re-enables the mrope Triton kernel for CUDA/ROCm platforms. The logic is mostly correct, but I've found a critical issue where the availability of Triton is not checked before enabling the kernel, which could lead to runtime crashes. I've provided a comment with a suggested fix that also improves the code's readability.

Comment on lines 256 to 261
enabled = super().enabled()
compilation_config = get_cached_compilation_config()
custom_ops = compilation_config.custom_ops
disabled = hasattr(cls, "name") and f"-{cls.name}" in custom_ops
use_triton = current_platform.is_cuda_alike()
return (use_triton or enabled) and not disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This implementation has two issues:

  1. Missing Triton check (Critical Bug): The mrope Triton kernel is enabled on CUDA/ROCm platforms, but there's no check to ensure Triton is actually available. If Triton is not installed or not configured correctly, this will lead to a runtime crash when forward_cuda is called.
  2. Complex logic (Maintainability): The boolean logic (use_triton or enabled) and not disabled is a bit convoluted and hard to reason about.

I've provided a suggestion that fixes the bug and refactors the logic to be more explicit and readable, separating the logic for CUDA-alike platforms from others.

        from vllm.triton_utils import HAS_TRITON
        if not HAS_TRITON:
            return False

        # On CUDA/ROCm, the Triton kernel is enabled by default unless
        # explicitly disabled.
        if current_platform.is_cuda_alike():
            compilation_config = get_cached_compilation_config()
            custom_ops = compilation_config.custom_ops
            disabled = hasattr(cls, "name") and f"-{cls.name}" in custom_ops
            return not disabled

        # On other platforms, fall back to the default behavior.
        return super().enabled()

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom op enablement mechanism is complex as it is, please let's not add complexity here. Couldyou instead add logic to VllmConfig.__post_init__ that enables mrope by default on CUDA-alike platforms? You can add the CustomOp.register decorator to mrope or you can conditionally enable rope if the model uses mrope.

Signed-off-by: Isotr0py <[email protected]>
@Isotr0py
Copy link
Member Author

Benchmark results on RTX 3090

vllm serve Qwen/Qwen3-VL-4B-Instruct/ --limit-mm-per-prompt.video 0
vllm bench serve  --backend openai-chat   --endpoint /v1/chat/completions --model Qwen/Qwen3-VL-4B-Instruct/   --endpoint /v1/chat/completions   --dataset-name hf   --dataset-path "lmarena-ai/VisionArena-Chat"   --hf-split train   --num-prompts 200 --max-concurrency 64

Main branch

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  32.80     
Total input tokens:                      15317     
Total generated tokens:                  24209     
Request throughput (req/s):              6.10      
Output token throughput (tok/s):         738.01    
Peak output token throughput (tok/s):    2441.00   
Peak concurrent requests:                86.00     
Total Token throughput (tok/s):          1204.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          2187.02   
Median TTFT (ms):                        1817.40   
P99 TTFT (ms):                           6392.69   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.44     
Median TPOT (ms):                        67.19     
P99 TPOT (ms):                           372.61    
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.52     
Median ITL (ms):                         27.50     
P99 ITL (ms):                            424.95    
==================================================

PR

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  33.34     
Total input tokens:                      15317     
Total generated tokens:                  24140     
Request throughput (req/s):              6.00      
Output token throughput (tok/s):         723.96    
Peak output token throughput (tok/s):    2559.00   
Peak concurrent requests:                79.00     
Total Token throughput (tok/s):          1183.32   
---------------Time to First Token----------------
Mean TTFT (ms):                          1984.13   
Median TTFT (ms):                        1241.46   
P99 TTFT (ms):                           6767.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          81.76     
Median TPOT (ms):                        69.51     
P99 TPOT (ms):                           373.03    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.64     
Median ITL (ms):                         27.44     
P99 ITL (ms):                            423.14    
==================================================

Hmmm, seems using the triton MRoPE kernel can get a lower TTFT. Perhaps @tjtanaa can help benchmark on ROCM platform as well?

@tjtanaa
Copy link
Contributor

tjtanaa commented Oct 21, 2025

@Isotr0py , we observed the same trend on MI300X.

Server command

VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
vllm serve Qwen/Qwen3-VL-4B-Instruct \
--tensor-parallel-size 1 \
--limit-mm-per-prompt.video 0 \
--port 8090 \
> logs/server.log 2>&1

Bench

vllm bench serve  \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model Qwen/Qwen3-VL-4B-Instruct \
--endpoint /v1/chat/completions  \
--dataset-name hf \
--dataset-path "lmarena-ai/VisionArena-Chat" \
--hf-split train \
--num-prompts 2000 \
--port 8090 \
--max-concurrency 64 \
> logs/before.log 2>&1

Before PR

============ Serving Benchmark Result ============
Successful requests:                     2000     
Failed requests:                         0        
Maximum request concurrency:             64       
Benchmark duration (s):                  255.60   
Total input tokens:                      190515   
Total generated tokens:                  244872   
Request throughput (req/s):              7.82     
Output token throughput (tok/s):         958.04   
Peak output token throughput (tok/s):    641.00   
Peak concurrent requests:                84.00    
Total Token throughput (tok/s):          1703.41  
---------------Time to First Token----------------
Mean TTFT (ms):                          6377.35  
Median TTFT (ms):                        6553.74  
P99 TTFT (ms):                           7343.13  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.63    
Median TPOT (ms):                        12.90    
P99 TPOT (ms):                           26.11    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.18   
Median ITL (ms):                         118.18   
P99 ITL (ms):                            284.10   
==================================================

After PR

============ Serving Benchmark Result ============
Successful requests:                     2000     
Failed requests:                         0        
Maximum request concurrency:             64       
Benchmark duration (s):                  257.02   
Total input tokens:                      190515   
Total generated tokens:                  244993   
Request throughput (req/s):              7.78     
Output token throughput (tok/s):         953.19   
Peak output token throughput (tok/s):    1403.00  
Peak concurrent requests:                110.00   
Total Token throughput (tok/s):          1694.43  
---------------Time to First Token----------------
Mean TTFT (ms):                          6360.02  
Median TTFT (ms):                        6531.22  
P99 TTFT (ms):                           8692.30  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.26    
Median TPOT (ms):                        12.92    
P99 TPOT (ms):                           40.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           118.71   
Median ITL (ms):                         22.33    
P99 ITL (ms):                            289.05   
==================================================

@tjtanaa
Copy link
Contributor

tjtanaa commented Oct 21, 2025

Do we still want to re-enable mrope triton for CUDA/ROCM platform by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants