-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default #27216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request re-enables the mrope Triton kernel for CUDA/ROCm platforms. The logic is mostly correct, but I've found a critical issue where the availability of Triton is not checked before enabling the kernel, which could lead to runtime crashes. I've provided a comment with a suggested fix that also improves the code's readability.
enabled = super().enabled() | ||
compilation_config = get_cached_compilation_config() | ||
custom_ops = compilation_config.custom_ops | ||
disabled = hasattr(cls, "name") and f"-{cls.name}" in custom_ops | ||
use_triton = current_platform.is_cuda_alike() | ||
return (use_triton or enabled) and not disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation has two issues:
- Missing Triton check (Critical Bug): The
mrope
Triton kernel is enabled on CUDA/ROCm platforms, but there's no check to ensure Triton is actually available. If Triton is not installed or not configured correctly, this will lead to a runtime crash whenforward_cuda
is called. - Complex logic (Maintainability): The boolean logic
(use_triton or enabled) and not disabled
is a bit convoluted and hard to reason about.
I've provided a suggestion that fixes the bug and refactors the logic to be more explicit and readable, separating the logic for CUDA-alike platforms from others.
from vllm.triton_utils import HAS_TRITON
if not HAS_TRITON:
return False
# On CUDA/ROCm, the Triton kernel is enabled by default unless
# explicitly disabled.
if current_platform.is_cuda_alike():
compilation_config = get_cached_compilation_config()
custom_ops = compilation_config.custom_ops
disabled = hasattr(cls, "name") and f"-{cls.name}" in custom_ops
return not disabled
# On other platforms, fall back to the default behavior.
return super().enabled()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom op enablement mechanism is complex as it is, please let's not add complexity here. Couldyou instead add logic to VllmConfig.__post_init__
that enables mrope
by default on CUDA-alike platforms? You can add the CustomOp.register
decorator to mrope
or you can conditionally enable rope if the model uses mrope.
Signed-off-by: Isotr0py <[email protected]>
Benchmark results on RTX 3090
Main branch
PR
Hmmm, seems using the triton MRoPE kernel can get a lower TTFT. Perhaps @tjtanaa can help benchmark on ROCM platform as well? |
@Isotr0py , we observed the same trend on MI300X. Server command
Bench
Before PR
After PR
|
Do we still want to re-enable mrope triton for CUDA/ROCM platform by default? |
Purpose
Discussion: https://vllm-dev.slack.com/archives/C07QCGVDNUF/p1760976569264999
cc @tjtanaa @ProExpertProg
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.