[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21836

sakogan · 2025-07-29T14:15:10Z

This PR enhances the work started in #18768 and #20766 by introducing Marlin-based kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense models quantized with RTN.

To measure the effect of new kernels, we ran benchmark_latency with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] python benchmark_latency.py --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> [--quantization rtn]
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes:

Variant (data type)	1	4	8	16
Baseline (BF16)	0.236	0.260	0.284	0.336
old RTN (Int8)	0.469	0.500	0.526	0.581
new RTN (Int8)	0.186	0.231	0.248	0.300
old RTN (Int4)	0.716	0.756	0.788	0.842
new RTN (Int4)	0.154	0.194	0.216	0.267

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type)	1	4	8	16
Baseline (BF16)	0.558	0.629	0.700	0.855
old RTN (Int8)	1.131	1.216	1.287	1.436
new RTN (Int8)	0.440	0.563	0.616	0.764
old RTN (Int4)	1.732	1.850	1.920	2.068
new RTN (Int4)	0.358	0.466	0.531	0.681

github-actions · 2025-07-29T14:15:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces significant performance improvements for RTN quantization by integrating Marlin-based GEMM kernels. The changes are well-structured, touching the build system, C++ ops, Python bindings, and the quantization logic. The performance gains demonstrated in the PR description are impressive.

My review focuses on ensuring the robustness and correctness of the new implementation, particularly in multi-GPU scenarios and with respect to build configurations. I've identified one critical issue related to workspace management and a high-severity issue concerning compile-time checks in the CUDA kernel.

Once these points are addressed, this will be a fantastic addition to the project.

gemini-code-assist · 2025-07-29T14:18:20Z

vllm/model_executor/layers/quantization/rtn.py

The workspace is shared as a class attribute across all instances of RTNLinearMethod. The current logic for re-allocating the workspace checks for size but not for the device.

In a multi-GPU setup (e.g., when loading multiple models on different devices in the same process), this can lead to a critical runtime error. If a layer on cuda:1 is processed after a layer on cuda:0, and the cuda:1 layer requires a smaller workspace, the if condition will be false. This will result in using the workspace allocated on cuda:0 for a GEMM operation on cuda:1, causing a cross-device error.

To prevent this, you should also check if the workspace is on the correct device before using it.

if RTNLinearMethod.workspace is None or \ RTNLinearMethod.workspace.device != device or \ RTNLinearMethod.workspace.numel() < layer.output_size_per_partition:

gemini-code-assist · 2025-07-29T14:18:21Z

csrc/quantization/rtn_marlin/rtn_marlin_cuda_kernel.cu

Using assert(0) for unsupported architectures can lead to runtime errors that are difficult to debug, and it might be compiled out in release builds. It's better to enforce this constraint at compile time.

A common technique to cause a compile-time error for unsupported preprocessor branches is to reference an incomplete type. This ensures that any attempt to compile this code for an unsupported architecture will fail with a clear message.

#else // Cause a compile-time error for unsupported architectures. (void)sizeof(struct cuda_arch_not_supported); #endif

gemini-code-assist · 2025-07-29T14:18:21Z

csrc/quantization/rtn_marlin/rtn_marlin_cuda_kernel.cu

Similar to the previous comment, using assert(0) here is not ideal. A compile-time check is more robust and provides clearer feedback to developers if they attempt to build for an unsupported architecture.

#else // Cause a compile-time error for unsupported architectures. (void)sizeof(struct cuda_arch_not_supported); #endif

mergify bot added the ci/build label Jul 29, 2025

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

sakogan closed this Jul 29, 2025

sakogan force-pushed the rtn-marlin-kernels branch from 3b0400f to 0ae970e Compare July 29, 2025 20:04

sakogan deleted the rtn-marlin-kernels branch August 18, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21836

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21836

Uh oh!

sakogan commented Jul 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Uh oh!

gemini-code-assist bot Jul 29, 2025

Uh oh!

gemini-code-assist bot Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21836

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21836

Uh oh!

Conversation

sakogan commented Jul 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sakogan commented Jul 29, 2025 •

edited by github-actions bot

Loading