[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21865

sakogan · 2025-07-29T20:14:38Z

This PR enhances the work started in #18768 and #20766 by introducing Marlin-based kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense models quantized with RTN.

We ran benchmark_latency with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] python benchmark_latency.py --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> [--quantization rtn]
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes:

Variant (data type)	1	4	8	16
Baseline (BF16)	0.236	0.260	0.284	0.336
old RTN (Int8)	0.469	0.500	0.526	0.581
new RTN (Int8)	0.186	0.231	0.248	0.300
old RTN (Int4)	0.716	0.756	0.788	0.842
new RTN (Int4)	0.154	0.194	0.216	0.267

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type)	1	4	8	16
Baseline (BF16)	0.558	0.629	0.700	0.855
old RTN (Int8)	1.131	1.216	1.287	1.436
new RTN (Int8)	0.440	0.563	0.616	0.764
old RTN (Int4)	1.732	1.850	1.920	2.068
new RTN (Int4)	0.358	0.466	0.531	0.681

github-actions · 2025-07-29T20:14:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces high-performance Marlin-based kernels for RTN quantization, significantly improving latency. The changes include a new CUDA kernel, modifications to build files, and updates to the Python-level quantization logic to use the new kernel. My review focuses on API correctness, code robustness, and maintainability within the new CUDA kernel. I've identified a few areas for improvement, such as using const references for read-only tensors, replacing device-side assert(false) with static_assert for better error reporting, and ensuring consistent logic for determining quantization bit-width.

csrc/ops.h

gemini-code-assist · 2025-07-29T20:16:21Z

csrc/quantization/rtn_marlin/rtn_marlin_cuda_kernel.cu

Using assert(false) in CUDA device code is not ideal. If triggered, it can lead to unhelpful error messages. It's better to use static_assert to provide a clear compile-time error if an unsupported architecture is used.

#else static_assert(false, "Unsupported CUDA architecture for this Marlin kernel path."); #endif

gemini-code-assist · 2025-07-29T20:16:22Z

csrc/quantization/rtn_marlin/rtn_marlin_cuda_kernel.cu

Using assert(false) in CUDA device code is not ideal. If triggered, it can lead to unhelpful error messages. It's better to use static_assert to provide a clear compile-time error if an unsupported architecture is used.

#else static_assert(false, "Unsupported CUDA architecture for this Marlin kernel path."); #endif

csrc/quantization/rtn_marlin/rtn_marlin_cuda_kernel.cu

mergify bot added the ci/build label Jul 29, 2025

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

sakogan closed this Jul 29, 2025

sakogan force-pushed the rtn-marlin-kernels branch from 8c97d26 to 0ae970e Compare July 29, 2025 21:46

sakogan deleted the rtn-marlin-kernels branch August 18, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21865

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21865

Uh oh!

sakogan commented Jul 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Jul 29, 2025

Uh oh!

gemini-code-assist bot Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21865

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21865

Uh oh!

Conversation

sakogan commented Jul 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sakogan commented Jul 29, 2025 •

edited by github-actions bot

Loading