[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #23197

sakogan · 2025-08-19T18:47:01Z

This PR enhances the work started in #18768 and #20766 by introducing Marlin-based kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense models quantized with RTN.

We ran benchmark_latency with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] python benchmark_latency.py --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> [--quantization rtn]
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes:

Variant (data type)	1	4	8	16
Baseline (BF16)	0.236	0.260	0.284	0.336
old RTN (Int8)	0.469	0.500	0.526	0.581
new RTN (Int8)	0.186	0.231	0.248	0.300
old RTN (Int4)	0.716	0.756	0.788	0.842
new RTN (Int4)	0.154	0.194	0.216	0.267

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type)	1	4	8	16
Baseline (BF16)	0.558	0.629	0.700	0.855
old RTN (Int8)	1.131	1.216	1.287	1.436
new RTN (Int8)	0.440	0.563	0.616	0.764
old RTN (Int4)	1.732	1.850	1.920	2.068
new RTN (Int4)	0.358	0.466	0.531	0.681

Signed-off-by: Alex Kogan <[email protected]>

github-actions · 2025-08-19T18:47:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin · 2025-08-19T19:46:45Z

Could you explain the need for adding marlin kernels specifically for this case? It seems you could get the same results by providing dummy scales to the existing marlin impl, is that right?

sakogan · 2025-08-19T20:11:15Z

Could you explain the need for adding marlin kernels specifically for this case? It seems you could get the same results by providing dummy scales to the existing marlin impl, is that right?

Yes, I think you are right. In theory, there are a number of Marlin kernels in the vLLM code base that could be used for RTN, e.g., https://github.com/vllm-project/vllm/blob/main/csrc/quantization/marlin/dense/marlin_cuda_kernel.cu. But I have not seen a version that could be used as-is, e.g., the one linked above seems to support only FP16. So I thought it would be better to create a separate version for RTN, which could also be tuned in the future without impacting other quantization schemes. I did try to reuse as many helper functions as possible, hence all the includes for gptq_marlin headers.

mergify · 2025-08-26T02:44:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sakogan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Alex Kogan <[email protected]>

introduce Marlin-based GEMM kernels for RTN

b8f219b

Signed-off-by: Alex Kogan <[email protected]>

mergify bot added the ci/build label Aug 19, 2025

sakogan marked this pull request as ready for review August 19, 2025 19:36

sakogan requested review from tlrmchlsmth, LucasWilkinson, mgoin, robertgshaw2-redhat and yewentao256 as code owners August 19, 2025 19:36

Merge branch 'main' into rtn-marlin-kernels

791da11

sakogan changed the title ~~introduce Marlin-based GEMM kernels for RTN~~ [Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization Aug 19, 2025

sakogan closed this Aug 19, 2025

sakogan reopened this Aug 19, 2025

Merge branch 'main' into rtn-marlin-kernels

e754828

mergify bot added the needs-rebase label Aug 26, 2025

Merge branch 'main' into rtn-marlin-kernels

d19e2e4

Signed-off-by: Alex Kogan <[email protected]>

mergify bot removed the needs-rebase label Aug 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #23197

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #23197

sakogan commented Aug 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

mgoin commented Aug 19, 2025

Uh oh!

sakogan commented Aug 19, 2025

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

Uh oh!

Uh oh!

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #23197

Are you sure you want to change the base?

[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #23197

Conversation

sakogan commented Aug 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

mgoin commented Aug 19, 2025

Uh oh!

sakogan commented Aug 19, 2025

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

Uh oh!

sakogan commented Aug 19, 2025 •

edited by github-actions bot

Loading