[Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization #26051

sakogan · 2025-10-01T21:49:22Z

This PR enhances the work started in #18768 and #20766 by enabling Marlin kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense/MoE models quantized with RTN.

We ran the built-in latency benchmark with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] vllm bench latency --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> -q rtn
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes (old/new refer to pre-PR/post-PR implementations):

Variant (data type)	1	4	8	16
old RTN (Int8)	0.458	0.490	0.516	0.570
new RTN (Int8)	0.174	0.222	0.285	0.402
old RTN (Int4)	0.720	0.761	0.794	0.847
new RTN (Int4)	0.139	0.180	0.230	0.331

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type)	1	4	8	16
old RTN (Int8)	1.104	1.190	1.262	1.411
new RTN (Int8)	0.416	0.545	0.707	1.025
old RTN (Int4)	1.736	1.859	1.933	2.082
new RTN (Int4)	0.328	0.442	0.573	0.846

Signed-off-by: Alex Kogan <[email protected]>

mergify · 2025-10-07T22:16:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sakogan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin

Looks reasonable to me, thanks for using the existing kernel! Please fix the merge conflict and I will enable the full CI @sakogan

Signed-off-by: Alex Kogan <[email protected]>

sakogan · 2025-10-08T15:06:14Z

Looks reasonable to me, thanks for using the existing kernel! Please fix the merge conflict and I will enable the full CI @sakogan

@mgoin Done (and thanks for the review!)

Signed-off-by: Alex Kogan <[email protected]>

…ration-free RTN-based quantization (vllm-project#26051) Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: 1994 <[email protected]>

…ration-free RTN-based quantization (vllm-project#26051) Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>

…ration-free RTN-based quantization (vllm-project#26051) Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: bbartels <[email protected]>

…ration-free RTN-based quantization (vllm-project#26051) Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: Alex Kogan <[email protected]>

…ration-free RTN-based quantization (vllm-project#26051) Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…ration-free RTN-based quantization (vllm-project#26051) Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: Alex Kogan <[email protected]> Signed-off-by: 0xrushi <[email protected]>

utilize Marlin GEMM kernels for RTN

7c05081

Signed-off-by: Alex Kogan <[email protected]>

sakogan marked this pull request as ready for review October 1, 2025 21:55

sakogan requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 1, 2025 21:55

mergify bot added the needs-rebase label Oct 7, 2025

mgoin approved these changes Oct 7, 2025

View reviewed changes

Merge branch 'main' into rtn-marlin-kernels-v2

7f5626d

Signed-off-by: Alex Kogan <[email protected]>

mergify bot removed the needs-rebase label Oct 8, 2025

sakogan added 3 commits October 8, 2025 09:43

fix style errors flagged by pre-commit

e7abb9a

Signed-off-by: Alex Kogan <[email protected]>

fix style errors flagged by pre-commit

5cda849

Signed-off-by: Alex Kogan <[email protected]>

fix style errors flagged by pre-commit

07b0b19

Signed-off-by: Alex Kogan <[email protected]>

sakogan and others added 4 commits October 8, 2025 11:06

Merge branch 'main' into rtn-marlin-kernels-v2

0cdd33f

Merge branch 'main' into rtn-marlin-kernels-v2

f5f272b

Signed-off-by: Alex Kogan <[email protected]>

Merge branch 'main' into rtn-marlin-kernels-v2

6241928

fix pre-commit error

b99ebd2

Signed-off-by: Alex Kogan <[email protected]>

mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels Oct 13, 2025

mgoin enabled auto-merge (squash) October 13, 2025 16:44

mgoin merged commit 89342ce into vllm-project:main Oct 13, 2025
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization #26051

[Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization #26051

Uh oh!

sakogan commented Oct 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

mgoin left a comment

Uh oh!

sakogan commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization #26051

[Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization #26051

Uh oh!

Conversation

sakogan commented Oct 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

sakogan commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sakogan commented Oct 1, 2025 •

edited by github-actions bot

Loading