Skip to content

Conversation

@sakogan
Copy link
Contributor

@sakogan sakogan commented Oct 1, 2025

This PR enhances the work started in #18768 and #20766 by enabling Marlin kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense/MoE models quantized with RTN.

We ran the built-in latency benchmark with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] vllm bench latency --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> -q rtn
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes (old/new refer to pre-PR/post-PR implementations):

Variant (data type) 1 4 8 16
old RTN (Int8) 0.458 0.490 0.516 0.570
new RTN (Int8) 0.174 0.222 0.285 0.402
old RTN (Int4) 0.720 0.761 0.794 0.847
new RTN (Int4) 0.139 0.180 0.230 0.331

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type) 1 4 8 16
old RTN (Int8) 1.104 1.190 1.262 1.411
new RTN (Int8) 0.416 0.545 0.707 1.025
old RTN (Int4) 1.736 1.859 1.933 2.082
new RTN (Int4) 0.328 0.442 0.573 0.846

@sakogan sakogan marked this pull request as ready for review October 1, 2025 21:55
@mergify
Copy link

mergify bot commented Oct 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sakogan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 7, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, thanks for using the existing kernel! Please fix the merge conflict and I will enable the full CI @sakogan

@mergify mergify bot removed the needs-rebase label Oct 8, 2025
@sakogan
Copy link
Contributor Author

sakogan commented Oct 8, 2025

Looks reasonable to me, thanks for using the existing kernel! Please fix the merge conflict and I will enable the full CI @sakogan

@mgoin Done (and thanks for the review!)

@mgoin mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels Oct 13, 2025
@mgoin mgoin enabled auto-merge (squash) October 13, 2025 16:44
@mgoin mgoin merged commit 89342ce into vllm-project:main Oct 13, 2025
55 checks passed
1994 pushed a commit to 1994/vllm that referenced this pull request Oct 14, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: 1994 <[email protected]>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Dhruvil Bhatt <[email protected]>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: bbartels <[email protected]>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: 0xrushi <[email protected]>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ration-free RTN-based quantization (vllm-project#26051)

Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: Alex Kogan <[email protected]>
Signed-off-by: 0xrushi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants