Skip to content

Conversation

sakogan
Copy link
Contributor

@sakogan sakogan commented Aug 19, 2025

This PR enhances the work started in #18768 and #20766 by introducing Marlin-based kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense models quantized with RTN.

We ran benchmark_latency with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] python benchmark_latency.py --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> [--quantization rtn]
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes:

Variant (data type) 1 4 8 16
Baseline (BF16) 0.236 0.260 0.284 0.336
old RTN (Int8) 0.469 0.500 0.526 0.581
new RTN (Int8) 0.186 0.231 0.248 0.300
old RTN (Int4) 0.716 0.756 0.788 0.842
new RTN (Int4) 0.154 0.194 0.216 0.267

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type) 1 4 8 16
Baseline (BF16) 0.558 0.629 0.700 0.855
old RTN (Int8) 1.131 1.216 1.287 1.436
new RTN (Int8) 0.440 0.563 0.616 0.764
old RTN (Int4) 1.732 1.850 1.920 2.068
new RTN (Int4) 0.358 0.466 0.531 0.681

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Aug 19, 2025
@sakogan sakogan marked this pull request as ready for review August 19, 2025 19:36
@mgoin
Copy link
Member

mgoin commented Aug 19, 2025

Could you explain the need for adding marlin kernels specifically for this case? It seems you could get the same results by providing dummy scales to the existing marlin impl, is that right?

@sakogan sakogan changed the title introduce Marlin-based GEMM kernels for RTN [Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization Aug 19, 2025
@sakogan sakogan closed this Aug 19, 2025
@sakogan sakogan reopened this Aug 19, 2025
@sakogan
Copy link
Contributor Author

sakogan commented Aug 19, 2025

Could you explain the need for adding marlin kernels specifically for this case? It seems you could get the same results by providing dummy scales to the existing marlin impl, is that right?

Yes, I think you are right. In theory, there are a number of Marlin kernels in the vLLM code base that could be used for RTN, e.g., https://github.com/vllm-project/vllm/blob/main/csrc/quantization/marlin/dense/marlin_cuda_kernel.cu. But I have not seen a version that could be used as-is, e.g., the one linked above seems to support only FP16. So I thought it would be better to create a separate version for RTN, which could also be tuned in the future without impacting other quantization schemes. I did try to reuse as many helper functions as possible, hence all the includes for gptq_marlin headers.

Copy link

mergify bot commented Aug 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sakogan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 26, 2025
@mergify mergify bot removed the needs-rebase label Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants