Skip to content

Conversation

sakogan
Copy link
Contributor

@sakogan sakogan commented Jul 29, 2025

This PR enhances the work started in #18768 and #20766 by introducing Marlin-based kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense models quantized with RTN.

We ran benchmark_latency with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] python benchmark_latency.py --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> [--quantization rtn]
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes:

Variant (data type) 1 4 8 16
Baseline (BF16) 0.236 0.260 0.284 0.336
old RTN (Int8) 0.469 0.500 0.526 0.581
new RTN (Int8) 0.186 0.231 0.248 0.300
old RTN (Int4) 0.716 0.756 0.788 0.842
new RTN (Int4) 0.154 0.194 0.216 0.267

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type) 1 4 8 16
Baseline (BF16) 0.558 0.629 0.700 0.855
old RTN (Int8) 1.131 1.216 1.287 1.436
new RTN (Int8) 0.440 0.563 0.616 0.764
old RTN (Int4) 1.732 1.850 1.920 2.068
new RTN (Int4) 0.358 0.466 0.531 0.681

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Jul 29, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high-performance Marlin-based kernels for RTN quantization, significantly improving latency. The changes include a new CUDA kernel, modifications to build files, and updates to the Python-level quantization logic to use the new kernel. My review focuses on API correctness, code robustness, and maintainability within the new CUDA kernel. I've identified a few areas for improvement, such as using const references for read-only tensors, replacing device-side assert(false) with static_assert for better error reporting, and ensuring consistent logic for determining quantization bit-width.

Comment on lines 369 to 371
#else
assert(false);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert(false) in CUDA device code is not ideal. If triggered, it can lead to unhelpful error messages. It's better to use static_assert to provide a clear compile-time error if an unsupported architecture is used.

#else
              static_assert(false, "Unsupported CUDA architecture for this Marlin kernel path.");
#endif

Comment on lines 383 to 385
#else
assert(false);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert(false) in CUDA device code is not ideal. If triggered, it can lead to unhelpful error messages. It's better to use static_assert to provide a clear compile-time error if an unsupported architecture is used.

#else
              static_assert(false, "Unsupported CUDA architecture for this Marlin kernel path.");
#endif

@sakogan sakogan closed this Jul 29, 2025
@sakogan sakogan force-pushed the rtn-marlin-kernels branch from 8c97d26 to 0ae970e Compare July 29, 2025 21:46
@sakogan sakogan deleted the rtn-marlin-kernels branch August 18, 2025 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant