Skip to content

Conversation

@sakogan
Copy link
Contributor

@sakogan sakogan commented Jul 29, 2025

This PR enhances the work started in #18768 and #20766 by introducing Marlin-based kernels for the calibration-free RTN-based quantization.

These kernels substantially improve the performance of dense models quantized with RTN.

To measure the effect of new kernels, we ran benchmark_latency with several Llama models on a machine equipped with H100 GPUs. The exact command was
[RTN_NUM_BITS=4] python benchmark_latency.py --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> [--quantization rtn]
Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).

Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes:

Variant (data type) 1 4 8 16
Baseline (BF16) 0.236 0.260 0.284 0.336
old RTN (Int8) 0.469 0.500 0.526 0.581
new RTN (Int8) 0.186 0.231 0.248 0.300
old RTN (Int4) 0.716 0.756 0.788 0.842
new RTN (Int4) 0.154 0.194 0.216 0.267

Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes:

Variant (data type) 1 4 8 16
Baseline (BF16) 0.558 0.629 0.700 0.855
old RTN (Int8) 1.131 1.216 1.287 1.436
new RTN (Int8) 0.440 0.563 0.616 0.764
old RTN (Int4) 1.732 1.850 1.920 2.068
new RTN (Int4) 0.358 0.466 0.531 0.681

@mergify mergify bot added the ci/build label Jul 29, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance improvements for RTN quantization by integrating Marlin-based GEMM kernels. The changes are well-structured, touching the build system, C++ ops, Python bindings, and the quantization logic. The performance gains demonstrated in the PR description are impressive.

My review focuses on ensuring the robustness and correctness of the new implementation, particularly in multi-GPU scenarios and with respect to build configurations. I've identified one critical issue related to workspace management and a high-severity issue concerning compile-time checks in the CUDA kernel.

Once these points are addressed, this will be a fantastic addition to the project.

Comment on lines 225 to 226
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The workspace is shared as a class attribute across all instances of RTNLinearMethod. The current logic for re-allocating the workspace checks for size but not for the device.

In a multi-GPU setup (e.g., when loading multiple models on different devices in the same process), this can lead to a critical runtime error. If a layer on cuda:1 is processed after a layer on cuda:0, and the cuda:1 layer requires a smaller workspace, the if condition will be false. This will result in using the workspace allocated on cuda:0 for a GEMM operation on cuda:1, causing a cross-device error.

To prevent this, you should also check if the workspace is on the correct device before using it.

        if RTNLinearMethod.workspace is None or \
            RTNLinearMethod.workspace.device != device or \
            RTNLinearMethod.workspace.numel() < layer.output_size_per_partition:

Comment on lines 332 to 334
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert(0) for unsupported architectures can lead to runtime errors that are difficult to debug, and it might be compiled out in release builds. It's better to enforce this constraint at compile time.

A common technique to cause a compile-time error for unsupported preprocessor branches is to reference an incomplete type. This ensures that any attempt to compile this code for an unsupported architecture will fail with a clear message.

#else
                // Cause a compile-time error for unsupported architectures.
                (void)sizeof(struct cuda_arch_not_supported);
#endif

Comment on lines 347 to 349
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, using assert(0) here is not ideal. A compile-time check is more robust and provides clearer feedback to developers if they attempt to build for an unsupported architecture.

#else
                // Cause a compile-time error for unsupported architectures.
                (void)sizeof(struct cuda_arch_not_supported);
#endif

@sakogan sakogan closed this Jul 29, 2025
@sakogan sakogan force-pushed the rtn-marlin-kernels branch from 3b0400f to 0ae970e Compare July 29, 2025 20:04
@sakogan sakogan deleted the rtn-marlin-kernels branch August 18, 2025 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant