-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Performance] Introduce Marlin-based GEMM kernels for the calibration-free RTN-based quantization #21836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant performance improvements for RTN quantization by integrating Marlin-based GEMM kernels. The changes are well-structured, touching the build system, C++ ops, Python bindings, and the quantization logic. The performance gains demonstrated in the PR description are impressive.
My review focuses on ensuring the robustness and correctness of the new implementation, particularly in multi-GPU scenarios and with respect to build configurations. I've identified one critical issue related to workspace management and a high-severity issue concerning compile-time checks in the CUDA kernel.
Once these points are addressed, this will be a fantastic addition to the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workspace is shared as a class attribute across all instances of RTNLinearMethod. The current logic for re-allocating the workspace checks for size but not for the device.
In a multi-GPU setup (e.g., when loading multiple models on different devices in the same process), this can lead to a critical runtime error. If a layer on cuda:1 is processed after a layer on cuda:0, and the cuda:1 layer requires a smaller workspace, the if condition will be false. This will result in using the workspace allocated on cuda:0 for a GEMM operation on cuda:1, causing a cross-device error.
To prevent this, you should also check if the workspace is on the correct device before using it.
if RTNLinearMethod.workspace is None or \
RTNLinearMethod.workspace.device != device or \
RTNLinearMethod.workspace.numel() < layer.output_size_per_partition:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using assert(0) for unsupported architectures can lead to runtime errors that are difficult to debug, and it might be compiled out in release builds. It's better to enforce this constraint at compile time.
A common technique to cause a compile-time error for unsupported preprocessor branches is to reference an incomplete type. This ensures that any attempt to compile this code for an unsupported architecture will fail with a clear message.
#else
// Cause a compile-time error for unsupported architectures.
(void)sizeof(struct cuda_arch_not_supported);
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, using assert(0) here is not ideal. A compile-time check is more robust and provides clearer feedback to developers if they attempt to build for an unsupported architecture.
#else
// Cause a compile-time error for unsupported architectures.
(void)sizeof(struct cuda_arch_not_supported);
#endif
3b0400f to
0ae970e
Compare
This PR enhances the work started in #18768 and #20766 by introducing Marlin-based kernels for the calibration-free RTN-based quantization.
These kernels substantially improve the performance of dense models quantized with RTN.
To measure the effect of new kernels, we ran
benchmark_latencywith several Llama models on a machine equipped with H100 GPUs. The exact command was[RTN_NUM_BITS=4] python benchmark_latency.py --model <model> --n 1 --num-iters-warmup 3 --num-iters 10 --input-len 256 --output-len 32 -tp <#GPUs> --batch-size <batch> [--quantization rtn]Each data point is an average of 5 runs, the units are seconds (measuring generation latency, the lower the better).
Here are the results for Llama3.1-8B (ran on 1 GPU), for various batch sizes:
Here are the results for Llama3.3-70B (ran on 4 GPUs), for various batch sizes: