-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[PERF] PyTorch Symmetric Memory All-Reduce #20759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] PyTorch Symmetric Memory All-Reduce #20759
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ilmarkov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates a new, performance-optimized allreduce implementation into vLLM, utilizing PyTorch's symmetric memory features for CUDA. The primary goal is to enhance the efficiency of tensor parallel allreduce operations, particularly for mid-sized inputs, by providing a faster alternative that can be conditionally enabled via an environment variable. The changes also involve adapting existing allreduce strategies to work synergistically with this new method.
Highlights
- New Allreduce Implementation: Introduced a
SymmMemCommunicator
class that leverages PyTorch's symmetric memory capabilities for efficient allreduce operations on CUDA devices, providing an alternative to existingCustomAllreduce
and NCCL methods. - Conditional Activation and Strategy Selection: The new symmetric memory allreduce can be enabled via the
VLLM_ALLREDUCE_USE_SYMM_MEM
environment variable. The implementation intelligently selects between 'two-shot' (for world sizes 2 and 4) and 'multimem' (for world sizes 6 and 8) allreduce strategies based on the world size, and is prioritized for suitable input tensor sizes. - Performance Optimization and Benchmarking: This change aims to improve the performance of tensor parallel allreduce for middle-sized inputs, with benchmarks showing 7-10% improvement in Mean TTFT and up to 5-7% improvement in Mean TPOT for Llama-3.1-70B-Instruct with TP=4 and TP=8.
- Adaptive Custom Allreduce Bounds: The maximum supported input sizes for the existing
CustomAllreduce
implementation are adjusted when symmetric memory is enabled, ensuring that the most performant allreduce method is used for different input ranges.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new all-reduce implementation using PyTorch's symmetric memory, improving performance for medium-sized tensors. The code is well-structured, but I suggest increasing the flexibility of SymmMemCommunicator
by allowing it to work with different dtypes and improving the accuracy of comments in custom_all_reduce.py
.
elif isinstance(device, str): | ||
device = torch.device(device) | ||
torch.cuda.set_device(device) | ||
self.dtype = torch.bfloat16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request has merge conflicts that must be resolved before it can be |
@ilmarkov hey do you know any reason why mutlicast_ptr == 0 on hopper? do I need specific version of torch? |
6636a97
to
888e406
Compare
888e406
to
de0ef8c
Compare
This pull request has merge conflicts that must be resolved before it can be |
aad0d31
to
4aa27bd
Compare
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
4aa27bd
to
f5b5f42
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Michael Goin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we delay it? I'm in discussion with the NCCL team to talk about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry i read it wrong. I was thinking about nccl register window-stuff. good to have this functionality, please add a follow-up for how to select the appropriate flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for integrating PyTorch Symmetric Memory into vllm!
LGTM!
self.buffer[:inp.numel()].copy_(inp.view(-1)) | ||
if self.world_size in self._WORLD_SIZES_MULTIMEM[ | ||
self.device_capability]: | ||
torch.ops.symm_mem.multimem_all_reduce_(self.buffer[:inp.numel()], | ||
"sum", | ||
self.group.group_name) | ||
else: | ||
torch.ops.symm_mem.two_shot_all_reduce_(self.buffer[:inp.numel()], | ||
"sum", | ||
self.group.group_name) | ||
out.copy_(self.buffer[:inp.numel()].view(out.shape)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is okay. We can talk more on how to optimize away the copy-in and copy-out :)
e.g. today some of our ops have an _out
version that can do the copy for you.
We can also check here whether inp
is a symmetric tensor, if it is, directly feed it to the ops, than copying into self.buffer
.
"sum", | ||
self.group.group_name) | ||
else: | ||
torch.ops.symm_mem.two_shot_all_reduce_(self.buffer[:inp.numel()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason you are not using two_shot_all_reduce_out
, that would produce output directly in the non-symmetric out buf?
self.buffer[:inp.numel()].copy_(inp.view(-1)) | ||
if self.world_size in self._WORLD_SIZES_MULTIMEM[ | ||
self.device_capability]: | ||
torch.ops.symm_mem.multimem_all_reduce_(self.buffer[:inp.numel()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you guys check that this is deterministic? nccl claims multimem allreduce is deterministic on newer driver NVIDIA/nccl#1497 (comment), but it's not in the docs
Thanks for the additional review! Given the questions are not blockers, I think we should merge the functionality now and follow up with future work/default usage in discussion/PRs. This PR has been up for a while |
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: FFFfff1FFFfff <[email protected]>
A few questions:
|
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: ilmarkov <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>
Add an alternative to custom_allreduce and nccl on cuda - pytorch symmetric memory
Enabled by environment variable
VLLM_ALLREDUCE_USE_SYMM_MEM=1
.Improves performance of TP allreduce for middle size input.
Bounds input sizes for custom allreduce as long as performance of two shot custom allreduce appears to be worse than nccl or pytorch symmetric memory based allreduce.
Max sizes for various world_sizes for custom allreduce and symmetric memory were chosen based on empirical results.
For world_sizes 2 and 4 pytorch two shot allreduce is used, for world sizes 6 and 8 pytorch multimem_all_reduce
Benchmark results:
Server:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --no-enable-prefix-caching -tp $tp
Client:
On Blackwell, B200
TP=4:
Baseline:
PR:
TP=8
Baseline:
PR:
Up to 8% TTFT speedup for TP=4
From 7 to 10% TTFT improvement, and up to 5-7% TPOT improvement for TP=8.
On Hopper, H100
TP=8
Baseline:
PR:
Up to 10% TTFT improvement, and minor TPOT speedups.
Validation:
model="meta-llama/Llama-3.1-70B-Instruct"
client:
lm_eval --model local-completions --model_args model=${model},base_url=http://localhost:8000/v1/completions --batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k --num_fewshot 5 --batch_size 200
server:
VLLM_ALLREDUCE_USE_SYMM_MEM=1 vllm serve $model \ --disable-log-requests --no-enable-prefix-caching -tp 4