-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Refactor] Remove moe_align_block_size_triton
#21335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Remove moe_align_block_size_triton
#21335
Conversation
Signed-off-by: yewentao256 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request removes the Triton implementation of moe_align_block_size as the vLLM custom operation is now faster. The changes remove the Triton kernels and update the benchmark script. The code removal is clean and the rationale is supported by the provided performance data.
| line_vals=["vllm", "triton"], # "triton" | ||
| line_names=["VLLM", "Triton"], # "Triton" | ||
| line_vals=["vllm"], | ||
| line_names=["vLLM"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move the kernel into the benchmark script? It seems not that useful to have a benchmark as just one kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it here because I think someone else may reuse this script to compare. eg. new_vllm vs, old_vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving the kernel into the benchmark doesn't help because we already win the triton kernel for every shape, so we don't need to maintain it. In the future when new kernel comes, just compare with the current version would be good, what's your thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @mgoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that is fair enough. As long as we have a naive implementation to unit test against, it should be okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, someone else developed a torch version for unit test. But it is quite slow, so I don't choose to compare it for benchmark.
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Purpose
moe_align_block_size_tritonwas taken from SGL, used to benchmark themoe_align_block_sizekernelBut now we already beats triton version in every shape, so we can safely remove this as no other references in the code as well.
moe-align-block-size-performance: num_tokens num_experts topk VLLM Triton 0 1.0 16.0 1.0 23.584001 52.127998 1 1.0 16.0 2.0 23.584001 50.655998 2 1.0 16.0 8.0 23.584001 51.936001 3 1.0 64.0 1.0 31.776000 52.096002 4 1.0 64.0 2.0 31.776000 51.376000 5 1.0 64.0 8.0 31.776000 50.880000 6 1.0 224.0 1.0 27.680000 64.544000 7 1.0 224.0 2.0 27.664000 64.544000 8 1.0 224.0 8.0 27.680000 64.544000 9 1.0 256.0 1.0 29.727999 66.592000 10 1.0 256.0 2.0 29.727999 66.592000 11 1.0 256.0 8.0 29.727999 66.592000 12 1.0 280.0 1.0 29.696001 68.672001 13 1.0 280.0 2.0 29.727999 68.672001 14 1.0 280.0 8.0 29.727999 68.640001 15 1.0 512.0 1.0 33.792000 97.344004 16 1.0 512.0 2.0 33.792000 97.344004 17 1.0 512.0 8.0 33.792000 97.344004 18 16.0 16.0 1.0 23.584001 52.448001 19 16.0 16.0 2.0 23.584001 51.199999 20 16.0 16.0 8.0 25.632000 52.207999 21 16.0 64.0 1.0 31.776000 52.064002 22 16.0 64.0 2.0 31.776000 52.032001 23 16.0 64.0 8.0 31.776000 51.456001 24 16.0 224.0 1.0 27.680000 61.632000 25 16.0 224.0 2.0 27.680000 61.439998 26 16.0 224.0 8.0 27.744001 61.407998 27 16.0 256.0 1.0 29.696001 66.592000 28 16.0 256.0 2.0 29.696001 66.592000 29 16.0 256.0 8.0 29.696001 66.624001 30 16.0 280.0 1.0 29.696001 70.720002 31 16.0 280.0 2.0 29.696001 70.688002 32 16.0 280.0 8.0 29.696001 71.584001 33 16.0 512.0 1.0 33.792000 97.375996 34 16.0 512.0 2.0 33.792000 97.375996 35 16.0 512.0 8.0 33.792000 95.296003 36 256.0 16.0 1.0 27.712001 52.832000 37 256.0 16.0 2.0 33.824001 54.912001 38 256.0 16.0 8.0 25.599999 64.479999 39 256.0 64.0 1.0 33.824001 52.800000 40 256.0 64.0 2.0 35.872001 52.320000 41 256.0 64.0 8.0 27.280000 56.352001 42 256.0 224.0 1.0 29.600000 62.527999 43 256.0 224.0 2.0 28.640000 62.527999 44 256.0 224.0 8.0 29.727999 74.784003 45 256.0 256.0 1.0 29.727999 68.640001 46 256.0 256.0 2.0 29.696001 64.576000 47 256.0 256.0 8.0 29.696001 70.720002 48 256.0 280.0 1.0 29.727999 68.736002 49 256.0 280.0 2.0 29.727999 70.752002 50 256.0 280.0 8.0 29.696001 74.752003 51 256.0 512.0 1.0 33.792000 93.216002 52 256.0 512.0 2.0 33.792000 92.032000 53 256.0 512.0 8.0 33.792000 91.168001 54 4096.0 16.0 1.0 27.648000 93.216002 55 4096.0 16.0 2.0 29.727999 136.224002 56 4096.0 16.0 8.0 46.080001 392.255992 57 4096.0 64.0 1.0 27.648000 64.544000 58 4096.0 64.0 2.0 31.744000 72.800003 59 4096.0 64.0 8.0 50.175998 140.351996 60 4096.0 224.0 1.0 29.727999 76.863997 61 4096.0 224.0 2.0 33.792000 80.959998 62 4096.0 224.0 8.0 52.352000 105.535999 63 4096.0 256.0 1.0 31.679999 78.879997 64 4096.0 256.0 2.0 35.808001 87.072000 65 4096.0 256.0 8.0 52.255999 109.792002 66 4096.0 280.0 1.0 31.744000 80.895998 67 4096.0 280.0 2.0 35.840001 88.992000 68 4096.0 280.0 8.0 50.175998 109.664001 69 4096.0 512.0 1.0 35.872001 95.200002 70 4096.0 512.0 2.0 37.888002 101.471998 71 4096.0 512.0 8.0 54.272000 128.064007