-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
[V1] implement tree sampler for draft token acceptance #22752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a tree-based sampler for speculative decoding, which is a significant feature for improving performance. The changes are well-structured, with new components like TreeDrafterParams
and TreeRejectionSampler
to handle the tree-based logic. The Eagle proposer is also updated to support tree drafting and to output draft probabilities. My review has identified a few critical issues regarding the handling of irregular token trees and the correctness of probability calculations, which should be addressed to ensure the feature works as expected.
9c59df6
to
3da3a66
Compare
3da3a66
to
66d7154
Compare
This pull request has merge conflicts that must be resolved before it can be |
544ec15
to
aa5dd93
Compare
ac2523d
to
f04916b
Compare
Signed-off-by: Giancarlo Delfin <[email protected]>
f04916b
to
5cd6a5e
Compare
Purpose
Continuing with my work in this PR: #20401 , where I added support for drafting a tree of speculative tokens, that are then validated by the target model. In this PR, I add the class that performs rejection sampling for those draft tokens, so that they conform to the target model's output distribution. This class is based off of RejectionSampler, but with some key differences necessary to support a tree structure for drafted tokens. I added some tests for this new class to verify it's correctness.
In addition, I also made some refactors to the tree attention parameters to improve readability and performance. I created a new class called
TreeDrafterParams
which is created during theSpeculativeConfig
initialization, and precomputes several properties from the spec token tree so that other tree-attention systems can use it (without re-computing themselves). Examples: attention mask, children per level, etc.Finally, I reenabled the
test_eagle_correctness
test, which is currently skipping the tree attention backend due to flakiness. After my changes in this PR and from my testing on an H100, I don't observe flakiness with the test anymore. However, I decided to reduce the accuracy threshold from 66% -> 50% for tree attention only, so as to not cause noise in test signals for others. This is also because I've heard that triton attention kernels (which tree attention uses under the hood), tend to have more floating point non-determinism than flash attention 3.Test Plan
Automated Testing
New tree rejection sampler tests:
Eagle tree proposer test:
Spec decode e2e test:
Manual Testing
Server
Client
Response
Benchmark
Server
Client
Results
Next Steps