Skip to content

Conversation

@lio1226
Copy link
Contributor

@lio1226 lio1226 commented Oct 24, 2025

What this PR does / why we need it?

We optimized the sample_recovered_tokens_pytorch method reject_sampler and improve the performance of eagle-3.

Does this PR introduce any user-facing change?

How was this patch tested?

None

Co-authored-by: QilaiZhang ([email protected] )

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request optimizes the sample_recovered_tokens_pytorch method in rejection_sampler.py to improve the performance of eagle-3. The optimization replaces the nested loops with vectorized operations using torch functions, which should reduce the execution time. I have identified a potential issue related to the indexing of q_values.


recovered_id = torch.argmax(prob / q_values).item()
output_token_ids[token_idx] = recovered_id
q_values[:vocab_size] = q_value_new[token_positions, :vocab_size]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The indexing q_values[:vocab_size] might lead to incorrect behavior. q_values is initialized with the shape (num_tokens, vocab_size), and q_value_new has the shape (num_tokens, vocab_size). Therefore, assigning q_value_new[token_positions, :vocab_size] to q_values[:vocab_size] will result in q_values having only the first vocab_size rows updated, while the rest of the rows will remain -inf. This is likely not the intended behavior, as it will skew the probability distribution for tokens beyond the first vocab_size positions. Consider assigning q_value_new to q_values directly.

To fix this, you should assign the entire q_value_new to q_values without slicing. This ensures that all token positions have the correct q-values for the subsequent argmax operation.

Severity: critical

Suggested change
q_values[:vocab_size] = q_value_new[token_positions, :vocab_size]
q_values = q_value_new

@lio1226 lio1226 force-pushed the rejection_sample_optimize_v1 branch from c3edfda to f128fd5 Compare October 24, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant