[BugFix] Add block_size validation for mamba cache align mode #34445
+9
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
In Mamba cache align mode, prefill requests are required to have a block-aligned number of tokens per scheduling step. If
max_num_batch_tokensis smaller thanblock_sizewhile the request length exceeds theblock_size, the_mamba_block_aligned_split()function will return anum_new_tokensof0due to these alignment constraints. This prevents the request from ever being scheduled, eventually causing the engine to hang.This PR adds a validation check to ensure that
block_sizeis not larger thanmax_num_batch_tokenswhen Mamba cache align mode is enabled.Test Plan
Test Result
Before fix: Engine hang.
After: Proper validation error.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.