[BugFix] Add block_size validation for mamba cache align mode #34445

peakcrosser7 · 2026-02-12T17:17:01Z

Purpose

In Mamba cache align mode, prefill requests are required to have a block-aligned number of tokens per scheduling step. If max_num_batch_tokens is smaller than block_size while the request length exceeds the block_size, the _mamba_block_aligned_split() function will return a num_new_tokens of 0 due to these alignment constraints. This prevents the request from ever being scheduled, eventually causing the engine to hang.

This PR adds a validation check to ensure that block_size is not larger than max_num_batch_tokens when Mamba cache align mode is enabled.

Test Plan

import time

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory


def main():
    MODEL = "Qwen/Qwen3-Next-80B-A3B-Instruct"  # gdn
    PROMPT_MULTIPLE = 310
    sampling_params = SamplingParams(temperature=0.0, max_tokens=128)
    prefix = ( # examples/offline_inference/prefix_caching.py
        "You are an expert school principal, skilled in effectively managing "
        "faculty and staff. Draft 10-15 questions for a potential first grade "
        "Head Teacher for my K-12, all-girls', independent school that emphasizes "
        "community, joyful discovery, and life-long learning. The candidate is "
        "coming in for a first-round panel interview for a 8th grade Math "
        "teaching role. They have 5 years of previous teaching experience "
        "as an assistant teacher at a co-ed, public school with experience "
        "in middle school math teaching. ")
    prefix2 = ("Based on these information, fulfill "
                "the following paragraph: ")
    prompt = PROMPT_MULTIPLE * prefix + prefix2 + "Hello, my name is"
    print('Prompt length:', len(prompt))
    for APC in [
        True
    ]:
        engine = LLM(
            model=MODEL, enable_prefix_caching=APC, 
            max_num_batched_tokens=512,   # smaller than block_size=544
            tensor_parallel_size=2,
            gpu_memory_utilization=0.9, 
            disable_log_stats=False,
            mamba_cache_mode="align",
        )
        for i in range(3):
            if i == 0:
                print('Warm-up')
            if i == 1:
                print('Measuring')
                start_time = time.time()
            outputs = engine.generate(prompt, sampling_params)
            print('APC:', APC, i, f"Generated text: {outputs[0].outputs[0].text!r}")
            for m in engine.llm_engine.get_metrics():
                if 'vllm:prefix_cache_hits' in m.name:
                    print(m.name, m.value)
        print('APC:', APC, "loop took --- %s seconds ---" % (time.time() - start_time))
        del engine
        cleanup_dist_env_and_memory()


if __name__ == "__main__":
    main()

Test Result

Before fix: Engine hang.

Warm-up

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 400.95it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

After: Proper validation error.

Traceback (most recent call last):
  File "/root/huanghy/vllm_opsrc/my_tests/test_lpc_offline.py", line 56, in <module>
    main()
  File "/root/huanghy/vllm_opsrc/my_tests/test_lpc_offline.py", line 31, in main
    engine = LLM(
             ^^^^
  File "/root/huanghy/vllm_opsrc/vllm/entrypoints/llm.py", line 346, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/huanghy/vllm_opsrc/vllm/v1/engine/llm_engine.py", line 166, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/huanghy/vllm_opsrc/vllm/engine/arg_utils.py", line 1809, in create_engine_config
    config = VllmConfig(
             ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
  Assertion failed, In Mamba cache align mode, block_size (544) must be <= max_num_batched_tokens (512). [type=assertion_error, input_value=ArgsKwargs((), {'model_co...transfer_config': None}), input_type=ArgsKwargs]
    For further information visit https://errors.pydantic.dev/2.12/v/assertion_error

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: huanghaoyan.hhy <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a crucial validation check to prevent the engine from hanging when using Mamba's cache align mode with an invalid configuration where block_size exceeds max_num_batched_tokens. The fix is correct and well-placed. I've suggested a minor improvement to the assertion's error message to make it more informative for users encountering this configuration error.

vllm/config/vllm.py

Signed-off-by: huanghaoyan.hhy <[email protected]>

heheda12345

LGTM!

[BugFix] Add block_size validation for mamba cache align mode

104b68f

Signed-off-by: huanghaoyan.hhy <[email protected]>

peakcrosser7 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners February 12, 2026 17:17

mergify bot added the bug Something isn't working label Feb 12, 2026

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

vllm/config/vllm.py Show resolved Hide resolved

update error message

34c4507

Signed-off-by: huanghaoyan.hhy <[email protected]>

heheda12345 approved these changes Feb 12, 2026

View reviewed changes

heheda12345 enabled auto-merge (squash) February 12, 2026 19:31

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Add block_size validation for mamba cache align mode #34445

[BugFix] Add block_size validation for mamba cache align mode #34445

peakcrosser7 commented Feb 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

heheda12345 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[BugFix] Add block_size validation for mamba cache align mode #34445

Are you sure you want to change the base?

[BugFix] Add block_size validation for mamba cache align mode #34445

Conversation

peakcrosser7 commented Feb 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peakcrosser7 commented Feb 12, 2026 •

edited by github-actions bot

Loading