Skip to content

chat: forward chat_template_kwargs on simple-engine paths#218

Open
krystophny wants to merge 4 commits intowaybarrios:mainfrom
computor-org:fix/chat-template-kwargs-forwarding
Open

chat: forward chat_template_kwargs on simple-engine paths#218
krystophny wants to merge 4 commits intowaybarrios:mainfrom
computor-org:fix/chat-template-kwargs-forwarding

Conversation

@krystophny
Copy link
Copy Markdown
Contributor

@krystophny krystophny commented Mar 24, 2026

Summary

Honor chat_template_kwargs on the simple-engine paths that still ignored it and run the regression coverage in Apple Silicon CI.

Why

Before this branch, chat_template_kwargs was only reliably honored on the batched path and the plain LLM chat path. Simple-engine multimodal chat, multimodal stream chat, and the text-only MTP route still dropped the field.

What changed

  • forward chat_template_kwargs through simple-engine multimodal chat()
  • forward chat_template_kwargs through simple-engine multimodal stream_chat()
  • forward chat_template_kwargs into _stream_generate_text() for the text-only MTP route
  • include tests/test_chat_template_kwargs.py in Apple Silicon CI

Status

  • refreshed onto current upstream main (b4fa030) on 2026-04-09
  • no logic changes beyond the base refresh

Files to review

  • vllm_mlx/engine/simple.py
  • .github/workflows/ci.yml
  • tests/test_chat_template_kwargs.py

Validation

  • python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed
  • note: the older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file here

@krystophny krystophny changed the title Forward chat template kwargs in batched chat chat: forward chat_template_kwargs in batched path Mar 24, 2026
@krystophny krystophny changed the title chat: forward chat_template_kwargs in batched path chat: forward chat_template_kwargs on simple-engine paths Mar 24, 2026
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is solid and addresses real coverage gaps. The forwarding is consistent across all simple-engine paths:

What works:

  • API model field properly declared with optional dict[str, Any]
  • SimpleEngine MLLM multimodal chat/stream_chat forward kwargs to model
  • SimpleEngine text-only MTP route in _stream_generate_text applies kwargs
  • LLMLanguageModel.chat applies kwargs with graceful TypeError fallback
  • BatchedEngine properly merges kwargs and propagates to prefix boundary computation
  • TypeError handling updated to remove arbitrary kwargs, not just tools

Pattern is defensive: chat_template_kwargs = dict(kwargs.pop("chat_template_kwargs", {}) or {}) safely handles None and creates fresh dict. Line 372 guard in BatchedEngine prevents tools being inserted twice when merging kwargs.

Test coverage is comprehensive — all paths have mocks covering the assertion. Adding to Apple Silicon CI ensures regression detection.

One implementation detail: line 607 (SimpleEngine text path) and line 380 (BatchedEngine) both retry on TypeError by removing all user-provided template kwargs. This is correct but slightly more aggressive than the original "tools only" approach. The exception is rare enough this won't be a problem, but if a template silently ignores an unknown kwarg instead of raising TypeError, those kwargs would pass through on first try. This is acceptable trade-off for simplicity.

Ready to merge from the implementation side.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 7, 2026

@waybarrios, @krystophny: independent technical review of this PR.

Verification of the fix

Confirmed against current upstream main (b4fa030). The diff plumbs chat_template_kwargs through every place it was previously dropped:

  1. vllm_mlx/api/models.py:172 adds the field to ChatCompletionRequest
  2. vllm_mlx/server.py:1422 forwards it from the request into chat_kwargs
  3. vllm_mlx/engine/simple.py forwards it through SimpleEngine chat() (LLM and MLLM branches), stream_chat() (MLLM and run_stream branches), and _stream_generate_text() (MTP path)
  4. vllm_mlx/engine/batched.py forwards it through BatchedEngine chat(), stream_chat(), and _compute_prefix_boundary() so per-template-kwargs prefix caching works correctly
  5. vllm_mlx/models/llm.py adds the parameter to MLXLanguageModel.chat() so the LLM path honors it

All template-apply call sites also gain a graceful fallback: if a tokenizer raises TypeError because it does not support a given kwarg, the failed kwargs are popped and the call retries.

Test coverage

tests/test_chat_template_kwargs.py adds 7 tests covering Pydantic field preservation, BatchedEngine _apply_chat_template, the HTTP endpoint via FakeEngine + TestClient, LLM chat applying kwargs before generate, SimpleEngine MLLM chat forwarding, and SimpleEngine _stream_generate_text applying kwargs. The CI workflow is updated to run the new test in the Apple Silicon job.

Why this matters

Per the PR description, before this branch chat_template_kwargs was honored on the batched path and the plain LLM chat path but silently dropped on simple-engine multimodal chat(), simple-engine multimodal stream_chat(), and the text-only MTP _stream_generate_text route. That means enable_thinking=false in chat_template_kwargs was being silently ignored on those three paths, which can cause Qwen 3.5 thinking-tag leakage in multimodal and MTP responses.

Recommendation

Merge candidate. Real fix to a real silently-ignored API field, comprehensive plumbing across all relevant call sites, good test coverage, and the CI workflow update means the regression cannot return without someone disabling the test job.

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 9, 2026
…l generate+stream_generate

Pre-existing regression from an earlier rebase that dropped bdf7dcc's
llm.py additions. The server.py request handlers still pass top_k,
min_p, presence_penalty, repetition_penalty through to SimpleEngine,
which forwards them via **kwargs to MLXLanguageModel.chat() (which
accepts **kwargs) which then calls self.generate(..., **kwargs). But
MLXLanguageModel.generate() and stream_generate() had been left with
only (temperature, top_p, repetition_penalty) in their signatures, so
any non-MLLM SimpleEngine request crashed with:

    TypeError: MLXLanguageModel.stream_generate() got an unexpected
    keyword argument 'top_k'

Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in
the feature matrix regression sweep after the Session 87 cherry-picks
of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause
this regression — they exposed it by finally running the LLM-path
tests that no one had exercised since the rebase happened. Confirmed
via stderr.log:

  TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k'
  TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k'

Fix: restore the signatures and bodies of _create_sampler,
_create_logits_processors, generate, and stream_generate to match
bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter
and non-str prompt support on stream_generate. Adds **kwargs to both
generate and stream_generate so future param additions degrade
gracefully instead of crashing.

This is a runtime-local fix. The equivalent upstream fix lives in
bdf7dcc which was never upstreamed (confirmed via
git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR
to upstream could carry this forward.

Verification:
  bin/verify-patches: 33/33 clean
  Full feature matrix regression sweep pending re-run after this commit.

Related: runtime PR waybarrios#265 (waybarrios#265) fixed the
CompletionRequest schema side of the same bdf7dcc drop; this commit
fixes the engine-model side.
@krystophny krystophny force-pushed the fix/chat-template-kwargs-forwarding branch from 1e17fb1 to be2ba60 Compare April 9, 2026 06:35
@krystophny
Copy link
Copy Markdown
Contributor Author

Force-pushed a refresh onto current upstream main (b4fa030). No logic change beyond the base refresh. Validation: python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed. The older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 9, 2026

Refresh confirmed on head 3c33f72 against upstream main b4fa030. The only delta on top of the previously approved be2ba60 is the "style: format chat template kwargs tests" commit, which is a no-op on the forwarding logic. The SimpleEngine multimodal chat, stream_chat, _stream_generate_text, BatchedEngine chat, models/llm.py, and server.py wiring all match the previously reviewed shape.

CI green on lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_chat_template_kwargs.py -> 6 passed on head. Prior APPROVED review at be2ba60 applies to the refreshed head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants