chat: forward chat_template_kwargs on simple-engine paths by krystophny · Pull Request #218 · waybarrios/vllm-mlx

krystophny · 2026-03-24T12:15:59Z

Summary

Honor chat_template_kwargs on the simple-engine paths that still ignored it and run the regression coverage in Apple Silicon CI.

Why

Before this branch, chat_template_kwargs was only reliably honored on the batched path and the plain LLM chat path. Simple-engine multimodal chat, multimodal stream chat, and the text-only MTP route still dropped the field.

What changed

forward chat_template_kwargs through simple-engine multimodal chat()
forward chat_template_kwargs through simple-engine multimodal stream_chat()
forward chat_template_kwargs into _stream_generate_text() for the text-only MTP route
include tests/test_chat_template_kwargs.py in Apple Silicon CI

Status

refreshed onto current upstream main (b4fa030) on 2026-04-09
no logic changes beyond the base refresh

Files to review

vllm_mlx/engine/simple.py
.github/workflows/ci.yml
tests/test_chat_template_kwargs.py

Validation

python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed
note: the older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file here

Thump604

Implementation is solid and addresses real coverage gaps. The forwarding is consistent across all simple-engine paths:

What works:

API model field properly declared with optional dict[str, Any]
SimpleEngine MLLM multimodal chat/stream_chat forward kwargs to model
SimpleEngine text-only MTP route in _stream_generate_text applies kwargs
LLMLanguageModel.chat applies kwargs with graceful TypeError fallback
BatchedEngine properly merges kwargs and propagates to prefix boundary computation
TypeError handling updated to remove arbitrary kwargs, not just tools

Pattern is defensive: chat_template_kwargs = dict(kwargs.pop("chat_template_kwargs", {}) or {}) safely handles None and creates fresh dict. Line 372 guard in BatchedEngine prevents tools being inserted twice when merging kwargs.

Test coverage is comprehensive — all paths have mocks covering the assertion. Adding to Apple Silicon CI ensures regression detection.

One implementation detail: line 607 (SimpleEngine text path) and line 380 (BatchedEngine) both retry on TypeError by removing all user-provided template kwargs. This is correct but slightly more aggressive than the original "tools only" approach. The exception is rare enough this won't be a problem, but if a template silently ignores an unknown kwarg instead of raising TypeError, those kwargs would pass through on first try. This is acceptable trade-off for simplicity.

Ready to merge from the implementation side.

Thump604 · 2026-04-07T23:52:17Z

@waybarrios, @krystophny: independent technical review of this PR.

Verification of the fix

Confirmed against current upstream main (b4fa030). The diff plumbs chat_template_kwargs through every place it was previously dropped:

vllm_mlx/api/models.py:172 adds the field to ChatCompletionRequest
vllm_mlx/server.py:1422 forwards it from the request into chat_kwargs
vllm_mlx/engine/simple.py forwards it through SimpleEngine chat() (LLM and MLLM branches), stream_chat() (MLLM and run_stream branches), and _stream_generate_text() (MTP path)
vllm_mlx/engine/batched.py forwards it through BatchedEngine chat(), stream_chat(), and _compute_prefix_boundary() so per-template-kwargs prefix caching works correctly
vllm_mlx/models/llm.py adds the parameter to MLXLanguageModel.chat() so the LLM path honors it

All template-apply call sites also gain a graceful fallback: if a tokenizer raises TypeError because it does not support a given kwarg, the failed kwargs are popped and the call retries.

Test coverage

tests/test_chat_template_kwargs.py adds 7 tests covering Pydantic field preservation, BatchedEngine _apply_chat_template, the HTTP endpoint via FakeEngine + TestClient, LLM chat applying kwargs before generate, SimpleEngine MLLM chat forwarding, and SimpleEngine _stream_generate_text applying kwargs. The CI workflow is updated to run the new test in the Apple Silicon job.

Why this matters

Per the PR description, before this branch chat_template_kwargs was honored on the batched path and the plain LLM chat path but silently dropped on simple-engine multimodal chat(), simple-engine multimodal stream_chat(), and the text-only MTP _stream_generate_text route. That means enable_thinking=false in chat_template_kwargs was being silently ignored on those three paths, which can cause Qwen 3.5 thinking-tag leakage in multimodal and MTP responses.

Recommendation

Merge candidate. Real fix to a real silently-ignored API field, comprehensive plumbing across all relevant call sites, good test coverage, and the CI workflow update means the regression cannot return without someone disabling the test job.

…l generate+stream_generate Pre-existing regression from an earlier rebase that dropped bdf7dcc's llm.py additions. The server.py request handlers still pass top_k, min_p, presence_penalty, repetition_penalty through to SimpleEngine, which forwards them via **kwargs to MLXLanguageModel.chat() (which accepts **kwargs) which then calls self.generate(..., **kwargs). But MLXLanguageModel.generate() and stream_generate() had been left with only (temperature, top_p, repetition_penalty) in their signatures, so any non-MLLM SimpleEngine request crashed with: TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in the feature matrix regression sweep after the Session 87 cherry-picks of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause this regression — they exposed it by finally running the LLM-path tests that no one had exercised since the rebase happened. Confirmed via stderr.log: TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k' TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Fix: restore the signatures and bodies of _create_sampler, _create_logits_processors, generate, and stream_generate to match bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter and non-str prompt support on stream_generate. Adds **kwargs to both generate and stream_generate so future param additions degrade gracefully instead of crashing. This is a runtime-local fix. The equivalent upstream fix lives in bdf7dcc which was never upstreamed (confirmed via git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR to upstream could carry this forward. Verification: bin/verify-patches: 33/33 clean Full feature matrix regression sweep pending re-run after this commit. Related: runtime PR waybarrios#265 (waybarrios#265) fixed the CompletionRequest schema side of the same bdf7dcc drop; this commit fixes the engine-model side.

krystophny · 2026-04-09T06:40:49Z

Force-pushed a refresh onto current upstream main (b4fa030). No logic change beyond the base refresh. Validation: python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed. The older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file.

Thump604 · 2026-04-09T12:30:52Z

Refresh confirmed on head 3c33f72 against upstream main b4fa030. The only delta on top of the previously approved be2ba60 is the "style: format chat template kwargs tests" commit, which is a no-op on the forwarding logic. The SimpleEngine multimodal chat, stream_chat, _stream_generate_text, BatchedEngine chat, models/llm.py, and server.py wiring all match the previously reviewed shape.

CI green on lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_chat_template_kwargs.py -> 6 passed on head. Prior APPROVED review at be2ba60 applies to the refreshed head.

krystophny changed the title ~~Forward chat template kwargs in batched chat~~ chat: forward chat_template_kwargs in batched path Mar 24, 2026

krystophny changed the title ~~chat: forward chat_template_kwargs in batched path~~ chat: forward chat_template_kwargs on simple-engine paths Mar 24, 2026

This was referenced Mar 25, 2026

[Tracking] Upstream backlog and merge plan computor-org/vllm-mlx#12

Open

chat: finish upstreaming chat_template_kwargs forwarding on simple-engine paths computor-org/vllm-mlx#19

Closed

Thump604 approved these changes Mar 31, 2026

View reviewed changes

This was referenced Apr 8, 2026

server: add OpenAI-compatible /v1/responses endpoint #214

Open

Add per-request enable_thinking API parameter #262

Closed

krystophny added 3 commits April 9, 2026 08:32

Forward chat template kwargs in batched chat

0adad82

fix: honor chat_template_kwargs on simple engine paths

871035f

fix: forward chat_template_kwargs on simple engine

be2ba60

krystophny force-pushed the fix/chat-template-kwargs-forwarding branch from 1e17fb1 to be2ba60 Compare April 9, 2026 06:35

style: format chat template kwargs tests

3c33f72

Thump604 mentioned this pull request Apr 9, 2026

simple-engine: keep tool chat on the streaming execution path #222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chat: forward chat_template_kwargs on simple-engine paths#218

chat: forward chat_template_kwargs on simple-engine paths#218
krystophny wants to merge 4 commits intowaybarrios:mainfrom
computor-org:fix/chat-template-kwargs-forwarding

krystophny commented Mar 24, 2026 •

edited

Loading

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krystophny commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Status

Files to review

Validation

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 commented Apr 7, 2026

Verification of the fix

Test coverage

Why this matters

Recommendation

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krystophny commented Mar 24, 2026 •

edited

Loading