chat: forward chat_template_kwargs on simple-engine paths#218
chat: forward chat_template_kwargs on simple-engine paths#218krystophny wants to merge 4 commits intowaybarrios:mainfrom
Conversation
Thump604
left a comment
There was a problem hiding this comment.
Implementation is solid and addresses real coverage gaps. The forwarding is consistent across all simple-engine paths:
What works:
- API model field properly declared with optional dict[str, Any]
- SimpleEngine MLLM multimodal chat/stream_chat forward kwargs to model
- SimpleEngine text-only MTP route in _stream_generate_text applies kwargs
- LLMLanguageModel.chat applies kwargs with graceful TypeError fallback
- BatchedEngine properly merges kwargs and propagates to prefix boundary computation
- TypeError handling updated to remove arbitrary kwargs, not just tools
Pattern is defensive: chat_template_kwargs = dict(kwargs.pop("chat_template_kwargs", {}) or {}) safely handles None and creates fresh dict. Line 372 guard in BatchedEngine prevents tools being inserted twice when merging kwargs.
Test coverage is comprehensive — all paths have mocks covering the assertion. Adding to Apple Silicon CI ensures regression detection.
One implementation detail: line 607 (SimpleEngine text path) and line 380 (BatchedEngine) both retry on TypeError by removing all user-provided template kwargs. This is correct but slightly more aggressive than the original "tools only" approach. The exception is rare enough this won't be a problem, but if a template silently ignores an unknown kwarg instead of raising TypeError, those kwargs would pass through on first try. This is acceptable trade-off for simplicity.
Ready to merge from the implementation side.
|
@waybarrios, @krystophny: independent technical review of this PR. Verification of the fixConfirmed against current upstream main (b4fa030). The diff plumbs
All template-apply call sites also gain a graceful fallback: if a tokenizer raises Test coverage
Why this mattersPer the PR description, before this branch RecommendationMerge candidate. Real fix to a real silently-ignored API field, comprehensive plumbing across all relevant call sites, good test coverage, and the CI workflow update means the regression cannot return without someone disabling the test job. |
…l generate+stream_generate Pre-existing regression from an earlier rebase that dropped bdf7dcc's llm.py additions. The server.py request handlers still pass top_k, min_p, presence_penalty, repetition_penalty through to SimpleEngine, which forwards them via **kwargs to MLXLanguageModel.chat() (which accepts **kwargs) which then calls self.generate(..., **kwargs). But MLXLanguageModel.generate() and stream_generate() had been left with only (temperature, top_p, repetition_penalty) in their signatures, so any non-MLLM SimpleEngine request crashed with: TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in the feature matrix regression sweep after the Session 87 cherry-picks of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause this regression — they exposed it by finally running the LLM-path tests that no one had exercised since the rebase happened. Confirmed via stderr.log: TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k' TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Fix: restore the signatures and bodies of _create_sampler, _create_logits_processors, generate, and stream_generate to match bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter and non-str prompt support on stream_generate. Adds **kwargs to both generate and stream_generate so future param additions degrade gracefully instead of crashing. This is a runtime-local fix. The equivalent upstream fix lives in bdf7dcc which was never upstreamed (confirmed via git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR to upstream could carry this forward. Verification: bin/verify-patches: 33/33 clean Full feature matrix regression sweep pending re-run after this commit. Related: runtime PR waybarrios#265 (waybarrios#265) fixed the CompletionRequest schema side of the same bdf7dcc drop; this commit fixes the engine-model side.
1e17fb1 to
be2ba60
Compare
|
Force-pushed a refresh onto current upstream |
|
Refresh confirmed on head 3c33f72 against upstream main b4fa030. The only delta on top of the previously approved be2ba60 is the "style: format chat template kwargs tests" commit, which is a no-op on the forwarding logic. The SimpleEngine multimodal chat, stream_chat, _stream_generate_text, BatchedEngine chat, models/llm.py, and server.py wiring all match the previously reviewed shape. CI green on lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_chat_template_kwargs.py -> 6 passed on head. Prior APPROVED review at be2ba60 applies to the refreshed head. |
Summary
Honor
chat_template_kwargson the simple-engine paths that still ignored it and run the regression coverage in Apple Silicon CI.Why
Before this branch,
chat_template_kwargswas only reliably honored on the batched path and the plain LLM chat path. Simple-engine multimodal chat, multimodal stream chat, and the text-only MTP route still dropped the field.What changed
chat_template_kwargsthrough simple-engine multimodalchat()chat_template_kwargsthrough simple-engine multimodalstream_chat()chat_template_kwargsinto_stream_generate_text()for the text-only MTP routetests/test_chat_template_kwargs.pyin Apple Silicon CIStatus
main(b4fa030) on 2026-04-09Files to review
vllm_mlx/engine/simple.py.github/workflows/ci.ymltests/test_chat_template_kwargs.pyValidation
python -m pytest tests/test_chat_template_kwargs.py -q->6 passedtests/test_simple_engine.pyvalidation command now depends on the separate async-harness refresh in#226on current upstream, so I kept validation scoped to this PR's dedicated regression file here