Add per-request enable_thinking API parameter#262
Add per-request enable_thinking API parameter#262janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Conversation
|
@waybarrios, @janhilgard: coordination note with my #213. #213 includes a (currently bundled) change to the This PR (#262) adds The two are complementary. Server-level env var sets the default, per-request field overrides it. The natural precedence order if both land is:
If you want, I can rebase #213 to land its env var change after this PR so the precedence is built in cleanly. Or this PR can land first and I will adjust the env var fallback to honor the per-request field. No file-level conflict between the two PRs at the moment. Both mergeable. |
Allows controlling thinking/reasoning mode per-request via the enable_thinking field in extra_body. Three-level priority: 1. Per-request: extra_body.enable_thinking (true/false) 2. Environment: VLLM_MLX_ENABLE_THINKING (true/false/1/0/yes/no) 3. Default: true All code paths (SimpleEngine, BatchedEngine, MLLM) now consistently use the VLLM_MLX_ENABLE_THINKING env var as fallback, replacing the previous "coder" model name heuristic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
470a44b to
cbca049
Compare
|
Thanks for the heads-up @Thump604! I've just updated this PR to unify the fallback across all code paths — The three-level priority is now consistent everywhere:
Both PRs can land in any order without conflict — if #213 lands first, it establishes the env var; this PR already respects it. If this one lands first, the env var support is already built in. |
Thump604
left a comment
There was a problem hiding this comment.
LGTM as the per-request thinking control and the fallback chain (request → env → default True) is the right shape. The plumbing through every simple-engine path is consistent with the #218 chat_template_kwargs work, and the runtime test I just ran against a runtime carrying the #218 cherry-pick confirms the chat_template_kwargs-based thinking override still works, so landing #262 just gives users a more discoverable top-level field on top of that.
Two non-blocking notes:
-
If #218 lands first,
_apply_chat_templateinbatched.pywill need a tiny rebase: #218's version setstemplate_kwargsvia the_merge_template_kwargshelper and adds chat_template_kwargs keys that might collide withenable_thinking. Precedence question: if a caller sends bothenable_thinking: false(top level) andchat_template_kwargs: {enable_thinking: true}, which wins? My read is that the dedicated top-level field should be the higher-precedence one since it's more specific, but I don't see that resolution codified. Happy with either order as long as it's documented. -
The
os.environfallback inside_apply_chat_templatereads the env var on every call, which is fine for runtimes that set it once at startup. If anything ever mutates it per-request, the current shape is racy. Non-issue for the current codebase.
Approving.
Summary
enable_thinkingfield toChatCompletionRequestfor per-request control of thinking/reasoning modefalse: chat template rendered without<think>injection, reasoning parser bypassedtrueor omitted: existing behavior preserved (thinking enabled by default)Supported across all engine paths: SimpleEngine, BatchedEngine, MLXMultimodalLM.
Motivation
Models like Qwen3/3.5 support
enable_thinking=Falsein their chat templates to skip the thinking phase. This is useful when:<think>overheadWithout this change,
enable_thinkingis hardcoded toTruewith no way to override from the API.Usage
Changes
api/models.pyenable_thinking: bool | None = Nonefieldserver.pyFalseengine/simple.pystream_chatand_stream_generate_textengine/batched.py_apply_chat_template, pass fromchat/stream_chatmodels/mllm.pyget_chat_templatecallsTest plan
enable_thinking=trueproduces reasoning content (Qwen3.5)enable_thinking=falseproduces direct answer without reasoning🤖 Generated with Claude Code