Add per-request enable_thinking API parameter by janhilgard · Pull Request #262 · waybarrios/vllm-mlx

janhilgard · 2026-04-06T21:33:00Z

Summary

Adds enable_thinking field to ChatCompletionRequest for per-request control of thinking/reasoning mode
When false: chat template rendered without <think> injection, reasoning parser bypassed
When true or omitted: existing behavior preserved (thinking enabled by default)

Supported across all engine paths: SimpleEngine, BatchedEngine, MLXMultimodalLM.

Motivation

Models like Qwen3/3.5 support enable_thinking=False in their chat templates to skip the thinking phase. This is useful when:

Low latency is more important than reasoning quality
Clients want direct answers without <think> overhead
Applications need to toggle thinking per-request (e.g., simple vs complex questions)

Without this change, enable_thinking is hardcoded to True with no way to override from the API.

Usage

# Disable thinking
curl /v1/chat/completions -d '{"enable_thinking": false, "messages": [...]}'

# OpenAI Python client
client.chat.completions.create(
    extra_body={"enable_thinking": False},
    messages=[...]
)

Changes

File	Change
`api/models.py`	Add `enable_thinking: bool \| None = None` field
`server.py`	Pass to engine kwargs + bypass reasoning parser when `False`
`engine/simple.py`	Read from kwargs in `stream_chat` and `_stream_generate_text`
`engine/batched.py`	Add to `_apply_chat_template`, pass from `chat`/`stream_chat`
`models/mllm.py`	Pop from kwargs, pass to `get_chat_template` calls

Test plan

enable_thinking=true produces reasoning content (Qwen3.5)
enable_thinking=false produces direct answer without reasoning
Default (omitted) preserves existing behavior
Streaming mode works correctly for both values
Tested on SimpleEngine (port 1237) and BatchedEngine (port 1238)
Tested on MLLM path (port 1240)

🤖 Generated with Claude Code

Thump604 · 2026-04-07T23:54:59Z

@waybarrios, @janhilgard: coordination note with my #213.

#213 includes a (currently bundled) change to the enable_thinking heuristic in vllm_mlx/engine/simple.py that switches from a hardcoded "coder" in model_name string check to reading the VLLM_MLX_ENABLE_THINKING env var (server-level default).

This PR (#262) adds enable_thinking as a per-request field on ChatCompletionRequest (per-request override).

The two are complementary. Server-level env var sets the default, per-request field overrides it. The natural precedence order if both land is:

Per-request enable_thinking from this PR (highest priority)
Server-level VLLM_MLX_ENABLE_THINKING env var from feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213
Default True (current behavior)

If you want, I can rebase #213 to land its env var change after this PR so the precedence is built in cleanly. Or this PR can land first and I will adjust the env var fallback to honor the per-request field.

No file-level conflict between the two PRs at the moment. Both mergeable.

Allows controlling thinking/reasoning mode per-request via the enable_thinking field in extra_body. Three-level priority: 1. Per-request: extra_body.enable_thinking (true/false) 2. Environment: VLLM_MLX_ENABLE_THINKING (true/false/1/0/yes/no) 3. Default: true All code paths (SimpleEngine, BatchedEngine, MLLM) now consistently use the VLLM_MLX_ENABLE_THINKING env var as fallback, replacing the previous "coder" model name heuristic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard · 2026-04-08T07:52:18Z

Thanks for the heads-up @Thump604!

I've just updated this PR to unify the fallback across all code paths — SimpleEngine.stream_chat, BatchedEngine._apply_chat_template, and both MLLM.chat/MLLM.stream_chat now consistently fall back to VLLM_MLX_ENABLE_THINKING env var (default true), replacing the old "coder" in model_name heuristic.

The three-level priority is now consistent everywhere:

Per-request: extra_body.enable_thinking (this PR)
Environment: VLLM_MLX_ENABLE_THINKING=true|false|1|0|yes|no (aligns with your feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213)
Default: true

Both PRs can land in any order without conflict — if #213 lands first, it establishes the env var; this PR already respects it. If this one lands first, the env var support is already built in.

Thump604

LGTM as the per-request thinking control and the fallback chain (request → env → default True) is the right shape. The plumbing through every simple-engine path is consistent with the #218 chat_template_kwargs work, and the runtime test I just ran against a runtime carrying the #218 cherry-pick confirms the chat_template_kwargs-based thinking override still works, so landing #262 just gives users a more discoverable top-level field on top of that.

Two non-blocking notes:

If #218 lands first, _apply_chat_template in batched.py will need a tiny rebase: #218's version sets template_kwargs via the _merge_template_kwargs helper and adds chat_template_kwargs keys that might collide with enable_thinking. Precedence question: if a caller sends both enable_thinking: false (top level) and chat_template_kwargs: {enable_thinking: true}, which wins? My read is that the dedicated top-level field should be the higher-precedence one since it's more specific, but I don't see that resolution codified. Happy with either order as long as it's documented.
The os.environ fallback inside _apply_chat_template reads the env var on every call, which is fine for runtimes that set it once at startup. If anything ever mutates it per-request, the current shape is racy. Non-issue for the current codebase.

Approving.

janhilgard · 2026-04-11T14:18:29Z

@Thump604 Superseded — per-request enable_thinking parameter is already in main via #278. Closing.

janhilgard force-pushed the feat/enable-thinking-api branch from 470a44b to cbca049 Compare April 8, 2026 07:52

janhilgard requested a review from Thump604 April 8, 2026 07:53

Thump604 approved these changes Apr 8, 2026

View reviewed changes

janhilgard closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-request enable_thinking API parameter#262

Add per-request enable_thinking API parameter#262
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/enable-thinking-api

janhilgard commented Apr 6, 2026

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

janhilgard commented Apr 8, 2026

Uh oh!

Thump604 left a comment

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janhilgard commented Apr 6, 2026

Summary

Motivation

Usage

Changes

Test plan

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

janhilgard commented Apr 8, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants