Skip to content

Add per-request enable_thinking API parameter#262

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/enable-thinking-api
Closed

Add per-request enable_thinking API parameter#262
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/enable-thinking-api

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Adds enable_thinking field to ChatCompletionRequest for per-request control of thinking/reasoning mode
  • When false: chat template rendered without <think> injection, reasoning parser bypassed
  • When true or omitted: existing behavior preserved (thinking enabled by default)

Supported across all engine paths: SimpleEngine, BatchedEngine, MLXMultimodalLM.

Motivation

Models like Qwen3/3.5 support enable_thinking=False in their chat templates to skip the thinking phase. This is useful when:

  • Low latency is more important than reasoning quality
  • Clients want direct answers without <think> overhead
  • Applications need to toggle thinking per-request (e.g., simple vs complex questions)

Without this change, enable_thinking is hardcoded to True with no way to override from the API.

Usage

# Disable thinking
curl /v1/chat/completions -d '{"enable_thinking": false, "messages": [...]}'

# OpenAI Python client
client.chat.completions.create(
    extra_body={"enable_thinking": False},
    messages=[...]
)

Changes

File Change
api/models.py Add enable_thinking: bool | None = None field
server.py Pass to engine kwargs + bypass reasoning parser when False
engine/simple.py Read from kwargs in stream_chat and _stream_generate_text
engine/batched.py Add to _apply_chat_template, pass from chat/stream_chat
models/mllm.py Pop from kwargs, pass to get_chat_template calls

Test plan

  • enable_thinking=true produces reasoning content (Qwen3.5)
  • enable_thinking=false produces direct answer without reasoning
  • Default (omitted) preserves existing behavior
  • Streaming mode works correctly for both values
  • Tested on SimpleEngine (port 1237) and BatchedEngine (port 1238)
  • Tested on MLLM path (port 1240)

🤖 Generated with Claude Code

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 7, 2026

@waybarrios, @janhilgard: coordination note with my #213.

#213 includes a (currently bundled) change to the enable_thinking heuristic in vllm_mlx/engine/simple.py that switches from a hardcoded "coder" in model_name string check to reading the VLLM_MLX_ENABLE_THINKING env var (server-level default).

This PR (#262) adds enable_thinking as a per-request field on ChatCompletionRequest (per-request override).

The two are complementary. Server-level env var sets the default, per-request field overrides it. The natural precedence order if both land is:

  1. Per-request enable_thinking from this PR (highest priority)
  2. Server-level VLLM_MLX_ENABLE_THINKING env var from feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213
  3. Default True (current behavior)

If you want, I can rebase #213 to land its env var change after this PR so the precedence is built in cleanly. Or this PR can land first and I will adjust the env var fallback to honor the per-request field.

No file-level conflict between the two PRs at the moment. Both mergeable.

Allows controlling thinking/reasoning mode per-request via the
enable_thinking field in extra_body. Three-level priority:

1. Per-request: extra_body.enable_thinking (true/false)
2. Environment: VLLM_MLX_ENABLE_THINKING (true/false/1/0/yes/no)
3. Default: true

All code paths (SimpleEngine, BatchedEngine, MLLM) now consistently
use the VLLM_MLX_ENABLE_THINKING env var as fallback, replacing the
previous "coder" model name heuristic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard force-pushed the feat/enable-thinking-api branch from 470a44b to cbca049 Compare April 8, 2026 07:52
@janhilgard
Copy link
Copy Markdown
Collaborator Author

Thanks for the heads-up @Thump604!

I've just updated this PR to unify the fallback across all code paths — SimpleEngine.stream_chat, BatchedEngine._apply_chat_template, and both MLLM.chat/MLLM.stream_chat now consistently fall back to VLLM_MLX_ENABLE_THINKING env var (default true), replacing the old "coder" in model_name heuristic.

The three-level priority is now consistent everywhere:

  1. Per-request: extra_body.enable_thinking (this PR)
  2. Environment: VLLM_MLX_ENABLE_THINKING=true|false|1|0|yes|no (aligns with your feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213)
  3. Default: true

Both PRs can land in any order without conflict — if #213 lands first, it establishes the env var; this PR already respects it. If this one lands first, the env var support is already built in.

@janhilgard janhilgard requested a review from Thump604 April 8, 2026 07:53
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as the per-request thinking control and the fallback chain (request → env → default True) is the right shape. The plumbing through every simple-engine path is consistent with the #218 chat_template_kwargs work, and the runtime test I just ran against a runtime carrying the #218 cherry-pick confirms the chat_template_kwargs-based thinking override still works, so landing #262 just gives users a more discoverable top-level field on top of that.

Two non-blocking notes:

  1. If #218 lands first, _apply_chat_template in batched.py will need a tiny rebase: #218's version sets template_kwargs via the _merge_template_kwargs helper and adds chat_template_kwargs keys that might collide with enable_thinking. Precedence question: if a caller sends both enable_thinking: false (top level) and chat_template_kwargs: {enable_thinking: true}, which wins? My read is that the dedicated top-level field should be the higher-precedence one since it's more specific, but I don't see that resolution codified. Happy with either order as long as it's documented.

  2. The os.environ fallback inside _apply_chat_template reads the env var on every call, which is fine for runtimes that set it once at startup. If anything ever mutates it per-request, the current shape is racy. Non-issue for the current codebase.

Approving.

@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Superseded — per-request enable_thinking parameter is already in main via #278. Closing.

@janhilgard janhilgard closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants