You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reasoning (<think>) output from mlx-community/Qwen3.5-27B-8bit appears broken when the server runs in concurrent mode (--concurrent N) with the WebUI enabled (-w).
Model emits <think>...</think> content which gets extracted into reasoning_content (streaming and non-streaming), with natural-language content following.
Observed
Reasoning output appears broken. (Exact symptoms to be filled in with a reproducer — e.g., missing <think> tags, reasoning_content empty when it should not be, malformed structure, or content/reasoning interleaving incorrectly.)
Suspected area
Concurrent + WebUI is a code path with overlapping state:
BatchScheduler per-slot token streaming — think-tag extraction is applied downstream of StreamChunk emission in MLXChatCompletionsController. Any per-slot boundary handling (think-buffer carry between chunks) would need to be independent per request.
WebUI frontend — the llama.cpp webui may strip or mishandle reasoning_content / <think> tags if they arrive via non-standard SSE fields.
Chat template — Qwen3.5-27B-8bit's chat_template.jinja<think> handling may interact with concurrent mode's prompt processing differently than serial mode.
A separate TopPSampler 1D crash in concurrent mode (fixed locally, not yet published) may be masking this issue during reproduction — any request with top_p < 1 hits [squeeze] fatal error before reasoning output is evaluated. The WebUI default is top_p=0.95, which triggers that crash. Once the TopPSampler fix is published in the next nightly, re-test this issue to confirm the reasoning breakage is independent.
Summary
Reasoning (
<think>) output frommlx-community/Qwen3.5-27B-8bitappears broken when the server runs in concurrent mode (--concurrent N) with the WebUI enabled (-w).Environment
mlx-community/Qwen3.5-27B-8bit--concurrent <N>+-w(WebUI)v0.9.10-628c2bb(releasenightly-20260408-628c2bb)Repro
Then submit a prompt via the browser WebUI.
Expected
Model emits
<think>...</think>content which gets extracted intoreasoning_content(streaming and non-streaming), with natural-language content following.Observed
Reasoning output appears broken. (Exact symptoms to be filled in with a reproducer — e.g., missing
<think>tags,reasoning_contentempty when it should not be, malformed structure, or content/reasoning interleaving incorrectly.)Suspected area
Concurrent + WebUI is a code path with overlapping state:
StreamChunkemission inMLXChatCompletionsController. Any per-slot boundary handling (think-buffer carry between chunks) would need to be independent per request.reasoning_content/<think>tags if they arrive via non-standard SSE fields.chat_template.jinja<think>handling may interact with concurrent mode's prompt processing differently than serial mode.effectiveResponseFormat, and per-request think-extract state is initialized per slot; verify the think buffer isn't shared across slots.Validation questions
--concurrent)?-w(direct HTTP client)?reasoning_contentdeltas, or is the issue purely in the WebUI rendering layer?<think>template (e.g., other Qwen3.5 variants)?Related
Known concurrent-mode regression to fix first
A separate TopPSampler 1D crash in concurrent mode (fixed locally, not yet published) may be masking this issue during reproduction — any request with
top_p < 1hits[squeeze]fatal error before reasoning output is evaluated. The WebUI default istop_p=0.95, which triggers that crash. Once the TopPSampler fix is published in the next nightly, re-test this issue to confirm the reasoning breakage is independent.