Skip to content

Reasoning appears broken with --concurrent + WebUI on Qwen3.5-27B-8bit #99

@scouzi1966

Description

@scouzi1966

Summary

Reasoning (<think>) output from mlx-community/Qwen3.5-27B-8bit appears broken when the server runs in concurrent mode (--concurrent N) with the WebUI enabled (-w).

Environment

  • Model: mlx-community/Qwen3.5-27B-8bit
  • Flags: --concurrent <N> + -w (WebUI)
  • Nightly: v0.9.10-628c2bb (release nightly-20260408-628c2bb)
  • Platform: Apple Silicon, macOS 26+

Repro

MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
  afm mlx -m mlx-community/Qwen3.5-27B-8bit --concurrent 15 -w

Then submit a prompt via the browser WebUI.

Expected

Model emits <think>...</think> content which gets extracted into reasoning_content (streaming and non-streaming), with natural-language content following.

Observed

Reasoning output appears broken. (Exact symptoms to be filled in with a reproducer — e.g., missing <think> tags, reasoning_content empty when it should not be, malformed structure, or content/reasoning interleaving incorrectly.)

Suspected area

Concurrent + WebUI is a code path with overlapping state:

  1. BatchScheduler per-slot token streaming — think-tag extraction is applied downstream of StreamChunk emission in MLXChatCompletionsController. Any per-slot boundary handling (think-buffer carry between chunks) would need to be independent per request.
  2. WebUI frontend — the llama.cpp webui may strip or mishandle reasoning_content / <think> tags if they arrive via non-standard SSE fields.
  3. Chat template — Qwen3.5-27B-8bit's chat_template.jinja <think> handling may interact with concurrent mode's prompt processing differently than serial mode.
  4. Relation to --guided-json CLI flag is silently ignored, model produces unconstrained output #97 fix area — the batch completions controller now uses effectiveResponseFormat, and per-request think-extract state is initialized per slot; verify the think buffer isn't shared across slots.

Validation questions

  • Does the same model + same prompt work correctly in serial mode (no --concurrent)?
  • Does it work correctly in concurrent mode without -w (direct HTTP client)?
  • Does the raw SSE stream contain reasoning_content deltas, or is the issue purely in the WebUI rendering layer?
  • Does it reproduce across models that share the same <think> template (e.g., other Qwen3.5 variants)?

Related

Known concurrent-mode regression to fix first

A separate TopPSampler 1D crash in concurrent mode (fixed locally, not yet published) may be masking this issue during reproduction — any request with top_p < 1 hits [squeeze] fatal error before reasoning output is evaluated. The WebUI default is top_p=0.95, which triggers that crash. Once the TopPSampler fix is published in the next nightly, re-test this issue to confirm the reasoning breakage is independent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions