Reasoning appears broken with --concurrent + WebUI on Qwen3.5-27B-8bit

## Summary

Reasoning (`<think>`) output from `mlx-community/Qwen3.5-27B-8bit` appears broken when the server runs in concurrent mode (`--concurrent N`) with the WebUI enabled (`-w`).

## Environment

- Model: `mlx-community/Qwen3.5-27B-8bit`
- Flags: `--concurrent <N>` + `-w` (WebUI)
- Nightly: `v0.9.10-628c2bb` (release [`nightly-20260408-628c2bb`](https://github.com/scouzi1966/maclocal-api/releases/tag/nightly-20260408-628c2bb))
- Platform: Apple Silicon, macOS 26+

## Repro

```bash
MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
  afm mlx -m mlx-community/Qwen3.5-27B-8bit --concurrent 15 -w
```

Then submit a prompt via the browser WebUI.

## Expected

Model emits `<think>...</think>` content which gets extracted into `reasoning_content` (streaming and non-streaming), with natural-language content following.

## Observed

Reasoning output appears broken. (Exact symptoms to be filled in with a reproducer — e.g., missing `<think>` tags, `reasoning_content` empty when it should not be, malformed structure, or content/reasoning interleaving incorrectly.)

## Suspected area

Concurrent + WebUI is a code path with overlapping state:

1. **BatchScheduler per-slot token streaming** — think-tag extraction is applied downstream of `StreamChunk` emission in `MLXChatCompletionsController`. Any per-slot boundary handling (think-buffer carry between chunks) would need to be independent per request.
2. **WebUI frontend** — the llama.cpp webui may strip or mishandle `reasoning_content` / `<think>` tags if they arrive via non-standard SSE fields.
3. **Chat template** — Qwen3.5-27B-8bit's `chat_template.jinja` `<think>` handling may interact with concurrent mode's prompt processing differently than serial mode.
4. **Relation to #97 fix area** — the batch completions controller now uses `effectiveResponseFormat`, and per-request think-extract state is initialized per slot; verify the think buffer isn't shared across slots.

## Validation questions

- Does the same model + same prompt work correctly in **serial mode** (no `--concurrent`)?
- Does it work correctly in **concurrent mode without `-w`** (direct HTTP client)?
- Does the raw SSE stream contain `reasoning_content` deltas, or is the issue purely in the WebUI rendering layer?
- Does it reproduce across models that share the same `<think>` template (e.g., other Qwen3.5 variants)?

## Related

- #87 — Gemma 4 streaming tool call type coercion (concurrent + streaming state)
- #94 — Radix cache SIGTRAP on wrapped RotatingKVCache (concurrent + per-slot state)

## Known concurrent-mode regression to fix first

A separate TopPSampler 1D crash in concurrent mode (fixed locally, not yet published) may be masking this issue during reproduction — any request with `top_p < 1` hits `[squeeze]` fatal error before reasoning output is evaluated. The WebUI default is `top_p=0.95`, which triggers that crash. Once the TopPSampler fix is published in the next nightly, re-test this issue to confirm the reasoning breakage is independent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasoning appears broken with --concurrent + WebUI on Qwen3.5-27B-8bit #99

Summary

Environment

Repro

Expected

Observed

Suspected area

Validation questions

Related

Known concurrent-mode regression to fix first

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reasoning appears broken with --concurrent + WebUI on Qwen3.5-27B-8bit #99

Description

Summary

Environment

Repro

Expected

Observed

Suspected area

Validation questions

Related

Known concurrent-mode regression to fix first

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions