Concurrent x8 prefix cache + grammar returns empty responses

## Summary

When 8+ simultaneous requests share a common prompt prefix and use `response_format=json_schema` with `strict: true`, all requests return empty responses. Affects all models (Qwen3.5, Qwen3-Coder, Gemma 4).

## Reproduction

```bash
afm mlx -m mlx-community/Qwen3.5-35B-A3B-4bit --concurrent 15 --enable-prefix-caching --enable-grammar-constraints --port 9998
```

Assertion test Section 6, "Concurrent x8 shared-prefix":
```
❌ Concurrent x8 shared-prefix: uncached suffix remains on every branch
   Expected: 0 < cached_tokens < prompt_tokens for all 8
   Actual:   FAIL: 2:0/59, 3:0/59, 4:0/59, 5:0/59, 6:0/59, 7:0/59, 8:0/59

❌ Concurrent x8 shared-prefix: divergent suffix responses stay isolated
   Expected: each of 8 responses keeps only its own marker
   Actual:   FAIL: all 8 return empty (Expecting value: line 1 column 1)
```

## Root Cause Analysis

Two separate issues compound:

### 1. Prefix cache timing (cached_tokens=0)

When 8 requests arrive simultaneously, they all enter `prefillOne` at roughly the same time. Request 1 hasn't completed its prefill (and hasn't saved to the radix cache) when requests 2-8 do their radix lookup — all miss. No request benefits from prefix sharing.

This is a timing issue in `BatchScheduler.prefillOne()` → radix cache save happens in `finishSlot()` (after decode completes), not after prefill. Simultaneous arrivals can never hit each other's prefix cache.

### 2. Empty responses with grammar constraints

All 8 requests return empty responses (`Expecting value: line 1 column 1`). The requests use `response_format=json_schema` with `strict: true`, which creates an xgrammar matcher per request. With 8 concurrent grammar-constrained requests, either:
- The xgrammar engine has a thread-safety issue under concurrent access
- The grammar matcher state gets corrupted when multiple requests share the generation loop
- The BatchScheduler's grammar constraint handling doesn't isolate per-slot state correctly

Without grammar constraints (`strict: false` or no `--enable-grammar-constraints`), the 8 concurrent requests succeed (content is generated). The issue is specifically the interaction between concurrent batch decode + grammar enforcement.

## Affected Models (confirmed)

| Model | Pass rate (overall) | This bug? |
|-------|------|-----------|
| mlx-community/gemma-4-e4b-it-4bit | 92% | Yes |
| mlx-community/gemma-4-26B-A4B-it-mlx-4bit | 89% | Yes |
| mlx-community/Qwen3.5-35B-A3B-4bit | 97% | Yes |
| mlx-community/Qwen3-Coder-Next-4bit | 95% | Yes |

## Not related to

- Tool calling — this is `response_format=json_schema`, not tool calls
- BatchRotatingKVCache — happens with both KVCacheSimple and RotatingKVCache models
- Type coercion — separate issue
- Prefix cache correctness — sequential requests cache correctly; only simultaneous arrivals miss

## Scope

- `Sources/MacLocalAPI/Models/BatchScheduler.swift` — prefix cache save timing, grammar constraint per-slot isolation
- `Sources/MacLocalAPI/Services/XGrammarService.swift` — thread safety under concurrent access
- Test: `Scripts/test-assertions.sh` Section 6, "Concurrent x8 shared-prefix" tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent x8 prefix cache + grammar returns empty responses #86

Summary

Reproduction

Root Cause Analysis

1. Prefix cache timing (cached_tokens=0)

2. Empty responses with grammar constraints

Affected Models (confirmed)

Not related to

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Pass rate (overall)	This bug?
mlx-community/gemma-4-e4b-it-4bit	92%	Yes
mlx-community/gemma-4-26B-A4B-it-mlx-4bit	89%	Yes
mlx-community/Qwen3.5-35B-A3B-4bit	97%	Yes
mlx-community/Qwen3-Coder-Next-4bit	95%	Yes

Concurrent x8 prefix cache + grammar returns empty responses #86

Description

Summary

Reproduction

Root Cause Analysis

1. Prefix cache timing (cached_tokens=0)

2. Empty responses with grammar constraints

Affected Models (confirmed)

Not related to

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions