Summary
When 8+ simultaneous requests share a common prompt prefix and use response_format=json_schema with strict: true, all requests return empty responses. Affects all models (Qwen3.5, Qwen3-Coder, Gemma 4).
Reproduction
afm mlx -m mlx-community/Qwen3.5-35B-A3B-4bit --concurrent 15 --enable-prefix-caching --enable-grammar-constraints --port 9998
Assertion test Section 6, "Concurrent x8 shared-prefix":
❌ Concurrent x8 shared-prefix: uncached suffix remains on every branch
Expected: 0 < cached_tokens < prompt_tokens for all 8
Actual: FAIL: 2:0/59, 3:0/59, 4:0/59, 5:0/59, 6:0/59, 7:0/59, 8:0/59
❌ Concurrent x8 shared-prefix: divergent suffix responses stay isolated
Expected: each of 8 responses keeps only its own marker
Actual: FAIL: all 8 return empty (Expecting value: line 1 column 1)
Root Cause Analysis
Two separate issues compound:
1. Prefix cache timing (cached_tokens=0)
When 8 requests arrive simultaneously, they all enter prefillOne at roughly the same time. Request 1 hasn't completed its prefill (and hasn't saved to the radix cache) when requests 2-8 do their radix lookup — all miss. No request benefits from prefix sharing.
This is a timing issue in BatchScheduler.prefillOne() → radix cache save happens in finishSlot() (after decode completes), not after prefill. Simultaneous arrivals can never hit each other's prefix cache.
2. Empty responses with grammar constraints
All 8 requests return empty responses (Expecting value: line 1 column 1). The requests use response_format=json_schema with strict: true, which creates an xgrammar matcher per request. With 8 concurrent grammar-constrained requests, either:
- The xgrammar engine has a thread-safety issue under concurrent access
- The grammar matcher state gets corrupted when multiple requests share the generation loop
- The BatchScheduler's grammar constraint handling doesn't isolate per-slot state correctly
Without grammar constraints (strict: false or no --enable-grammar-constraints), the 8 concurrent requests succeed (content is generated). The issue is specifically the interaction between concurrent batch decode + grammar enforcement.
Affected Models (confirmed)
| Model |
Pass rate (overall) |
This bug? |
| mlx-community/gemma-4-e4b-it-4bit |
92% |
Yes |
| mlx-community/gemma-4-26B-A4B-it-mlx-4bit |
89% |
Yes |
| mlx-community/Qwen3.5-35B-A3B-4bit |
97% |
Yes |
| mlx-community/Qwen3-Coder-Next-4bit |
95% |
Yes |
Not related to
- Tool calling — this is
response_format=json_schema, not tool calls
- BatchRotatingKVCache — happens with both KVCacheSimple and RotatingKVCache models
- Type coercion — separate issue
- Prefix cache correctness — sequential requests cache correctly; only simultaneous arrivals miss
Scope
Sources/MacLocalAPI/Models/BatchScheduler.swift — prefix cache save timing, grammar constraint per-slot isolation
Sources/MacLocalAPI/Services/XGrammarService.swift — thread safety under concurrent access
- Test:
Scripts/test-assertions.sh Section 6, "Concurrent x8 shared-prefix" tests
Summary
When 8+ simultaneous requests share a common prompt prefix and use
response_format=json_schemawithstrict: true, all requests return empty responses. Affects all models (Qwen3.5, Qwen3-Coder, Gemma 4).Reproduction
Assertion test Section 6, "Concurrent x8 shared-prefix":
Root Cause Analysis
Two separate issues compound:
1. Prefix cache timing (cached_tokens=0)
When 8 requests arrive simultaneously, they all enter
prefillOneat roughly the same time. Request 1 hasn't completed its prefill (and hasn't saved to the radix cache) when requests 2-8 do their radix lookup — all miss. No request benefits from prefix sharing.This is a timing issue in
BatchScheduler.prefillOne()→ radix cache save happens infinishSlot()(after decode completes), not after prefill. Simultaneous arrivals can never hit each other's prefix cache.2. Empty responses with grammar constraints
All 8 requests return empty responses (
Expecting value: line 1 column 1). The requests useresponse_format=json_schemawithstrict: true, which creates an xgrammar matcher per request. With 8 concurrent grammar-constrained requests, either:Without grammar constraints (
strict: falseor no--enable-grammar-constraints), the 8 concurrent requests succeed (content is generated). The issue is specifically the interaction between concurrent batch decode + grammar enforcement.Affected Models (confirmed)
Not related to
response_format=json_schema, not tool callsScope
Sources/MacLocalAPI/Models/BatchScheduler.swift— prefix cache save timing, grammar constraint per-slot isolationSources/MacLocalAPI/Services/XGrammarService.swift— thread safety under concurrent accessScripts/test-assertions.shSection 6, "Concurrent x8 shared-prefix" tests