Skip to content

Concurrent x8 prefix cache + grammar returns empty responses #86

@scouzi1966

Description

@scouzi1966

Summary

When 8+ simultaneous requests share a common prompt prefix and use response_format=json_schema with strict: true, all requests return empty responses. Affects all models (Qwen3.5, Qwen3-Coder, Gemma 4).

Reproduction

afm mlx -m mlx-community/Qwen3.5-35B-A3B-4bit --concurrent 15 --enable-prefix-caching --enable-grammar-constraints --port 9998

Assertion test Section 6, "Concurrent x8 shared-prefix":

❌ Concurrent x8 shared-prefix: uncached suffix remains on every branch
   Expected: 0 < cached_tokens < prompt_tokens for all 8
   Actual:   FAIL: 2:0/59, 3:0/59, 4:0/59, 5:0/59, 6:0/59, 7:0/59, 8:0/59

❌ Concurrent x8 shared-prefix: divergent suffix responses stay isolated
   Expected: each of 8 responses keeps only its own marker
   Actual:   FAIL: all 8 return empty (Expecting value: line 1 column 1)

Root Cause Analysis

Two separate issues compound:

1. Prefix cache timing (cached_tokens=0)

When 8 requests arrive simultaneously, they all enter prefillOne at roughly the same time. Request 1 hasn't completed its prefill (and hasn't saved to the radix cache) when requests 2-8 do their radix lookup — all miss. No request benefits from prefix sharing.

This is a timing issue in BatchScheduler.prefillOne() → radix cache save happens in finishSlot() (after decode completes), not after prefill. Simultaneous arrivals can never hit each other's prefix cache.

2. Empty responses with grammar constraints

All 8 requests return empty responses (Expecting value: line 1 column 1). The requests use response_format=json_schema with strict: true, which creates an xgrammar matcher per request. With 8 concurrent grammar-constrained requests, either:

  • The xgrammar engine has a thread-safety issue under concurrent access
  • The grammar matcher state gets corrupted when multiple requests share the generation loop
  • The BatchScheduler's grammar constraint handling doesn't isolate per-slot state correctly

Without grammar constraints (strict: false or no --enable-grammar-constraints), the 8 concurrent requests succeed (content is generated). The issue is specifically the interaction between concurrent batch decode + grammar enforcement.

Affected Models (confirmed)

Model Pass rate (overall) This bug?
mlx-community/gemma-4-e4b-it-4bit 92% Yes
mlx-community/gemma-4-26B-A4B-it-mlx-4bit 89% Yes
mlx-community/Qwen3.5-35B-A3B-4bit 97% Yes
mlx-community/Qwen3-Coder-Next-4bit 95% Yes

Not related to

  • Tool calling — this is response_format=json_schema, not tool calls
  • BatchRotatingKVCache — happens with both KVCacheSimple and RotatingKVCache models
  • Type coercion — separate issue
  • Prefix cache correctness — sequential requests cache correctly; only simultaneous arrivals miss

Scope

  • Sources/MacLocalAPI/Models/BatchScheduler.swift — prefix cache save timing, grammar constraint per-slot isolation
  • Sources/MacLocalAPI/Services/XGrammarService.swift — thread safety under concurrent access
  • Test: Scripts/test-assertions.sh Section 6, "Concurrent x8 shared-prefix" tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions