Skip to content

KV cache quantization (--kv-bits 4) crashes server with Gemma 4 #96

@scouzi1966

Description

@scouzi1966

Summary

afm mlx -m mlx-community/gemma-4-31b-it-8bit --kv-bits 4 causes the server to fail with connection errors. The server either crashes on startup or refuses requests.

Reproduction

afm mlx -m mlx-community/gemma-4-31b-it-8bit --port 9877 --kv-bits 4
# Then any chat completion request returns "Connection error"

Detected during the comprehensive smart-analysis test suite (Scripts/test-llm-comprehensive.txt [@ kv-quantized] variant).

Affected Variants

  • mlx-community/gemma-4-31b-it-8bit @ kv-quantized--kv-bits 4

The same --kv-bits 4 flag works on Qwen3.5-35B-A3B-4bit (passed in the same test suite).

Likely Cause

Gemma 4 uses mixed cache types: 50 RotatingKVCache (sliding attention) + 10 KVCacheSimple (full attention). The QuantizedKVCache implementation in mlx-swift-lm may not support RotatingKVCache, or there's an interaction with our patched KV cache code.

Impact

Low — KV quantization is an optional optimization. Workaround is to omit --kv-bits for Gemma 4 models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions