Summary
afm mlx -m mlx-community/gemma-4-31b-it-8bit --kv-bits 4 causes the server to fail with connection errors. The server either crashes on startup or refuses requests.
Reproduction
afm mlx -m mlx-community/gemma-4-31b-it-8bit --port 9877 --kv-bits 4
# Then any chat completion request returns "Connection error"
Detected during the comprehensive smart-analysis test suite (Scripts/test-llm-comprehensive.txt [@ kv-quantized] variant).
Affected Variants
mlx-community/gemma-4-31b-it-8bit @ kv-quantized — --kv-bits 4
The same --kv-bits 4 flag works on Qwen3.5-35B-A3B-4bit (passed in the same test suite).
Likely Cause
Gemma 4 uses mixed cache types: 50 RotatingKVCache (sliding attention) + 10 KVCacheSimple (full attention). The QuantizedKVCache implementation in mlx-swift-lm may not support RotatingKVCache, or there's an interaction with our patched KV cache code.
Impact
Low — KV quantization is an optional optimization. Workaround is to omit --kv-bits for Gemma 4 models.
Summary
afm mlx -m mlx-community/gemma-4-31b-it-8bit --kv-bits 4causes the server to fail with connection errors. The server either crashes on startup or refuses requests.Reproduction
afm mlx -m mlx-community/gemma-4-31b-it-8bit --port 9877 --kv-bits 4 # Then any chat completion request returns "Connection error"Detected during the comprehensive smart-analysis test suite (
Scripts/test-llm-comprehensive.txt[@ kv-quantized]variant).Affected Variants
mlx-community/gemma-4-31b-it-8bit @ kv-quantized—--kv-bits 4The same
--kv-bits 4flag works on Qwen3.5-35B-A3B-4bit (passed in the same test suite).Likely Cause
Gemma 4 uses mixed cache types: 50 RotatingKVCache (sliding attention) + 10 KVCacheSimple (full attention). The QuantizedKVCache implementation in mlx-swift-lm may not support RotatingKVCache, or there's an interaction with our patched KV cache code.
Impact
Low — KV quantization is an optional optimization. Workaround is to omit
--kv-bitsfor Gemma 4 models.