KV cache quantization (--kv-bits 4) crashes server with Gemma 4

## Summary

`afm mlx -m mlx-community/gemma-4-31b-it-8bit --kv-bits 4` causes the server to fail with connection errors. The server either crashes on startup or refuses requests.

## Reproduction

```bash
afm mlx -m mlx-community/gemma-4-31b-it-8bit --port 9877 --kv-bits 4
# Then any chat completion request returns "Connection error"
```

Detected during the comprehensive smart-analysis test suite (`Scripts/test-llm-comprehensive.txt` `[@ kv-quantized]` variant).

## Affected Variants

- `mlx-community/gemma-4-31b-it-8bit @ kv-quantized` — `--kv-bits 4`

The same `--kv-bits 4` flag works on Qwen3.5-35B-A3B-4bit (passed in the same test suite).

## Likely Cause

Gemma 4 uses mixed cache types: 50 RotatingKVCache (sliding attention) + 10 KVCacheSimple (full attention). The QuantizedKVCache implementation in mlx-swift-lm may not support RotatingKVCache, or there's an interaction with our patched KV cache code.

## Impact

Low — KV quantization is an optional optimization. Workaround is to omit `--kv-bits` for Gemma 4 models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache quantization (--kv-bits 4) crashes server with Gemma 4 #96

Summary

Reproduction

Affected Variants

Likely Cause

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KV cache quantization (--kv-bits 4) crashes server with Gemma 4 #96

Description

Summary

Reproduction

Affected Variants

Likely Cause

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions