Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Oct 23, 2025

This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.

Tested on an Apple M4 with Liquid LFM2-1.2B model:

./bin/llama-bench -p 256 -n 128 -pg 0,0 -t 8 -m models/LFM2-1.2B-Q4_K_M.gguf,models/LFM2-1.2B-Q4_K_pure.gguf
model backend test t/s (master) t/s (this PR) speedup
lfm2 1.2B Q4_K - Medium CPU pp256 436.57 ± 0.40 673.30 ± 2.56 1.54
lfm2 1.2B Q4_K - Medium CPU tg128 217.84 ± 8.17 229.91 ± 1.22 1.06
lfm2 1.2B Q4_K - Medium (pure Q4_K) CPU pp256 462.25 ± 0.67 800.99 ± 3.61 1.73
lfm2 1.2B Q4_K - Medium (pure Q4_K) CPU tg128 241.74 ± 1.47 254.42 ± 2.42 1.05
llama 8B Q4_K - Medium CPU pp256 62.43 ± 1.19 99.52 ± 0.11 1.54
llama 8B Q4_K - Medium CPU tg128 36.70 ± 0.70 42.47 ± 0.32 1.15

Master build: 8cf6b42 (6824)
This PR: c4f1358

Perplexity remains unchanged (teste current build vs master):

Llama3.1: 7.8861 +/- 0.11849 
LFM2 1.2B: 16.9954 +/- 0.97671

As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.

Any suggestions on how to better test the PR is welcomed.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant