ggml-cpu: arm64: q4_K repack gemm and gemv implementations #16739

Alcpz · 2025-10-23T12:03:18Z

This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.

Tested on an Apple M4 with Liquid LFM2-1.2B model:

./bin/llama-bench -p 256 -n 128 -pg 0,0 -t 8 -m models/LFM2-1.2B-Q4_K_M.gguf,models/LFM2-1.2B-Q4_K_pure.gguf

model	backend	test	t/s (master)	t/s (this PR)	speedup
lfm2 1.2B Q4_K - Medium	CPU	pp256	436.57 ± 0.40	673.30 ± 2.56	1.54
lfm2 1.2B Q4_K - Medium	CPU	tg128	217.84 ± 8.17	229.91 ± 1.22	1.06
lfm2 1.2B Q4_K - Medium (pure Q4_K)	CPU	pp256	462.25 ± 0.67	800.99 ± 3.61	1.73
lfm2 1.2B Q4_K - Medium (pure Q4_K)	CPU	tg128	241.74 ± 1.47	254.42 ± 2.42	1.05
llama 8B Q4_K - Medium	CPU	pp256	62.43 ± 1.19	99.52 ± 0.11	1.54
llama 8B Q4_K - Medium	CPU	tg128	36.70 ± 0.70	42.47 ± 0.32	1.15

Master build: 8cf6b42 (6824)
This PR: c4f1358

Perplexity remains unchanged (teste current build vs master):

Llama3.1: 7.8861 +/- 0.11849 
LFM2 1.2B: 16.9954 +/- 0.97671

As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.

Any suggestions on how to better test the PR is welcomed.

Signed-off-by: Alberto Cabrera <[email protected]>

Alcpz added 10 commits October 23, 2025 11:02

Enabled q4_K_8x8_q8_K path on ARM

1f7f498

wip: I8mm qs multiplication, pending bias

4e5be2c

cpu : arm : REPACK gemm q4_K8x8 implementation

f9e1527

Signed-off-by: Alberto Cabrera <[email protected]>

Guard gemm with proper features, improved superblock scale and min calc

28e30c2

Signed-off-by: Alberto Cabrera <[email protected]>

cpu: arm: Implemented REPACK gemv for Q4_K

0b1fec6

Signed-off-by: Alberto Cabrera <[email protected]>

Removed completed TODO

c14e3e4

Fixed missing guards when selecting optimal repack type for Q4_K

0b45665

Signed-off-by: Alberto Cabrera <[email protected]>

Fixed macro guard for gemv

f678c83

Fixed wrong comment in GEMV

ef01952

Fixed warning for unused variable

c4f1358

Alcpz requested review from ggerganov and slaren as code owners October 23, 2025 12:03

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: arm64: q4_K repack gemm and gemv implementations #16739

ggml-cpu: arm64: q4_K repack gemm and gemv implementations #16739

Alcpz commented Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggml-cpu: arm64: q4_K repack gemm and gemv implementations #16739

Are you sure you want to change the base?

ggml-cpu: arm64: q4_K repack gemm and gemv implementations #16739

Conversation

Alcpz commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alcpz commented Oct 23, 2025 •

edited

Loading