Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 7, 2025

This PR introduces a new dynamic way to compile and use Metal kernels. On master we had to pre-compile all kernels during the backend initialization. Now, we can dynamically compile the needed kernels on the fly during inference.

This change unlocks the possibility to compile significantly more optimized kernels using the specific shapes of the current computation. This is achieved through the MTLFunctionConstant mechanism. Using this, we can now unroll loops significantly better and improve the overall performance.

The new kernel loading mechanism is currently applied to all Flash Attention (FA) kernels. In follow up PRs, we will continuously move all the remaining kernels and utilize function constants to improve the performance.

A secondary additional change is improved vector kernels for Q8_0.

The impact from these changes is significantly improved performance across the board. The bigger the context, the bigger the speed up.

Model Test t/s master t/s gg/metal-refactor Speedup
gemma3 12B Q8_0 pp512 807.32 838.85 1.04
gemma3 12B Q8_0 pp512@d1024 725.71 797.56 1.10
gemma3 12B Q8_0 pp512@d2048 705.75 784.87 1.11
gemma3 12B Q8_0 pp512@d8192 617.82 730.08 1.18
gemma3 12B Q8_0 pp512@d32768 407.97 573.20 1.41
gemma3 12B Q8_0 tg32 42.12 42.97 1.02
gemma3 12B Q8_0 tg32@d1024 40.80 41.80 1.02
gemma3 12B Q8_0 tg32@d32768 35.55 36.26 1.02
gemma3 4B Q8_0 pp512 2473.10 2641.20 1.07
gemma3 4B Q8_0 pp512@d1024 2164.10 2478.27 1.15
gemma3 4B Q8_0 pp512@d2048 2095.75 2414.46 1.15
gemma3 4B Q8_0 pp512@d8192 1790.75 2219.47 1.24
gemma3 4B Q8_0 pp512@d32768 1138.71 1664.12 1.46
gemma3 4B Q8_0 tg32 98.78 98.75 1.00
gemma3 4B Q8_0 tg32@d1024 93.88 95.53 1.02
gemma3 4B Q8_0 tg32@d32768 81.96 83.43 1.02
gpt-oss 120B MXFP4 MoE pp512 1208.74 1222.50 1.01
gpt-oss 120B MXFP4 MoE pp512@d1024 1171.21 1196.71 1.02
gpt-oss 120B MXFP4 MoE pp512@d2048 1112.20 1149.21 1.03
gpt-oss 120B MXFP4 MoE pp512@d8192 899.23 980.58 1.09
gpt-oss 120B MXFP4 MoE pp512@d32768 507.91 607.60 1.20
gpt-oss 120B MXFP4 MoE tg32 80.43 83.52 1.04
gpt-oss 120B MXFP4 MoE tg32@d1024 79.57 81.16 1.02
gpt-oss 120B MXFP4 MoE tg32@d32768 57.21 61.25 1.07
gpt-oss 20B MXFP4 MoE pp512 2352.55 2404.28 1.02
gpt-oss 20B MXFP4 MoE pp512@d1024 2198.80 2283.64 1.04
gpt-oss 20B MXFP4 MoE pp512@d2048 2084.15 2188.54 1.05
gpt-oss 20B MXFP4 MoE pp512@d8192 1591.06 1764.50 1.11
gpt-oss 20B MXFP4 MoE pp512@d32768 822.74 1000.95 1.22
gpt-oss 20B MXFP4 MoE tg32 116.02 122.63 1.06
gpt-oss 20B MXFP4 MoE tg32@d1024 114.13 119.72 1.05
gpt-oss 20B MXFP4 MoE tg32@d32768 83.72 88.65 1.06
llama 8B Q8_0 pp512 1308.80 1320.41 1.01
llama 8B Q8_0 pp512@d1024 1221.28 1253.95 1.03
llama 8B Q8_0 pp512@d2048 1043.15 1191.10 1.14
llama 8B Q8_0 pp512@d8192 829.73 927.69 1.12
llama 8B Q8_0 pp512@d32768 378.31 478.48 1.26
llama 8B Q8_0 tg32 67.56 71.86 1.06
llama 8B Q8_0 tg32@d1024 64.71 70.27 1.09
llama 8B Q8_0 tg32@d32768 41.17 45.17 1.10
qwen2 1.5B Q8_0 pp512 6106.18 6210.50 1.02
qwen2 1.5B Q8_0 pp512@d1024 5373.06 5613.40 1.04
qwen2 1.5B Q8_0 pp512@d2048 4784.09 5078.52 1.06
qwen2 1.5B Q8_0 pp512@d8192 2596.45 3246.83 1.25
qwen2 1.5B Q8_0 pp512@d32768 940.35 1332.19 1.42
qwen2 1.5B Q8_0 tg32 184.88 187.62 1.01
qwen2 1.5B Q8_0 tg32@d1024 164.85 180.98 1.10
qwen2 1.5B Q8_0 tg32@d32768 114.57 121.77 1.06
qwen2 3B Q8_0 pp512 2961.57 2992.26 1.01
qwen2 3B Q8_0 pp512@d1024 2624.24 2742.73 1.05
qwen2 3B Q8_0 pp512@d2048 2265.07 2531.85 1.12
qwen2 3B Q8_0 pp512@d8192 1398.54 1739.14 1.24
qwen2 3B Q8_0 pp512@d32768 630.45 766.17 1.22
qwen2 3B Q8_0 tg32 117.70 124.62 1.06
qwen2 3B Q8_0 tg32@d1024 107.16 121.23 1.13
qwen2 3B Q8_0 tg32@d32768 66.23 78.30 1.18
qwen2 7B Q8_0 pp512 1419.15 1427.45 1.01
qwen2 7B Q8_0 pp512@d1024 1336.82 1363.87 1.02
qwen2 7B Q8_0 pp512@d2048 1263.52 1303.77 1.03
qwen2 7B Q8_0 pp512@d8192 941.38 1031.58 1.10
qwen2 7B Q8_0 pp512@d32768 460.00 545.48 1.19
qwen2 7B Q8_0 tg32 71.87 76.35 1.06
qwen2 7B Q8_0 tg32@d1024 68.38 74.68 1.09
qwen2 7B Q8_0 tg32@d32768 47.13 53.64 1.14
qwen3 4B Q8_0 pp512 2274.78 2321.97 1.02
qwen3 4B Q8_0 pp512@d1024 1989.08 2105.77 1.06
qwen3 4B Q8_0 pp512@d2048 1773.93 1923.09 1.08
qwen3 4B Q8_0 pp512@d8192 1075.20 1270.23 1.18
qwen3 4B Q8_0 pp512@d32768 355.96 519.51 1.46
qwen3 4B Q8_0 tg32 100.36 105.36 1.05
qwen3 4B Q8_0 tg32@d1024 92.35 101.01 1.09
qwen3 4B Q8_0 tg32@d32768 47.05 53.52 1.14
qwen3moe 30B.A3B Q8_0 pp512 2057.43 2110.77 1.03
qwen3moe 30B.A3B Q8_0 pp512@d1024 1777.20 1889.09 1.06
qwen3moe 30B.A3B Q8_0 pp512@d2048 1549.48 1694.82 1.09
qwen3moe 30B.A3B Q8_0 pp512@d8192 882.27 1050.35 1.19
qwen3moe 30B.A3B Q8_0 pp512@d32768 298.33 404.83 1.36
qwen3moe 30B.A3B Q8_0 tg32 76.90 77.77 1.01
qwen3moe 30B.A3B Q8_0 tg32@d1024 70.76 75.61 1.07
qwen3moe 30B.A3B Q8_0 tg32@d32768 38.77 43.84 1.13

TODO before merge

  • Add comments

Next PRs

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 7, 2025
@ggerganov ggerganov merged commit f28d4f4 into master Sep 8, 2025
53 of 55 checks passed
@ggerganov ggerganov deleted the gg/metal-refactor branch September 8, 2025 10:35
njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
* metal : refactor

ggml-ci

* cont : refactor FA-vec kernel

* cont : print metal library load time

* minor : warn to debug + bettern kernel names

ggml-ci

* metal : optimize mul_mv q8_0

ggml-ci

* metal : simplify FA pipeline creation functions

ggml-ci

* metal : improve naming consistency

* metal : safer function constants offsets

ggml-ci

* metal : comments

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant