Releases · ggml-org/llama.cpp

03 Oct 20:02

128d522

b6686 Latest

Latest

chat : support Magistral thinking (#16413)

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-03T20:02:18Z
llama-b6686-bin-macos-arm64.zip

sha256:2bf5198529c14fe2426c74e375b500d85becb35dab85f23d1c98cf1f391bfb81

10.3 MB 2025-10-03T20:02:27Z
llama-b6686-bin-macos-x64.zip

sha256:56cbda999aaaccc41619028740fe976112091e9c635b3d6cd6d2e3706666aed0

26.7 MB 2025-10-03T20:02:28Z
llama-b6686-bin-ubuntu-vulkan-x64.zip

sha256:536b787e6a6ae18ca01b1f1ba8b15e7c1b6832df0d54819ad3c6eef50a3369e6

25.6 MB 2025-10-03T20:02:30Z
llama-b6686-bin-ubuntu-x64.zip

sha256:2414aa1f7cd132182cce2adcda463ec8ff048b715454b032ba44bd08c0b3febc

12.3 MB 2025-10-03T20:02:31Z
llama-b6686-bin-win-cpu-arm64.zip

sha256:889b8890519ab46a54e2a35264b0861741c14ad0f404c1db4040b5618b7845b8

10.5 MB 2025-10-03T20:02:32Z
llama-b6686-bin-win-cpu-x64.zip

sha256:dc1c94e268e558916f6575c74c63af10846a932e56e924b5ea6a591ae8ee8a13

13.6 MB 2025-10-03T20:02:33Z
llama-b6686-bin-win-cuda-12.4-x64.zip

sha256:f88d02af17b6637d3009087e93855a36dad91d89cb63d26938e6e440f08b230d

149 MB 2025-10-03T20:02:34Z
llama-b6686-bin-win-hip-radeon-x64.zip

sha256:c3b840e57d028fc395fba1f551452f17669ec40e4eba9253f41302b2fb41097b

313 MB 2025-10-03T20:02:40Z
llama-b6686-bin-win-opencl-adreno-arm64.zip

sha256:e516a328cc80c65379376d41b14d8f62ee6f0e4b928104650bf4ef0070956f7d

10.9 MB 2025-10-03T20:02:48Z
Source code (zip)

2025-10-03T18:51:48Z
Source code (tar.gz)

2025-10-03T18:51:48Z

03 Oct 19:55

github-actions

b6685

f6dcda3

b6685

server : context checkpointing for hybrid and recurrent models (#16382)

* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 15

03 Oct 16:55

github-actions

b6684

606a73f

b6684

metal : fix loop bound in ggml_mem_ranges (#16412)

Assets 15

03 Oct 12:59

github-actions

b6683

946f71e

b6683

llama : fix shapes for bert/mpt q/k norm (#16409)

Assets 15

03 Oct 12:13

github-actions

b6682

638d330

b6682

ggml : fix graph reallocation with multiple chunks (#16396)

reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower

Assets 15

03 Oct 11:10

github-actions

b6680

2aaf0a2

b6680

vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#1…

Assets 15

03 Oct 10:27

github-actions

b6679

0e1f838

b6679

vulkan: Fix FA coopmat1 invalid array indexing (#16365)

When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.

Assets 15

03 Oct 10:19

github-actions

b6678

ad12647

b6678

ci : change macos-13 to macos-15-intel (#16401)

This commit updates the macos-13 runners to macos-15-intel.

The motivation for this changes is the macos-13 runners are scheduled
to be retired on 2025-12-04.

Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

Assets 15

03 Oct 09:31

github-actions

b6676

e308efd

b6676

vulkan: in flash attention, bounds check against nem1 (don't rely on …

Assets 15

02 Oct 19:32

github-actions

b6673

d64c810

b6673

test-barrier : do not use more threads than physically available (#16…

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b6686

Uh oh!

b6685

Uh oh!

b6684

Uh oh!

b6683

Uh oh!

b6682

Uh oh!

b6680

Uh oh!

b6679

Uh oh!

b6678

Uh oh!

b6676

Uh oh!

b6673

Uh oh!