Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5380
server : passthrough the /models endpoint during loading (#13535) * server : passthrough the /models endpoint during loading * server : update readme + return json for "meta" field
b5379
server : fix cache_tokens bug with no cache_prompt (#13533)
b5378
cmake: simplify vulkan shader test logic (#13263)
b5377
vulkan: KHR_coopmat flash attention (#13506) This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.
b5372
vulkan: workaround FA compile failures on macos (#13517)
b5371
quantize : improve tensor-type pattern matching (#13033)
b5370
clip : clip.h become private API (⚠️ breaking change) (#13510)
b5369
metal : use FA-vec kernel up to batch size 20 (#13496) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci
b5368
metal : optimize multi-sequence FA vec kernel (#13493) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci
b5367
ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509) Signed-off-by: Dan Johansson <[email protected]>