GLM 4.5 Air much slower in recent updates? #15332

Gobeda · 2025-08-14T21:00:47Z

Gobeda
Aug 14, 2025

I am on Linux Mint 22, I have an RTX 3070 Mobile with 64 GB of RAM, and I use in llama-server (for Silly Tavern)
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./llama-server -m ../text-generation-webui-main/models/GLM-4.5-Air-UD-Q3_K_XL-00001-of-00002.gguf --n-gpu-layers 99 --threads 7 --prio 2 -fa --batch-size 2048 --ubatch-size 2048 --ctx-size 12000 --jinja -ot "blk.([0-1]).ffn.=CUDA0,.ffn_._exps=CPU"

in this recent updates (compiled from llama) for GLM 4.5 Air the tk/s in generation are around half token per second much slower than my previous usual of 6 tokens per second while if I revert back to versions like llama.cpp-9515c6131aecaccc955fdedcfe16c3e030aaefcb (last update 5th of August) is back to my normal speeds. Am I the only one experiencing this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GLM 4.5 Air much slower in recent updates? #15332

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GLM 4.5 Air much slower in recent updates? #15332

Uh oh!

Uh oh!

Gobeda Aug 14, 2025

Replies: 0 comments

Gobeda
Aug 14, 2025