You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am on Linux Mint 22, I have an RTX 3070 Mobile with 64 GB of RAM, and I use in llama-server (for Silly Tavern)
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./llama-server -m ../text-generation-webui-main/models/GLM-4.5-Air-UD-Q3_K_XL-00001-of-00002.gguf --n-gpu-layers 99 --threads 7 --prio 2 -fa --batch-size 2048 --ubatch-size 2048 --ctx-size 12000 --jinja -ot "blk.([0-1]).ffn.=CUDA0,.ffn_._exps=CPU"
in this recent updates (compiled from llama) for GLM 4.5 Air the tk/s in generation are around half token per second much slower than my previous usual of 6 tokens per second while if I revert back to versions like llama.cpp-9515c6131aecaccc955fdedcfe16c3e030aaefcb (last update 5th of August) is back to my normal speeds. Am I the only one experiencing this?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am on Linux Mint 22, I have an RTX 3070 Mobile with 64 GB of RAM, and I use in llama-server (for Silly Tavern)
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./llama-server -m ../text-generation-webui-main/models/GLM-4.5-Air-UD-Q3_K_XL-00001-of-00002.gguf --n-gpu-layers 99 --threads 7 --prio 2 -fa --batch-size 2048 --ubatch-size 2048 --ctx-size 12000 --jinja -ot "blk.([0-1]).ffn.=CUDA0,.ffn_._exps=CPU"
in this recent updates (compiled from llama) for GLM 4.5 Air the tk/s in generation are around half token per second much slower than my previous usual of 6 tokens per second while if I revert back to versions like llama.cpp-9515c6131aecaccc955fdedcfe16c3e030aaefcb (last update 5th of August) is back to my normal speeds. Am I the only one experiencing this?
Beta Was this translation helpful? Give feedback.
All reactions