-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Closed
Labels
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: no
version: 5944 (36c1532)
built with MSVC 19.44.35208.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
Ryzen 5800x
3090 + P40
Models
magnum-v4-22b-Q6_K.gguf
TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf
Problem description & steps to reproduce
Running with environment variable 'LLAMA_SET_ROWS=0' results in normal output. Setting to 1 results in gibberish
Helpful AI
21 July 2025 7:42 AM
How can I help?
12t
User
21 July 2025 7:43 AM
Can you recite for me the intro to sesame street
6.4s
85t
Helpful AI
21 July 2025 8:25 PM
Sunnynynynyy dayyy day,,
day!…
AItSunIt''''ssIts
unnme of
time for for to to
talk you play
play S S
S
ThisSThis is is is is the is a the way street best w song sest
way of way you a
to explore to be come in a
to play learn
play
However, if i restrict to just a single GPU (P40 or 3090), llama_set_rows=1 works with no issues.
First Bad Commit
Havent tested earlier versions.
Relevant log output
With LLAMA_SET_ROWS=1
Gibberish:
.\llama-server.exe -m D:\text-generation-webui\models\TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf -ngl 99 -ts 40/43 -fa -c 32768
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -ts 40/43 -fa -c 32768
Sane (i cant fit 32k ctx on one GPU:
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -fa -c 16368
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 53 -fa -c 32768 (CPU has 4 layers offloaded)