Skip to content

Eval bug: LLAMA_SET_ROWS=1 gibberish output with Dual GPU offload #14795

@askmyteapot

Description

@askmyteapot

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: no
version: 5944 (36c1532)
built with MSVC 19.44.35208.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 5800x
3090 + P40

Models

magnum-v4-22b-Q6_K.gguf
TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf

Problem description & steps to reproduce

Running with environment variable 'LLAMA_SET_ROWS=0' results in normal output. Setting to 1 results in gibberish


Helpful AI
21 July 2025 7:42 AM

How can I help?
12t
User
21 July 2025 7:43 AM

Can you recite for me the intro to sesame street
6.4s
85t
Helpful AI
21 July 2025 8:25 PM

Sunnynynynyy dayyy day,,
day!…
AItSunIt''''ssIts
unnme of
time for for to to
talk you play
play S S
S
ThisSThis is is is is the is a the way street best w song sest
way of way you a
to explore to be come in a
to play learn
play

However, if i restrict to just a single GPU (P40 or 3090), llama_set_rows=1 works with no issues.

First Bad Commit

Havent tested earlier versions.

Relevant log output

With LLAMA_SET_ROWS=1
Gibberish:
.\llama-server.exe -m D:\text-generation-webui\models\TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf -ngl 99 -ts 40/43 -fa -c 32768
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -ts 40/43 -fa -c 32768

Sane (i cant fit 32k ctx on one GPU:
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -fa -c 16368
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 53 -fa -c 32768 (CPU has 4 layers offloaded)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions