Worse speed and GPU load than pure llama-cpp #1831

Mushoz · 2024-11-14T13:52:18Z

Mushoz
Nov 14, 2024

I am trying to use llama-cpp-python, but I am getting 22 tokens per second instead of the 25 tokens per second that I usually get under regular llama-cpp. Furthermore, looking at the GPU load, it's only hitting about 80% ish GPU load versus 100% load with pure llama-cpp. How can I get llama-cpp-python to perform the same?

I am running both in docker with the same base image, so I should be getting identical speeds in both. Here is the Dockerfile for llama-cpp with good performance:

FROM rocm/pytorch

WORKDIR /root/
ARG ROCM_TARGET_LST=/root/gfx
RUN echo "gfx1100" > /root/gfx
RUN rocm_agent_enumerator

RUN git clone https://github.com/ggerganov/llama.cpp.git
WORKDIR /root/llama.cpp

RUN make GGML_HIPBLAS=1 -j$(nproc)

ENTRYPOINT ["bash"]

Performance is evaluated through the following command: ./llama-cli -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 999 -p "Please write a minesweeper game using html, css and js. Please only output the codeblocks. Do NOT give any explanations." -n 1024

The Dockerfile for llama-cpp-python looks as follows:

FROM rocm/pytorch

WORKDIR /root/
ARG ROCM_TARGET_LST=/root/gfx
RUN echo "gfx1100" > /root/gfx
RUN rocm_agent_enumerator

RUN git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
WORKDIR /root/llama-cpp-python

ARG CMAKE_ARGS="-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++"
RUN pip install --upgrade pip
RUN pip install -e .
RUN pip install -e .[server]

ENTRYPOINT ["python3", "-m", "llama_cpp.server", "--config_file", "/root/models.conf"]

The models.conf looks as follows:

{
    "host": "0.0.0.0",
    "port": 8000,
    "models": [
        {
            "model": "/models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf",
            "model_alias": "Qwen2.5-Coder-32B",
            "n_gpu_layers": -1,
            "offload_kqv": true
        }
    ]
}

Performance under llama-cpp-python is evaluated through checking the logs of the server, while querying the server through openweb ui connected through the OpenAI compatible API.

I do see something in the logs when using llama-cpp-python that I do not see when using llama-cpp which might explain the difference in GPU load and performance, as some tensors are running on the CPU for some reason. But I do not know why and how I can fix this. Does anyone have any idea?

llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size =   417.66 MiB
llm_load_tensors:      ROCm0 model buffer size = 17490.85 MiB
warning: failed to mlock 1082613760-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).

Answered by Mushoz

Nov 14, 2024

Managed to find the answer myself. For some reason the logits_all parameter defaults to true and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.

View full answer

Mushoz · 2024-11-14T20:28:11Z

Mushoz
Nov 14, 2024
Author

Managed to find the answer myself. For some reason the logits_all parameter defaults to true and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.

2 replies

ExtReMLapin Nov 15, 2024

Thanks for posting out the answer here

gl2007 Dec 31, 2024

Managed to find the answer myself. For some reason the logits_all parameter defaults to true and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.

Can you please specify where to make the change for the parameter? Will be useful for newbies :)

exilebuild5423 · 2025-08-08T05:30:35Z

exilebuild5423
Aug 8, 2025

Yeah, I noticed something similar when I tried switching setups. On paper it shouldn’t be slower, but my GPU load was all over the place compared to running plain llama cpp directly. Not sure if it’s a config thing or just overhead from whatever wrapper I’m using.

Has anyone actually profiled both side by side? I’m curious if the drop is from memory handling or something else entirely.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Worse speed and GPU load than pure llama-cpp #1831

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Worse speed and GPU load than pure llama-cpp #1831

Uh oh!

Uh oh!

Mushoz Nov 14, 2024

Replies: 2 comments · 2 replies

Uh oh!

Mushoz Nov 14, 2024 Author

Uh oh!

ExtReMLapin Nov 15, 2024

Uh oh!

gl2007 Dec 31, 2024

Uh oh!

exilebuild5423 Aug 8, 2025

Mushoz
Nov 14, 2024

Replies: 2 comments 2 replies

Mushoz
Nov 14, 2024
Author

exilebuild5423
Aug 8, 2025