-
I am trying to use llama-cpp-python, but I am getting 22 tokens per second instead of the 25 tokens per second that I usually get under regular llama-cpp. Furthermore, looking at the GPU load, it's only hitting about 80% ish GPU load versus 100% load with pure llama-cpp. How can I get llama-cpp-python to perform the same? I am running both in docker with the same base image, so I should be getting identical speeds in both. Here is the Dockerfile for llama-cpp with good performance:
Performance is evaluated through the following command: The Dockerfile for llama-cpp-python looks as follows:
The models.conf looks as follows:
Performance under llama-cpp-python is evaluated through checking the logs of the server, while querying the server through openweb ui connected through the OpenAI compatible API. I do see something in the logs when using llama-cpp-python that I do not see when using llama-cpp which might explain the difference in GPU load and performance, as some tensors are running on the CPU for some reason. But I do not know why and how I can fix this. Does anyone have any idea?
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Managed to find the answer myself. For some reason the |
Beta Was this translation helpful? Give feedback.
-
Yeah, I noticed something similar when I tried switching setups. On paper it shouldn’t be slower, but my GPU load was all over the place compared to running plain llama cpp directly. Not sure if it’s a config thing or just overhead from whatever wrapper I’m using. Has anyone actually profiled both side by side? I’m curious if the drop is from memory handling or something else entirely. |
Beta Was this translation helpful? Give feedback.
Managed to find the answer myself. For some reason the
logits_all
parameter defaults totrue
and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.