-
Notifications
You must be signed in to change notification settings - Fork 103
Description
Describe the bug
When closing Newelle or switching between models, the backend process (llama-server) is not terminated correctly. It remains active as a zombie process in the background, holding onto the VRAM.
When trying to load a new model or restart the app, the initialization fails with a 503 error because the GPU VRAM is still occupied by the previous instance.
To Reproduce
- Open Newelle (Flatpak version / Newelle 1.2.0).
- Load a model
- Close Newelle via the window close button OR try to switch to another model.
- Check process list (
ps aux | grep llama) or GPU usage (nvidia-smi). - The
llama-serverprocess is still running and VRAM is full. - Restarting Newelle fails with "Loading model" timeout (Error 503) due to OOM.
Expected behavior
The llama-server child process should be killed immediately when the main GUI closes or when a model is unloaded.
Logs
Error in GUI:
Error code: 503 ... 'type': 'unavailable_error'
Terminal output showing OOM:
ggml_backend_cuda_buffer_type_alloc_buffer: allocating ... MiB on device 0: cudaMalloc failed: out of memory
nvidia-smi output (after closing App):
Process llama-server still holding the VRAM.
System Info
- OS: Arch Linux
- Installation: Flatpak (io.github.qwersyk.Newelle)
- GPU: RTX 3060 (12GB)
Additional Context
Since this is the Flatpak version, this might be related to how the sandbox handles child process termination signals bit I'm not sure about that. Workaround: Manually running killall -9 llama-server releases the VRAM and allows the app to start again.
(Or pkill -9 llama-server if killall is not available on your system).