Skip to content

llama-server process not killed on exit/model switch (Zombie process leads to OOM) #339

@Jannis161

Description

@Jannis161

Describe the bug
When closing Newelle or switching between models, the backend process (llama-server) is not terminated correctly. It remains active as a zombie process in the background, holding onto the VRAM.
When trying to load a new model or restart the app, the initialization fails with a 503 error because the GPU VRAM is still occupied by the previous instance.

To Reproduce

  1. Open Newelle (Flatpak version / Newelle 1.2.0).
  2. Load a model
  3. Close Newelle via the window close button OR try to switch to another model.
  4. Check process list (ps aux | grep llama) or GPU usage (nvidia-smi).
  5. The llama-server process is still running and VRAM is full.
  6. Restarting Newelle fails with "Loading model" timeout (Error 503) due to OOM.

Expected behavior
The llama-server child process should be killed immediately when the main GUI closes or when a model is unloaded.

Logs
Error in GUI:
Error code: 503 ... 'type': 'unavailable_error'

Terminal output showing OOM:
ggml_backend_cuda_buffer_type_alloc_buffer: allocating ... MiB on device 0: cudaMalloc failed: out of memory

nvidia-smi output (after closing App):
Process llama-server still holding the VRAM.

System Info

  • OS: Arch Linux
  • Installation: Flatpak (io.github.qwersyk.Newelle)
  • GPU: RTX 3060 (12GB)

Additional Context
Since this is the Flatpak version, this might be related to how the sandbox handles child process termination signals bit I'm not sure about that. Workaround: Manually running killall -9 llama-server releases the VRAM and allows the app to start again.
(Or pkill -9 llama-server if killall is not available on your system).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions