Rewrite llama-run to use llama-server #15818

ericcurtin · 2025-09-05T14:06:24Z

llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run.

ericcurtin · 2025-09-05T14:06:38Z

Copilot

Pull Request Overview

This PR rewrites the llama-run tool to use llama-server as its backend instead of implementing its own inference logic. The change significantly reduces code complexity while maintaining the interactive chat functionality by delegating the model loading and inference tasks to the existing llama-server infrastructure.

Key Changes

Complete rewrite of the inference engine to use HTTP client communication with llama-server
Replacement of direct llama.cpp API calls with REST API requests to a spawned server process
Simplification of command-line argument handling by passing most options through to llama-server

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
tools/run/run.cpp	Complete rewrite from ~1280 lines to ~400 lines, replacing direct model inference with HTTP client that communicates with a spawned llama-server process
tools/run/README.md	Updated example command to reflect new usage pattern

tools/run/run.cpp

llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run. Signed-off-by: Eric Curtin <[email protected]>

ngxson · 2025-09-05T16:46:26Z

I feel like writing a shim layer for httplib::Server will be a cleaner way.

Simply override the Get, Post, etc function to store the handler function somewhere, and call it later. bind_to_port can be override to be a no-op, so the app does not use any sockets.

ericcurtin · 2025-09-05T18:22:18Z

@ngxson how would you feel about getting this in first? I think the shim layer would be neat, it's gonna be a bigger undertaking than this.

ngxson · 2025-09-05T22:46:40Z

IMO, the goal of examples / tools should be to demonstrate how to use llama.cpp in a downstream project. Honestly, the current proposal is a bit hacky. While it still somewhat fulfill the goal, most people will just go with an easier approach than libcurl. Ultimately, it's also not reusable for projects where binding a port is not permitted.

A shim layer should be similar or even less complicated than your class HttpClient

class Server {
public:
  using Handler = std::function<void(const Request &, Response &)>;

  std::unordered_map<std::string, Handler> handlers;

  Server &Get(const std::string &pattern, Handler handler) {
    handlers["GET " + pattern] = handler;
    return *this;
  }
  Server &Post(const std::string &pattern, Handler handler) {
    handlers["POST " + pattern] = handler;
    return *this;
  }
  // ...
};

Ofc you also need to also provide your own version of Request and Response classes. But you get the idea.

Nevertheless, I don't know why my opinion is important here, given that many of my inputs about llama-run haven't been taken seriously in the past (for example, being ollama-compat or re-using components from libcommon). The current proposal is obviously something I have already brought up a long time ago.

ericcurtin · 2025-09-05T23:34:20Z

I'm not against doing both FWIW. And 1,000 less lines of code to maintain isn't a terrible thing (although if someone complained it wouldn't be a big deal to bring the old version back).

The thing is most projects can bind to a port, in most cases it's preferable, because talking to inference servers via http is the de facto standard.

I also wonder is the shim solution more suited to something like:

llama-server --cli

or else we make server.cpp just be a couple of lines of code, a main function. with run.cpp and server.cpp sharing 99% of code in some common file. I think this would be cool, it's just gonna take a while.

With two processes talking via HTTP, it's easy just to pass all the args to llama-server.

In C++, I don't agree most people will go for an easier approach than libcurl, libcurl is the most deployed C/C++ http client library in the world. Although there are plenty of options for a http client in C++.

There are easier options in other languages, you can do many of these things with golang/python3/etc. but meh, I don't actually find libcurl too bad to code. And we would be introducing a new language to "tools", have to mess around with the build/install system a bit, not sure it's worth it.

The other thing about libcurl, is it's packaged everywhere, so you can rely on package maintainers to patch for CVEs, etc. retaining backwards compatibility more than other http client libraries.

ericcurtin · 2025-09-06T12:58:38Z

And for the record, I'm not crazy picky about people rewriting bits of the code (including the shim you have mentioned) I have written or evolving it in future. linenoise.cpp is another example. Sometimes it's tempting to write an MIT-based C++ command prompt library from scratch to replace llama.cpp with something that also supports Windows. Nothing seems to exists like that because Windows is so drastically different to Unix/Linux as regards terminals. But that's not a small undertaking either and Windows users tend not to use the cmdline anyway like Unix/Linux people do.

ericcurtin requested a review from Copilot September 5, 2025 14:06

Copilot AI reviewed Sep 5, 2025

View reviewed changes

tools/run/run.cpp Outdated Show resolved Hide resolved

tools/run/run.cpp Outdated Show resolved Hide resolved

tools/run/run.cpp Outdated Show resolved Hide resolved

tools/run/run.cpp Outdated Show resolved Hide resolved

tools/run/run.cpp Outdated Show resolved Hide resolved

github-actions bot added the examples label Sep 5, 2025

ericcurtin force-pushed the rewrite-llama-run-to-be-llama-server-based branch 8 times, most recently from 219906c to e629135 Compare September 5, 2025 16:04

Rewrite llama-run to use llama-server

7b717fb

llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run. Signed-off-by: Eric Curtin <[email protected]>

ericcurtin force-pushed the rewrite-llama-run-to-be-llama-server-based branch from e629135 to 7b717fb Compare September 5, 2025 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite llama-run to use llama-server #15818

Rewrite llama-run to use llama-server #15818

ericcurtin commented Sep 5, 2025

Uh oh!

ericcurtin commented Sep 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson commented Sep 5, 2025

Uh oh!

ericcurtin commented Sep 5, 2025

Uh oh!

ngxson commented Sep 5, 2025 •

edited

Loading

Uh oh!

ericcurtin commented Sep 5, 2025 •

edited

Loading

Uh oh!

ericcurtin commented Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Rewrite llama-run to use llama-server #15818

Are you sure you want to change the base?

Rewrite llama-run to use llama-server #15818

Conversation

ericcurtin commented Sep 5, 2025

Uh oh!

ericcurtin commented Sep 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson commented Sep 5, 2025

Uh oh!

ericcurtin commented Sep 5, 2025

Uh oh!

ngxson commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngxson commented Sep 5, 2025 •

edited

Loading

ericcurtin commented Sep 5, 2025 •

edited

Loading

ericcurtin commented Sep 6, 2025 •

edited

Loading