Skip to content

Conversation

ericcurtin
Copy link
Collaborator

llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run.

@ericcurtin ericcurtin requested a review from Copilot September 5, 2025 14:06
@ericcurtin
Copy link
Collaborator Author

@ggerganov @ngxson WDYT?

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR rewrites the llama-run tool to use llama-server as its backend instead of implementing its own inference logic. The change significantly reduces code complexity while maintaining the interactive chat functionality by delegating the model loading and inference tasks to the existing llama-server infrastructure.

Key Changes

  • Complete rewrite of the inference engine to use HTTP client communication with llama-server
  • Replacement of direct llama.cpp API calls with REST API requests to a spawned server process
  • Simplification of command-line argument handling by passing most options through to llama-server

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tools/run/run.cpp Complete rewrite from ~1280 lines to ~400 lines, replacing direct model inference with HTTP client that communicates with a spawned llama-server process
tools/run/README.md Updated example command to reflect new usage pattern

@ericcurtin ericcurtin force-pushed the rewrite-llama-run-to-be-llama-server-based branch 8 times, most recently from 219906c to e629135 Compare September 5, 2025 16:04
llama-run works fine, but falls well behind llama-server functionality.
Integrate llama-server with llama-run.

Signed-off-by: Eric Curtin <[email protected]>
@ericcurtin ericcurtin force-pushed the rewrite-llama-run-to-be-llama-server-based branch from e629135 to 7b717fb Compare September 5, 2025 16:22
@ngxson
Copy link
Collaborator

ngxson commented Sep 5, 2025

I feel like writing a shim layer for httplib::Server will be a cleaner way.

Simply override the Get, Post, etc function to store the handler function somewhere, and call it later. bind_to_port can be override to be a no-op, so the app does not use any sockets.

@ericcurtin
Copy link
Collaborator Author

@ngxson how would you feel about getting this in first? I think the shim layer would be neat, it's gonna be a bigger undertaking than this.

@ngxson
Copy link
Collaborator

ngxson commented Sep 5, 2025

IMO, the goal of examples / tools should be to demonstrate how to use llama.cpp in a downstream project. Honestly, the current proposal is a bit hacky. While it still somewhat fulfill the goal, most people will just go with an easier approach than libcurl. Ultimately, it's also not reusable for projects where binding a port is not permitted.

A shim layer should be similar or even less complicated than your class HttpClient

class Server {
public:
  using Handler = std::function<void(const Request &, Response &)>;

  std::unordered_map<std::string, Handler> handlers;

  Server &Get(const std::string &pattern, Handler handler) {
    handlers["GET " + pattern] = handler;
    return *this;
  }
  Server &Post(const std::string &pattern, Handler handler) {
    handlers["POST " + pattern] = handler;
    return *this;
  }
  // ...
};

Ofc you also need to also provide your own version of Request and Response classes. But you get the idea.

Nevertheless, I don't know why my opinion is important here, given that many of my inputs about llama-run haven't been taken seriously in the past (for example, being ollama-compat or re-using components from libcommon). The current proposal is obviously something I have already brought up a long time ago.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Sep 5, 2025

I'm not against doing both FWIW. And 1,000 less lines of code to maintain isn't a terrible thing (although if someone complained it wouldn't be a big deal to bring the old version back).

The thing is most projects can bind to a port, in most cases it's preferable, because talking to inference servers via http is the de facto standard.

I also wonder is the shim solution more suited to something like:

llama-server --cli

or else we make server.cpp just be a couple of lines of code, a main function. with run.cpp and server.cpp sharing 99% of code in some common file. I think this would be cool, it's just gonna take a while.

With two processes talking via HTTP, it's easy just to pass all the args to llama-server.

In C++, I don't agree most people will go for an easier approach than libcurl, libcurl is the most deployed C/C++ http client library in the world. Although there are plenty of options for a http client in C++.

There are easier options in other languages, you can do many of these things with golang/python3/etc. but meh, I don't actually find libcurl too bad to code. And we would be introducing a new language to "tools", have to mess around with the build/install system a bit, not sure it's worth it.

The other thing about libcurl, is it's packaged everywhere, so you can rely on package maintainers to patch for CVEs, etc. retaining backwards compatibility more than other http client libraries.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Sep 6, 2025

And for the record, I'm not crazy picky about people rewriting bits of the code (including the shim you have mentioned) I have written or evolving it in future. linenoise.cpp is another example. Sometimes it's tempting to write an MIT-based C++ command prompt library from scratch to replace llama.cpp with something that also supports Windows. Nothing seems to exists like that because Windows is so drastically different to Unix/Linux as regards terminals. But that's not a small undertaking either and Windows users tend not to use the cmdline anyway like Unix/Linux people do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants