-
Notifications
You must be signed in to change notification settings - Fork 13k
Rewrite llama-run to use llama-server #15818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@ggerganov @ngxson WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR rewrites the llama-run tool to use llama-server as its backend instead of implementing its own inference logic. The change significantly reduces code complexity while maintaining the interactive chat functionality by delegating the model loading and inference tasks to the existing llama-server infrastructure.
Key Changes
- Complete rewrite of the inference engine to use HTTP client communication with llama-server
- Replacement of direct llama.cpp API calls with REST API requests to a spawned server process
- Simplification of command-line argument handling by passing most options through to llama-server
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
tools/run/run.cpp | Complete rewrite from ~1280 lines to ~400 lines, replacing direct model inference with HTTP client that communicates with a spawned llama-server process |
tools/run/README.md | Updated example command to reflect new usage pattern |
219906c
to
e629135
Compare
llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run. Signed-off-by: Eric Curtin <[email protected]>
e629135
to
7b717fb
Compare
I feel like writing a shim layer for httplib::Server will be a cleaner way. Simply override the |
@ngxson how would you feel about getting this in first? I think the shim layer would be neat, it's gonna be a bigger undertaking than this. |
IMO, the goal of A shim layer should be similar or even less complicated than your class Server {
public:
using Handler = std::function<void(const Request &, Response &)>;
std::unordered_map<std::string, Handler> handlers;
Server &Get(const std::string &pattern, Handler handler) {
handlers["GET " + pattern] = handler;
return *this;
}
Server &Post(const std::string &pattern, Handler handler) {
handlers["POST " + pattern] = handler;
return *this;
}
// ...
}; Ofc you also need to also provide your own version of Nevertheless, I don't know why my opinion is important here, given that many of my inputs about |
I'm not against doing both FWIW. And 1,000 less lines of code to maintain isn't a terrible thing (although if someone complained it wouldn't be a big deal to bring the old version back). The thing is most projects can bind to a port, in most cases it's preferable, because talking to inference servers via http is the de facto standard. I also wonder is the shim solution more suited to something like:
or else we make server.cpp just be a couple of lines of code, a main function. with run.cpp and server.cpp sharing 99% of code in some common file. I think this would be cool, it's just gonna take a while. With two processes talking via HTTP, it's easy just to pass all the args to llama-server. In C++, I don't agree most people will go for an easier approach than libcurl, libcurl is the most deployed C/C++ http client library in the world. Although there are plenty of options for a http client in C++. There are easier options in other languages, you can do many of these things with golang/python3/etc. but meh, I don't actually find libcurl too bad to code. And we would be introducing a new language to "tools", have to mess around with the build/install system a bit, not sure it's worth it. The other thing about libcurl, is it's packaged everywhere, so you can rely on package maintainers to patch for CVEs, etc. retaining backwards compatibility more than other http client libraries. |
And for the record, I'm not crazy picky about people rewriting bits of the code (including the shim you have mentioned) I have written or evolving it in future. linenoise.cpp is another example. Sometimes it's tempting to write an MIT-based C++ command prompt library from scratch to replace llama.cpp with something that also supports Windows. Nothing seems to exists like that because Windows is so drastically different to Unix/Linux as regards terminals. But that's not a small undertaking either and Windows users tend not to use the cmdline anyway like Unix/Linux people do. |
llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run.