Replies: 2 comments
-
|
I discovered llama-server also offers a webif :P The output seems OK here. Any Python libs to talk to this API? My question is basically "How to use llama.cpp from another program"
|
Beta Was this translation helpful? Give feedback.
0 replies
-
|
OK, so this is an "OpenAI compatible API". from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="kanker")
response = client.chat.completions.create(
[...]That works fine. For anyone reading, something like this helps (more useful than llama.cpp docs): https://blog.steelph0enix.dev/posts/llama-cpp-guide/ Thanks for coming to my TED talk! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Sorry, complete noob here.
llama-cliworks great, it summarizes a conversation for me from an input file (conversation.txt)llama-serverhas wildly different output and does not seem to follow my prompt exactlyI'd like to use llama.cpp from a program, and I don't want to call the
llama-clibinary each time. The correct output should just be some bullet points, which works inllama-cli, but notllama-server.CLI example
./build/bin/llama-cli -m qwen3-06.gguf \ --prompt "You are a summarizer, summarize the following text conversation into points, use minimum of 4 points, and a maximum of 7 points. Output the key takeaways of the conversation:\n\n$(cat conversation.txt)" \ --temp 0.7 \ --top_p 0.9 \ --repeat_penalty 1.1input (
conversation.txttruncated)output
This is OK.
Server example
running
input
output
{ "choices": [ { "text": "_for_now\n\nOkay, let's break down this conversation into key points. The main topic seems to be a group planning a brunch meetup at a new café with outdoor seating, fairy lights, and various food options. Here are the key takeaways:\n\n1. Alice and Bob agree on meeting at Clara's café.\n2. Clara drives the car as she can fit four people.\n3. They plan to meet at 9:45 AM Saturday.\n4. They discuss food preferences like pancakes, waffles, quiches, and pastries.\n\nThese points capture the main aspects of their conversation and the planning details discussed. Each point is concise and fits within the minimum and maximum limits specified (4-7 points).\nAnswer:\n```\n1. Alice and Bob agreed to meet at Clara's café. \n2. Clara drives as she can accommodate four people. \n3. They plan to meet at 9:45 AM on Saturday. \n4. The discussion covers food preferences like pancakes, waffles, quiches, and pastries.\n```\n```json\n[\n {\n \"key_takeaways\": [\n \"Alice and Bob agreed to meet at Clara's café.\",\n \"Clara drives as she can accommodate four people.\",\n \"They plan to meet at 9:45 AM on Saturday.\",\n \"The discussion covers food preferences like pancakes, waffles, quiches, and pastries.\"\n ]\n }\n]\n```json\n``` \n", "index": 0, "logprobs": null, "finish_reason": "length" } ], "created": 1760720931, "model": "gpt-3.5-turbo", "system_fingerprint": "b6782-1bb4f433", "object": "text_completion", "usage": { "completion_tokens": 300, "prompt_tokens": 859, "total_tokens": 1159 }, "id": "chatcmpl-mRKVbS29sxQDY5hYgmyjDws1i41UjAXm", "timings": { "cache_n": 0, "prompt_n": 859, "prompt_ms": 1667.245, "prompt_per_token_ms": 1.9409138533178112, "prompt_per_second": 515.2212182372717, "predicted_n": 300, "predicted_ms": 8118.911, "predicted_per_token_ms": 27.063036666666665, "predicted_per_second": 36.950768397387286 } }The output starts with:
And much more text. What makes it do that? How to get only the bullet points as with the CLI?
(In addition, I am also open to including
llama.hdirectly into my C++ program if that is easier)Beta Was this translation helpful? Give feedback.
All reactions