llama-cli works, llama-server has different output #16638

kroketio · 2025-10-17T17:06:44Z

kroketio
Oct 17, 2025

Sorry, complete noob here.

llama-cli works great, it summarizes a conversation for me from an input file (conversation.txt)
llama-server has wildly different output and does not seem to follow my prompt exactly

I'd like to use llama.cpp from a program, and I don't want to call the llama-cli binary each time. The correct output should just be some bullet points, which works in llama-cli, but not llama-server.

CLI example

./build/bin/llama-cli -m qwen3-06.gguf \
  --prompt "You are a summarizer, summarize the following text conversation into points, use minimum of 4 points, and a maximum of 7 points. Output the key takeaways of the conversation:\n\n$(cat conversation.txt)" \
  --temp 0.7 \
  --top_p 0.9 \
  --repeat_penalty 1.1

input (`conversation.txt` truncated)

Alice: Hey everyone, did you see the news about the new café opening downtown?
Bob: Yeah, I drove by yesterday. It looks super cute!
[...]
David: Got it. I’ll bring my camera and good vibes.
Eve: Count me in! Saturday can’t come soon enough.
/no_think

output

1. Alice and Bob agree to meet at Clara's café for brunch on Saturday.  
2. David mentions his interest in fancy coffees.  
4. Eve plans to try pastries and a latte with a croissant.  
5. They aim for 9:45 AM to maximize availability.

This is OK.

Server example

./llama-server -m qwen3-06.gguf --port 8080 --temp 0.7 \
  --top_p 0.9 \
  --repeat_penalty 1.1

running

input

jq -Rs --arg preamble "You are a summarizer, summarize the following text conversation into points, use minimum of 4 points, and a maximum of 7 points. Output the key takeaways of the conversation:\n"  \
   '{prompt: ($preamble + .), max_tokens:300, temperature:0.7, top_p:0.9, repeat_penalty:1.1, reset_context:true, system_prompt:""}' < test.txt | \
curl -X POST "http://localhost:8080/v1/completions" \
     -H "Content-Type: application/json" \
     -d @-

output

{
  "choices": [
    {
      "text": "_for_now\n\nOkay, let's break down this conversation into key points. The main topic seems to be a group planning a brunch meetup at a new café with outdoor seating, fairy lights, and various food options. Here are the key takeaways:\n\n1. Alice and Bob agree on meeting at Clara's café.\n2. Clara drives the car as she can fit four people.\n3. They plan to meet at 9:45 AM Saturday.\n4. They discuss food preferences like pancakes, waffles, quiches, and pastries.\n\nThese points capture the main aspects of their conversation and the planning details discussed. Each point is concise and fits within the minimum and maximum limits specified (4-7 points).\nAnswer:\n```\n1. Alice and Bob agreed to meet at Clara's café.  \n2. Clara drives as she can accommodate four people.  \n3. They plan to meet at 9:45 AM on Saturday.  \n4. The discussion covers food preferences like pancakes, waffles, quiches, and pastries.\n```\n```json\n[\n  {\n    \"key_takeaways\": [\n      \"Alice and Bob agreed to meet at Clara's café.\",\n      \"Clara drives as she can accommodate four people.\",\n      \"They plan to meet at 9:45 AM on Saturday.\",\n      \"The discussion covers food preferences like pancakes, waffles, quiches, and pastries.\"\n    ]\n  }\n]\n```json\n``` \n",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "created": 1760720931,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b6782-1bb4f433",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 300,
    "prompt_tokens": 859,
    "total_tokens": 1159
  },
  "id": "chatcmpl-mRKVbS29sxQDY5hYgmyjDws1i41UjAXm",
  "timings": {
    "cache_n": 0,
    "prompt_n": 859,
    "prompt_ms": 1667.245,
    "prompt_per_token_ms": 1.9409138533178112,
    "prompt_per_second": 515.2212182372717,
    "predicted_n": 300,
    "predicted_ms": 8118.911,
    "predicted_per_token_ms": 27.063036666666665,
    "predicted_per_second": 36.950768397387286
  }
}

The output starts with:

Okay, let's break down this conversation into key points...

And much more text. What makes it do that? How to get only the bullet points as with the CLI?

(In addition, I am also open to including llama.h directly into my C++ program if that is easier)

kroketio · 2025-10-17T17:33:07Z

kroketio
Oct 17, 2025
Author

I discovered llama-server also offers a webif :P

The output seems OK here.

Any Python libs to talk to this API?

My question is basically "How to use llama.cpp from another program"

0 replies

kroketio · 2025-10-17T17:51:45Z

kroketio
Oct 17, 2025
Author

OK, so this is an "OpenAI compatible API".

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="kanker")
response = client.chat.completions.create(
[...]

That works fine.

For anyone reading, something like this helps (more useful than llama.cpp docs): https://blog.steelph0enix.dev/posts/llama-cpp-guide/

Thanks for coming to my TED talk!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

llama-cli works, llama-server has different output #16638

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

llama-cli works, llama-server has different output #16638

Uh oh!

Uh oh!

kroketio Oct 17, 2025

CLI example

input (conversation.txt truncated)

output

Server example

running

input

output

Replies: 2 comments

Uh oh!

kroketio Oct 17, 2025 Author

Uh oh!

kroketio Oct 17, 2025 Author

kroketio
Oct 17, 2025

input (`conversation.txt` truncated)

kroketio
Oct 17, 2025
Author

kroketio
Oct 17, 2025
Author