How to prefill assistant response with gpt-oss and llama-server? #15714

dstoc · 2025-09-01T12:24:19Z

dstoc
Sep 1, 2025

Is this a bug, or am I doing something wrong?

> podman run \
  -p 8000:8000 \
  --device nvidia.com/gpu=all \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -hf ggml-org/gpt-oss-20b-GGUF \
  --port 8000 \
  --host 0.0.0.0 \
  --jinja \
  -v

> curl -v -s http://127.0.0.1:8000/v1/chat/completions -s --json '{
    "messages": [
        {
            "role": "user",
            "content": "Hello!"
        },
        {
            "role": "assistant",
            "content": "I have n"
        }
    ]
}' | jq

*   Trying 127.0.0.1:8000...
* Connected to 127.0.0.1 (127.0.0.1) port 8000
* using HTTP/1.x
> POST /v1/chat/completions HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/8.12.1
> Content-Type: application/json
> Accept: application/json
> Content-Length: 195
> 
} [195 bytes data]
* upload completely sent off: 195 bytes
< HTTP/1.1 200 OK
< Keep-Alive: timeout=5, max=100
< Content-Type: application/json; charset=utf-8
< Server: llama.cpp
< Content-Length: 5101
< Access-Control-Allow-Origin: 
< 
{ [5101 bytes data]
* Connection #0 to host 127.0.0.1 left intact

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to respond politely asking for clarification.",
        "content": "Hi there! 👋\n\nIt looks like your message got a bit garbled with all those ellipses. Could you please restate the problem or the question you’re working on? A quick rundown of:\n\n1. **What you’re trying to solve** (e.g., a math problem, a programming task, etc.)  \n2. **The exact statement or formula** (e.g., “I have \\(n = 1, 2, \\dots, N\\)” or “sum of…”)  \n3. **Any constraints or special conditions** you’re aware of.\n\nOnce we have a clearer picture, I’ll be happy to help you out!"
      }
    }
  ],
  "created": 1756728870,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b6327-4d74393b",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 492,
    "prompt_tokens": 72,
    "total_tokens": 564
  },
  "id": "chatcmpl-Bu9wkCe2pcxwdRzzOKjNieBWZZbCcKMw",
  "__verbose": {
    "index": 0,
    "content": " = 1 … … …………… … ... ……...? The problem …... ...……? ...??……\n\nWe need to interpret. User says: \"Hello! I have n = 1 … … ………… … … ...…?? The problem …... ...…? ...??……\". The user is unclear. Might be a math problem but truncated. They likely want help with a problem about n = 1 ... something. The context might be about a sequence or something like that. The user says \"I have n=1 … … ………… … … ...…?? The problem …... ...…? ...??……\". They likely want to ask about something like \"I have n=1,..., maybe 100, what's sum of something?\" But the message is garbled.\n\nWe must ask clarifying questions. Also try to guess what they want: maybe they have a problem about summation, like \"I have n=1,...,N, find sum of something\"? The user might be using ellipses to indicate a range. The phrase \"I have n = 1 … … ………… … … ...…??\" might be \"I have n=1,...,N? The problem ...?\" It's ambiguous.\n\nWe can respond: \"Hello! Could you please clarify the problem? The statement is not clear.\" Provide a request for clarification. Also maybe ask about the nature: math, programming, etc. Also mention that ellipses may indicate missing details.\n\nLet's produce a helpful answer: ask clarifying questions.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to respond politely asking for clarification.<|end|><|start|>assistant<|channel|>final<|message|>Hi there! 👋\n\nIt looks like your message got a bit garbled with all those ellipses. Could you please restate the problem or the question you’re working on? A quick rundown of:\n\n1. **What you’re trying to solve** (e.g., a math problem, a programming task, etc.)  \n2. **The exact statement or formula** (e.g., “I have \\(n = 1, 2, \\dots, N\\)” or “sum of…”)  \n3. **Any constraints or special conditions** you’re aware of.\n\nOnce we have a clearer picture, I’ll be happy to help you out!",
    "tokens": [],
    "id_slot": 0,
    "stop": true,
    "model": "gpt-3.5-turbo",
    "tokens_predicted": 492,
    "tokens_evaluated": 72,
    "generation_settings": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "top_n_sigma": -1.0,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": 131072,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": false,
      "logit_bias": [],
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "grammar_lazy": false,
      "grammar_triggers": [],
      "preserved_tokens": [
        200003,
        200005,
        200006,
        200007,
        200008
      ],
      "chat_format": "GPT-OSS",
      "reasoning_format": "auto",
      "reasoning_in_content": false,
      "thinking_forced_open": false,
      "samplers": [
        "penalties",
        "dry",
        "top_n_sigma",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 0,
      "speculative.p_min": 0.75,
      "timings_per_token": false,
      "post_sampling_probs": false,
      "lora": []
    },
    "prompt": "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-09-01\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Hello!<|end|><|start|>assistantI have n",
    "has_new_line": true,
    "truncated": false,
    "stop_type": "eos",
    "stopping_word": "",
    "tokens_cached": 563,
    "timings": {
      "prompt_n": 1,
      "prompt_ms": 23.163,
      "prompt_per_token_ms": 23.163,
      "prompt_per_second": 43.172300651901736,
      "predicted_n": 492,
      "predicted_ms": 2910.808,
      "predicted_per_token_ms": 5.916276422764228,
      "predicted_per_second": 169.02523285630656
    }
  },
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 23.163,
    "prompt_per_token_ms": 23.163,
    "prompt_per_second": 43.172300651901736,
    "predicted_n": 492,
    "predicted_ms": 2910.808,
    "predicted_per_token_ms": 5.916276422764228,
    "predicted_per_second": 169.02523285630656
  }
}

Answered by aldehir

Sep 1, 2025

Try

> curl -v -s http://127.0.0.1:8000/v1/chat/completions -s --json '{
    "messages": [
        {
            "role": "user",
            "content": "Hello!"
        },
        {
            "role": "assistant",
            "content": "<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>final<|message|>I have n"
        }
    ]
}' | jq

response

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " patterns matching 1..~8  which give them even and !- etc. In this because *applied RPP and after graph.\n\nThe consistent the # the pattern to for up to any problem of your... Not. I'm order o…

View full answer

aldehir · 2025-09-01T20:31:01Z

aldehir
Sep 1, 2025
Collaborator

Try

> curl -v -s http://127.0.0.1:8000/v1/chat/completions -s --json '{
    "messages": [
        {
            "role": "user",
            "content": "Hello!"
        },
        {
            "role": "assistant",
            "content": "<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>final<|message|>I have n"
        }
    ]
}' | jq

response

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " patterns matching 1..~8  which give them even and !- etc. In this because *applied RPP and after graph.\n\nThe consistent the # the pattern to for up to any problem of your... Not. I'm order of these should we to them your. We are I to and by using perfect left you. And I'm\n\nGiven a M&copy; and but and problem the.\n\nI think you have to be aware. The answer might be a few '0's? \n\nAP for any helpful 2 patterns...\n\nI hope you find this helpful. If you have any other questions, let me know."
      }
    }
  ],
  "created": 1756758383,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b6328-8edb5c46",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 128,
    "prompt_tokens": 81,
    "total_tokens": 209
  },
  "id": "chatcmpl-btFyTsD1po8zRQ2TxYv1xqtTKVLR9mRg",
  "__verbose": {
    "index": 0,
    "content": " patterns matching 1..~8  which give them even and !- etc. In this because *applied RPP and after graph.\n\nThe consistent the # the pattern to for up to any problem of your... Not. I'm order of these should we to them your. We are I to and by using perfect left you. And I'm\n\nGiven a M&copy; and but and problem the.\n\nI think you have to be aware. The answer might be a few '0's? \n\nAP for any helpful 2 patterns...\n\nI hope you find this helpful. If you have any other questions, let me know.",
    "tokens": [],
    "id_slot": 2,
    "stop": true,
    "model": "gpt-3.5-turbo",
    "tokens_predicted": 128,
    "tokens_evaluated": 81,
    "generation_settings": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 1.0,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 1000,
      "top_p": 1.0,
      "min_p": 0.0,
      "top_n_sigma": -1.0,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": 267264,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": false,
      "logit_bias": [],
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "grammar_lazy": false,
      "grammar_triggers": [],
      "preserved_tokens": [
        200003,
        200005,
        200006,
        200007,
        200008
      ],
      "chat_format": "GPT-OSS",
      "reasoning_format": "auto",
      "reasoning_in_content": false,
      "thinking_forced_open": false,
      "samplers": [
        "penalties",
        "dry",
        "top_n_sigma",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 0,
      "speculative.p_min": 0.75,
      "timings_per_token": false,
      "post_sampling_probs": false,
      "lora": []
    },
    "prompt": "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-09-01\n\nReasoning: high\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Hello!<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>final<|message|>I have n",
    "has_new_line": true,
    "truncated": false,
    "stop_type": "eos",
    "stopping_word": "",
    "tokens_cached": 208,
    "timings": {
      "prompt_n": 81,
      "prompt_ms": 69.476,
      "prompt_per_token_ms": 0.8577283950617284,
      "prompt_per_second": 1165.8702285681386,
      "predicted_n": 128,
      "predicted_ms": 1642.158,
      "predicted_per_token_ms": 12.829359375,
      "predicted_per_second": 77.94621467605432
    }
  },
  "timings": {
    "prompt_n": 81,
    "prompt_ms": 69.476,
    "prompt_per_token_ms": 0.8577283950617284,
    "prompt_per_second": 1165.8702285681386,
    "predicted_n": 128,
    "predicted_ms": 1642.158,
    "predicted_per_token_ms": 12.829359375,
    "predicted_per_second": 77.94621467605432
  }
}

The prefill needs to use the harmony response format and since it's a generation, should contain an analysis channel. In this example I left it empty, i.e. no "reasoning".

With this, the parsing breaks because it begins parsing from the end of the prefill. Your best bet might be to exclude --jinja and parse the contents out yourself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to prefill assistant response with gpt-oss and llama-server? #15714

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How to prefill assistant response with gpt-oss and llama-server? #15714

Uh oh!

dstoc Sep 1, 2025

Replies: 1 comment

Uh oh!

Uh oh!

aldehir Sep 1, 2025 Collaborator

dstoc
Sep 1, 2025

aldehir
Sep 1, 2025
Collaborator