gpt-oss and grammar #15341
Replies: 2 comments 1 reply
-
After writing this post I realised that a few hours earlier a commit had been made which answers the first "some related questions". You can now do this:
And it works like magic. I have my grammar, and llama.cpp is correctly splitting the output without my needing to do it. However I still need to model the harmony tags in the grammar! So the advice above is still very much relevant. Perhaps the model card could be updated? Likewise for the llama-server docs re |
Beta Was this translation helpful? Give feedback.
-
This might be of interest to you: #15276. I've been meaning to submit a PR for the response_format, just haven't gotten around to it. If response format meets your needs then I recommend that (once support is added). I don't know if I can easily process the user provided grammar and combine it with the internal grammar for harmony. If I can, then it's possible to support this without the user needing to write the grammar for the harmony format. I'll look into it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've just figured out how to use grammars with gpt-oss-20b via OpenAI compatibility API, and thought I'd share it (and perhaps get some comments on my assumptions/conclusions). My goal is to use grammar constraint for the final message but not for the thinking, and to be able to retrieve the thinking separately.
Using version 6170 (4227c9b) and running the latest version of ggml gpt-oss-20b using the recommended args:
llama-server -m models/gpt-oss-20b-mxfp4.gguf -c 0 -fa --jinja --reasoning-format none --verbose
(adding verbose so I can see in the terminal exactly what tokens are generated by the model before llama.cpp does any processing).
(It looks like extracting reasoning_content isn't ready yet hence the none recommended arg. So I want it to appear in the output and will be parsing it myself from the output).
My test prompt is
You say just hello or goodbye
Raw model output
Without grammar, and before llama.cpp processes the output, the model emits (wrapped for clarity):
(and then token
200002
which is<|return|>
stop word but it's not included in the output or fed through any grammar)This is the harmony standard. (Note that sometimes I see
final json
instead offinal
when I've prompted the output to be JSON.)Filtered output
When I run this through llama.cpp, in the API output the message content is:
I can see that llama.cpp is stripping some of the harmony tags, but not all, presumably because we're using reasoning-format none, as recommended. This is similar to allowing
<think>...</think>...
tags through for deepseek.Earlier behaviour
Note that, in the first release of llama.cpp that supported gpt-oss, instead I would see
So strangely it has swapped behaviour from
<|end|>
but keeping<|start|>assistant<|channel|>final<|message|>
(previously), to<|end|>
but losing<|start|>assistant<|channel|>final<|message|>
(currently)I get it; this is all new, and probably it'll evolve in future.
But I digress...
Grammar processing order
The main lightbulb moment for me was that grammar is applied before llama.cpp skips some of the harmony tags. For someone writing grammar without a deep understanding of LLM output tokens and llama.cpp parsing, this is a key point that isn't obvious.
So let's create a super-simple grammar for our test. I'm naively preventing
<
in the thinking text, which is too restrictive in practice, but good for this example.In English this means:
<|channel|>analysis<|message|>
<
<|end|><|start|>assistant<|channel|>final<|message|>
(of which, all but<|end|>
is currently being stripped post-grammar by llama.cpp)hello
orgoodbye
literal text.This works perfectly. The model is constrained to precisely
And then llama.cpp filters the result and yields
Which I can then process downstream and extract
i am some thinking
andhello
using simple pattern matching.(Side note: I don't think you can use grammar with function calling; I've seen evidence in the code that function calling internally uses its own grammar, depending on the model.)
Previous gotcha
The gotcha I was running into previously was to write a grammar that worked upon what I was seeing in the output from the API call when running unconstrained - i.e. just
<|end|>
terminating the thinking. Here are some examples of incorrect grammars:Using the initial gpt-oss release of llama.cpp, I saw
<|channel|>analysis<|message|>thinking<|start|>assistant<|channel|>final<|message|>hello
and accordingly wrote the grammar
root ::= "<|channel|>analysis<|message|>" [^<]* "<|start|>assistant<|channel|>final<|message|>" ( "hello" | "goodbye" )
. This would hang because the model was trying to generate <|end|> but wasn't being allowed to.Using the current release of llama.cpp, I saw
<|channel|>analysis<|message|>thinking<|end|>hello
and wrote the grammarroot ::= "<|channel|>analysis<|message|>" [^<]* "<|end|>" ( "hello" | "goodbye" )
. This would halt generation after <|end|> and I'd get no content after the thinking.Some related questions
There appears to be logic related to harmony and extracting thinking, in the llama.cpp code. So why recommend reasoning format none? Is it possible to have llama.cpp extract the
reasoning_content
property in the API response? It seems wrong to pass--reasoning-format deepseek
for this model, but there's no "harmony" option; I don't want other deepseek logic elsewhere to bleed in.I wanted to play with
grammar_lazy
andgrammar_triggers
but could not find a way to get this to take effect from the OpenAI API route for chat completions. It would be a nice alternative to explore, instead of writing the thinking tags into the grammar. But it appears not to be supported in the OpenAI API route. Although perhaps it's too primitive to handle the harmony case.Beta Was this translation helpful? Give feedback.
All reactions