gpt-oss and grammar #15341

supersteves · 2025-08-15T09:17:03Z

supersteves
Aug 15, 2025

I've just figured out how to use grammars with gpt-oss-20b via OpenAI compatibility API, and thought I'd share it (and perhaps get some comments on my assumptions/conclusions). My goal is to use grammar constraint for the final message but not for the thinking, and to be able to retrieve the thinking separately.

Using version 6170 (4227c9b) and running the latest version of ggml gpt-oss-20b using the recommended args:
llama-server -m models/gpt-oss-20b-mxfp4.gguf -c 0 -fa --jinja --reasoning-format none --verbose

(adding verbose so I can see in the terminal exactly what tokens are generated by the model before llama.cpp does any processing).

(It looks like extracting reasoning_content isn't ready yet hence the none recommended arg. So I want it to appear in the output and will be parsing it myself from the output).

My test prompt is You say just hello or goodbye

Raw model output

Without grammar, and before llama.cpp processes the output, the model emits (wrapped for clarity):

<|channel|>analysis<|message|>
...thinking...
<|end|>
<|start|>assistant<|channel|>final<|message|>
...answer...

(and then token 200002 which is <|return|> stop word but it's not included in the output or fed through any grammar)

This is the harmony standard. (Note that sometimes I see final json instead of final when I've prompted the output to be JSON.)

Filtered output

When I run this through llama.cpp, in the API output the message content is:

<|channel|>analysis<|message|>
i am some thinking
<|end|>
hello

I can see that llama.cpp is stripping some of the harmony tags, but not all, presumably because we're using reasoning-format none, as recommended. This is similar to allowing <think>...</think>... tags through for deepseek.

Earlier behaviour

Note that, in the first release of llama.cpp that supported gpt-oss, instead I would see

<|channel|>analysis<|message|>
i am some thinking
<|start|>assistant<|channel|>final<|message|>
hello

So strangely it has swapped behaviour from

losing <|end|> but keeping <|start|>assistant<|channel|>final<|message|> (previously), to
keeping <|end|> but losing <|start|>assistant<|channel|>final<|message|> (currently)

I get it; this is all new, and probably it'll evolve in future.

But I digress...

Grammar processing order

The main lightbulb moment for me was that grammar is applied before llama.cpp skips some of the harmony tags. For someone writing grammar without a deep understanding of LLM output tokens and llama.cpp parsing, this is a key point that isn't obvious.

So let's create a super-simple grammar for our test. I'm naively preventing < in the thinking text, which is too restrictive in practice, but good for this example.

root ::= "<|channel|>analysis<|message|>" [^<]* "<|end|><|start|>assistant<|channel|>final<|message|>" ( "hello" | "goodbye" )

In English this means:

the literal text <|channel|>analysis<|message|>
any text up to but not including <
the literal text <|end|><|start|>assistant<|channel|>final<|message|> (of which, all but <|end|> is currently being stripped post-grammar by llama.cpp)
either hello or goodbye literal text.

This works perfectly. The model is constrained to precisely

<|channel|>analysis<|message|>
...thinking...
<|end|>
<|start|>assistant<|channel|>final<|message|>
...answer (hello or goodbye)...

And then llama.cpp filters the result and yields

<|channel|>analysis<|message|>
i am some thinking
<|end|>
hello

Which I can then process downstream and extract i am some thinking and hello using simple pattern matching.

(Side note: I don't think you can use grammar with function calling; I've seen evidence in the code that function calling internally uses its own grammar, depending on the model.)

Previous gotcha

The gotcha I was running into previously was to write a grammar that worked upon what I was seeing in the output from the API call when running unconstrained - i.e. just <|end|> terminating the thinking. Here are some examples of incorrect grammars:

Using the initial gpt-oss release of llama.cpp, I saw <|channel|>analysis<|message|>thinking<|start|>assistant<|channel|>final<|message|>hello
and accordingly wrote the grammar root ::= "<|channel|>analysis<|message|>" [^<]* "<|start|>assistant<|channel|>final<|message|>" ( "hello" | "goodbye" ). This would hang because the model was trying to generate <|end|> but wasn't being allowed to.
Using the current release of llama.cpp, I saw <|channel|>analysis<|message|>thinking<|end|>hello and wrote the grammar root ::= "<|channel|>analysis<|message|>" [^<]* "<|end|>" ( "hello" | "goodbye" ). This would halt generation after <|end|> and I'd get no content after the thinking.

Some related questions

There appears to be logic related to harmony and extracting thinking, in the llama.cpp code. So why recommend reasoning format none? Is it possible to have llama.cpp extract the reasoning_content property in the API response? It seems wrong to pass --reasoning-format deepseek for this model, but there's no "harmony" option; I don't want other deepseek logic elsewhere to bleed in.
I wanted to play with grammar_lazy and grammar_triggers but could not find a way to get this to take effect from the OpenAI API route for chat completions. It would be a nice alternative to explore, instead of writing the thinking tags into the grammar. But it appears not to be supported in the OpenAI API route. Although perhaps it's too primitive to handle the harmony case.

supersteves · 2025-08-15T17:32:26Z

supersteves
Aug 15, 2025
Author

After writing this post I realised that a few hours earlier a commit had been made which answers the first "some related questions".

b204a5a

You can now do this:

llama-server -m models/gpt-oss-20b-mxfp4.gguf -c 0 -fa --jinja --reasoning-format auto

And it works like magic. I have my grammar, and llama.cpp is correctly splitting the output without my needing to do it.

However I still need to model the harmony tags in the grammar! So the advice above is still very much relevant.

Perhaps the model card could be updated?
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

Likewise for the llama-server docs re --reasoning-format, it should list auto:
https://github.com/ggml-org/llama.cpp/tree/master/tools/server

0 replies

aldehir · 2025-08-15T17:50:13Z

aldehir
Aug 15, 2025
Collaborator

This might be of interest to you: #15276. I've been meaning to submit a PR for the response_format, just haven't gotten around to it.

If response format meets your needs then I recommend that (once support is added). I don't know if I can easily process the user provided grammar and combine it with the internal grammar for harmony. If I can, then it's possible to support this without the user needing to write the grammar for the harmony format. I'll look into it.

1 reply

supersteves Aug 15, 2025
Author

Thanks @aldehir. I use GBNF grammars as I've found the JSON Schema approach not to be sufficient for my schema. I've explored all the JSON Schema routes (which for llama.cpp end up as GBNF) and had to handwrite my own generator for my purposes. And it works great. The only trick is to understand where in the pipeline the grammar is applied: on the raw harmony output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt-oss and grammar #15341

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

gpt-oss and grammar #15341

Uh oh!

supersteves Aug 15, 2025

Raw model output

Filtered output

Earlier behaviour

Grammar processing order

Previous gotcha

Some related questions

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

supersteves Aug 15, 2025 Author

Uh oh!

Uh oh!

aldehir Aug 15, 2025 Collaborator

Uh oh!

supersteves Aug 15, 2025 Author

supersteves
Aug 15, 2025

Replies: 2 comments 1 reply

supersteves
Aug 15, 2025
Author

aldehir
Aug 15, 2025
Collaborator

supersteves Aug 15, 2025
Author