Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Oct 24, 2025

Seems like the "OCR model race" has started. This seems to be one of the few "low hanging fruits" that we can easily support in llama.cpp

The model features:

  • Qwen3 as language model
  • Mistral3 as vision encoder (the difference is that LightOnOCR does not use [IMG_BREAK] token)

Original model: https://huggingface.co/lightonai/LightOnOCR-1B-1025

GGUF model: https://huggingface.co/ggml-org/LightOnOCR-1B-1025-GGUF

To try it:

llama-cli -hf ggml-org/LightOnOCR-1B-1025-GGUF -c 8192

# open https://localhost:8080 and try uploading an image

Important note: this model requires specific input structure, see the chat template

The structure seems to be:

  • Starts with an empty system message
  • Then, an user message. All images must be contained in this message; No instructions are needed

Example:

{
  "messages": [{
    "role": "system",
    "content": ""
  }, {
    "role": "user",
    "content": [{
      "type": "image_url",
      "image_url": {"url": "data:image/png;base64,......"}
    }]
  }],
}

@ngxson ngxson requested a review from CISC as a code owner October 24, 2025 23:15
@github-actions github-actions bot added examples python python script changes labels Oct 24, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

image

The command in OP should be llama-server instead of llama-cli.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants