KhmerOCR

A high-performance Khmer Optical Character Recognition (OCR) engine tailored for documents. This model was trained on 3 million text lines using over 800+ Khmer fonts to ensure robust recognition across various styles and weights.

Important

Update: The library now supports full document processing, layout detection, and multi-format exports (PDF, DOCX, HTML, Markdown).

Features

Fast: Optimized for Khmer script using ONNX Runtime for fast inference
Native C++ Engine: High-performance C/C++ implementation with C API for FFI bindings
Font Detection: Automatically identifies and preserves Moul vs. Regular font styles
Multi-format Export: Convert images or PDFs into editable .docx, .md, .html, or .txt files
PDF Support: High-resolution PDF rendering and processing via PyMuPDF
Cross-Platform: Supports macOS, Linux, Windows, iOS, and Android

Installation

Python

pip install git+https://github.com/seanghay/KhmerOCR

C++ Library

See cpp/README.md for build instructions.

cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Usage

Python CLI

For single images or documents, run:

khmerocr document.jpg --format docx

Python CLI Options

Option	Shortcut	Description	Default
`--output`	`-o`	Custom output path	`input_filename.{format}`
`--format`	`-f`	Output format: `txt`, `html`, `docx`, `md`	`txt`

C++ CLI

The C++ CLI is a lightweight inference tool focused on text extraction. For document formatting (DOCX, HTML, etc.), use the Python CLI.

# Full OCR (detect + recognize)
./cpp/build/khmerocr image.png

# JSON output
./cpp/build/khmerocr -j image.png

# Detection only
./cpp/build/khmerocr -d image.png

# Recognition only (for pre-cropped text images)
./cpp/build/khmerocr -r cropped_text.png

# Verbose output with confidence scores
./cpp/build/khmerocr -v image.png

Option	Shortcut	Description
`--json`	`-j`	Output results in JSON format
`--detect-only`	`-d`	Only detect text regions, skip recognition
`--recognize-only`	`-r`	Only recognize (skip detection)
`--verbose`	`-v`	Show confidence scores
`--model-dir`	`-m`	Custom model directory path

Example Output

When processing a line, the engine provides rich metadata:

{
  "text": "លទ្ធផលនៃការធ្វើកំណែទប្រង់លើទូរគមនាគមន៍កម្ពុជា",
  "text_confidence": 0.9804,
  "font": "Moul",
  "font_confidence": 0.9999
}

Examples

Input	Detected Text	Font Style
[Line 1]	យេម៉ែនលង់ក្នុងសង្គ្រាម...	Bold
[Line 2]	ក្រសួងមហាផ្ទៃឱ្យត្រៀម...	Bold
[Line 3]	លទ្ធផលនៃការធ្វើកំណែ...	Moul

Milestones

Basic Font Style Detection
Multi-line Document Support
Export to DOCX/HTML/Markdown
Add English & Symbol support
Add ONNXRuntime for faster inference
Add C/C++ Inference Engine

License

Distributed under the MIT License. See the LICENSE file for more information.

Contact

Seanghay Yath

Email: seanghay.dev@gmail.com
Telegram: @seanghay_yath

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cpp		cpp
examples		examples
khmerocr		khmerocr
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KhmerOCR

Features

Installation

Python

C++ Library

Usage

Python CLI

Python CLI Options

C++ CLI

Example Output

Examples

Milestones

License

Contact

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KhmerOCR

Features

Installation

Python

C++ Library

Usage

Python CLI

Python CLI Options

C++ CLI

Example Output

Examples

Milestones

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages