A high-performance Khmer Optical Character Recognition (OCR) engine tailored for documents. This model was trained on 3 million text lines using over 800+ Khmer fonts to ensure robust recognition across various styles and weights.
Important
Update: The library now supports full document processing, layout detection, and multi-format exports (PDF, DOCX, HTML, Markdown).
- Fast: Optimized for Khmer script using ONNX Runtime for fast inference
- Native C++ Engine: High-performance C/C++ implementation with C API for FFI bindings
- Font Detection: Automatically identifies and preserves Moul vs. Regular font styles
- Multi-format Export: Convert images or PDFs into editable
.docx,.md,.html, or.txtfiles - PDF Support: High-resolution PDF rendering and processing via PyMuPDF
- Cross-Platform: Supports macOS, Linux, Windows, iOS, and Android
pip install git+https://github.com/seanghay/KhmerOCRSee cpp/README.md for build instructions.
cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)For single images or documents, run:
khmerocr document.jpg --format docx
| Option | Shortcut | Description | Default |
|---|---|---|---|
--output |
-o |
Custom output path | input_filename.{format} |
--format |
-f |
Output format: txt, html, docx, md |
txt |
The C++ CLI is a lightweight inference tool focused on text extraction. For document formatting (DOCX, HTML, etc.), use the Python CLI.
# Full OCR (detect + recognize)
./cpp/build/khmerocr image.png
# JSON output
./cpp/build/khmerocr -j image.png
# Detection only
./cpp/build/khmerocr -d image.png
# Recognition only (for pre-cropped text images)
./cpp/build/khmerocr -r cropped_text.png
# Verbose output with confidence scores
./cpp/build/khmerocr -v image.png| Option | Shortcut | Description |
|---|---|---|
--json |
-j |
Output results in JSON format |
--detect-only |
-d |
Only detect text regions, skip recognition |
--recognize-only |
-r |
Only recognize (skip detection) |
--verbose |
-v |
Show confidence scores |
--model-dir |
-m |
Custom model directory path |
When processing a line, the engine provides rich metadata:
{
"text": "លទ្ធផលនៃការធ្វើកំណែទប្រង់លើទូរគមនាគមន៍កម្ពុជា",
"text_confidence": 0.9804,
"font": "Moul",
"font_confidence": 0.9999
}
| Input | Detected Text | Font Style |
|---|---|---|
| [Line 1] | យេម៉ែនលង់ក្នុងសង្គ្រាម... | Bold |
| [Line 2] | ក្រសួងមហាផ្ទៃឱ្យត្រៀម... | Bold |
| [Line 3] | លទ្ធផលនៃការធ្វើកំណែ... | Moul |
- Basic Font Style Detection
- Multi-line Document Support
- Export to DOCX/HTML/Markdown
- Add English & Symbol support
- Add ONNXRuntime for faster inference
- Add C/C++ Inference Engine
Distributed under the MIT License. See the LICENSE file for more information.
Seanghay Yath
- Email: seanghay.dev@gmail.com
- Telegram: @seanghay_yath