Skip to content

seanghay/KhmerOCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KhmerOCR

A high-performance Khmer Optical Character Recognition (OCR) engine tailored for documents. This model was trained on 3 million text lines using over 800+ Khmer fonts to ensure robust recognition across various styles and weights.

Important

Update: The library now supports full document processing, layout detection, and multi-format exports (PDF, DOCX, HTML, Markdown).


Features

  • Fast: Optimized for Khmer script using ONNX Runtime for fast inference
  • Native C++ Engine: High-performance C/C++ implementation with C API for FFI bindings
  • Font Detection: Automatically identifies and preserves Moul vs. Regular font styles
  • Multi-format Export: Convert images or PDFs into editable .docx, .md, .html, or .txt files
  • PDF Support: High-resolution PDF rendering and processing via PyMuPDF
  • Cross-Platform: Supports macOS, Linux, Windows, iOS, and Android

Installation

Python

pip install git+https://github.com/seanghay/KhmerOCR

C++ Library

See cpp/README.md for build instructions.

cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Usage

Python CLI

For single images or documents, run:

khmerocr document.jpg --format docx

Python CLI Options

Option Shortcut Description Default
--output -o Custom output path input_filename.{format}
--format -f Output format: txt, html, docx, md txt

C++ CLI

The C++ CLI is a lightweight inference tool focused on text extraction. For document formatting (DOCX, HTML, etc.), use the Python CLI.

# Full OCR (detect + recognize)
./cpp/build/khmerocr image.png

# JSON output
./cpp/build/khmerocr -j image.png

# Detection only
./cpp/build/khmerocr -d image.png

# Recognition only (for pre-cropped text images)
./cpp/build/khmerocr -r cropped_text.png

# Verbose output with confidence scores
./cpp/build/khmerocr -v image.png
Option Shortcut Description
--json -j Output results in JSON format
--detect-only -d Only detect text regions, skip recognition
--recognize-only -r Only recognize (skip detection)
--verbose -v Show confidence scores
--model-dir -m Custom model directory path

Example Output

When processing a line, the engine provides rich metadata:

{
  "text": "លទ្ធផលនៃការធ្វើកំណែទប្រង់លើទូរគមនាគមន៍កម្ពុជា",
  "text_confidence": 0.9804,
  "font": "Moul",
  "font_confidence": 0.9999
}

Examples

Input Detected Text Font Style
[Line 1] យេម៉ែនលង់ក្នុងសង្គ្រាម... Bold
[Line 2] ក្រសួងមហាផ្ទៃឱ្យត្រៀម... Bold
[Line 3] លទ្ធផលនៃការធ្វើកំណែ... Moul

Milestones

  • Basic Font Style Detection
  • Multi-line Document Support
  • Export to DOCX/HTML/Markdown
  • Add English & Symbol support
  • Add ONNXRuntime for faster inference
  • Add C/C++ Inference Engine

License

Distributed under the MIT License. See the LICENSE file for more information.


Contact

Seanghay Yath


KhmerScan Logo

Sponsored by KhmerScan

(បម្លែងរូបភាពទៅជាអត្ថបទខ្មែរ)

About

A Fast Khmer Optical Character Recognition (KhmerOCR)

Topics

Resources

License

Stars

Watchers

Forks

Contributors