BoCorpusQC: Tibetan Corpus Quality Control

A tool for filtering Tibetan text files based on language model perplexity, separating high-quality documents from low-quality ones. Two tokenization strategies are supported:

SentencePiece (default) — uses a sub-word model downloaded from Hugging Face Hub.
Syllable — splits text on the Tibetan tsek character (་) and uses a KenLM model (openpecha/BoKenlm-syl) downloaded from Hugging Face Hub.

Installation

Clone the repository:

git clone https://github.com/your-repo/BoCorpusQC.git
cd BoCorpusQC

Install the required Python packages:
```
pip install -r requirements.txt
```

Command-Line Usage

Use cli.py to filter a directory of .txt files. Each file is scored by perplexity and sorted into good_quality or bad_quality sub-directories.

Command-Line Arguments

--input_dir: (Required) The path to the directory containing the .txt files you want to filter.
--output_dir: (Required) The path to the directory where the sorted files will be saved.
--tokenizer_type: (Optional) sentencepiece (default) or syllable.
--num_workers: (Optional) The number of parallel processes to use for scoring the files. Defaults to the total number of CPU cores on your machine.

Example — SentencePiece (default)

python -m BoCorpusQC.cli \
    --input_dir /path/to/your/text_files \
    --output_dir /path/to/your/output_folder \
    --num_workers 4

This command will:

Process all .txt files in /path/to/your/text_files using 4 CPU cores.
Create two new folders inside /path/to/your/output_folder:
- good_quality: Contains the top 33.3% of files with the lowest perplexity scores.
- bad_quality: Contains the remaining 66.7% of files.

Example — Syllable tokenization

python -m BoCorpusQC.cli \
    --input_dir /path/to/your/text_files \
    --output_dir /path/to/your/output_folder \
    --tokenizer_type syllable \
    --num_workers 4

Programmatic Usage

Calculate Perplexity for a Text String (SentencePiece)

from BoCorpusQC import PerplexityCalculator

# Downloads KenLM + SentencePiece models from Hugging Face Hub automatically
calculator = PerplexityCalculator.from_sentencepiece()

text = "བཀྲ་ཤིས་བདེ་ལེགས། ཁམས་བཟང་ངམ།"
ppl = calculator.calculate_perplexity(text)
print(f"Perplexity: {ppl:.4f}")

Calculate Perplexity for a Text String (Syllable)

from BoCorpusQC import PerplexityCalculator

# Downloads syllable-level KenLM model from Hugging Face Hub automatically
calculator = PerplexityCalculator.from_syllable()

text = "བཀྲ་ཤིས་བདེ་ལེགས། ཁམས་བཟང་ངམ།"
ppl = calculator.calculate_perplexity(text)
print(f"Perplexity: {ppl:.4f}")

Filter Documents Programmatically

from BoCorpusQC import DocumentFilter

# Using the default SentencePiece tokenizer
doc_filter = DocumentFilter(tokenizer_type="sentencepiece", num_workers=4)
doc_filter.filter_documents("/path/to/input", "/path/to/output")

# Or using syllable-level tokenization
doc_filter = DocumentFilter(
    tokenizer_type="syllable",
    num_workers=4,
)
doc_filter.filter_documents("/path/to/input", "/path/to/output")

A lower perplexity score indicates that the text is more fluent and predictable according to the language model, suggesting higher quality.

Implementation

This tool evaluates the quality of Tibetan text files using a pre-trained KenLM language model.

Tokenization: Two strategies are available. SentencePiece downloads a sub-word tokenizer (openpecha/BoSentencePiece) from Hugging Face Hub. Syllable splits text on the Tibetan tsek character (་).
Model Loading: For SentencePiece mode the KenLM model (openpecha/BoKenlm) is automatically downloaded from Hugging Face Hub. For syllable mode the KenLM model (openpecha/BoKenlm-syl) is automatically downloaded from Hugging Face Hub.
Perplexity Calculation: Each .txt file in the input directory is treated as a single document and scored. A lower score means more fluent, higher-quality text.
Dynamic Thresholding: A quality threshold is computed at the 33rd percentile of all perplexity scores, keeping the top one-third of documents as "good quality".
Parallel Processing: Multiprocessing is used to score many files in parallel.
Output: Each file is copied into either the good_quality or bad_quality subdirectory in the specified output folder.

Contributing

If you'd like to help out, check out our contributing guidelines.

How to get help

File an issue on our GitHub repository.
Email us at openpecha[at]gmail.com.
Join our Discord.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
src/BoCorpusQC		src/BoCorpusQC
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BoCorpusQC: Tibetan Corpus Quality Control

Installation

Command-Line Usage

Command-Line Arguments

Example — SentencePiece (default)

Example — Syllable tokenization

Programmatic Usage

Calculate Perplexity for a Text String (SentencePiece)

Calculate Perplexity for a Text String (Syllable)

Filter Documents Programmatically

Implementation

Contributing

How to get help

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BoCorpusQC: Tibetan Corpus Quality Control

Installation

Command-Line Usage

Command-Line Arguments

Example — SentencePiece (default)

Example — Syllable tokenization

Programmatic Usage

Calculate Perplexity for a Text String (SentencePiece)

Calculate Perplexity for a Text String (Syllable)

Filter Documents Programmatically

Implementation

Contributing

How to get help

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages