A tool for filtering Tibetan text files based on language model perplexity, separating high-quality documents from low-quality ones. Two tokenization strategies are supported:
- SentencePiece (default) — uses a sub-word model downloaded from Hugging Face Hub.
- Syllable — splits text on the Tibetan tsek character (
་) and uses a KenLM model (openpecha/BoKenlm-syl) downloaded from Hugging Face Hub.
-
Clone the repository:
git clone https://github.com/your-repo/BoCorpusQC.git cd BoCorpusQC -
Install the required Python packages:
pip install -r requirements.txt
Use cli.py to filter a directory of .txt files. Each file is scored by perplexity and sorted into good_quality or bad_quality sub-directories.
--input_dir: (Required) The path to the directory containing the.txtfiles you want to filter.--output_dir: (Required) The path to the directory where the sorted files will be saved.--tokenizer_type: (Optional)sentencepiece(default) orsyllable.--num_workers: (Optional) The number of parallel processes to use for scoring the files. Defaults to the total number of CPU cores on your machine.
python -m BoCorpusQC.cli \
--input_dir /path/to/your/text_files \
--output_dir /path/to/your/output_folder \
--num_workers 4This command will:
- Process all
.txtfiles in/path/to/your/text_filesusing 4 CPU cores. - Create two new folders inside
/path/to/your/output_folder:good_quality: Contains the top 33.3% of files with the lowest perplexity scores.bad_quality: Contains the remaining 66.7% of files.
python -m BoCorpusQC.cli \
--input_dir /path/to/your/text_files \
--output_dir /path/to/your/output_folder \
--tokenizer_type syllable \
--num_workers 4from BoCorpusQC import PerplexityCalculator
# Downloads KenLM + SentencePiece models from Hugging Face Hub automatically
calculator = PerplexityCalculator.from_sentencepiece()
text = "བཀྲ་ཤིས་བདེ་ལེགས། ཁམས་བཟང་ངམ།"
ppl = calculator.calculate_perplexity(text)
print(f"Perplexity: {ppl:.4f}")from BoCorpusQC import PerplexityCalculator
# Downloads syllable-level KenLM model from Hugging Face Hub automatically
calculator = PerplexityCalculator.from_syllable()
text = "བཀྲ་ཤིས་བདེ་ལེགས། ཁམས་བཟང་ངམ།"
ppl = calculator.calculate_perplexity(text)
print(f"Perplexity: {ppl:.4f}")from BoCorpusQC import DocumentFilter
# Using the default SentencePiece tokenizer
doc_filter = DocumentFilter(tokenizer_type="sentencepiece", num_workers=4)
doc_filter.filter_documents("/path/to/input", "/path/to/output")
# Or using syllable-level tokenization
doc_filter = DocumentFilter(
tokenizer_type="syllable",
num_workers=4,
)
doc_filter.filter_documents("/path/to/input", "/path/to/output")A lower perplexity score indicates that the text is more fluent and predictable according to the language model, suggesting higher quality.
This tool evaluates the quality of Tibetan text files using a pre-trained KenLM language model.
- Tokenization: Two strategies are available. SentencePiece downloads a sub-word tokenizer (
openpecha/BoSentencePiece) from Hugging Face Hub. Syllable splits text on the Tibetan tsek character (་). - Model Loading: For SentencePiece mode the KenLM model (
openpecha/BoKenlm) is automatically downloaded from Hugging Face Hub. For syllable mode the KenLM model (openpecha/BoKenlm-syl) is automatically downloaded from Hugging Face Hub. - Perplexity Calculation: Each
.txtfile in the input directory is treated as a single document and scored. A lower score means more fluent, higher-quality text. - Dynamic Thresholding: A quality threshold is computed at the 33rd percentile of all perplexity scores, keeping the top one-third of documents as "good quality".
- Parallel Processing: Multiprocessing is used to score many files in parallel.
- Output: Each file is copied into either the
good_qualityorbad_qualitysubdirectory in the specified output folder.
If you'd like to help out, check out our contributing guidelines.
- File an issue on our GitHub repository.
- Email us at openpecha[at]gmail.com.
- Join our Discord.
This project is licensed under the MIT License.