Skip to content

mbzuai-nlp/do_diacritics_matter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks

Go Inoue, Bashar Alhafni, Nizar Habash, Timothy Baldwin

Abstract

Diacritics are orthographic marks added to letters to specify pronunciation, disambiguate lexical meanings, or indicate grammatical distinctions. Diacritics can significantly influence language processing tasks, especially in languages like Arabic, where diacritic usage varies widely across domains and contexts. While diacritics provide valuable linguistic information, their presence can increase subword fragmentation during tokenization, potentially degrading the performance of NLP models. In this paper, we systematically analyze the impact of diacritics on tokenization and benchmark task performance across major Large Language Models (LLMs). Our results demonstrate that while modern LLMs show robustness to the limited diacritics naturally found in texts, full diacritization leads to substantially increased token fragmentation and degraded performance, highlighting the need for careful handling of diacritics in the future development of Arabic LLMs.

Getting Started

# create a virtual environment
conda create -n ddm python=3.10
conda activate ddm

# install required libraries
pip install -r requirements.txt

# download CAMeLBERT disambiguator
camel_data -i disambig-bert-unfactored-msa

Diacritized and Undiacritized Benchmark Datasets

We provide preprocessed benchmark datasets that are readily available and the code to generate these datasets. For automatic diacritization, we use the system of Elgamal et al., 2024, an extended version of the CAMeL Tools disambiguator. For removing diacritics, we use the dediac_ar() function implemented in CAMeL Tools.

Where can I find procesesed benchmark datasets?

Procesesed benchmark datasets are hosted at Hugging Face:

Raw/Wild Undiacritized Fully diacritized
ArabicMMLU ArabicMMLU_undiac ArabicMMLU_full
ArabCulture ArabCulture_undiac ArabCulture_full
AraTrust AraTrust_undiac AraTrust_full

How do I diacritize and undiacritize benchmark datasets myself?

Get SAMA3.1 database from LDC

Get the original database LDC2010L01.tgz from Linguistic Data Consortium and reconstruct the extended database (Elgamal et al., 2024) with muddler.

cd database

# reconstruct the database
muddler unmuddle -s /path/to/LDC2010L01.tgz -m ./calima-msa-s31-extended.db.muddled ./calima-msa-s31-extended.db

Download benchmark datasets to local

cd benchmark

# download "wild" datasets
hf download go-inoue/ArabCulture --repo-type dataset --local-dir ArabCulture
hf download go-inoue/AraTrust --repo-type dataset --local-dir AraTrust
hf download MBZUAI/ArabicMMLU --repo-type dataset --local-dir ArabicMMLU

Diacritize and undiacritize benchmark datasets

cd scripts

# diacritize benchmark datasets
sh diacritize_benchmark.sh ArabicMMLU
sh diacritize_benchmark.sh ArabCulture
sh diacritize_benchmark.sh AraTrust

# undiacritize benchmark datasets
sh undiacritize_benchmark.sh ArabicMMLU
sh undiacritize_benchmark.sh ArabCulture
sh undiacritize_benchmark.sh AraTrust

Benchmarking with lm-evaluation-harness

We use lm-evaluation-harness to run our experiments. The task definition files required for lm-evaluation-harness are avilalable at tasks/ directory.

Where can I find model output files used in the paper?

All the model output files are available at experiments/.

How do I run evaluation?

Here, we show an example code to run inceptionai/jais-family-13b-chat on fully diacritized version of ArabicMMLU (ArabicMMLU_full).

cd experiments

export MODEL_PATH=inceptionai/jais-family-13b-chat
export TASK_PATH=../tasks/arabicmmlu_full
export OUTPUT_PATH=./arabicmmlu_full

lm_eval --model hf \
    --model_args pretrained=$MODEL_PATH,trust_remote_code=True \
    --tasks $TASK_PATH \
    --include_path $TASK_PATH \
    --device cuda:0 \
    --output_path $OUTPUT_PATH \
    --log_samples \
    --batch_size 1 \
    --gen_kwargs temperature=0.0,do_sample=False

We used the following open-weight models:

  • FreedomIntelligence/AceGPT-v2-8B-Chat
  • humain-ai/ALLaM-7B-Instruct-preview
  • QCRI/Fanar-1-9B-Instruct
  • inceptionai/jais-family-13b-chat
  • inceptionai/Jais-2-8B-Chat
  • CohereLabs/aya-expanse-8b
  • meta-llama/Llama-3.1-8B-Instruct
  • Qwen/Qwen3-8B

Note: To run inceptionai/Jais-2-8B-Chat, you will need to install transformers (5.0.0rc2).

Getting Tokenization Statistics

Scripts used to compute tokenization statistics are available at scripts/.

Surface level statsitics

  • get_stats_wild2max.py: Compute tokenization stastitiscs on the Wild2Max dataset.
  • get_stats_benchmark.py: Compute tokenization stastitiscs on benchmark datasets.
  • get_stats_token_overlap.py: Compute token overlap and Jaccard similarity on the Wild2Max dataset.

Internal representation

  • compare_embdding.py: Compute layer-wise cosine similarities of word-level representations for undiac-wild, undic-full, and wild-full pairs on the Wild2Max dataset.
  • compare_embedding_inter_word.py: Compute layer-wise inter-word cosine similarities of word-level representations for undiac-wild, undic-full, and wild-full pairs on the Wild2Max dataset.
  • compare_sentence_embedding.py: Compute cosine similarities of sentence-level representations from the first layer for undiac-wild, undiac-full, and wild-full parings on three benchmark datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors