Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks

Go Inoue, Bashar Alhafni, Nizar Habash, Timothy Baldwin

Abstract

Diacritics are orthographic marks added to letters to specify pronunciation, disambiguate lexical meanings, or indicate grammatical distinctions. Diacritics can significantly influence language processing tasks, especially in languages like Arabic, where diacritic usage varies widely across domains and contexts. While diacritics provide valuable linguistic information, their presence can increase subword fragmentation during tokenization, potentially degrading the performance of NLP models. In this paper, we systematically analyze the impact of diacritics on tokenization and benchmark task performance across major Large Language Models (LLMs). Our results demonstrate that while modern LLMs show robustness to the limited diacritics naturally found in texts, full diacritization leads to substantially increased token fragmentation and degraded performance, highlighting the need for careful handling of diacritics in the future development of Arabic LLMs.

Getting Started

# create a virtual environment
conda create -n ddm python=3.10
conda activate ddm

# install required libraries
pip install -r requirements.txt

# download CAMeLBERT disambiguator
camel_data -i disambig-bert-unfactored-msa

Diacritized and Undiacritized Benchmark Datasets

We provide preprocessed benchmark datasets that are readily available and the code to generate these datasets. For automatic diacritization, we use the system of Elgamal et al., 2024, an extended version of the CAMeL Tools disambiguator. For removing diacritics, we use the dediac_ar() function implemented in CAMeL Tools.

Where can I find procesesed benchmark datasets?

Procesesed benchmark datasets are hosted at Hugging Face:

Raw/Wild	Undiacritized	Fully diacritized
ArabicMMLU	ArabicMMLU_undiac	ArabicMMLU_full
ArabCulture	ArabCulture_undiac	ArabCulture_full
AraTrust	AraTrust_undiac	AraTrust_full

How do I diacritize and undiacritize benchmark datasets myself?

Get SAMA3.1 database from LDC

Get the original database LDC2010L01.tgz from Linguistic Data Consortium and reconstruct the extended database (Elgamal et al., 2024) with muddler.

cd database

# reconstruct the database
muddler unmuddle -s /path/to/LDC2010L01.tgz -m ./calima-msa-s31-extended.db.muddled ./calima-msa-s31-extended.db

Download benchmark datasets to local

cd benchmark

# download "wild" datasets
hf download go-inoue/ArabCulture --repo-type dataset --local-dir ArabCulture
hf download go-inoue/AraTrust --repo-type dataset --local-dir AraTrust
hf download MBZUAI/ArabicMMLU --repo-type dataset --local-dir ArabicMMLU

Diacritize and undiacritize benchmark datasets

cd scripts

# diacritize benchmark datasets
sh diacritize_benchmark.sh ArabicMMLU
sh diacritize_benchmark.sh ArabCulture
sh diacritize_benchmark.sh AraTrust

# undiacritize benchmark datasets
sh undiacritize_benchmark.sh ArabicMMLU
sh undiacritize_benchmark.sh ArabCulture
sh undiacritize_benchmark.sh AraTrust

Benchmarking with lm-evaluation-harness

We use lm-evaluation-harness to run our experiments. The task definition files required for lm-evaluation-harness are avilalable at tasks/ directory.

Where can I find model output files used in the paper?

All the model output files are available at experiments/.

How do I run evaluation?

Here, we show an example code to run inceptionai/jais-family-13b-chat on fully diacritized version of ArabicMMLU (ArabicMMLU_full).

cd experiments

export MODEL_PATH=inceptionai/jais-family-13b-chat
export TASK_PATH=../tasks/arabicmmlu_full
export OUTPUT_PATH=./arabicmmlu_full

lm_eval --model hf \
    --model_args pretrained=$MODEL_PATH,trust_remote_code=True \
    --tasks $TASK_PATH \
    --include_path $TASK_PATH \
    --device cuda:0 \
    --output_path $OUTPUT_PATH \
    --log_samples \
    --batch_size 1 \
    --gen_kwargs temperature=0.0,do_sample=False

We used the following open-weight models:

FreedomIntelligence/AceGPT-v2-8B-Chat
humain-ai/ALLaM-7B-Instruct-preview
QCRI/Fanar-1-9B-Instruct
inceptionai/jais-family-13b-chat
inceptionai/Jais-2-8B-Chat
CohereLabs/aya-expanse-8b
meta-llama/Llama-3.1-8B-Instruct
Qwen/Qwen3-8B

Note: To run inceptionai/Jais-2-8B-Chat, you will need to install transformers (5.0.0rc2).

Getting Tokenization Statistics

Scripts used to compute tokenization statistics are available at scripts/.

Surface level statsitics

get_stats_wild2max.py: Compute tokenization stastitiscs on the Wild2Max dataset.
get_stats_benchmark.py: Compute tokenization stastitiscs on benchmark datasets.
get_stats_token_overlap.py: Compute token overlap and Jaccard similarity on the Wild2Max dataset.

Internal representation

compare_embdding.py: Compute layer-wise cosine similarities of word-level representations for undiac-wild, undic-full, and wild-full pairs on the Wild2Max dataset.
compare_embedding_inter_word.py: Compute layer-wise inter-word cosine similarities of word-level representations for undiac-wild, undic-full, and wild-full pairs on the Wild2Max dataset.
compare_sentence_embedding.py: Compute cosine similarities of sentence-level representations from the first layer for undiac-wild, undiac-full, and wild-full parings on three benchmark datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks

Abstract

Getting Started

Diacritized and Undiacritized Benchmark Datasets

Where can I find procesesed benchmark datasets?

How do I diacritize and undiacritize benchmark datasets myself?

Get SAMA3.1 database from LDC

Download benchmark datasets to local

Diacritize and undiacritize benchmark datasets

Benchmarking with lm-evaluation-harness

Where can I find model output files used in the paper?

How do I run evaluation?

Getting Tokenization Statistics

Surface level statsitics

Internal representation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
database		database
experiments		experiments
experiments_with_chat_template		experiments_with_chat_template
scripts		scripts
tasks		tasks
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks

Abstract

Getting Started

Diacritized and Undiacritized Benchmark Datasets

Where can I find procesesed benchmark datasets?

How do I diacritize and undiacritize benchmark datasets myself?

Get SAMA3.1 database from LDC

Download benchmark datasets to local

Diacritize and undiacritize benchmark datasets

Benchmarking with lm-evaluation-harness

Where can I find model output files used in the paper?

How do I run evaluation?

Getting Tokenization Statistics

Surface level statsitics

Internal representation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages