Repository of the ACL 2026 Findings paper "Evade: LLM-Based Explanation Generation and Validation for Error Detection in NLI".
pip install -r requirements.txtCUDA_VISIBLE_DEVICES=0 python generation/generate_explanation_qwen.py \
--model_name \
[--json_path] \
[--output_dir]CUDA_VISIBLE_DEVICES=0 python generation/generate_explanation_llama.py \
--model_name \
[--json_path] \
[--output_dir]--model_name: Huggingface model name.--json_path: Original VariErr dataset JSON file (optional).--output_dir: Output directory. Auto-generated from model name if not specified (optional).
Saved output:
generation/<model>_generation_raw/<sample_id>/
The generation scripts write one file per target label inside each sample folder:
E_0.txt: entailment / true explanationsN_0.txt: neutral / undetermined explanationsC_0.txt: contradiction / false explanations
After manual inspection, keep the cleaned files in the same sample directory and name them as:
generation/<model>_generation_raw/<sample_id>/Egeneration/<model>_generation_raw/<sample_id>/Ngeneration/<model>_generation_raw/<sample_id>/C
Make sure every _0.txt file has corresponding ENC file. These cleaned files are what the preprocessing script actually reads.
python processing/processing.py \
[--generation_dir] \
[--input_json] \
[--processing_dir] \
[--all_dir] --generation_dir: Directory containing<model>_generation_rawfolders (optional).--input_json: Original VariErr dataset JSON file (optional).--processing_dir: Directory to save per-model JSONL files (optional).--all_dir: Final merged output filename (optional).
Saved output:
- One merged file per model:
processing/<model>_generation_raw.jsonl - One merged file across all models:
processing/generation_all.jsonl
Validate one explanation per prompt:
CUDA_VISIBLE_DEVICES=0 python validation/one_expl.py \
--model_name_or_path \
--model_type \
[--input_path] \
[--output_dir]--model_name_or_path: Huggingface model name.--model_type:llamaorqwen--input_path: Path to input JSONL file. (optional, default:../processing/<model_name>_generation_raw.jsonl)--output_dir: Output directory (optional).
Saved output:
validation/validation_results/one_expl/<model>/scores.json
Validate all explanations from one source LLM in one prompt:
CUDA_VISIBLE_DEVICES=0 python validation/one_llm.py \
--model_name_or_path \
--model_type \
[--input_path] \
[--output_dir]--model_name_or_path: Huggingface model name.--model_type:llamaorqwen--input_path: Path to input JSONL file. Auto-generated from model name if not specified (optional, default:../processing/<model_name>_generation_raw.jsonl)--output_dir: Output directory (optional).
Saved output:
validation/validation_results/one_llm/<model>/scores.json
Validate explanations from multiple source LLMs together:
CUDA_VISIBLE_DEVICES=0 python validation/all_llm.py \
--model_name_or_path \
--model_type \
[--input_path] \
[--output_dir]--model_name_or_path: Huggingface model name.--model_type:llamaorqwen--input_path: Path to input JSONL file. Auto-generated from model name if not specified (optional, default:../processing/<model_name>_generation_raw.jsonl)--output_dir: Output directory (optional).
Saved output:
validation/validation_results/all_llm/<model>/scores_all.json
cd validation
bash run_llama_all.sh
bash run_qwen_all.shYou can also use these scripts to run the full validation workflow for the LlaMA and Qwen models used in the paper.
All validation scripts save a JSON file mapping explanation IDs to probabilities.
For one_expl and one_llm, the key format is:
{
"<sample_id>_<label_code>-<index>": 0.87
}For all_llm, the source model is also included:
{
"<source_model>_<sample_id>_<label_code>-<index>": 0.87
}Run this to split the merged scores_all.json into separate files based on model prefixes.
python split_score.py --input_json scores_all.jsonThe self-validation file (e.g. scores generated by Qwen2.5-72B for Qwen2.5-72B outputs) should be manually renamed to: scores.json
Apply validation tags to generated explanations across a range of thresholds (0.0–1.0) to support further analysis.
cd evaluation
bash run_val_threshold.shSaved output:
evaluation/<mode>/<model>/threshold/with_validation_<threshold>.jsonl
Compare the validated label distribution's alignment with ChaosNLI and VariErr Distribution before and after validation, with different thresholds.
bash run_kld_jsd.sh <mode> <model><model>: model directory name used in previous experiments (e.g., Llama-3.1-8B),<mode>:one_expl,one_llmorall_llm
Saved output:
evaluation/<mode>/<model>/kld_jsd/
Evaluate the model predictions using Precision, Recall by comparing LLM-validated labels against VariErr-validated labels across different thresholds.
cd evaluation
bash run_pre_re.shSaved output:
evaluation/<mode>/<model>/validated_overlap.csv
We analyze the linguistic similarity between human and LLM-generated explanations before and after validation from three perspectives: lexical, syntactic and semantic.
This script measures the diversity of human-written explanations in the VariErr dataset. For each instance and each label, we compute pairwise similarity between explanations along the three dimensions.
python -m spacy download en_core_web_md
python similarity_within_human.pyThis script measures the diversity of explanations generated by LLMs within the same instance and label. Default thresholds are set the same as the ones reported in the paper.
bash run_similarity_within_llm.shThis script compares LLM-generated explanations with human-written explanations from the VariErr dataset. Default thresholds are set the same as the ones reported in the paper.
bash run_similarity_llm_human.shReport average precision (AP), as well as precision and recall at the top 100 predictions (P@100 and R@100). Metrics are computed using the average score for each label of each instance.
bash evaluation/run_aed.shOutput will be displayed in terminal.
We clean the input data for fine_tuning by converting label sets into soft label distributions. Each file is in JSONL format. Each line (instance) has the following structure:
{
"uid": "123",
"premise": "A man is running",
"hypothesis": "Someone is moving",
"label": [0.5, 0.5, 0.0]
}
cd fine_tuning
python baseline_r1_r2.pyAfter running the script, two files are generated in the LLM_AED/fine_tuning/processed_data/baselines/ directory.
bash run_llm_fine_tuning.shAfter running the script, two files are saved under LLM_AED/fine_tuning/processed_data/llm_cleaned/<mode>/<model>/ for all the modes and models. Default thresholds are set the same as the ones reported in the paper.
bash run_remove_llm_error.shProcessed data will be saved under LLM_AED/fine_tuning/processed_data/without_llm_error/<mode>/<model>/
Place the processed training data into the designated folder, then update the corresponding directory paths in the scripts before running them.
First, run the following commands so that the model obtains a basic understanding of the NLI task:
cd fine_tuning
bash mnli_train.sh
bash mnli_train_roberta.shAfter that, start fine_tuning with:
bash varierr_tune_bert.sh
bash varierr_tune_roberta.shThe development and test sets are located at LLM_AED/dataset/dev_test.