Annotation, model evaluation, and LLM-as-Judge evaluation framework for FinCriticalED.
Annotation folder contains post-processing and annotation quality assessments.
-
Annotation/Highlight_Annotation.ipynb: process expert annotations by wrapping financially critical entities with<"Number">and<"Time">labels to make it the gold standard of FinCriticalED dataset. -
Annotation/calculate_agreement.pyandAnnotation/run_agreement.sh: calculate overall and pairwise annotator agreement scores to ensure annotation quality
- Before running models, configure model_eval/agent.py and OPEN_AI_API_KEY and TOGETHER_API_KEY for running models
- Run
model_eval/main.pyto generate model OCR output. To change model, or only run model on small sample, update
def evaluate(
model_name="gpt-4o",
experiment_tag="zero-shot",
language="en",
local_version=True,
local_dir="./FinCriticalED",
sample=None
):
- DeepSeek-OCR and and MinerU2.5 can be run seperately in
model_eval/deepseekocr/batch_process_deepseek.pyandmodel_eval/miner/batch_process_miner.py, respectively.
Upon running main.py, run model_eval/evaluation.py on ROUGE-1,ROUGE-L, Edit Distance. To control input output path, or change models, csv names etc., update
def main():
models = [
...
"Qwen/Qwen2.5-VL-72B-Instruct",
"google/gemma-3n-E4B-it",
"gpt-5",
]
languages = [
"smallocr"
...
]
In llm-as-a-judge.ipynb, a large language model (GPT-4o) serves as the evaluator responsible for extracting financial facts from the ground-truth HTML and verifying their presence in the model-generated HTML. The LLM Judge processes both inputs under a structured evaluation prompt, enabling it to perform normalization, contextual matching, and fine-grained fact checking.
- Dataset are available on HuggingFace: TheFinAI/FinCriticalED
- Paper is available on Arxiv: 2511.14998