Skip to content

Releases: RISE-UNIBAS/humanities_data_benchmark

Version v0.4.0

09 Dec 12:42

Choose a tag to compare

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

v0.4.0 - 2025-12-09

Added

  • 4 new models: gpt-5.1 (OpenAI), gemini-3-pro-preview (GenAI), magistral-medium-2509 (Mistral), mistral-small-2506 (Mistral)
  • 42 new benchmark test configurations (T0403-T0444) across all benchmarks for new models
  • Pricing data for 2025-11-24 with updated model prices and source URLs
  • Cohere provider support with 5 models: command-r-08-2024, command-r-plus-08-2024, command-r7b-12-2024, command-a-03-2025, command-a-vision-07-2025
  • book_advert_xml benchmark for correcting malformed XML from 18th century book advertisements
  • 43 new benchmark test configurations (T0445-T0487) for book_advert_xml across all providers
  • Pricing data for Cohere models (2025-12-09)

Changed

Removed

  • All tests with claude-3-5-sonnet-20241022 (now legacy)
  • All renders and related docs (now handled by dedicated frontend)

Full Changelog: v0.3.1...v0.4.0

Version v0.3.1

29 Oct 13:07

Choose a tag to compare

Added

  • OpenRouter provider support with fallback for models not supporting structured outputs
  • sciCORE provider support (LiteLLM-based OpenAI-compatible API)
  • blacklist benchmark on extracting structured company information from historical index cards
  • company_lists benchmark on extracting structured company information (company name and location) from historical trade indexes
  • medieval_manuscripts benchmark for 15th century page segmentation and handwritten text extraction with CER and fuzzy matching
  • 6 new models: qwen/qwen3-vl-8b-thinking, qwen/qwen3-vl-30b-a3b-instruct, qwen/qwen3-vl-8b-instruct, meta-llama/llama-4-maverick, x-ai/grok-4, GLM-4.5V-FP8
  • 170 new benchmark test configurations (T0233-T0402) across blacklist, medieval_manuscripts, and company_lists benchmarks
  • Tests on 2025-10-03: T0164
  • Tests on 2025-10-17: T0233-T0234, T0237-T0252
  • Tests on 2025-10-20: T0253-T0270
  • Tests on 2025-10-24: T0271-T0336
  • Tests on 2025-10-24: T0336-T0402
  • Update pricing data to 2025-10-28

Fixed

  • metadata_extraction scoring now correctly counts failed requests as complete failures (0 TP, all FN) instead of excluding them
  • Fuzzy score matching now handles type mismatches between strings and integers (e.g., "1965" vs 1965 for year fields)
  • Pricing lookup now searches through all available dates to find provider/model pricing instead of only checking the most recent date

Full Changelog: v0.3.0...v0.3.1

Version v0.3.0

03 Oct 07:32

Choose a tag to compare

Added

  • Cost tracking system with pricing database and automatic cost calculation
    • Token usage extraction for all providers (OpenAI, GenAI, Anthropic, Mistral)
    • Automatic cost calculation based on token usage and date-based pricing data
    • Cost summary in benchmark scoring files with detailed token and cost breakdowns
    • tables showing test execution costs
    • Cost per Point metric in global leaderboard showing normalized cost efficiency ($/performance point)
  • 7 new models: pixtral-12b, mistral-large-latest, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.5-flash-lite-preview-09-2025, gemini-2.5-flash-preview-09-2025, claude-sonnet-4-5-20250929
  • 50 new benchmark test configurations (T0181-T0230) for new models across all benchmark variants
  • Test Time tracking and metrics
    • Test Time (s) column in benchmark tables showing total execution time per test
    • Time per Point column in benchmark tables showing time efficiency (seconds/point per item)
    • Time/Point metric in global leaderboard showing normalized time efficiency
    • Multi-level normalized calculation: per-test (average time per item / score), per-benchmark (average of test ratios), global (average of benchmark ratios)
    • Analogous to cost calculation methodology for consistency
  • Structured outputs for Google Gen AI, Anthropic (native tool calling), and Mistral models
  • Automatic regeneration of empty or invalid JSON result files
  • T0099 on 2025-09-24
  • T0107 on 2025-09-24
  • T0117 on 2025-09-24
  • T0120 on 2025-09-24
  • T0125 on 2025-09-24
  • T0130 on 2025-09-24
  • T0132 on 2025-09-24
  • T0134 on 2025-09-24
  • T0145 on 2025-09-24
  • T0151 on 2025-09-24
  • T0162 on 2025-09-24
  • T0023 on 2025-09-25
  • T0035 on 2025-09-25
  • T0095 on 2025-09-25
  • T0159 on 2025-09-25
  • T0169 on 2025-09-26
  • T0170 on 2025-09-26
  • T0171 on 2025-09-26
  • T0172 on 2025-09-26
  • T0173 on 2025-09-26
  • T0174 on 2025-09-26
  • T0175 on 2025-09-26
  • T0176 on 2025-09-26
  • T0177 on 2025-09-26
  • T0178 on 2025-09-26
  • T0179 on 2025-09-26
  • T0180 on 2025-09-26
  • T0107 on 2025-09-30
  • T0169 on 2025-09-30
  • T0001 on 2025-09-30
  • T0002 on 2025-09-30
  • T0003 on 2025-09-30
  • T0004 on 2025-09-30
  • T0005 on 2025-09-30
  • T0006 on 2025-09-30
  • T0007 on 2025-09-30
  • T0008 on 2025-09-30
  • T0009 on 2025-09-30
  • T0010 on 2025-09-30
  • T0012 on 2025-09-30
  • T0013 on 2025-09-30
  • T0017 on 2025-09-30
  • T0018 on 2025-09-30
  • T0020 on 2025-09-30
  • T0023 on 2025-09-30
  • T0024 on 2025-09-30
  • T0025 on 2025-09-30
  • T0027 on 2025-09-30
  • T0031 on 2025-09-30
  • T0033 on 2025-09-30
  • T0035 on 2025-09-30
  • T0036 on 2025-09-30
  • T0038 on 2025-09-30
  • T0039 on 2025-09-30
  • T0042 on 2025-09-30
  • T0043 on 2025-09-30
  • T0044 on 2025-09-30
  • T0045 on 2025-09-30
  • T0052 on 2025-09-30
  • T0053 on 2025-09-30
  • T0056 on 2025-09-30
  • T0057 on 2025-09-30
  • T0060 on 2025-09-30
  • T0061 on 2025-09-30
  • T0062 on 2025-09-30
  • T0063 on 2025-09-30
  • T0066 on 2025-09-30
  • T0067 on 2025-09-30
  • T0068 on 2025-09-30
  • T0069 on 2025-09-30
  • T0070 on 2025-09-30
  • T0071 on 2025-09-30
  • T0072 on 2025-09-30
  • T0073 on 2025-09-30
  • T0074 on 2025-09-30
  • T0075 on 2025-09-30
  • T0076 on 2025-09-30
  • T0077 on 2025-09-30
  • T0078 on 2025-09-30
  • T0079 on 2025-09-30
  • T0082 on 2025-09-30
  • T0083 on 2025-09-30
  • T0084 on 2025-09-30
  • T0085 on 2025-09-30
  • T0086 on 2025-09-30
  • T0090 on 2025-09-30
  • T0092 on 2025-09-30
  • T0093 on 2025-09-30
  • T0094 on 2025-09-30
  • T0095 on 2025-09-30
  • T0098 on 2025-09-30
  • T0099 on 2025-09-30
  • T0100 on 2025-09-30
  • T0101 on 2025-09-30
  • T0102 on 2025-09-30
  • T0103 on 2025-09-30
  • T0104 on 2025-09-30
  • T0105 on 2025-09-30
  • T0106 on 2025-09-30
  • T0108 on 2025-09-30
  • T0109 on 2025-09-30
  • T0110 on 2025-09-30
  • T0111 on 2025-09-30
  • T0112 on 2025-09-30
  • T0113 on 2025-09-30
  • T0114 on 2025-09-30
  • T0115 on 2025-09-30
  • T0116 on 2025-09-30
  • T0117 on 2025-09-30
  • T0118 on 2025-09-30
  • T0119 on 2025-09-30
  • T0120 on 2025-09-30
  • T0121 on 2025-09-30
  • T0122 on 2025-09-30
  • T0123 on 2025-09-30
  • T0124 on 2025-09-30
  • T0125 on 2025-09-30
  • T0126 on 2025-09-30
  • T0127 on 2025-09-30
  • T0128 on 2025-09-30
  • T0129 on 2025-09-30
  • T0130 on 2025-09-30
  • T0131 on 2025-09-30
  • T0132 on 2025-09-30
  • T0133 on 2025-09-30
  • T0160 on 2025-09-30
  • T0193 on 2025-09-30
  • T0194 on 2025-09-30
  • T0195 on 2025-09-30
  • T0196 on 2025-09-30
  • T0197 on 2025-09-30
  • T0198 on 2025-09-30
  • T0199 on 2025-09-30
  • T0200 on 2025-09-30
  • T0230 on 2025-09-30
  • T0129 on 2025-10-01
  • T0130 on 2025-10-01
  • T0131 on 2025-10-01
  • T0132 on 2025-10-01
  • T0133 on 2025-10-01
  • T0134 on 2025-10-01
  • T0135 on 2025-10-01
  • T0136 on 2025-10-01
  • T0137 on 2025-10-01
  • T0138 on 2025-10-01
  • T0139 on 2025-10-01
  • T0140 on 2025-10-01
  • T0141 on 2025-10-01
  • T0143 on 2025-10-01
  • T0144 on 2025-10-01
  • T0145 on 2025-10-01
  • T0146 on 2025-10-01
  • T0147 on 2025-10-01
  • T0148 on 2025-10-01
  • T0151 on 2025-10-01
  • T0152 on 2025-10-01
  • T0155 on 2025-10-01
  • T0159 on 2025-10-01
  • T0160 on 2025-10-01
  • T0161 on 2025-10-01
  • T0165 on 2025-10-01
  • T0169 on 2025-10-01
  • T0170 on 2025-10-01
  • T0171 on 2025-10-01
  • T0172 on 2025-10-01
  • T0173 on 2025-10-01
  • T0174 on 2025-10-01
  • T0175 on 2025-10-01
  • T0176 on 2025-10-01
  • T0177 on 2025-10-01
  • T0178 on 2025-10-01
  • T0179 on 2025-10-01
  • T0180 on 2025-10-01
  • T0181 on 2025-10-01
  • T0182 on 2025-10-01
  • T0183 on 2025-10-01
  • T0184 on 2025-10-01
  • T0185 on 2025-10-01
  • T0201 on 2025-10-01
  • T0202 on 2025-10-01
  • T0203 on 2025-10-01
  • T0204 on 2025-10-01
  • T0205 on 2025-10-01
  • T0206 on 2025-10-01
  • T0207 on 2025-10-01
  • T0208 on 2025-10-01
  • T0209 on 2025-10-01
  • T0210 on 2025-10-01
  • T0211 on 2025-10-01
  • T0212 on 2025-10-01
  • T0213 on 2025-10-01
  • T0214 on 2025-10-01
  • T0215 on 2025-10-01
  • T0216 on 2025-10-01
  • T0217 on 2025-10-01
  • T0218 on 2025-10-01
  • T0219 on 2025-10-01
  • T0220 on 2025-10-01
  • T0221 on 2025-10-01
  • T0222 on 2025-10-01
  • T0223 on 2025-10-01
  • T0224 on 2025-10-01
  • T0225 on 2025-10-01
  • T0226 on 2025-10-01
  • T0227 on 2025-10-01
  • T0228 on 2025-10-01
  • T0229 on 2025-10-01
  • T0186 on 2025-10-01
  • T0187 on 2025-10-01
  • T0188 on 2025-10-01
  • T0189 on 2025-10-01
  • T0190 on 2025-10-01
  • T0191 on 2025-10-01
  • T0192 on 2025-10-01
  • T0168 on 2025-10-01
  • T0166 on 2025-10-01
  • T0167 on 2025-10-01
  • T0161 on 2025-10-02
  • T0162 on 2025-10-02
  • T0164 on 2025-10-02

Changed

  • All provider response objects now converted to JSON-serializable format before storage
  • Schema default values automatically removed for GenAI API compatibility

Fixed

  • Empty log files (FileHandler now properly configured)
  • JSON serialization errors for OpenAI, GenAI, Anthropic, and Mistral response objects
  • Pydantic validation errors now handled gracefully with fallback to raw tool input
  • Letter dataclass normalization for list-formatted fields (letter_title, send_date)
  • bibliographic_data attributions
  • Pydantic dataclass models

Full Changelog: v0.2.2...v0.3.0

Version v0.2.2

19 Sep 06:46

Choose a tag to compare

Added

  • bibliographic_data README.md sections
  • CONTRIBUTING.md
  • CONTRIBUTORS.md

Full Changelog: v0.2.1...v0.2.2

Version v0.2.1

10 Sep 10:08

Choose a tag to compare

Added

  • Radar chart top 10 models.
  • Zettelkatalog benchmark.
  • T0066 on 2025-09-02
  • T0143 on 2025-09-02
  • T0144 on 2025-09-02
  • T0145 on 2025-09-02
  • T0146 on 2025-09-02
  • T0147 on 2025-09-02
  • T0148 on 2025-09-02
  • T0159 on 2025-09-02
  • T0160 on 2025-09-02
  • T0161 on 2025-09-02
  • T0162 on 2025-09-02
  • T0164 on 2025-09-02
  • T0165 on 2025-09-02
  • T0166 on 2025-09-02
  • T0151 on 2025-09-02
  • T0152 on 2025-09-02
  • T0155 on 2025-09-02
  • T0167 on 2025-09-02
  • T0168 on 2025-09-02

Full Changelog: v0.2.0...v0.2.1

Version v0.2.0

31 Aug 05:44
ae34be0

Choose a tag to compare

Added

  • Global model performance leaderboard to docs.
  • T0136 on 2025-08-27
  • T0137 on 2025-08-27
  • T0138 on 2025-08-27
  • T0139 on 2025-08-27
  • T0140 on 2025-08-27
  • T0141 on 2025-08-27
  • T0106 on 2025-08-27

Fixed

  • Broken link patterns in docs.

Changed

  • Standardize test-IDs to 4-digit zero-padded format (T0001).

Full Changelog: v0.1.0...v0.2.0

Version v0.1.0

25 Aug 12:17

Choose a tag to compare

Initial release.