09 Dec 12:42

MHindermann

c3c5d4d

Version v0.4.0 Latest

Latest

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

v0.4.0 - 2025-12-09

Added

4 new models: gpt-5.1 (OpenAI), gemini-3-pro-preview (GenAI), magistral-medium-2509 (Mistral), mistral-small-2506 (Mistral)
42 new benchmark test configurations (T0403-T0444) across all benchmarks for new models
Pricing data for 2025-11-24 with updated model prices and source URLs
Cohere provider support with 5 models: command-r-08-2024, command-r-plus-08-2024, command-r7b-12-2024, command-a-03-2025, command-a-vision-07-2025
book_advert_xml benchmark for correcting malformed XML from 18th century book advertisements
43 new benchmark test configurations (T0445-T0487) for book_advert_xml across all providers
Pricing data for Cohere models (2025-12-09)

Changed

All requests are now handled by https://pypi.org/project/generic-llm-api-client/
Suite name is now "RISE Humanities Data Benchmark"
Remap "latest" suffix to actual model used

Removed

All tests with claude-3-5-sonnet-20241022 (now legacy)
All renders and related docs (now handled by dedicated frontend)

Full Changelog: v0.3.1...v0.4.0

Assets 2

29 Oct 13:07

MHindermann

v0.3.1

d3237d5

Version v0.3.1

Added

OpenRouter provider support with fallback for models not supporting structured outputs
sciCORE provider support (LiteLLM-based OpenAI-compatible API)
blacklist benchmark on extracting structured company information from historical index cards
company_lists benchmark on extracting structured company information (company name and location) from historical trade indexes
medieval_manuscripts benchmark for 15th century page segmentation and handwritten text extraction with CER and fuzzy matching
6 new models: qwen/qwen3-vl-8b-thinking, qwen/qwen3-vl-30b-a3b-instruct, qwen/qwen3-vl-8b-instruct, meta-llama/llama-4-maverick, x-ai/grok-4, GLM-4.5V-FP8
170 new benchmark test configurations (T0233-T0402) across blacklist, medieval_manuscripts, and company_lists benchmarks
Tests on 2025-10-03: T0164
Tests on 2025-10-17: T0233-T0234, T0237-T0252
Tests on 2025-10-20: T0253-T0270
Tests on 2025-10-24: T0271-T0336
Tests on 2025-10-24: T0336-T0402
Update pricing data to 2025-10-28

Fixed

metadata_extraction scoring now correctly counts failed requests as complete failures (0 TP, all FN) instead of excluding them
Fuzzy score matching now handles type mismatches between strings and integers (e.g., "1965" vs 1965 for year fields)
Pricing lookup now searches through all available dates to find provider/model pricing instead of only checking the most recent date

Full Changelog: v0.3.0...v0.3.1

Assets 2

03 Oct 07:32

MHindermann

v0.3.0

3d24547

Version v0.3.0

Added

Cost tracking system with pricing database and automatic cost calculation
- Token usage extraction for all providers (OpenAI, GenAI, Anthropic, Mistral)
- Automatic cost calculation based on token usage and date-based pricing data
- Cost summary in benchmark scoring files with detailed token and cost breakdowns
- tables showing test execution costs
- Cost per Point metric in global leaderboard showing normalized cost efficiency ($/performance point)
7 new models: pixtral-12b, mistral-large-latest, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.5-flash-lite-preview-09-2025, gemini-2.5-flash-preview-09-2025, claude-sonnet-4-5-20250929
50 new benchmark test configurations (T0181-T0230) for new models across all benchmark variants
Test Time tracking and metrics
- Test Time (s) column in benchmark tables showing total execution time per test
- Time per Point column in benchmark tables showing time efficiency (seconds/point per item)
- Time/Point metric in global leaderboard showing normalized time efficiency
- Multi-level normalized calculation: per-test (average time per item / score), per-benchmark (average of test ratios), global (average of benchmark ratios)
- Analogous to cost calculation methodology for consistency
Structured outputs for Google Gen AI, Anthropic (native tool calling), and Mistral models
Automatic regeneration of empty or invalid JSON result files
T0099 on 2025-09-24
T0107 on 2025-09-24
T0117 on 2025-09-24
T0120 on 2025-09-24
T0125 on 2025-09-24
T0130 on 2025-09-24
T0132 on 2025-09-24
T0134 on 2025-09-24
T0145 on 2025-09-24
T0151 on 2025-09-24
T0162 on 2025-09-24
T0023 on 2025-09-25
T0035 on 2025-09-25
T0095 on 2025-09-25
T0159 on 2025-09-25
T0169 on 2025-09-26
T0170 on 2025-09-26
T0171 on 2025-09-26
T0172 on 2025-09-26
T0173 on 2025-09-26
T0174 on 2025-09-26
T0175 on 2025-09-26
T0176 on 2025-09-26
T0177 on 2025-09-26
T0178 on 2025-09-26
T0179 on 2025-09-26
T0180 on 2025-09-26
T0107 on 2025-09-30
T0169 on 2025-09-30
T0001 on 2025-09-30
T0002 on 2025-09-30
T0003 on 2025-09-30
T0004 on 2025-09-30
T0005 on 2025-09-30
T0006 on 2025-09-30
T0007 on 2025-09-30
T0008 on 2025-09-30
T0009 on 2025-09-30
T0010 on 2025-09-30
T0012 on 2025-09-30
T0013 on 2025-09-30
T0017 on 2025-09-30
T0018 on 2025-09-30
T0020 on 2025-09-30
T0023 on 2025-09-30
T0024 on 2025-09-30
T0025 on 2025-09-30
T0027 on 2025-09-30
T0031 on 2025-09-30
T0033 on 2025-09-30
T0035 on 2025-09-30
T0036 on 2025-09-30
T0038 on 2025-09-30
T0039 on 2025-09-30
T0042 on 2025-09-30
T0043 on 2025-09-30
T0044 on 2025-09-30
T0045 on 2025-09-30
T0052 on 2025-09-30
T0053 on 2025-09-30
T0056 on 2025-09-30
T0057 on 2025-09-30
T0060 on 2025-09-30
T0061 on 2025-09-30
T0062 on 2025-09-30
T0063 on 2025-09-30
T0066 on 2025-09-30
T0067 on 2025-09-30
T0068 on 2025-09-30
T0069 on 2025-09-30
T0070 on 2025-09-30
T0071 on 2025-09-30
T0072 on 2025-09-30
T0073 on 2025-09-30
T0074 on 2025-09-30
T0075 on 2025-09-30
T0076 on 2025-09-30
T0077 on 2025-09-30
T0078 on 2025-09-30
T0079 on 2025-09-30
T0082 on 2025-09-30
T0083 on 2025-09-30
T0084 on 2025-09-30
T0085 on 2025-09-30
T0086 on 2025-09-30
T0090 on 2025-09-30
T0092 on 2025-09-30
T0093 on 2025-09-30
T0094 on 2025-09-30
T0095 on 2025-09-30
T0098 on 2025-09-30
T0099 on 2025-09-30
T0100 on 2025-09-30
T0101 on 2025-09-30
T0102 on 2025-09-30
T0103 on 2025-09-30
T0104 on 2025-09-30
T0105 on 2025-09-30
T0106 on 2025-09-30
T0108 on 2025-09-30
T0109 on 2025-09-30
T0110 on 2025-09-30
T0111 on 2025-09-30
T0112 on 2025-09-30
T0113 on 2025-09-30
T0114 on 2025-09-30
T0115 on 2025-09-30
T0116 on 2025-09-30
T0117 on 2025-09-30
T0118 on 2025-09-30
T0119 on 2025-09-30
T0120 on 2025-09-30
T0121 on 2025-09-30
T0122 on 2025-09-30
T0123 on 2025-09-30
T0124 on 2025-09-30
T0125 on 2025-09-30
T0126 on 2025-09-30
T0127 on 2025-09-30
T0128 on 2025-09-30
T0129 on 2025-09-30
T0130 on 2025-09-30
T0131 on 2025-09-30
T0132 on 2025-09-30
T0133 on 2025-09-30
T0160 on 2025-09-30
T0193 on 2025-09-30
T0194 on 2025-09-30
T0195 on 2025-09-30
T0196 on 2025-09-30
T0197 on 2025-09-30
T0198 on 2025-09-30
T0199 on 2025-09-30
T0200 on 2025-09-30
T0230 on 2025-09-30
T0129 on 2025-10-01
T0130 on 2025-10-01
T0131 on 2025-10-01
T0132 on 2025-10-01
T0133 on 2025-10-01
T0134 on 2025-10-01
T0135 on 2025-10-01
T0136 on 2025-10-01
T0137 on 2025-10-01
T0138 on 2025-10-01
T0139 on 2025-10-01
T0140 on 2025-10-01
T0141 on 2025-10-01
T0143 on 2025-10-01
T0144 on 2025-10-01
T0145 on 2025-10-01
T0146 on 2025-10-01
T0147 on 2025-10-01
T0148 on 2025-10-01
T0151 on 2025-10-01
T0152 on 2025-10-01
T0155 on 2025-10-01
T0159 on 2025-10-01
T0160 on 2025-10-01
T0161 on 2025-10-01
T0165 on 2025-10-01
T0169 on 2025-10-01
T0170 on 2025-10-01
T0171 on 2025-10-01
T0172 on 2025-10-01
T0173 on 2025-10-01
T0174 on 2025-10-01
T0175 on 2025-10-01
T0176 on 2025-10-01
T0177 on 2025-10-01
T0178 on 2025-10-01
T0179 on 2025-10-01
T0180 on 2025-10-01
T0181 on 2025-10-01
T0182 on 2025-10-01
T0183 on 2025-10-01
T0184 on 2025-10-01
T0185 on 2025-10-01
T0201 on 2025-10-01
T0202 on 2025-10-01
T0203 on 2025-10-01
T0204 on 2025-10-01
T0205 on 2025-10-01
T0206 on 2025-10-01
T0207 on 2025-10-01
T0208 on 2025-10-01
T0209 on 2025-10-01
T0210 on 2025-10-01
T0211 on 2025-10-01
T0212 on 2025-10-01
T0213 on 2025-10-01
T0214 on 2025-10-01
T0215 on 2025-10-01
T0216 on 2025-10-01
T0217 on 2025-10-01
T0218 on 2025-10-01
T0219 on 2025-10-01
T0220 on 2025-10-01
T0221 on 2025-10-01
T0222 on 2025-10-01
T0223 on 2025-10-01
T0224 on 2025-10-01
T0225 on 2025-10-01
T0226 on 2025-10-01
T0227 on 2025-10-01
T0228 on 2025-10-01
T0229 on 2025-10-01
T0186 on 2025-10-01
T0187 on 2025-10-01
T0188 on 2025-10-01
T0189 on 2025-10-01
T0190 on 2025-10-01
T0191 on 2025-10-01
T0192 on 2025-10-01
T0168 on 2025-10-01
T0166 on 2025-10-01
T0167 on 2025-10-01
T0161 on 2025-10-02
T0162 on 2025-10-02
T0164 on 2025-10-02

Changed

All provider response objects now converted to JSON-serializable format before storage
Schema default values automatically removed for GenAI API compatibility

Fixed

Empty log files (FileHandler now properly configured)
JSON serialization errors for OpenAI, GenAI, Anthropic, and Mistral response objects
Pydantic validation errors now handled gracefully with fallback to raw tool input
Letter dataclass normalization for list-formatted fields (letter_title, send_date)
bibliographic_data attributions
Pydantic dataclass models

Full Changelog: v0.2.2...v0.3.0

Assets 2

19 Sep 06:46

MHindermann

v0.2.2

33c4f18

Version v0.2.2

Added

bibliographic_data README.md sections
CONTRIBUTING.md
CONTRIBUTORS.md

Full Changelog: v0.2.1...v0.2.2

Assets 2

10 Sep 10:08

MHindermann

v0.2.1

e078668

Version v0.2.1

Added

Radar chart top 10 models.
Zettelkatalog benchmark.
T0066 on 2025-09-02
T0143 on 2025-09-02
T0144 on 2025-09-02
T0145 on 2025-09-02
T0146 on 2025-09-02
T0147 on 2025-09-02
T0148 on 2025-09-02
T0159 on 2025-09-02
T0160 on 2025-09-02
T0161 on 2025-09-02
T0162 on 2025-09-02
T0164 on 2025-09-02
T0165 on 2025-09-02
T0166 on 2025-09-02
T0151 on 2025-09-02
T0152 on 2025-09-02
T0155 on 2025-09-02
T0167 on 2025-09-02
T0168 on 2025-09-02

Full Changelog: v0.2.0...v0.2.1

Assets 2

31 Aug 05:44

MHindermann

v0.2.0

ae34be0

Version v0.2.0

Added

Global model performance leaderboard to docs.
T0136 on 2025-08-27
T0137 on 2025-08-27
T0138 on 2025-08-27
T0139 on 2025-08-27
T0140 on 2025-08-27
T0141 on 2025-08-27
T0106 on 2025-08-27

Fixed

Broken link patterns in docs.

Changed

Standardize test-IDs to 4-digit zero-padded format (T0001).

Full Changelog: v0.1.0...v0.2.0

Assets 2

25 Aug 12:17

MHindermann

v0.1.0

8a23c0b

Version v0.1.0

Initial release.

Assets 2

Releases: RISE-UNIBAS/humanities_data_benchmark

Version v0.4.0

Changelog

v0.4.0 - 2025-12-09

Added

Changed

Removed

Uh oh!

Version v0.3.1

Added

Fixed

Uh oh!

Version v0.3.0

Added

Changed

Fixed

Uh oh!

Version v0.2.2

Added

Uh oh!

Version v0.2.1

Added

Uh oh!

Version v0.2.0

Added

Fixed

Changed

Uh oh!

Version v0.1.0

Uh oh!