Releases: RISE-UNIBAS/humanities_data_benchmark
Releases · RISE-UNIBAS/humanities_data_benchmark
Version v0.4.0
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
v0.4.0 - 2025-12-09
Added
- 4 new models: gpt-5.1 (OpenAI), gemini-3-pro-preview (GenAI), magistral-medium-2509 (Mistral), mistral-small-2506 (Mistral)
- 42 new benchmark test configurations (T0403-T0444) across all benchmarks for new models
- Pricing data for 2025-11-24 with updated model prices and source URLs
- Cohere provider support with 5 models: command-r-08-2024, command-r-plus-08-2024, command-r7b-12-2024, command-a-03-2025, command-a-vision-07-2025
- book_advert_xml benchmark for correcting malformed XML from 18th century book advertisements
- 43 new benchmark test configurations (T0445-T0487) for book_advert_xml across all providers
- Pricing data for Cohere models (2025-12-09)
Changed
- All requests are now handled by https://pypi.org/project/generic-llm-api-client/
- Suite name is now "RISE Humanities Data Benchmark"
- Remap "latest" suffix to actual model used
Removed
- All tests with claude-3-5-sonnet-20241022 (now legacy)
- All renders and related docs (now handled by dedicated frontend)
Full Changelog: v0.3.1...v0.4.0
Version v0.3.1
Added
- OpenRouter provider support with fallback for models not supporting structured outputs
- sciCORE provider support (LiteLLM-based OpenAI-compatible API)
- blacklist benchmark on extracting structured company information from historical index cards
- company_lists benchmark on extracting structured company information (company name and location) from historical trade indexes
- medieval_manuscripts benchmark for 15th century page segmentation and handwritten text extraction with CER and fuzzy matching
- 6 new models: qwen/qwen3-vl-8b-thinking, qwen/qwen3-vl-30b-a3b-instruct, qwen/qwen3-vl-8b-instruct, meta-llama/llama-4-maverick, x-ai/grok-4, GLM-4.5V-FP8
- 170 new benchmark test configurations (T0233-T0402) across blacklist, medieval_manuscripts, and company_lists benchmarks
- Tests on 2025-10-03: T0164
- Tests on 2025-10-17: T0233-T0234, T0237-T0252
- Tests on 2025-10-20: T0253-T0270
- Tests on 2025-10-24: T0271-T0336
- Tests on 2025-10-24: T0336-T0402
- Update pricing data to 2025-10-28
Fixed
- metadata_extraction scoring now correctly counts failed requests as complete failures (0 TP, all FN) instead of excluding them
- Fuzzy score matching now handles type mismatches between strings and integers (e.g., "1965" vs 1965 for year fields)
- Pricing lookup now searches through all available dates to find provider/model pricing instead of only checking the most recent date
Full Changelog: v0.3.0...v0.3.1
Version v0.3.0
Added
- Cost tracking system with pricing database and automatic cost calculation
- Token usage extraction for all providers (OpenAI, GenAI, Anthropic, Mistral)
- Automatic cost calculation based on token usage and date-based pricing data
- Cost summary in benchmark scoring files with detailed token and cost breakdowns
- tables showing test execution costs
- Cost per Point metric in global leaderboard showing normalized cost efficiency ($/performance point)
- 7 new models: pixtral-12b, mistral-large-latest, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.5-flash-lite-preview-09-2025, gemini-2.5-flash-preview-09-2025, claude-sonnet-4-5-20250929
- 50 new benchmark test configurations (T0181-T0230) for new models across all benchmark variants
- Test Time tracking and metrics
- Test Time (s) column in benchmark tables showing total execution time per test
- Time per Point column in benchmark tables showing time efficiency (seconds/point per item)
- Time/Point metric in global leaderboard showing normalized time efficiency
- Multi-level normalized calculation: per-test (average time per item / score), per-benchmark (average of test ratios), global (average of benchmark ratios)
- Analogous to cost calculation methodology for consistency
- Structured outputs for Google Gen AI, Anthropic (native tool calling), and Mistral models
- Automatic regeneration of empty or invalid JSON result files
- T0099 on 2025-09-24
- T0107 on 2025-09-24
- T0117 on 2025-09-24
- T0120 on 2025-09-24
- T0125 on 2025-09-24
- T0130 on 2025-09-24
- T0132 on 2025-09-24
- T0134 on 2025-09-24
- T0145 on 2025-09-24
- T0151 on 2025-09-24
- T0162 on 2025-09-24
- T0023 on 2025-09-25
- T0035 on 2025-09-25
- T0095 on 2025-09-25
- T0159 on 2025-09-25
- T0169 on 2025-09-26
- T0170 on 2025-09-26
- T0171 on 2025-09-26
- T0172 on 2025-09-26
- T0173 on 2025-09-26
- T0174 on 2025-09-26
- T0175 on 2025-09-26
- T0176 on 2025-09-26
- T0177 on 2025-09-26
- T0178 on 2025-09-26
- T0179 on 2025-09-26
- T0180 on 2025-09-26
- T0107 on 2025-09-30
- T0169 on 2025-09-30
- T0001 on 2025-09-30
- T0002 on 2025-09-30
- T0003 on 2025-09-30
- T0004 on 2025-09-30
- T0005 on 2025-09-30
- T0006 on 2025-09-30
- T0007 on 2025-09-30
- T0008 on 2025-09-30
- T0009 on 2025-09-30
- T0010 on 2025-09-30
- T0012 on 2025-09-30
- T0013 on 2025-09-30
- T0017 on 2025-09-30
- T0018 on 2025-09-30
- T0020 on 2025-09-30
- T0023 on 2025-09-30
- T0024 on 2025-09-30
- T0025 on 2025-09-30
- T0027 on 2025-09-30
- T0031 on 2025-09-30
- T0033 on 2025-09-30
- T0035 on 2025-09-30
- T0036 on 2025-09-30
- T0038 on 2025-09-30
- T0039 on 2025-09-30
- T0042 on 2025-09-30
- T0043 on 2025-09-30
- T0044 on 2025-09-30
- T0045 on 2025-09-30
- T0052 on 2025-09-30
- T0053 on 2025-09-30
- T0056 on 2025-09-30
- T0057 on 2025-09-30
- T0060 on 2025-09-30
- T0061 on 2025-09-30
- T0062 on 2025-09-30
- T0063 on 2025-09-30
- T0066 on 2025-09-30
- T0067 on 2025-09-30
- T0068 on 2025-09-30
- T0069 on 2025-09-30
- T0070 on 2025-09-30
- T0071 on 2025-09-30
- T0072 on 2025-09-30
- T0073 on 2025-09-30
- T0074 on 2025-09-30
- T0075 on 2025-09-30
- T0076 on 2025-09-30
- T0077 on 2025-09-30
- T0078 on 2025-09-30
- T0079 on 2025-09-30
- T0082 on 2025-09-30
- T0083 on 2025-09-30
- T0084 on 2025-09-30
- T0085 on 2025-09-30
- T0086 on 2025-09-30
- T0090 on 2025-09-30
- T0092 on 2025-09-30
- T0093 on 2025-09-30
- T0094 on 2025-09-30
- T0095 on 2025-09-30
- T0098 on 2025-09-30
- T0099 on 2025-09-30
- T0100 on 2025-09-30
- T0101 on 2025-09-30
- T0102 on 2025-09-30
- T0103 on 2025-09-30
- T0104 on 2025-09-30
- T0105 on 2025-09-30
- T0106 on 2025-09-30
- T0108 on 2025-09-30
- T0109 on 2025-09-30
- T0110 on 2025-09-30
- T0111 on 2025-09-30
- T0112 on 2025-09-30
- T0113 on 2025-09-30
- T0114 on 2025-09-30
- T0115 on 2025-09-30
- T0116 on 2025-09-30
- T0117 on 2025-09-30
- T0118 on 2025-09-30
- T0119 on 2025-09-30
- T0120 on 2025-09-30
- T0121 on 2025-09-30
- T0122 on 2025-09-30
- T0123 on 2025-09-30
- T0124 on 2025-09-30
- T0125 on 2025-09-30
- T0126 on 2025-09-30
- T0127 on 2025-09-30
- T0128 on 2025-09-30
- T0129 on 2025-09-30
- T0130 on 2025-09-30
- T0131 on 2025-09-30
- T0132 on 2025-09-30
- T0133 on 2025-09-30
- T0160 on 2025-09-30
- T0193 on 2025-09-30
- T0194 on 2025-09-30
- T0195 on 2025-09-30
- T0196 on 2025-09-30
- T0197 on 2025-09-30
- T0198 on 2025-09-30
- T0199 on 2025-09-30
- T0200 on 2025-09-30
- T0230 on 2025-09-30
- T0129 on 2025-10-01
- T0130 on 2025-10-01
- T0131 on 2025-10-01
- T0132 on 2025-10-01
- T0133 on 2025-10-01
- T0134 on 2025-10-01
- T0135 on 2025-10-01
- T0136 on 2025-10-01
- T0137 on 2025-10-01
- T0138 on 2025-10-01
- T0139 on 2025-10-01
- T0140 on 2025-10-01
- T0141 on 2025-10-01
- T0143 on 2025-10-01
- T0144 on 2025-10-01
- T0145 on 2025-10-01
- T0146 on 2025-10-01
- T0147 on 2025-10-01
- T0148 on 2025-10-01
- T0151 on 2025-10-01
- T0152 on 2025-10-01
- T0155 on 2025-10-01
- T0159 on 2025-10-01
- T0160 on 2025-10-01
- T0161 on 2025-10-01
- T0165 on 2025-10-01
- T0169 on 2025-10-01
- T0170 on 2025-10-01
- T0171 on 2025-10-01
- T0172 on 2025-10-01
- T0173 on 2025-10-01
- T0174 on 2025-10-01
- T0175 on 2025-10-01
- T0176 on 2025-10-01
- T0177 on 2025-10-01
- T0178 on 2025-10-01
- T0179 on 2025-10-01
- T0180 on 2025-10-01
- T0181 on 2025-10-01
- T0182 on 2025-10-01
- T0183 on 2025-10-01
- T0184 on 2025-10-01
- T0185 on 2025-10-01
- T0201 on 2025-10-01
- T0202 on 2025-10-01
- T0203 on 2025-10-01
- T0204 on 2025-10-01
- T0205 on 2025-10-01
- T0206 on 2025-10-01
- T0207 on 2025-10-01
- T0208 on 2025-10-01
- T0209 on 2025-10-01
- T0210 on 2025-10-01
- T0211 on 2025-10-01
- T0212 on 2025-10-01
- T0213 on 2025-10-01
- T0214 on 2025-10-01
- T0215 on 2025-10-01
- T0216 on 2025-10-01
- T0217 on 2025-10-01
- T0218 on 2025-10-01
- T0219 on 2025-10-01
- T0220 on 2025-10-01
- T0221 on 2025-10-01
- T0222 on 2025-10-01
- T0223 on 2025-10-01
- T0224 on 2025-10-01
- T0225 on 2025-10-01
- T0226 on 2025-10-01
- T0227 on 2025-10-01
- T0228 on 2025-10-01
- T0229 on 2025-10-01
- T0186 on 2025-10-01
- T0187 on 2025-10-01
- T0188 on 2025-10-01
- T0189 on 2025-10-01
- T0190 on 2025-10-01
- T0191 on 2025-10-01
- T0192 on 2025-10-01
- T0168 on 2025-10-01
- T0166 on 2025-10-01
- T0167 on 2025-10-01
- T0161 on 2025-10-02
- T0162 on 2025-10-02
- T0164 on 2025-10-02
Changed
- All provider response objects now converted to JSON-serializable format before storage
- Schema default values automatically removed for GenAI API compatibility
Fixed
- Empty log files (FileHandler now properly configured)
- JSON serialization errors for OpenAI, GenAI, Anthropic, and Mistral response objects
- Pydantic validation errors now handled gracefully with fallback to raw tool input
- Letter dataclass normalization for list-formatted fields (letter_title, send_date)
- bibliographic_data attributions
- Pydantic dataclass models
Full Changelog: v0.2.2...v0.3.0
Version v0.2.2
Added
- bibliographic_data README.md sections
- CONTRIBUTING.md
- CONTRIBUTORS.md
Full Changelog: v0.2.1...v0.2.2
Version v0.2.1
Added
- Radar chart top 10 models.
- Zettelkatalog benchmark.
- T0066 on 2025-09-02
- T0143 on 2025-09-02
- T0144 on 2025-09-02
- T0145 on 2025-09-02
- T0146 on 2025-09-02
- T0147 on 2025-09-02
- T0148 on 2025-09-02
- T0159 on 2025-09-02
- T0160 on 2025-09-02
- T0161 on 2025-09-02
- T0162 on 2025-09-02
- T0164 on 2025-09-02
- T0165 on 2025-09-02
- T0166 on 2025-09-02
- T0151 on 2025-09-02
- T0152 on 2025-09-02
- T0155 on 2025-09-02
- T0167 on 2025-09-02
- T0168 on 2025-09-02
Full Changelog: v0.2.0...v0.2.1
Version v0.2.0
Added
- Global model performance leaderboard to docs.
- T0136 on 2025-08-27
- T0137 on 2025-08-27
- T0138 on 2025-08-27
- T0139 on 2025-08-27
- T0140 on 2025-08-27
- T0141 on 2025-08-27
- T0106 on 2025-08-27
Fixed
- Broken link patterns in docs.
Changed
- Standardize test-IDs to 4-digit zero-padded format (T0001).
Full Changelog: v0.1.0...v0.2.0
Version v0.1.0
Initial release.