-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Milestone
Description
What specific task do you want to benchmark?
Catalog data extraction from Briefzettelkatalog scans. If successful, the extracted data will be ingested into ALMA and replace the manual cataloging of the Briefzettelkatalog.
What dataset will you use, and do you have permission to share it?
The Briefzettelkatalog consists of about 50,000 cards, all of which are in the process of being digitized. The scans will be released under a CC0 license.
How will you create ground truth (who will annotate, and how)?
About 10,000 cards have been re-cataloged by domain experts and serve as ground truth. These records are available in ALMA as MARC21. A randomized sample of around 500 records will be transformed into the expected output JSON format (see below).
What does successful model output look like?
{
"Metadata": {
"Author": "Andrait, Jacques",
"Reference": "G² II 14, fol. 125",
"Recipient": "Arragonis, Euchelmius",
"Date": "1608-05-19",
"Place": "St. Michel de Lanes",
"Language": "französisch",
"Note": "Apogr. französ.",
"Bibliography": "[Bibliogr.: Fr Ertl. I 15,3]"
},
"Description": {
"Text": "R: Der Herr A.S. Ihan wird nach Rasel kommen ins A. gut zu bedienen. - Und hat sich wegen des Reitzes von H. alle Mühe gegeben, kein Roger A. hat sich immer gegen den Willen des A. gestellt. Es soll nun besser werden. - Grüess."
}
}How will you score model performance?
Compute F1 per field (with a string similarity threshold) and F1 micro per card.
Metadata
Metadata
Assignees
Labels
No labels