Script to transition metadata to new author representation #5471

mbollmann · 2025-07-18T13:23:28Z

This PR adds a script to implement the transition logic to the new author representation system. I'm adding some implementation notes below as a basis for discussing if the script should do anything differently from what I implemented so far.

Notes:

The commit history is noisy because the PR depends on v0.5.3 of the library, which isn't merged yet. As soon as Create acl-anthology release v0.5.3 #5405 is merged, I can update/force-rebase this PR.
This PR contains some minor fixes to name_variants.yaml (0914c18) and removes name variants that are never used (40dd8bf) to avoid errors.
The script currently assumes that no ORCIDs are recorded in the XML. This is true at the time of this writing, but we have started to backfill ORCIDs on a separate branch. We'll need to decide whether to merge them in before running this (and modify this script accordingly) or after.
~~The script currently doesn't set disable_name_matching: true anywhere, but it probably needs to for some authors to preserve the current logic; need to work that out.~~

Before actually running this script, we should:

Update Python library for new author representation system #5472
Merge any open ingestions.
Merge all open & approved author metadata requests.
Check that no further issues (à la 0914c18) have been introduced.

Implementation notes

The implementation logic here is a bit trickier than I expected. I'm writing down some concrete examples here to illustrate what we need to consider.

The straight-forward case

The name_variants.yaml has:

- canonical: {first: Felicia, last: Körner}
  id: felicia-koerner
  variants:
  - {first: Felicia, last: Koerner}

There are two instances of these names in the XML:

2020.wmt.xml
1458:      <author><first>Felicia</first><last>Koerner</last></author>

2025.vardial.xml
157:      <author><first>Felicia</first><last>Körner</last></author>

The Python library returns these two names and papers:

>>> anthology.people.get("felicia-koerner")
Person(
    id='felicia-koerner',
    names=[Name(first='Felicia', last='Körner'), Name(first='Felicia', last='Koerner')],
    item_ids=<list of 2 AnthologyIDTuple objects>,
    comment=None,
    is_explicit=True
)

This is the straight-forward case. The script generates an entry for the new people.yaml like this:

felicia-koerner:
  names:
  - {first: Felicia, last: Körner}
  - {first: Felicia, last: Koerner}

and modifies the XML like this:

<author id="felicia-koerner"><first>Felicia</first><last>Koerner</last></author>
<author id="felicia-koerner"><first>Felicia</first><last>Körner</last></author>

Name variants not explicitly listed in the YAML

The name_variants.yaml has:

- canonical: {first: Aarno, last: Lehtola}
  id: aarno-lehtola

There are two kinds of instances (13 papers in total) in the XML:

2003.eamt.xml
75:      <author><first>Aarno</first><last>Lehtola</last></author>

C86.xml
748:      <author id="aarno-lehtola"><first>A.</first><last>Lehtola</last></author>

Here, the canonical name is used without the ID in the XML, but a name variant is used with the ID without it being listed in name_variants.yaml. This is allowed in our current system.

The Python library returns both names and all 13 papers, as expected:

>>> anthology.people.get("aarno-lehtola")
Person(
    id='aarno-lehtola',
    names=[Name(first='Aarno', last='Lehtola'), Name(first='A.', last='Lehtola')],
    item_ids=<list of 13 AnthologyIDTuple objects>,
    comment=None,
    is_explicit=True
)

Generating the entry for people.yaml is therefore also straight-forward, and should produce:

aarno-lehtola:
  names:
  - {first: Aarno, last: Lehtola}
  - {first: A., last: Lehtola}

While the XML should be modified to include the ID also for the canonical name:

<author id="aarno-lehtola"><first>Aarno</first><last>Lehtola</last></author>
<author id="aarno-lehtola"><first>A.</first><last>Lehtola</last></author>

Mixture of explicit and implicitly-matched name variants

This is where it gets tricky. The name_variants.yaml has:

- canonical: {first: Pedro, last: Ortiz Suarez}
  variants:
  - {first: Pedro Javier, last: Ortiz Suárez}

There are two names listed here, and no explicit ID, so no further names can be added to this person via an explicit ID in the XML. The generated ID, pedro-ortiz-suarez, does not exist in any XML file.

However, the Python library returns three names:

>>> anthology.people.get("pedro-ortiz-suarez")
Person(
    id='pedro-ortiz-suarez',
    names=[Name(first='Pedro', last='Ortiz Suarez'), Name(first='Pedro Javier', last='Ortiz Suárez'), Name(first='Pedro Ortiz', last='Suarez')],
    item_ids=<list of 19 AnthologyIDTuple objects>,
    comment=None,
    is_explicit=True
)

This is because the third name, {first: "Pedro Ortiz", last: "Suarez"}, was attached to this person by our implicit name matching mechanism that assumes names refer to the same person if they slugify to the same string (for reference, here is where this happens).

In my understanding of our new system, this name variant should not be written to the people.yaml since it was not manually defined by us (and is, in fact, just the result of an incorrect first/last split). In other words, the script should generate an entry for the new people.yaml like this:

pedro-ortiz-suarez:
  names:
  - {first: Pedro, last: Ortiz Suarez}
  - {first: Pedro Javier, last: Ortiz Suárez}

and modify the XML like this:

<author id="pedro-ortiz-suarez"><first>Pedro</first><last>Ortiz Suarez</last></author>
<author id="pedro-ortiz-suarez"><first>Pedro Javier</first><last>Ortiz Suárez</last></author>
<author><first>Pedro Ortiz</first><last>Suarez</last></author>  <!-- no explicit ID here! -->

IDs that "may refer to several people"

In a few instances, our name_variants.yaml contains entries simply because they are currently required for technical reasons:

- canonical: {first: Fei, last: Liu}
  comment: May refer to several people
  id: fei-liu

This is because there are several other "Fei Liu"s with their own respective IDs. Under the new system, no entry in people.yaml should be generated for this "catch-all" person, and existing references in the XML to this ID should be removed.

codecov · 2025-07-18T13:26:46Z

Codecov Report

Attention: Patch coverage is 92.30769% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.03%. Comparing base (2058665) to head (34bf474).
Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
python/acl_anthology/collections/paper.py	64.28%	5 Missing ⚠️
python/acl_anthology/utils/xml.py	96.49%	2 Missing ⚠️
python/acl_anthology/collections/event.py	92.85%	1 Missing ⚠️
python/acl_anthology/utils/git.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5471      +/-   ##
==========================================
- Coverage   93.67%   92.03%   -1.64%     
==========================================
  Files          35       35              
  Lines        2781     2850      +69     
==========================================
+ Hits         2605     2623      +18     
- Misses        176      227      +51

Files with missing lines	Coverage Δ
python/acl_anthology/collections/__init__.py	`100.00% <100.00%> (ø)`
python/acl_anthology/collections/collection.py	`96.82% <100.00%> (+0.10%)`	⬆️
python/acl_anthology/collections/eventindex.py	`92.95% <100.00%> (-0.20%)`	⬇️
python/acl_anthology/collections/types.py	`100.00% <100.00%> (ø)`
python/acl_anthology/people/name.py	`97.81% <100.00%> (-0.73%)`	⬇️
python/acl_anthology/sigs.py	`97.02% <100.00%> (ø)`
python/acl_anthology/venues.py	`88.40% <100.00%> (ø)`
python/acl_anthology/collections/event.py	`99.20% <92.85%> (-0.80%)`	⬇️
python/acl_anthology/utils/git.py	`20.33% <0.00%> (-62.72%)`	⬇️
python/acl_anthology/utils/xml.py	`97.39% <96.49%> (-1.19%)`	⬇️
... and 1 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-07-18T13:57:44Z

Build successful. Some useful links:

Complete site preview: https://preview.aclanthology.org/transition-to-people-yaml
Potential volumes of interest: 1952.earlymt-1, 1956.earlymt-1, 1957.earlymt-1, 1960.earlymt-nsmt, 1960.earlymt-fsmtw, 1961.earlymt-1, 1962.earlymt-1, 1976.earlymt-1, 1978.tc-1, 1980.tc-1, 1981.tc-1, 1984.bcs-1, 1984.tc-1, 1985.tc-1, 1985.tmi-1, 1986.tc-1, 1987.mtsummit-1, 1987.tc-1, 1988.tc-1, 1988.tmi-1, 1989.mtsummit-1, 1989.tc-1, 1990.tc-1, 1991.iwpt-1, 1991.mtsummit-papers, 1991.mtsummit-panels, 1991.tc-1, 1992.tc-1, 1992.tmi-1, 1993.eamt-1, 1993.iwpt-1, 1993.mtsummit-1, 1993.tc-1, 1993.tmi-1, 1994.amta-1, 1994.bcs-1, 1994.eamt-1, 1994.tc-1, 1994.vlc-1, 1995.iwpt-1, 1995.mtsummit-1, 1995.tc-1, 1995.tmi-1, 1996.amta-1, 1996.eamt-1, 1996.tc-1, 1997.eamt-1, 1997.iwpt-1, 1997.mtsummit-tutorials, 1997.mtsummit-plenaries, (plus 3102 more...)

This preview will be removed when the branch is merged.

nschneid · 2025-07-18T14:06:46Z

This is because the third name, {first: "Pedro Ortiz", last: "Suarez"}, was attached to this person by our implicit name matching mechanism that assumes names refer to the same person if they slugify to the same string (for reference, here is where this happens).

In my understanding of our new system, this name variant should not be written to the people.yaml since it was not manually defined by us (and is, in fact, just the result of an incorrect first/last split).

How many such first/last split inconsistencies involving a name_variants entry are there? Can we flag them for manual review?

Are there other cases that are not listed explicitly but slugify to the same string as an explicit name variant due to differences in capitalization/diacritics?

mbollmann · 2025-07-18T14:16:31Z

How many such first/last split inconsistencies involving a name_variants entry are there? Can we flag them for manual review?

Are there other cases that are not listed explicitly but slugify to the same string as an explicit name variant due to differences in capitalization/diacritics?

There are 387 instances (222 of them unique) of such cases currently, and yes, many of them result from differences in capitalization/diacritics as well.

mbollmann · 2025-07-18T14:29:32Z

Quick overview: https://gist.github.com/mbollmann/9b58727a8cd84520d2d0a2c05beb546a

nschneid · 2025-07-18T14:32:44Z

OK. Functionally, until these have explicit author IDs in the XML, they will show up as unverified right? And the URL of the page will be based on the slug, and thus the grouping of unverified entries is agnostic as to first/last splits?

mbollmann · 2025-07-18T14:39:17Z

OK. Functionally, until these have explicit author IDs in the XML, they will show up as unverified right? And the URL of the page will be based on the slug, and thus the grouping of unverified entries is agnostic as to first/last splits?

Yes, based on these parts (if we don't change this logic):

What I haven't actually thought about yet is when/if/how to prepopulate disable_name_matching: true based on our current metadata. There may be instances where name_variants.yaml is currently used for that purpose but we clearly don't want to set it by default for everyone.

nschneid · 2025-07-18T14:49:13Z

ORCID in the ingestion material, matches an ORCID in our people.yaml. We assign the paper to the ID with the matching ORCID, adding it explicitly to the XML with an id="..." attribute.

If the name on the paper differs from the name variants in people.yaml, will it be added?

Will it trigger an error if the name variants differ in first/last split? (Same for merges.) I suppose it should fail because it reflects a paper-level metadata error.

mbollmann · 2025-07-18T15:05:03Z

ORCID in the ingestion material, matches an ORCID in our people.yaml. We assign the paper to the ID with the matching ORCID, adding it explicitly to the XML with an id="..." attribute.

If the name on the paper differs from the name variants in people.yaml, will it be added?

Added to people.yaml, you mean? I would say yes.

Will it trigger an error if the name variants differ in first/last split? (Same for merges.) I suppose it should fail because it reflects a paper-level metadata error.

How do you distinguish legitimate (but previously unseen) name variants from metadata errors? Yes, you could create a special rule when the only difference is the first/last split, but what if the name differs by diacritics? Or by diacritics plus first/last split? I think this might become unwieldy to handle manually at ingestion time.

nschneid · 2025-07-18T15:13:55Z

Will it trigger an error if the name variants differ in first/last split? (Same for merges.) I suppose it should fail because it reflects a paper-level metadata error.

How do you distinguish legitimate (but previously unseen) name variants from metadata errors? Yes, you could create a special rule when the only difference is the first/last split, but what if the name differs by diacritics? Or by diacritics plus first/last split? I think this might become unwieldy to handle manually at ingestion time.

Name variants that differ in capitalization or diacritics can be legitimate because they can be reflected in the PDF. Whereas first/last splits are implicit in the PDF, but we presume there is only one correct way to do it.

I will have to think about the precise way to identify a first/last split mismatch between two name variants (abbreviations or omitted name parts create complications) but I would start by replacing hyphens and periods with spaces and then slugifying each space-separated part of the name:

John Jacob | Jingleheimer-schmidt becomes john jacob | jingleheimer schmidt
John Jacob Jingleheimer | Schmidt becomes john jacob jingleheimer | schmidt
John | Jingleheimer Schmidt becomes john | jingleheimer schmidt
John J.J. | Schmidt becomes john j j | schmidt

Clearly the first two differ only in split point, but there will need to be some fuzzy matching for the others.

mbollmann · 2025-07-18T15:30:05Z

So far that looks functionally identical to just slugifying first/last parts separately (without any preprocessing, as discarding hyphens and periods is what slugifying already does anyway).

nschneid · 2025-07-18T16:00:27Z

Yeah it occurs to me that one way to go is to add the split point in the actual URL if there is more than 1 hyphen in the current slug: people/john-jacob-jingleheimer-schmidt => people/john-jacob|jingleheimer-schmidt. Then different split points would cause the unverified name variants to have different pages, and in order to merge them the paper metadata should be corrected. Not sure how people like the aesthetics of those URLs though.

mbollmann · 2025-07-18T19:08:36Z

That would remove what so far has been an intentional feature. I think I'd rather have a separate process/script for detecting these kinds of "logical" errors and allowing/prompting us to deal with them, rather than having the Python library itself try to catch them or making it a necessity during ingestion.

nschneid · 2025-07-18T20:29:37Z

How about the separate script and also a warning during ingestion. If it's a small venue being ingested there may be capacity to fix it right away. Otherwise the warnings can be dealt with later.

mbollmann · 2025-08-03T11:02:37Z

What I haven't actually thought about yet is when/if/how to prepopulate disable_name_matching: true based on our current metadata. There may be instances where name_variants.yaml is currently used for that purpose but we clearly don't want to set it by default for everyone.

Update: I have tried to add this now. My thinking is that if there was a "catch-all" ID ("May refer to several people" etc.) and there is only one other person defined in name_variants.yaml with the same name, that person needs to get disable_name_matching: true under the new system.

…ogic

mjpost · 2025-09-19T19:12:21Z

Okay, I have both ORCID iD and unverified examples:

https://preview.aclanthology.org/transition-to-people-yaml/people/yang-zhang-ustc/
https://preview.aclanthology.org/transition-to-people-yaml/people/unverified/shihao-cai/
(That was actually kind of painful. All the font awesome docs say to use fa-brands fa-orcid, but you actually need fab fa-orcid; I only found his via ChatGPT).

nschneid · 2025-09-19T19:28:32Z

Not working for me:

mjpost · 2025-09-19T19:42:27Z

can you force reload? That hash was missing a string-final $.

nschneid · 2025-09-19T20:00:02Z

Still getting the error

mjpost · 2025-09-19T20:48:27Z

Hmm, your error message suggests the checksum should not have the $. Can you try again? It works both ways for me.

(The change here is I had to update font awesome to 5.11.0)

nschneid · 2025-09-19T20:59:41Z

Works now! (I'm using Firefox)

mbollmann · 2025-09-19T21:04:28Z

(That was actually kind of painful. All the font awesome docs say to use fa-brands fa-orcid, but you actually need fab fa-orcid; I only found his via ChatGPT).

Pretty sure this just depends on the version :)

hugo/layouts/people/single.html

mbollmann · 2025-09-19T22:32:23Z

I think I'd prefer something a lot more subtle for the unverified case, tbh.

mjpost · 2025-09-19T23:13:19Z

Okay, that corroborates a thought I'd been having, too: if we release this new version, and every bit of unverified data is glaring out, it's going to look bad and trigger a ton of work.

nschneid · 2025-09-20T01:52:13Z

I'm in favor of gradually making the lack of verification more prominent once we have well-established processes with the new model.

For now: opacity: .2 on the unverified icon on the people pages?

mjpost · 2025-09-20T12:04:01Z

Looks good to me.

nschneid · 2025-09-20T14:53:09Z

I'm getting the error again...remove the $? (UPDATE: Success)

mbollmann · 2025-09-20T15:04:02Z

I think I should prepare a branch with all the data transition done, so we can have a PR that only shows the layout and/or logical changes, without all the data files included in "Files changed", as that tab is pretty much unusable for me right now...

mjpost · 2025-09-21T00:06:24Z

This is a good idea. I'm behind on it. There have been a lot of other behind-the-scenes issues that have consumed my time. Maybe I should prioritize this, going year-by-year with the main conferences back to 2020, and then we can hold that static. Give me till the 27th?

mbollmann · 2025-09-21T12:15:21Z

@mjpost In my comment, "data transition" referred to transitioning to the new author system (which, among other things, writes lots of IDs to the XML files), while you seem to talk about backfilling ORCIDs (which is related, but orthogonal to that)?

In any case I'd imagine making a master-transitioned branch that has these data changes, and where we can push updates to the data or change the transition logic if need be, while making PRs against that branch that implement layout changes or anything else that needs to be based on the transitioned data.

mbollmann added 4 commits July 18, 2025 13:01

Fix problems in name_variants.yaml

0914c18

Remove unused name variants

40dd8bf

Merge branch 'python-dev' into transition-to-people-yaml

ecf12cc

Add script to transition metadata to new author representation

407823f

mbollmann mentioned this pull request Jul 18, 2025

Author name URLs #623

Open

Fix small (potential) mistake when IDs to delete are editors

34bf474

Make logger use stderr (#5474)

10a9e58

mbollmann added 2 commits August 2, 2025 10:29

Merge branch 'master' into transition-to-people-yaml

fddd2f0

Check when disable_name_matching: true is needed and add it

798597e

mbollmann added this to Author representation Aug 8, 2025

mbollmann added 5 commits August 9, 2025 12:34

Merge branch 'master' into python-dev

1479c56

Add ORCID field to Person

ee78650

Add new fields to NameSpec and Person, add check for verified IDs

309ad65

Removed outdated special case when slugifying

5b9ca84

Switch from name_variants.yaml to people.yaml & new name resolution l…

aa2411a

…ogic

mjpost added 9 commits September 19, 2025 12:12

black

9450198

Ensmallen; use fontawesome for both

ae80f4f

Bump fontawesome version 5.7.2 -> 5.11.0

4265398

Icon sizing

0211a1f

Update checksum

56a9262

Use fontawesome kit

8b0f159

black

7c6abf7

Contra all docs everywhere, you need "fab"

242ee65

Switch back to stylesheet

8c048ef

fa-solid -> fas

34a0aeb

mbollmann commented Sep 19, 2025

View reviewed changes

hugo/layouts/people/single.html Outdated Show resolved Hide resolved

Relativize link and lighten question mark

3bdf8f4

Set opacity for unverified people

fd766f7

reference syntax

06b8903

fontawesome hash

d6d2838

Script to transition metadata to new author representation #5471

Are you sure you want to change the base?

Script to transition metadata to new author representation #5471

Uh oh!

Conversation

mbollmann commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes

The straight-forward case

Name variants not explicitly listed in the YAML

Mixture of explicit and implicitly-matched name variants

IDs that "may refer to several people"

Uh oh!

codecov bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nschneid commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbollmann commented Jul 18, 2025

Uh oh!

mbollmann commented Jul 18, 2025

Uh oh!

nschneid commented Jul 18, 2025

Uh oh!

mbollmann commented Jul 18, 2025

Uh oh!

nschneid commented Jul 18, 2025

Uh oh!

mbollmann commented Jul 18, 2025

Uh oh!

nschneid commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbollmann commented Jul 18, 2025

Uh oh!

nschneid commented Jul 18, 2025

Uh oh!

mbollmann commented Jul 18, 2025

Uh oh!

nschneid commented Jul 18, 2025

Uh oh!

mbollmann commented Aug 3, 2025

Uh oh!

mjpost commented Sep 19, 2025

Uh oh!

nschneid commented Sep 19, 2025

Uh oh!

mjpost commented Sep 19, 2025

Uh oh!

nschneid commented Sep 19, 2025

Uh oh!

mjpost commented Sep 19, 2025

Uh oh!

nschneid commented Sep 19, 2025

Uh oh!

mbollmann commented Sep 19, 2025

Uh oh!

Uh oh!

mbollmann commented Sep 19, 2025

Uh oh!

mjpost commented Sep 19, 2025

Uh oh!

nschneid commented Sep 20, 2025

Uh oh!

mjpost commented Sep 20, 2025

Uh oh!

nschneid commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbollmann commented Sep 20, 2025

Uh oh!

mjpost commented Sep 21, 2025

Uh oh!

mbollmann commented Sep 21, 2025

Uh oh!

mbollmann commented Jul 18, 2025 •

edited

Loading

codecov bot commented Jul 18, 2025 •

edited

Loading

github-actions bot commented Jul 18, 2025 •

edited

Loading

nschneid commented Jul 18, 2025 •

edited

Loading

nschneid commented Jul 18, 2025 •

edited

Loading

nschneid commented Sep 20, 2025 •

edited

Loading