Skip to content

Conversation

mbollmann
Copy link
Member

@mbollmann mbollmann commented Jul 18, 2025

This PR adds a script to implement the transition logic to the new author representation system. I'm adding some implementation notes below as a basis for discussing if the script should do anything differently from what I implemented so far.

Notes:

  • The commit history is noisy because the PR depends on v0.5.3 of the library, which isn't merged yet. As soon as Create acl-anthology release v0.5.3 #5405 is merged, I can update/force-rebase this PR.
  • This PR contains some minor fixes to name_variants.yaml (0914c18) and removes name variants that are never used (40dd8bf) to avoid errors.
  • The script currently assumes that no ORCIDs are recorded in the XML. This is true at the time of this writing, but we have started to backfill ORCIDs on a separate branch. We'll need to decide whether to merge them in before running this (and modify this script accordingly) or after.
  • The script currently doesn't set disable_name_matching: true anywhere, but it probably needs to for some authors to preserve the current logic; need to work that out.

Before actually running this script, we should:

Implementation notes

The implementation logic here is a bit trickier than I expected. I'm writing down some concrete examples here to illustrate what we need to consider.

The straight-forward case

The name_variants.yaml has:

- canonical: {first: Felicia, last: Körner}
  id: felicia-koerner
  variants:
  - {first: Felicia, last: Koerner}

There are two instances of these names in the XML:

2020.wmt.xml
1458:      <author><first>Felicia</first><last>Koerner</last></author>

2025.vardial.xml
157:      <author><first>Felicia</first><last>Körner</last></author>

The Python library returns these two names and papers:

>>> anthology.people.get("felicia-koerner")
Person(
    id='felicia-koerner',
    names=[Name(first='Felicia', last='Körner'), Name(first='Felicia', last='Koerner')],
    item_ids=<list of 2 AnthologyIDTuple objects>,
    comment=None,
    is_explicit=True
)

This is the straight-forward case. The script generates an entry for the new people.yaml like this:

felicia-koerner:
  names:
  - {first: Felicia, last: Körner}
  - {first: Felicia, last: Koerner}

and modifies the XML like this:

<author id="felicia-koerner"><first>Felicia</first><last>Koerner</last></author>
<author id="felicia-koerner"><first>Felicia</first><last>Körner</last></author>

Name variants not explicitly listed in the YAML

The name_variants.yaml has:

- canonical: {first: Aarno, last: Lehtola}
  id: aarno-lehtola

There are two kinds of instances (13 papers in total) in the XML:

2003.eamt.xml
75:      <author><first>Aarno</first><last>Lehtola</last></author>

C86.xml
748:      <author id="aarno-lehtola"><first>A.</first><last>Lehtola</last></author>

Here, the canonical name is used without the ID in the XML, but a name variant is used with the ID without it being listed in name_variants.yaml. This is allowed in our current system.

The Python library returns both names and all 13 papers, as expected:

>>> anthology.people.get("aarno-lehtola")
Person(
    id='aarno-lehtola',
    names=[Name(first='Aarno', last='Lehtola'), Name(first='A.', last='Lehtola')],
    item_ids=<list of 13 AnthologyIDTuple objects>,
    comment=None,
    is_explicit=True
)

Generating the entry for people.yaml is therefore also straight-forward, and should produce:

aarno-lehtola:
  names:
  - {first: Aarno, last: Lehtola}
  - {first: A., last: Lehtola}

While the XML should be modified to include the ID also for the canonical name:

<author id="aarno-lehtola"><first>Aarno</first><last>Lehtola</last></author>
<author id="aarno-lehtola"><first>A.</first><last>Lehtola</last></author>

Mixture of explicit and implicitly-matched name variants

This is where it gets tricky. The name_variants.yaml has:

- canonical: {first: Pedro, last: Ortiz Suarez}
  variants:
  - {first: Pedro Javier, last: Ortiz Suárez}

There are two names listed here, and no explicit ID, so no further names can be added to this person via an explicit ID in the XML. The generated ID, pedro-ortiz-suarez, does not exist in any XML file.

However, the Python library returns three names:

>>> anthology.people.get("pedro-ortiz-suarez")
Person(
    id='pedro-ortiz-suarez',
    names=[Name(first='Pedro', last='Ortiz Suarez'), Name(first='Pedro Javier', last='Ortiz Suárez'), Name(first='Pedro Ortiz', last='Suarez')],
    item_ids=<list of 19 AnthologyIDTuple objects>,
    comment=None,
    is_explicit=True
)

This is because the third name, {first: "Pedro Ortiz", last: "Suarez"}, was attached to this person by our implicit name matching mechanism that assumes names refer to the same person if they slugify to the same string (for reference, here is where this happens).

In my understanding of our new system, this name variant should not be written to the people.yaml since it was not manually defined by us (and is, in fact, just the result of an incorrect first/last split). In other words, the script should generate an entry for the new people.yaml like this:

pedro-ortiz-suarez:
  names:
  - {first: Pedro, last: Ortiz Suarez}
  - {first: Pedro Javier, last: Ortiz Suárez}

and modify the XML like this:

<author id="pedro-ortiz-suarez"><first>Pedro</first><last>Ortiz Suarez</last></author>
<author id="pedro-ortiz-suarez"><first>Pedro Javier</first><last>Ortiz Suárez</last></author>
<author><first>Pedro Ortiz</first><last>Suarez</last></author>  <!-- no explicit ID here! -->

IDs that "may refer to several people"

In a few instances, our name_variants.yaml contains entries simply because they are currently required for technical reasons:

- canonical: {first: Fei, last: Liu}
  comment: May refer to several people
  id: fei-liu

This is because there are several other "Fei Liu"s with their own respective IDs. Under the new system, no entry in people.yaml should be generated for this "catch-all" person, and existing references in the XML to this ID should be removed.

@mbollmann mbollmann mentioned this pull request Jul 18, 2025
Copy link

codecov bot commented Jul 18, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.03%. Comparing base (2058665) to head (34bf474).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
python/acl_anthology/collections/paper.py 64.28% 5 Missing ⚠️
python/acl_anthology/utils/xml.py 96.49% 2 Missing ⚠️
python/acl_anthology/collections/event.py 92.85% 1 Missing ⚠️
python/acl_anthology/utils/git.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5471      +/-   ##
==========================================
- Coverage   93.67%   92.03%   -1.64%     
==========================================
  Files          35       35              
  Lines        2781     2850      +69     
==========================================
+ Hits         2605     2623      +18     
- Misses        176      227      +51     
Files with missing lines Coverage Δ
python/acl_anthology/collections/__init__.py 100.00% <100.00%> (ø)
python/acl_anthology/collections/collection.py 96.82% <100.00%> (+0.10%) ⬆️
python/acl_anthology/collections/eventindex.py 92.95% <100.00%> (-0.20%) ⬇️
python/acl_anthology/collections/types.py 100.00% <100.00%> (ø)
python/acl_anthology/people/name.py 97.81% <100.00%> (-0.73%) ⬇️
python/acl_anthology/sigs.py 97.02% <100.00%> (ø)
python/acl_anthology/venues.py 88.40% <100.00%> (ø)
python/acl_anthology/collections/event.py 99.20% <92.85%> (-0.80%) ⬇️
python/acl_anthology/utils/git.py 20.33% <0.00%> (-62.72%) ⬇️
python/acl_anthology/utils/xml.py 97.39% <96.49%> (-1.19%) ⬇️
... and 1 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nschneid
Copy link
Contributor

nschneid commented Jul 18, 2025

This is because the third name, {first: "Pedro Ortiz", last: "Suarez"}, was attached to this person by our implicit name matching mechanism that assumes names refer to the same person if they slugify to the same string (for reference, here is where this happens).

In my understanding of our new system, this name variant should not be written to the people.yaml since it was not manually defined by us (and is, in fact, just the result of an incorrect first/last split).

How many such first/last split inconsistencies involving a name_variants entry are there? Can we flag them for manual review?

Are there other cases that are not listed explicitly but slugify to the same string as an explicit name variant due to differences in capitalization/diacritics?

@mbollmann
Copy link
Member Author

How many such first/last split inconsistencies involving a name_variants entry are there? Can we flag them for manual review?

Are there other cases that are not listed explicitly but slugify to the same string as an explicit name variant due to differences in capitalization/diacritics?

There are 387 instances (222 of them unique) of such cases currently, and yes, many of them result from differences in capitalization/diacritics as well.

@mbollmann
Copy link
Member Author

@nschneid
Copy link
Contributor

OK. Functionally, until these have explicit author IDs in the XML, they will show up as unverified right? And the URL of the page will be based on the slug, and thus the grouping of unverified entries is agnostic as to first/last splits?

@mbollmann
Copy link
Member Author

OK. Functionally, until these have explicit author IDs in the XML, they will show up as unverified right? And the URL of the page will be based on the slug, and thus the grouping of unverified entries is agnostic as to first/last splits?

Yes, based on these parts (if we don't change this logic):
image

What I haven't actually thought about yet is when/if/how to prepopulate disable_name_matching: true based on our current metadata. There may be instances where name_variants.yaml is currently used for that purpose but we clearly don't want to set it by default for everyone.

@nschneid
Copy link
Contributor

ORCID in the ingestion material, matches an ORCID in our people.yaml. We assign the paper to the ID with the matching ORCID, adding it explicitly to the XML with an id="..." attribute.

If the name on the paper differs from the name variants in people.yaml, will it be added?

Will it trigger an error if the name variants differ in first/last split? (Same for merges.) I suppose it should fail because it reflects a paper-level metadata error.

@mbollmann
Copy link
Member Author

ORCID in the ingestion material, matches an ORCID in our people.yaml. We assign the paper to the ID with the matching ORCID, adding it explicitly to the XML with an id="..." attribute.

If the name on the paper differs from the name variants in people.yaml, will it be added?

Added to people.yaml, you mean? I would say yes.

Will it trigger an error if the name variants differ in first/last split? (Same for merges.) I suppose it should fail because it reflects a paper-level metadata error.

How do you distinguish legitimate (but previously unseen) name variants from metadata errors? Yes, you could create a special rule when the only difference is the first/last split, but what if the name differs by diacritics? Or by diacritics plus first/last split? I think this might become unwieldy to handle manually at ingestion time.

@nschneid
Copy link
Contributor

nschneid commented Jul 18, 2025

Will it trigger an error if the name variants differ in first/last split? (Same for merges.) I suppose it should fail because it reflects a paper-level metadata error.

How do you distinguish legitimate (but previously unseen) name variants from metadata errors? Yes, you could create a special rule when the only difference is the first/last split, but what if the name differs by diacritics? Or by diacritics plus first/last split? I think this might become unwieldy to handle manually at ingestion time.

Name variants that differ in capitalization or diacritics can be legitimate because they can be reflected in the PDF. Whereas first/last splits are implicit in the PDF, but we presume there is only one correct way to do it.

I will have to think about the precise way to identify a first/last split mismatch between two name variants (abbreviations or omitted name parts create complications) but I would start by replacing hyphens and periods with spaces and then slugifying each space-separated part of the name:

  • John Jacob | Jingleheimer-schmidt becomes john jacob | jingleheimer schmidt
  • John Jacob Jingleheimer | Schmidt becomes john jacob jingleheimer | schmidt
  • John | Jingleheimer Schmidt becomes john | jingleheimer schmidt
  • John J.J. | Schmidt becomes john j j | schmidt

Clearly the first two differ only in split point, but there will need to be some fuzzy matching for the others.

@mbollmann
Copy link
Member Author

So far that looks functionally identical to just slugifying first/last parts separately (without any preprocessing, as discarding hyphens and periods is what slugifying already does anyway).

@nschneid
Copy link
Contributor

Yeah it occurs to me that one way to go is to add the split point in the actual URL if there is more than 1 hyphen in the current slug: people/john-jacob-jingleheimer-schmidt => people/john-jacob|jingleheimer-schmidt. Then different split points would cause the unverified name variants to have different pages, and in order to merge them the paper metadata should be corrected. Not sure how people like the aesthetics of those URLs though.

@mbollmann
Copy link
Member Author

That would remove what so far has been an intentional feature. I think I'd rather have a separate process/script for detecting these kinds of "logical" errors and allowing/prompting us to deal with them, rather than having the Python library itself try to catch them or making it a necessity during ingestion.

@nschneid
Copy link
Contributor

How about the separate script and also a warning during ingestion. If it's a small venue being ingested there may be capacity to fix it right away. Otherwise the warnings can be dealt with later.

@mbollmann
Copy link
Member Author

What I haven't actually thought about yet is when/if/how to prepopulate disable_name_matching: true based on our current metadata. There may be instances where name_variants.yaml is currently used for that purpose but we clearly don't want to set it by default for everyone.

Update: I have tried to add this now. My thinking is that if there was a "catch-all" ID ("May refer to several people" etc.) and there is only one other person defined in name_variants.yaml with the same name, that person needs to get disable_name_matching: true under the new system.

@mjpost
Copy link
Member

mjpost commented Sep 19, 2025

Okay, I have both ORCID iD and unverified examples:

@nschneid
Copy link
Contributor

Not working for me:
image

@mjpost
Copy link
Member

mjpost commented Sep 19, 2025

can you force reload? That hash was missing a string-final $.

@nschneid
Copy link
Contributor

Still getting the error

@mjpost
Copy link
Member

mjpost commented Sep 19, 2025

Hmm, your error message suggests the checksum should not have the $. Can you try again? It works both ways for me.

(The change here is I had to update font awesome to 5.11.0)

@nschneid
Copy link
Contributor

Works now! (I'm using Firefox)

@mbollmann
Copy link
Member Author

(That was actually kind of painful. All the font awesome docs say to use fa-brands fa-orcid, but you actually need fab fa-orcid; I only found his via ChatGPT).

Pretty sure this just depends on the version :)

@mbollmann
Copy link
Member Author

I think I'd prefer something a lot more subtle for the unverified case, tbh.

@mjpost
Copy link
Member

mjpost commented Sep 19, 2025

Okay, that corroborates a thought I'd been having, too: if we release this new version, and every bit of unverified data is glaring out, it's going to look bad and trigger a ton of work.

@nschneid
Copy link
Contributor

I'm in favor of gradually making the lack of verification more prominent once we have well-established processes with the new model.

For now: opacity: .2 on the unverified icon on the people pages?

@mjpost
Copy link
Member

mjpost commented Sep 20, 2025

Looks good to me.
image

@nschneid
Copy link
Contributor

nschneid commented Sep 20, 2025

I'm getting the error again...remove the $? (UPDATE: Success)

@mbollmann
Copy link
Member Author

I think I should prepare a branch with all the data transition done, so we can have a PR that only shows the layout and/or logical changes, without all the data files included in "Files changed", as that tab is pretty much unusable for me right now...

@mjpost
Copy link
Member

mjpost commented Sep 21, 2025

This is a good idea. I'm behind on it. There have been a lot of other behind-the-scenes issues that have consumed my time. Maybe I should prioritize this, going year-by-year with the main conferences back to 2020, and then we can hold that static. Give me till the 27th?

@mbollmann
Copy link
Member Author

@mjpost In my comment, "data transition" referred to transitioning to the new author system (which, among other things, writes lots of IDs to the XML files), while you seem to talk about backfilling ORCIDs (which is related, but orthogonal to that)?

In any case I'd imagine making a master-transitioned branch that has these data changes, and where we can push updates to the data or change the transition logic if need be, while making PRs against that branch that implement layout changes or anything else that needs to be based on the transitioned data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants