-
Notifications
You must be signed in to change notification settings - Fork 361
Script to transition metadata to new author representation #5471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5471 +/- ##
==========================================
- Coverage 93.67% 92.03% -1.64%
==========================================
Files 35 35
Lines 2781 2850 +69
==========================================
+ Hits 2605 2623 +18
- Misses 176 227 +51
🚀 New features to boost your workflow:
|
Build successful. Some useful links:
This preview will be removed when the branch is merged. |
How many such first/last split inconsistencies involving a name_variants entry are there? Can we flag them for manual review? Are there other cases that are not listed explicitly but slugify to the same string as an explicit name variant due to differences in capitalization/diacritics? |
There are 387 instances (222 of them unique) of such cases currently, and yes, many of them result from differences in capitalization/diacritics as well. |
OK. Functionally, until these have explicit author IDs in the XML, they will show up as unverified right? And the URL of the page will be based on the slug, and thus the grouping of unverified entries is agnostic as to first/last splits? |
If the name on the paper differs from the name variants in people.yaml, will it be added? Will it trigger an error if the name variants differ in first/last split? (Same for merges.) I suppose it should fail because it reflects a paper-level metadata error. |
Added to people.yaml, you mean? I would say yes.
How do you distinguish legitimate (but previously unseen) name variants from metadata errors? Yes, you could create a special rule when the only difference is the first/last split, but what if the name differs by diacritics? Or by diacritics plus first/last split? I think this might become unwieldy to handle manually at ingestion time. |
Name variants that differ in capitalization or diacritics can be legitimate because they can be reflected in the PDF. Whereas first/last splits are implicit in the PDF, but we presume there is only one correct way to do it. I will have to think about the precise way to identify a first/last split mismatch between two name variants (abbreviations or omitted name parts create complications) but I would start by replacing hyphens and periods with spaces and then slugifying each space-separated part of the name:
Clearly the first two differ only in split point, but there will need to be some fuzzy matching for the others. |
So far that looks functionally identical to just slugifying first/last parts separately (without any preprocessing, as discarding hyphens and periods is what slugifying already does anyway). |
Yeah it occurs to me that one way to go is to add the split point in the actual URL if there is more than 1 hyphen in the current slug: |
That would remove what so far has been an intentional feature. I think I'd rather have a separate process/script for detecting these kinds of "logical" errors and allowing/prompting us to deal with them, rather than having the Python library itself try to catch them or making it a necessity during ingestion. |
How about the separate script and also a warning during ingestion. If it's a small venue being ingested there may be capacity to fix it right away. Otherwise the warnings can be dealt with later. |
Update: I have tried to add this now. My thinking is that if there was a "catch-all" ID ("May refer to several people" etc.) and there is only one other person defined in |
Okay, I have both ORCID iD and unverified examples:
|
can you force reload? That hash was missing a string-final |
Still getting the error |
Hmm, your error message suggests the checksum should not have the (The change here is I had to update font awesome to 5.11.0) |
Works now! (I'm using Firefox) |
Pretty sure this just depends on the version :) |
I think I'd prefer something a lot more subtle for the unverified case, tbh. |
Okay, that corroborates a thought I'd been having, too: if we release this new version, and every bit of unverified data is glaring out, it's going to look bad and trigger a ton of work. |
I'm in favor of gradually making the lack of verification more prominent once we have well-established processes with the new model. For now: |
I'm getting the error again...remove the |
I think I should prepare a branch with all the data transition done, so we can have a PR that only shows the layout and/or logical changes, without all the data files included in "Files changed", as that tab is pretty much unusable for me right now... |
This is a good idea. I'm behind on it. There have been a lot of other behind-the-scenes issues that have consumed my time. Maybe I should prioritize this, going year-by-year with the main conferences back to 2020, and then we can hold that static. Give me till the 27th? |
@mjpost In my comment, "data transition" referred to transitioning to the new author system (which, among other things, writes lots of IDs to the XML files), while you seem to talk about backfilling ORCIDs (which is related, but orthogonal to that)? In any case I'd imagine making a |
This PR adds a script to implement the transition logic to the new author representation system. I'm adding some implementation notes below as a basis for discussing if the script should do anything differently from what I implemented so far.
Notes:
The commit history is noisy because the PR depends on v0.5.3 of the library, which isn't merged yet. As soon as Create acl-anthology release v0.5.3 #5405 is merged, I can update/force-rebase this PR.name_variants.yaml
(0914c18) and removes name variants that are never used (40dd8bf) to avoid errors.The script currently doesn't setdisable_name_matching: true
anywhere, but it probably needs to for some authors to preserve the current logic; need to work that out.Before actually running this script, we should:
Implementation notes
The implementation logic here is a bit trickier than I expected. I'm writing down some concrete examples here to illustrate what we need to consider.
The straight-forward case
The
name_variants.yaml
has:There are two instances of these names in the XML:
The Python library returns these two names and papers:
This is the straight-forward case. The script generates an entry for the new
people.yaml
like this:and modifies the XML like this:
Name variants not explicitly listed in the YAML
The
name_variants.yaml
has:There are two kinds of instances (13 papers in total) in the XML:
Here, the canonical name is used without the ID in the XML, but a name variant is used with the ID without it being listed in
name_variants.yaml
. This is allowed in our current system.The Python library returns both names and all 13 papers, as expected:
Generating the entry for
people.yaml
is therefore also straight-forward, and should produce:While the XML should be modified to include the ID also for the canonical name:
Mixture of explicit and implicitly-matched name variants
This is where it gets tricky. The
name_variants.yaml
has:There are two names listed here, and no explicit ID, so no further names can be added to this person via an explicit ID in the XML. The generated ID,
pedro-ortiz-suarez
, does not exist in any XML file.However, the Python library returns three names:
This is because the third name,
{first: "Pedro Ortiz", last: "Suarez"}
, was attached to this person by our implicit name matching mechanism that assumes names refer to the same person if they slugify to the same string (for reference, here is where this happens).In my understanding of our new system, this name variant should not be written to the
people.yaml
since it was not manually defined by us (and is, in fact, just the result of an incorrect first/last split). In other words, the script should generate an entry for the newpeople.yaml
like this:and modify the XML like this:
IDs that "may refer to several people"
In a few instances, our
name_variants.yaml
contains entries simply because they are currently required for technical reasons:This is because there are several other "Fei Liu"s with their own respective IDs. Under the new system, no entry in
people.yaml
should be generated for this "catch-all" person, and existing references in the XML to this ID should be removed.