Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
0914c18
Fix problems in name_variants.yaml
mbollmann Jul 18, 2025
40dd8bf
Remove unused name variants
mbollmann Jul 18, 2025
ecf12cc
Merge branch 'python-dev' into transition-to-people-yaml
mbollmann Jul 18, 2025
407823f
Add script to transition metadata to new author representation
mbollmann Jul 18, 2025
34bf474
Fix small (potential) mistake when IDs to delete are editors
mbollmann Jul 18, 2025
10a9e58
Make logger use stderr (#5474)
mbollmann Jul 18, 2025
fddd2f0
Merge branch 'master' into transition-to-people-yaml
mbollmann Aug 2, 2025
798597e
Check when disable_name_matching: true is needed and add it
mbollmann Aug 3, 2025
1479c56
Merge branch 'master' into python-dev
mbollmann Aug 9, 2025
ee78650
Add ORCID field to Person
mbollmann Jul 18, 2025
309ad65
Add new fields to NameSpec and Person, add check for verified IDs
mbollmann Aug 3, 2025
5b9ca84
Removed outdated special case when slugifying
mbollmann Aug 3, 2025
aa2411a
Switch from name_variants.yaml to people.yaml & new name resolution l…
mbollmann Aug 3, 2025
248009d
Transition test data & fix tests outside of personindex_test.py
mbollmann Aug 3, 2025
3471119
Remove tests for get_or_create_person, fix remaining ones
mbollmann Aug 3, 2025
7ee6ac8
Refactor get_or_create_person to resolve_namespec, refactor exceptions
mbollmann Aug 3, 2025
983bbc8
Refactor exceptions (again), add checks for ORCID on NameSpecification
mbollmann Aug 3, 2025
44ae702
Add ORCID validation (incl. checksum)
mbollmann Aug 3, 2025
f287f16
Add integration test for PersonIndex, currently expected to fail
mbollmann Aug 3, 2025
2c11c80
Bump Codecov action to v5
mbollmann Aug 8, 2025
a582685
Add by_orcid, rename name_to_ids to by_name
mbollmann Aug 8, 2025
e5511f4
Disallow person IDs starting with numbers
mbollmann Aug 8, 2025
5fe6470
Add tests for name resolution logic
mbollmann Aug 9, 2025
bfc8bbc
Increase test coverage, fix small bug (checked for wrong exception)
mbollmann Aug 9, 2025
4d7b39b
Add function & tests for ingestion logic
mbollmann Aug 9, 2025
59ef07b
Update CHANGELOG
mbollmann Aug 9, 2025
5f20fc2
Refactor Person.names to store if NameLink was EXPLICIT or INFERRED
mbollmann Aug 9, 2025
fb64a67
Add save functionality for people.yaml
mbollmann Aug 9, 2025
e65733f
Let changes to Person automatically update PersonIndex
mbollmann Aug 10, 2025
6f6bbe2
Add Person.make_explicit + more people.yaml saving tests
mbollmann Aug 10, 2025
e5330c8
Refactor PersonIndex tests & add check for duplicate ORCIDs
mbollmann Aug 10, 2025
cbecb5a
Move PersonIndex fields behind getters that auto-load data
mbollmann Aug 10, 2025
157e313
Add Person.update_id
mbollmann Aug 10, 2025
43575a4
Automatically call ingest_namespec() on create_ functions
mbollmann Aug 10, 2025
61cedf7
Add PersonIndex.create_person()
mbollmann Aug 10, 2025
7ac5e12
Update documentation (WIP)
mbollmann Aug 10, 2025
d5c3aa8
Merge branch 'master' into transition-to-people-yaml
mbollmann Aug 11, 2025
91b5551
Fix disallowed uppercase letter in person ID
mbollmann Aug 11, 2025
10a317f
Change slugs_to_verified_ids to contain sets (fixes bug with IDs bein…
mbollmann Aug 11, 2025
7a95834
Implement caching for PersonIndex
mbollmann Aug 13, 2025
e83b5f5
Revert "Implement caching for PersonIndex"
mbollmann Aug 13, 2025
3682487
Improve test coverage & make Person.set_canonical_name private
mbollmann Aug 18, 2025
0bf8332
Update documentation
mbollmann Aug 18, 2025
d4c7d10
Merge pull request #5472 from acl-org/python-author-refactor
mbollmann Aug 19, 2025
0463601
Merge branch 'master' into python-dev
mbollmann Aug 21, 2025
130ebcc
Merge branch 'python-dev' into transition-to-people-yaml
mbollmann Aug 21, 2025
fddcc1b
Hotfix for empty MarkupText serialization
mbollmann Aug 21, 2025
6a9e08a
Merge branch 'python-dev' into transition-to-people-yaml
mbollmann Aug 21, 2025
37bb1aa
Run transition_to_people_yaml.py
mbollmann Aug 21, 2025
a3a132c
Enable people.yaml integration test
mbollmann Aug 21, 2025
5391e67
Fix logging (ensure all Rich output uses the same console)
mbollmann Aug 21, 2025
873267a
Fix Hugo template to work with unverified/ IDs
mbollmann Aug 21, 2025
47ce884
Add person-id -> unverified/perosn-id redirect to .htaccess
mbollmann Aug 21, 2025
568e3e6
Fix(?) .htaccess rule
mbollmann Aug 21, 2025
99cdfbb
Revert "Run transition_to_people_yaml.py"
mbollmann Aug 28, 2025
dc5163c
Merge branch 'master' into transition-to-people-yaml
mbollmann Aug 28, 2025
0765588
Run transition_to_people_yaml.py
mbollmann Aug 28, 2025
6eb4381
Add ORCID link on author page
mjpost Sep 19, 2025
fcce825
Export orcid in hugo data
mjpost Sep 19, 2025
7009334
Move ORCID icon
mjpost Sep 19, 2025
01c4803
Remove ORCID itself
mjpost Sep 19, 2025
02022c1
Add question mark for unverified authors
mjpost Sep 19, 2025
fb90000
Add verification page stub
mjpost Sep 19, 2025
9450198
black
mjpost Sep 19, 2025
ae80f4f
Ensmallen; use fontawesome for both
mjpost Sep 19, 2025
4265398
Bump fontawesome version 5.7.2 -> 5.11.0
mjpost Sep 19, 2025
0211a1f
Icon sizing
mjpost Sep 19, 2025
56a9262
Update checksum
mjpost Sep 19, 2025
8b0f159
Use fontawesome kit
mjpost Sep 19, 2025
7c6abf7
black
mjpost Sep 19, 2025
242ee65
Contra all docs everywhere, you need "fab"
mjpost Sep 19, 2025
8c048ef
Switch back to stylesheet
mjpost Sep 19, 2025
34a0aeb
fa-solid -> fas
mjpost Sep 19, 2025
3bdf8f4
Relativize link and lighten question mark
mjpost Sep 20, 2025
fd766f7
Set opacity for unverified people
mjpost Sep 20, 2025
06b8903
reference syntax
mjpost Sep 20, 2025
d6d2838
fontawesome hash
nschneid Sep 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .ackrc
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
--ignore-directory=is:.mypy_cache
--ignore-directory=is:.pytest_cache
--ignore-directory=is:.ruff_cache
--ignore-directory=is:.venv
--ignore-directory=is:site
2 changes: 1 addition & 1 deletion .github/workflows/code-quality.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,6 @@ jobs:

# Coverage report
- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
6 changes: 5 additions & 1 deletion bin/create_extra_bib.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import msgspec
from pathlib import Path
import re
from rich.console import Console
from rich.progress import track
import shutil
import subprocess
Expand All @@ -48,6 +49,7 @@

BIB2XML = None
XML2END = None
CONSOLE = Console(stderr=True)

# Max shard size in MiB
MAX_SHARD_MB = 49
Expand Down Expand Up @@ -89,6 +91,7 @@ def create_bibtex(builddir, clean=False) -> None:
reverse=True,
),
description="Create anthology.bib.gz... ",
console=CONSOLE,
):
with open(volume_file, "r") as f:
bibtex = f.read()
Expand Down Expand Up @@ -124,6 +127,7 @@ def create_bibtex(builddir, clean=False) -> None:
reverse=True,
),
description=" +abstracts.bib.gz... ",
console=CONSOLE,
):
with open(collection_file, "rb") as f:
data = msgspec.json.decode(f.read())
Expand Down Expand Up @@ -351,7 +355,7 @@ def batch_convert_to_mods_and_endf(bibtex, context):
)

log_level = log.DEBUG if args["--debug"] else log.INFO
tracker = setup_rich_logging(level=log_level)
tracker = setup_rich_logging(console=CONSOLE, level=log_level)

max_workers = int(args["--max-workers"]) if args["--max-workers"] else None
if (BIB2XML := shutil.which("bib2xml")) is None:
Expand Down
8 changes: 6 additions & 2 deletions bin/create_hugo_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import msgspec
from omegaconf import OmegaConf
import os
from rich.console import Console
from rich.progress import (
Progress,
TextColumn,
Expand All @@ -60,6 +61,7 @@


BIBLIMIT = None
CONSOLE = Console(stderr=True)
ENCODER = msgspec.json.Encoder()
SCRIPTDIR = os.path.dirname(os.path.realpath(__file__))

Expand Down Expand Up @@ -93,7 +95,7 @@ def make_progress():
TaskProgressColumn(show_speed=True),
TimeRemainingColumn(elapsed_when_finished=True),
]
return Progress(*columns)
return Progress(*columns, console=CONSOLE)


@cache
Expand Down Expand Up @@ -396,6 +398,8 @@ def export_people(anthology, builddir, dryrun):
data["full"] = f"{data['full']} ({', '.join(diff_script_variants)})"
if person.comment is not None:
data["comment"] = person.comment
if person.orcid is not None:
data["orcid"] = person.orcid
similar = anthology.people.similar.subset(person_id)
if len(similar) > 1:
data["similar"] = list(similar - {person_id})
Expand Down Expand Up @@ -567,7 +571,7 @@ def export_anthology(anthology, builddir, clean=False, dryrun=False):
)

log_level = log.DEBUG if args["--debug"] else log.INFO
tracker = setup_rich_logging(level=log_level)
tracker = setup_rich_logging(console=CONSOLE, level=log_level)

if limit := args["--bib-limit"]:
BIBLIMIT = int(limit)
Expand Down
260 changes: 260 additions & 0 deletions bin/oneoff/transition_to_people_yaml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Copyright 2025 Marcel Bollmann <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Usage: transition_to_people_yaml.py [options]

Creates people.yaml and rewrites author IDs in the XML according to <https://github.com/acl-org/acl-anthology/wiki/Author-Page-Plan#transitioning-the-metadata>.

Options:
--debug Output debug-level log messages.
-d, --datadir=DIR Directory with data files. [default: {scriptdir}/../../data]
-x, --write-xml Write changes to the XML files.
-y, --write-yaml Write the new people.yaml.
-h, --help Display this helpful text.
"""

from collections import defaultdict
from docopt import docopt
from importlib.metadata import version as get_version
import itertools as it
import logging as log
import os
from pathlib import Path
import yaml

try:
from yaml import CLoader as Loader, CDumper as Dumper
except ImportError: # pragma: no cover
from yaml import Loader, Dumper # type: ignore

from acl_anthology import Anthology
from acl_anthology.people import Name
from acl_anthology.utils.logging import setup_rich_logging


def parse_variant_list(anthology):
# We create a dictionary mapping person IDs to their original entry in
# name_variants.yaml; this is because there are fields in name_variants.yaml
# that the Python library does not store (such as 'orcid' or 'degree'), and
# we might want to transfer them to the new people.yaml
name_variants = {}
with open(
anthology.datadir / "yaml" / "name_variants.yaml", "r", encoding="utf-8"
) as f:
variant_list = yaml.load(f, Loader=Loader)
for entry in variant_list:
if "id" in entry:
name_variants[entry["id"]] = entry
else:
people = anthology.people.get_by_name(Name.from_dict(entry["canonical"]))
assert (
len(people) == 1
), "Canonical name in name_variants.yaml shouldn't be ambiguous"
name_variants[people[0].id] = entry
return name_variants


# This exists to serialize names in "flow" style (i.e. one-liner {first: ...,
# last: ...}), without having to force flow style on the entire YAML document
class YAMLName(yaml.YAMLObject):
yaml_dumper = Dumper
yaml_tag = "tag:yaml.org,2002:map" # serialize like a dictionary
yaml_flow_style = True # force flow style

def __init__(self, first, last, script):
if first is not None:
self.first = first
self.last = last
if script is not None:
self.script = script


def name_to_yaml(name):
return YAMLName(name.first, name.last, name.script)


def refactor(anthology, name_variants):
new_people_dict = {}
c_removed, c_added = 0, 0

# These two are to infer if we need to set disable_name_matching: true somewhere
names_to_ids = defaultdict(list)
names_with_catchall_id = []
c_disable_name_matching = 0

for pid, person in anthology.people.items():
# We only consider people who are currently defined in name_variants.yaml
if not person.is_explicit:
continue

orig_entry = name_variants[pid]

# name_variants.yaml may define IDs that are actually never used
if not person.item_ids:
log.warning(
f"Person '{pid}' derived from name_variants.yaml has no papers; discarding"
)
continue

# If person has a comment like "May refer to multiple people" or "May
# refer to several people", their identity is "unverified", so we:
# - Don't write them to people.yaml
# - Remove their ID from the XML
if person.comment is not None and person.comment.startswith("May refer"):
log.debug(f"Removing ID '{pid}' ('{person.comment}')")
for paper in person.papers():
# Remove their ID from the XML
for namespec in it.chain(paper.authors, paper.get_editors()):
if namespec.id == pid:
namespec.id = None
c_removed += 1

# Record the name(s) of this person so we can check later if this ID
# was important for disambiguation
names_with_catchall_id.extend(person.names)

# Don't process this person further
continue

# If we reach this point, this person should be considered "verified"
# under the new system. However, maybe not all of their *names* should
# go into people.yaml---a name can have been added to `person.names` in
# different ways:
#
# 1. It was listed explicitly in `name_variants.yaml` -- keep
# 2. It was in the XML with this person's explicit ID -- keep
# 3. It was added to this person via the name matching mechanism that
# compares slugified names -- don't keep, as it was inferred heuristically
#
# (This happens in <https://github.com/acl-org/acl-anthology/blob/170ff9706aba87de0e353da690e6b0bb33ea6a98/python/acl_anthology/people/index.py#L252-L299>)
c = 0
names_to_keep = {Name.from_dict(orig_entry["canonical"])} | {
Name.from_dict(name) for name in orig_entry.get("variants", [])
} # Case 1

for paper in person.papers():
for namespec in it.chain(paper.authors, paper.get_editors()):
if namespec.id == pid:
names_to_keep.add(namespec.name) # Case 2
break
else:
# Does *not* already have an explicit ID in the XML; add it.
# ---
# NOTE: Doing this in a separate loop to avoid the edge case where
# a paper might have two authors with identical names,
# disambiguated by their ID---not sure if that ever happens, but
# better be safe than sorry.
for namespec in it.chain(paper.authors, paper.get_editors()):
if person.has_name(namespec.name):
if namespec.name in names_to_keep: # Avoid case 3
namespec.id = pid
c += 1
c_added += 1
break
else:
# Should never happen
log.error(
f"Did not find '{pid}' on paper '{paper.full_id}' connected to them",
)

if c > 0:
log.debug(f"Added explicit ID '{pid}' to {c} papers")

for name in person.names:
names_to_ids[name].append(pid)

# Construct entry for new people.yaml
entry = {
# First name is always the canonical one
"names": [
name_to_yaml(name) for name in person.names if name in names_to_keep
],
}
if person.comment is not None:
entry["comment"] = person.comment
# These are keys we copy over from the old name_variants.yaml
for key in ("degree", "similar", "orcid"):
if key in orig_entry:
entry[key] = orig_entry[key]

new_people_dict[pid] = entry

for name in names_with_catchall_id:
pids = names_to_ids.get(name, [])
if len(pids) == 1:
# There is only one "verified" person with this name, but there was
# a catch-all ID ("May refer to several people") with this name too,
# so we need to disable name matching under the new system
new_people_dict[pids[0]]["disable_name_matching"] = True
c_disable_name_matching += 1

log.info(
f"Removed {c_removed:>5d} explicit IDs from the XML ('May refer to several people' etc.)"
)
log.info(f" Added {c_added:>5d} explicit IDs to the XML")
log.info(f"Created {len(new_people_dict):>5d} entries for people.yaml")
log.info(
f" {c_disable_name_matching:>5d} of those have `disable_name_matching: true`"
)

return new_people_dict


if __name__ == "__main__":
args = docopt(__doc__)

log_level = log.DEBUG if args["--debug"] else log.INFO
tracker = setup_rich_logging(level=log_level)

if (version := get_version("acl_anthology")) != "0.5.3":
log.error(
f"This script needs to run with version 0.5.3 of the acl-anthology library; got {version}"
)
exit(1)

if "{scriptdir}" in args["--datadir"]:
args["--datadir"] = os.path.abspath(
args["--datadir"].format(scriptdir=os.path.dirname(os.path.abspath(__file__)))
)
datadir = Path(args["--datadir"])
log.info(f"Using data directory {datadir}")

anthology = Anthology(datadir=datadir)
anthology.load_all()

name_variants = parse_variant_list(anthology)
log.info(f" Found {len(name_variants):>5d} entries in name_variants.yaml")

new_people_dict = refactor(anthology, name_variants)

if tracker.highest >= log.ERROR:
log.warning("There were errors; aborting without saving")
exit(1)

if args["--write-yaml"]:
log.info("Writing new people.yaml...")
with open(datadir / "yaml" / "people.yaml", "w", encoding="utf-8") as f:
yaml.dump(new_people_dict, f, allow_unicode=True, Dumper=Dumper)
else:
log.warning("Not writing people.yaml; use -y/--write-yaml flag")

if args["--write-xml"]:
log.info("Saving XML files...")
for collection in anthology.collections.values():
collection.save()
else:
log.warning("Not modifying XML files; use -x/--write-xml flag")
2 changes: 1 addition & 1 deletion data/xml/1952.earlymt.xml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
</paper>
<paper id="6">
<title>Human translation versus machine translation</title>
<author><first>Leon</first><last>Dostert</last></author>
<author id="leon-dostert"><first>Leon</first><last>Dostert</last></author>
<bibkey>dostert-1952-human</bibkey>
</paper>
<paper id="7">
Expand Down
2 changes: 1 addition & 1 deletion data/xml/1956.earlymt.xml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
</paper>
<paper id="3">
<title>Organisation and Method in Mechanical Translation Work</title>
<author><first>L. E.</first><last>Dostert</last></author>
<author id="leon-dostert"><first>L. E.</first><last>Dostert</last></author>
<url hash="ec10d336">1956.earlymt-1.3</url>
<bibkey>dostert-1956-organisation</bibkey>
</paper>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/1957.earlymt.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<address>Georgetown University</address>
<month>12-13 April</month>
<year>1957</year>
<editor><first>Léon</first><last>Dostert</last></editor>
<editor id="leon-dostert"><first>Léon</first><last>Dostert</last></editor>
<venue>earlymt</venue>
</meta>
<frontmatter>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/1960.earlymt.xml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
</paper>
<paper id="8">
<title>Summation by Chairman</title>
<author><first>Leon</first><last>Dostert</last></author>
<author id="leon-dostert"><first>Leon</first><last>Dostert</last></author>
<url hash="86d780f0">1960.earlymt-nsmt.8</url>
<bibkey>dostert-1960-summation</bibkey>
</paper>
Expand Down
Loading