-
-
Notifications
You must be signed in to change notification settings - Fork 184
Closed
Description
While writing a PR for the parent issue I noticed the same buggy pattern in get_canonicals_from_reporter
There are 211 collisions, but most of them lead to the same canonicals, so I guess that's OK. Examples of these:
# collision for variant slug 'uspq2d-bna'
# Variant: 'U.S.P.Q.2d (BNA)' ['uspq-2d-bna']
# Variant: 'U.S.P.Q.2D (BNA)' ['uspq-2d-bna']
# ------------'
# collision for variant slug 'uspq'
# Variant: 'USPQ' ['uspq-bna']
# Variant: 'U.S.P.Q.' ['uspq-bna']
# ------------'
# collision for variant slug 'utah-app'
# Variant: 'Utah App.' ['ut-app']
# Variant: 'Utah App' ['ut-app']
The following 14 seem problematic, leading to different canonicals:
collision for variant slug 'vr'
Variant: 'V.R.' ['vt']
Variant: 'Vr.' ['vroom']
------------'
collision for variant slug 'br'
Variant: 'B.R.' ['balt-c-rep']
Variant: 'BR' ['br']
------------'
collision for variant slug 'black-rep'
Variant: 'Black. Rep.' ['blackf']
Variant: 'Black Rep.' ['black']
------------'
collision for variant slug 'cal-app-2d-supp'
Variant: 'Cal. App. 2d Supp.' ['cal-app-supp-2d']
Variant: 'Cal. App. 2d Supp' ['cal-app-2d']
------------'
collision for variant slug 'clr'
Variant: 'CLR' ['conn-l-rptr']
Variant: 'Cl.R.' ['cl-ch']
------------'
collision for variant slug 'dec-commr-pat'
Variant: 'Dec. Comm’r Pat.' ['dec-commr-pat']
Variant: 'Dec. Commr. Pat.' ['dec-com-pat']
------------'
collision for variant slug 'hayw-h'
Variant: 'Hayw. & H.' ['hayw-hdc']
Variant: 'Hayw.& H.' ['hay-haz']
------------'
collision for variant slug 'how'
Variant: 'How.' ['howard']
Variant: 'HOW' ['how']
------------'
collision for variant slug 'johnsny'
Variant: 'Johns.(N.Y.)' ['johns-ch']
Variant: 'Johns.N.Y.' ['johns']
------------'
collision for variant slug 'mt'
Variant: 'Mt.' ['mont']
Variant: 'mt' ['mt']
------------'
collision for variant slug 'okla'
Variant: 'Okla.' ['okla-crim']
Variant: 'OKla.' ['okla']
------------'
collision for variant slug 'pac'
Variant: 'Pa.C.' ['pa-commw']
Variant: 'Pac.' ['p']
------------'
collision for variant slug 'sc'
Variant: 'Sc.' ['scam']
Variant: 'S.C.' ['s-ct']
------------'
collision for variant slug 'wash'
Variant: 'Wash.' ['wash-terr']
Variant: 'WASH' ['wash']
courtlistener/cl/citations/utils.py
Lines 92 to 112 in aeddcaa
def get_canonicals_from_reporter(reporter_slug: str) -> list[SafeString]: | |
""" | |
Disambiguates a reporter slug using a list of variations. | |
The list of variations is a dictionary that maps each variation | |
to a list of reporters that it could be possibly referring to. | |
Args: | |
reporter_slug (str): The reporter's name in slug format | |
Returns: | |
list[str]: A list of potential canonical names for the reporter | |
""" | |
slugified_variations = {} | |
for variant, canonicals in VARIATIONS_ONLY.items(): | |
slugged_canonicals = [] | |
for canonical in canonicals: | |
slugged_canonicals.append(slugify(canonical)) | |
slugified_variations[str(slugify(variant))] = slugged_canonicals | |
return slugified_variations.get(reporter_slug, []) |
To reproduce
from reporters_db import EDITIONS, VARIATIONS_ONLY
from django.template.defaultfilters import slugify
prevnames = {}
slugified_variations = {}
collision_count = 0
problem_count = 0
for variant, canonicals in VARIATIONS_ONLY.items():
slugged_canonicals = []
for canonical in canonicals:
slugged_canonicals.append(slugify(canonical))
variant_slug = str(slugify(variant))
if slugified_variations.get(variant_slug):
collision_count += 1
if slugified_variations[variant_slug][0] != slugged_canonicals[0]:
problem_count +=1
print(f"collision for variant slug '{variant_slug}'\nVariant: '{variant}'",
slugged_canonicals,
f"\nVariant: '{prevnames[variant_slug]}'",
slugified_variations[variant_slug],
end = "\n------------'\n"
)
prevnames[variant_slug] = variant
slugified_variations[variant_slug] = slugged_canonicals
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done