Skip to content

Slugifying variations in get_canonicals_from_reporter discards some data #5389

@grossir

Description

@grossir

While writing a PR for the parent issue I noticed the same buggy pattern in get_canonicals_from_reporter
There are 211 collisions, but most of them lead to the same canonicals, so I guess that's OK. Examples of these:

# collision for variant slug 'uspq2d-bna'
# Variant: 'U.S.P.Q.2d (BNA)' ['uspq-2d-bna'] 
# Variant: 'U.S.P.Q.2D (BNA)' ['uspq-2d-bna']
# ------------'
# collision for variant slug 'uspq'
# Variant: 'USPQ' ['uspq-bna'] 
# Variant: 'U.S.P.Q.' ['uspq-bna']
# ------------'
# collision for variant slug 'utah-app'
# Variant: 'Utah App.' ['ut-app'] 
# Variant: 'Utah App' ['ut-app']

The following 14 seem problematic, leading to different canonicals:

collision for variant slug 'vr'
Variant: 'V.R.' ['vt'] 
Variant: 'Vr.' ['vroom']
------------'
collision for variant slug 'br'
Variant: 'B.R.' ['balt-c-rep'] 
Variant: 'BR' ['br']
------------'
collision for variant slug 'black-rep'
Variant: 'Black. Rep.' ['blackf'] 
Variant: 'Black Rep.' ['black']
------------'
collision for variant slug 'cal-app-2d-supp'
Variant: 'Cal. App. 2d Supp.' ['cal-app-supp-2d'] 
Variant: 'Cal. App. 2d Supp' ['cal-app-2d']
------------'
collision for variant slug 'clr'
Variant: 'CLR' ['conn-l-rptr'] 
Variant: 'Cl.R.' ['cl-ch']
------------'
collision for variant slug 'dec-commr-pat'
Variant: 'Dec. Comm’r Pat.' ['dec-commr-pat'] 
Variant: 'Dec. Commr. Pat.' ['dec-com-pat']
------------'
collision for variant slug 'hayw-h'
Variant: 'Hayw. & H.' ['hayw-hdc'] 
Variant: 'Hayw.& H.' ['hay-haz']
------------'
collision for variant slug 'how'
Variant: 'How.' ['howard'] 
Variant: 'HOW' ['how']
------------'
collision for variant slug 'johnsny'
Variant: 'Johns.(N.Y.)' ['johns-ch'] 
Variant: 'Johns.N.Y.' ['johns']
------------'
collision for variant slug 'mt'
Variant: 'Mt.' ['mont'] 
Variant: 'mt' ['mt']
------------'
collision for variant slug 'okla'
Variant: 'Okla.' ['okla-crim'] 
Variant: 'OKla.' ['okla']
------------'
collision for variant slug 'pac'
Variant: 'Pa.C.' ['pa-commw'] 
Variant: 'Pac.' ['p']
------------'
collision for variant slug 'sc'
Variant: 'Sc.' ['scam'] 
Variant: 'S.C.' ['s-ct']
------------'
collision for variant slug 'wash'
Variant: 'Wash.' ['wash-terr'] 
Variant: 'WASH' ['wash']

def get_canonicals_from_reporter(reporter_slug: str) -> list[SafeString]:
"""
Disambiguates a reporter slug using a list of variations.
The list of variations is a dictionary that maps each variation
to a list of reporters that it could be possibly referring to.
Args:
reporter_slug (str): The reporter's name in slug format
Returns:
list[str]: A list of potential canonical names for the reporter
"""
slugified_variations = {}
for variant, canonicals in VARIATIONS_ONLY.items():
slugged_canonicals = []
for canonical in canonicals:
slugged_canonicals.append(slugify(canonical))
slugified_variations[str(slugify(variant))] = slugged_canonicals
return slugified_variations.get(reporter_slug, [])


To reproduce

from reporters_db import EDITIONS, VARIATIONS_ONLY
from django.template.defaultfilters import slugify

prevnames = {}
slugified_variations = {}
collision_count = 0
problem_count = 0
for variant, canonicals in VARIATIONS_ONLY.items():
    slugged_canonicals = []
    for canonical in canonicals:
        slugged_canonicals.append(slugify(canonical))
        variant_slug = str(slugify(variant))
    if slugified_variations.get(variant_slug):

        collision_count += 1
        if slugified_variations[variant_slug][0] != slugged_canonicals[0]:
            problem_count +=1
            print(f"collision for variant slug '{variant_slug}'\nVariant: '{variant}'", 
                slugged_canonicals,
                f"\nVariant: '{prevnames[variant_slug]}'", 
                slugified_variations[variant_slug],
                end = "\n------------'\n"
                )

    prevnames[variant_slug] = variant
    slugified_variations[variant_slug] = slugged_canonicals

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions