Skip to content

feat: add NER label normalization across scispaCy models (#538)#574

Open
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Ashut0sh-mishra:feat/ner-label-normalization-538
Open

feat: add NER label normalization across scispaCy models (#538)#574
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Ashut0sh-mishra:feat/ner-label-normalization-538

Conversation

@Ashut0sh-mishra
Copy link
Copy Markdown

Dug into #538 — turns out the root issue is pretty simple. Each NER model uses labels straight from its training corpus (CRAFT, JNLPBA, BC5CDR, BIONLP13CG), so the same concept ends up with a different label depending on which model you load. Genes show up as PROTEIN in one model, GGP in another, GENE_OR_GENE_PRODUCT in a third. Makes it annoying to work with multiple models at once.

What I did

Added scispacy/label_normalization.py with:

  • A lookup table mapping all 30+ model-specific labels to 8 unified categories
  • normalize_label(label) for single lookups
  • normalize_entities(doc) that relabels a whole Doc in-place

Unknown labels pass through unchanged so this won't break anything for folks using custom models.

Unified categories

Unified Label Maps From
GENE PROTEIN, GGP, DNA, RNA, GENE_OR_GENE_PRODUCT
CHEMICAL SIMPLE_CHEMICAL, CHEBI, CHEMICAL
DISEASE PATHOLOGICAL_FORMATION, DISEASE, CANCER
CELL CELL, CELL_TYPE, CELL_LINE, CL
ORGANISM ORGANISM, TAXON
ANATOMY ORGAN, TISSUE, ANATOMICAL_SYSTEM, MULTI-TISSUE_STRUCTURE, etc
CELLULAR_COMPONENT CELLULAR_COMPONENT, GO
OTHER everything else

Tests

16 tests covering every label family plus edge cases like empty docs and unknown labels. All pass locally.

Connection to #388

This pairs nicely with the entity merging work — if you normalize labels first, combining entities from different models becomes way cleaner. Planning to hook this into entity_merging.py in a follow-up.

Let me know if you want different category names or additional mappings.

Co-authored-by: nik464 nikhil18chaudhary@gmail.com

Ashut0sh-mishra and others added 2 commits April 15, 2026 07:07
Long entity names or cache keys could exceed the 255-character
filesystem limit causing OSError. Changed url_to_filename() to
only preserve the file extension instead of the full trailing URL
component, keeping filenames under 143 bytes (eCryptfs limit).
Added backward-compat lookup for old-style cache entries.

Fixes allenai#539

Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
The four NER models (CRAFT, JNLPBA, BC5CDR, BIONLP13CG) each use
different label sets from their training corpora. The same concept
gets labeled differently depending on which model produced it --
e.g. 'PROTEIN' vs 'GGP' vs 'GENE_OR_GENE_PRODUCT' for genes.

Adds scispacy/label_normalization.py with:
- LABEL_MAP: maps every model-specific label to a unified label
- normalize_label(): single label lookup
- normalize_entities(): relabels all ents in a Doc in-place

Unified categories: GENE_OR_GENE_PRODUCT, RNA, CHEMICAL, DISEASE,
CELL, ORGANISM, ANATOMY, BIOLOGICAL_PROCESS. Unknown labels pass
through unchanged.

16 tests covering all label families, edge cases, and Doc-level
normalization.

Fixes allenai#538

Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant