feat: add NER label normalization across scispaCy models (#538) by Ashut0sh-mishra · Pull Request #574 · allenai/scispacy

Ashut0sh-mishra · 2026-04-15T11:09:31Z

Dug into #538 — turns out the root issue is pretty simple. Each NER model uses labels straight from its training corpus (CRAFT, JNLPBA, BC5CDR, BIONLP13CG), so the same concept ends up with a different label depending on which model you load. Genes show up as PROTEIN in one model, GGP in another, GENE_OR_GENE_PRODUCT in a third. Makes it annoying to work with multiple models at once.

What I did

Added scispacy/label_normalization.py with:

A lookup table mapping all 30+ model-specific labels to 8 unified categories
normalize_label(label) for single lookups
normalize_entities(doc) that relabels a whole Doc in-place

Unknown labels pass through unchanged so this won't break anything for folks using custom models.

Unified categories

Unified Label	Maps From
`GENE`	PROTEIN, GGP, DNA, RNA, GENE_OR_GENE_PRODUCT
`CHEMICAL`	SIMPLE_CHEMICAL, CHEBI, CHEMICAL
`DISEASE`	PATHOLOGICAL_FORMATION, DISEASE, CANCER
`CELL`	CELL, CELL_TYPE, CELL_LINE, CL
`ORGANISM`	ORGANISM, TAXON
`ANATOMY`	ORGAN, TISSUE, ANATOMICAL_SYSTEM, MULTI-TISSUE_STRUCTURE, etc
`CELLULAR_COMPONENT`	CELLULAR_COMPONENT, GO
`OTHER`	everything else

Tests

16 tests covering every label family plus edge cases like empty docs and unknown labels. All pass locally.

Connection to #388

This pairs nicely with the entity merging work — if you normalize labels first, combining entities from different models becomes way cleaner. Planning to hook this into entity_merging.py in a follow-up.

Let me know if you want different category names or additional mappings.

Co-authored-by: nik464 nikhil18chaudhary@gmail.com

Long entity names or cache keys could exceed the 255-character filesystem limit causing OSError. Changed url_to_filename() to only preserve the file extension instead of the full trailing URL component, keeping filenames under 143 bytes (eCryptfs limit). Added backward-compat lookup for old-style cache entries. Fixes allenai#539 Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>

The four NER models (CRAFT, JNLPBA, BC5CDR, BIONLP13CG) each use different label sets from their training corpora. The same concept gets labeled differently depending on which model produced it -- e.g. 'PROTEIN' vs 'GGP' vs 'GENE_OR_GENE_PRODUCT' for genes. Adds scispacy/label_normalization.py with: - LABEL_MAP: maps every model-specific label to a unified label - normalize_label(): single label lookup - normalize_entities(): relabels all ents in a Doc in-place Unified categories: GENE_OR_GENE_PRODUCT, RNA, CHEMICAL, DISEASE, CELL, ORGANISM, ANATOMY, BIOLOGICAL_PROCESS. Unknown labels pass through unchanged. 16 tests covering all label families, edge cases, and Doc-level normalization. Fixes allenai#538 Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>

Ashut0sh-mishra and others added 2 commits April 15, 2026 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add NER label normalization across scispaCy models (#538)#574

feat: add NER label normalization across scispaCy models (#538)#574
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Ashut0sh-mishra:feat/ner-label-normalization-538

Ashut0sh-mishra commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ashut0sh-mishra commented Apr 15, 2026

What I did

Unified categories

Tests

Connection to #388

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant