feat: add NER label normalization across scispaCy models (#538)#574
Open
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Open
feat: add NER label normalization across scispaCy models (#538)#574Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Conversation
Long entity names or cache keys could exceed the 255-character filesystem limit causing OSError. Changed url_to_filename() to only preserve the file extension instead of the full trailing URL component, keeping filenames under 143 bytes (eCryptfs limit). Added backward-compat lookup for old-style cache entries. Fixes allenai#539 Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
The four NER models (CRAFT, JNLPBA, BC5CDR, BIONLP13CG) each use different label sets from their training corpora. The same concept gets labeled differently depending on which model produced it -- e.g. 'PROTEIN' vs 'GGP' vs 'GENE_OR_GENE_PRODUCT' for genes. Adds scispacy/label_normalization.py with: - LABEL_MAP: maps every model-specific label to a unified label - normalize_label(): single label lookup - normalize_entities(): relabels all ents in a Doc in-place Unified categories: GENE_OR_GENE_PRODUCT, RNA, CHEMICAL, DISEASE, CELL, ORGANISM, ANATOMY, BIOLOGICAL_PROCESS. Unknown labels pass through unchanged. 16 tests covering all label families, edge cases, and Doc-level normalization. Fixes allenai#538 Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dug into #538 — turns out the root issue is pretty simple. Each NER model uses labels straight from its training corpus (CRAFT, JNLPBA, BC5CDR, BIONLP13CG), so the same concept ends up with a different label depending on which model you load. Genes show up as
PROTEINin one model,GGPin another,GENE_OR_GENE_PRODUCTin a third. Makes it annoying to work with multiple models at once.What I did
Added
scispacy/label_normalization.pywith:normalize_label(label)for single lookupsnormalize_entities(doc)that relabels a whole Doc in-placeUnknown labels pass through unchanged so this won't break anything for folks using custom models.
Unified categories
GENECHEMICALDISEASECELLORGANISMANATOMYCELLULAR_COMPONENTOTHERTests
16 tests covering every label family plus edge cases like empty docs and unknown labels. All pass locally.
Connection to #388
This pairs nicely with the entity merging work — if you normalize labels first, combining entities from different models becomes way cleaner. Planning to hook this into
entity_merging.pyin a follow-up.Let me know if you want different category names or additional mappings.
Co-authored-by: nik464 nikhil18chaudhary@gmail.com