feat: add entity merging across multiple NER models (#388)#573
Open
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Open
feat: add entity merging across multiple NER models (#388)#573Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Conversation
Long entity names or cache keys could exceed the 255-character filesystem limit causing OSError. Changed url_to_filename() to only preserve the file extension instead of the full trailing URL component, keeping filenames under 143 bytes (eCryptfs limit). Added backward-compat lookup for old-style cache entries. Fixes allenai#539 Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
Adds scispacy/entity_merging.py with: - merge_overlapping_spans(): keeps longest non-overlapping spans from a flat list using spacy.util.filter_spans - merge_entities(): runs text through multiple spaCy models, collects all recognized entities, optionally adds abbreviation long forms, and returns a single Doc with merged ents Also adds tests/test_entity_merging.py covering overlap resolution, deduplication, and multi-model span merging. Fixes allenai#388 Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Took a stab at #388.
The idea is pretty simple - run your text through multiple NER models, grab all the entities, and keep the longest spans when they overlap. This way you get the best of each model without duplicates or fragments.
What's in here
scispacy/entity_merging.pywith amerge_entities()functionAbbreviationDetectorspacy.util.filter_spansfor overlap resolution (longest wins)merge_overlapping_spans()helper if you just want to dedupe a list of spans you already haveTests
Added 6 tests covering:
All pass locally.
For reviewers
filter_spansdoc.char_span()to avoid mixing Doc objectsAbbreviationDetectorgets auto-added if missing from the pipelineLet me know if you want me to change anything or add more test cases.
Fixes #388
Co-authored-by: nik464 nikhil18chaudhary@gmail.com