Skip to content

feat: add entity merging across multiple NER models (#388)#573

Open
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Ashut0sh-mishra:feat/merge-overlapping-entities-388
Open

feat: add entity merging across multiple NER models (#388)#573
Ashut0sh-mishra wants to merge 2 commits intoallenai:mainfrom
Ashut0sh-mishra:feat/merge-overlapping-entities-388

Conversation

@Ashut0sh-mishra
Copy link
Copy Markdown

Took a stab at #388.

The idea is pretty simple - run your text through multiple NER models, grab all the entities, and keep the longest spans when they overlap. This way you get the best of each model without duplicates or fragments.

What's in here

  • scispacy/entity_merging.py with a merge_entities() function
  • Loads each model, collects ents from all of them into one pool
  • Optionally pulls in abbreviation long forms from AbbreviationDetector
  • Uses spacy.util.filter_spans for overlap resolution (longest wins)
  • Also added a standalone merge_overlapping_spans() helper if you just want to dedupe a list of spans you already have

Tests

Added 6 tests covering:

  • empty input
  • non-overlapping spans
  • full overlap
  • partial overlap
  • duplicates
  • multi-model scenario from the original issue

All pass locally.

For reviewers

  • This is a new file, no existing code was touched
  • Overlap resolution uses spacy's built-in filter_spans
  • Cross-model spans are projected via doc.char_span() to avoid mixing Doc objects
  • AbbreviationDetector gets auto-added if missing from the pipeline

Let me know if you want me to change anything or add more test cases.

Fixes #388

Co-authored-by: nik464 nikhil18chaudhary@gmail.com

Ashut0sh-mishra and others added 2 commits April 15, 2026 07:07
Long entity names or cache keys could exceed the 255-character
filesystem limit causing OSError. Changed url_to_filename() to
only preserve the file extension instead of the full trailing URL
component, keeping filenames under 143 bytes (eCryptfs limit).
Added backward-compat lookup for old-style cache entries.

Fixes allenai#539

Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
Adds scispacy/entity_merging.py with:
- merge_overlapping_spans(): keeps longest non-overlapping spans
  from a flat list using spacy.util.filter_spans
- merge_entities(): runs text through multiple spaCy models,
  collects all recognized entities, optionally adds abbreviation
  long forms, and returns a single Doc with merged ents

Also adds tests/test_entity_merging.py covering overlap
resolution, deduplication, and multi-model span merging.

Fixes allenai#388

Co-authored-by: nik464 <nikhil18chaudhary@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Combining Entities Recognized by Different Models & by the AbbreviationDetector

1 participant