Fix _handle_sentence trailing entity not replacing hyphen in label#569
Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Open
Fix _handle_sentence trailing entity not replacing hyphen in label#569Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Conversation
`_handle_sentence` walks a BIO-tagged sentence and emits `(start, end, entity_type)` tuples. There are two append sites: one inside the loop (when a non-`O` run is closed by a later `O` tag) and one after the loop (when a run extends all the way to the end of the sentence and is never closed by an `O`). Commit 43dffb8 ("Replace hyphen with underscore in entity labels") added `.replace("-", "_")` normalization to the in-loop append, so an entity type like `"DRUG-NAME"` is emitted as `"DRUG_NAME"`: entities.append((start_index, end_index, entity_type.replace("-", "_"))) but the after-loop append still emits the raw `entity_type`, so any sentence whose last token belongs to a multi-word entity ends up with a hyphenated label where every other entity in the sentence gets an underscored label. Downstream code that compares labels (e.g. per-class F1 bucketing) then silently splits the same entity type into two distinct buckets depending only on whether the sentence ended mid-entity. Mirror the same `.replace("-", "_")` call on the after-loop append so both code paths normalize identically.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
`handle_sentence` in `scispacy/data_util.py` has two places where it emits an entity tuple — one inside the per-token loop (when a non-`O` run is closed by a later `O` tag) and one after the loop (when the run extends all the way to the end of the sentence and is never closed). The in-loop branch normalizes the label with `.replace("-", "")`, but the after-loop branch still emits the raw `entity_type`, so any sentence whose last token belongs to a multi-word hyphenated entity ends up with a hyphenated label while every other entity of the same type in the same file gets an underscored label.
Root cause
```python
else:
if in_entity:
end_index = current_index - 1
entities.append((start_index, end_index, entity_type.replace("-", "_")))
...
if in_entity:
end_index = current_index - 1
entities.append((start_index, end_index, entity_type))
```
Commit 43dffb8 ("Replace hyphen with underscore in entity labels") added `.replace("-", "_")` to the in-loop append in 2020 but missed the identical after-loop append right below it. Both branches build the tuple from the same
entity_typevariable that came fromentity[2:].upper()and are supposed to produce identically-normalized labels — the after-loop branch has been silently emitting the un-normalized form for any sentence whose final token is still inside an entity.For BIO files with labels like `B-DRUG-NAME` / `I-DRUG-NAME`, sentences that end on a `DRUG-NAME` token produce `"DRUG-NAME"` while sentences that don't produce `"DRUG_NAME"`, and any downstream code that groups or compares by label (per-class F1, label sets, confusion matrices) ends up splitting the same class into two buckets keyed only on whether the sentence happened to end mid-entity.
Fix
Mirror the same
.replace(\"-\", \"_\")call on the after-loop append so both code paths normalize identically. One-character addition.```diff
if in_entity:
end_index = current_index - 1
```
No effect for datasets whose label names don't contain hyphens (including the existing `test_read_ner_from_tsv` fixture, which uses `SO` / `TAXON`); for hyphenated label names this restores the invariant the original commit was trying to establish.