Fix _handle_sentence trailing entity not replacing hyphen in label by Chessing234 · Pull Request #569 · allenai/scispacy

Chessing234 · 2026-04-11T09:42:43Z

Bug

`handle_sentence` in `scispacy/data_util.py` has two places where it emits an entity tuple — one inside the per-token loop (when a non-`O` run is closed by a later `O` tag) and one after the loop (when the run extends all the way to the end of the sentence and is never closed). The in-loop branch normalizes the label with `.replace("-", "")`, but the after-loop branch still emits the raw `entity_type`, so any sentence whose last token belongs to a multi-word hyphenated entity ends up with a hyphenated label while every other entity of the same type in the same file gets an underscored label.

Root cause

```python
else:
if in_entity:
end_index = current_index - 1
entities.append((start_index, end_index, entity_type.replace("-", "_")))
...
if in_entity:
end_index = current_index - 1
entities.append((start_index, end_index, entity_type))
```

Commit 43dffb8 ("Replace hyphen with underscore in entity labels") added `.replace("-", "_")` to the in-loop append in 2020 but missed the identical after-loop append right below it. Both branches build the tuple from the same entity_type variable that came from entity[2:].upper() and are supposed to produce identically-normalized labels — the after-loop branch has been silently emitting the un-normalized form for any sentence whose final token is still inside an entity.

For BIO files with labels like `B-DRUG-NAME` / `I-DRUG-NAME`, sentences that end on a `DRUG-NAME` token produce `"DRUG-NAME"` while sentences that don't produce `"DRUG_NAME"`, and any downstream code that groups or compares by label (per-class F1, label sets, confusion matrices) ends up splitting the same class into two buckets keyed only on whether the sentence happened to end mid-entity.

Fix

Mirror the same .replace(\"-\", \"_\") call on the after-loop append so both code paths normalize identically. One-character addition.

```diff
if in_entity:
end_index = current_index - 1

   entities.append((start_index, end_index, entity_type))

   entities.append((start_index, end_index, entity_type.replace(\"-\", \"_\")))

```

No effect for datasets whose label names don't contain hyphens (including the existing `test_read_ner_from_tsv` fixture, which uses `SO` / `TAXON`); for hyphenated label names this restores the invariant the original commit was trying to establish.

`_handle_sentence` walks a BIO-tagged sentence and emits `(start, end, entity_type)` tuples. There are two append sites: one inside the loop (when a non-`O` run is closed by a later `O` tag) and one after the loop (when a run extends all the way to the end of the sentence and is never closed by an `O`). Commit 43dffb8 ("Replace hyphen with underscore in entity labels") added `.replace("-", "_")` normalization to the in-loop append, so an entity type like `"DRUG-NAME"` is emitted as `"DRUG_NAME"`: entities.append((start_index, end_index, entity_type.replace("-", "_"))) but the after-loop append still emits the raw `entity_type`, so any sentence whose last token belongs to a multi-word entity ends up with a hyphenated label where every other entity in the sentence gets an underscored label. Downstream code that compares labels (e.g. per-class F1 bucketing) then silently splits the same entity type into two distinct buckets depending only on whether the sentence ended mid-entity. Mirror the same `.replace("-", "_")` call on the after-loop append so both code paths normalize identically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix _handle_sentence trailing entity not replacing hyphen in label#569

Fix _handle_sentence trailing entity not replacing hyphen in label#569
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/handle-sentence-trailing-entity-hyphen-replace

Chessing234 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 11, 2026

Bug

Root cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant