Skip to content

Fix _handle_sentence trailing entity not replacing hyphen in label#569

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/handle-sentence-trailing-entity-hyphen-replace
Open

Fix _handle_sentence trailing entity not replacing hyphen in label#569
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/handle-sentence-trailing-entity-hyphen-replace

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

`handle_sentence` in `scispacy/data_util.py` has two places where it emits an entity tuple — one inside the per-token loop (when a non-`O` run is closed by a later `O` tag) and one after the loop (when the run extends all the way to the end of the sentence and is never closed). The in-loop branch normalizes the label with `.replace("-", "")`, but the after-loop branch still emits the raw `entity_type`, so any sentence whose last token belongs to a multi-word hyphenated entity ends up with a hyphenated label while every other entity of the same type in the same file gets an underscored label.

Root cause

```python
else:
if in_entity:
end_index = current_index - 1
entities.append((start_index, end_index, entity_type.replace("-", "_")))
...
if in_entity:
end_index = current_index - 1
entities.append((start_index, end_index, entity_type))
```

Commit 43dffb8 ("Replace hyphen with underscore in entity labels") added `.replace("-", "_")` to the in-loop append in 2020 but missed the identical after-loop append right below it. Both branches build the tuple from the same entity_type variable that came from entity[2:].upper() and are supposed to produce identically-normalized labels — the after-loop branch has been silently emitting the un-normalized form for any sentence whose final token is still inside an entity.

For BIO files with labels like `B-DRUG-NAME` / `I-DRUG-NAME`, sentences that end on a `DRUG-NAME` token produce `"DRUG-NAME"` while sentences that don't produce `"DRUG_NAME"`, and any downstream code that groups or compares by label (per-class F1, label sets, confusion matrices) ends up splitting the same class into two buckets keyed only on whether the sentence happened to end mid-entity.

Fix

Mirror the same .replace(\"-\", \"_\") call on the after-loop append so both code paths normalize identically. One-character addition.

```diff
if in_entity:
end_index = current_index - 1

  •    entities.append((start_index, end_index, entity_type))
    
  •    entities.append((start_index, end_index, entity_type.replace(\"-\", \"_\")))
    

```

No effect for datasets whose label names don't contain hyphens (including the existing `test_read_ner_from_tsv` fixture, which uses `SO` / `TAXON`); for hyphenated label names this restores the invariant the original commit was trying to establish.

`_handle_sentence` walks a BIO-tagged sentence and emits `(start,
end, entity_type)` tuples. There are two append sites: one inside
the loop (when a non-`O` run is closed by a later `O` tag) and one
after the loop (when a run extends all the way to the end of the
sentence and is never closed by an `O`).

Commit 43dffb8 ("Replace hyphen with underscore in entity labels")
added `.replace("-", "_")` normalization to the in-loop append, so
an entity type like `"DRUG-NAME"` is emitted as `"DRUG_NAME"`:

    entities.append((start_index, end_index, entity_type.replace("-", "_")))

but the after-loop append still emits the raw `entity_type`, so any
sentence whose last token belongs to a multi-word entity ends up
with a hyphenated label where every other entity in the sentence
gets an underscored label. Downstream code that compares labels
(e.g. per-class F1 bucketing) then silently splits the same entity
type into two distinct buckets depending only on whether the
sentence ended mid-entity.

Mirror the same `.replace("-", "_")` call on the after-loop append
so both code paths normalize identically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant