Fix convert_freqs.py writing hardcoded oov_prob instead of computed value by Chessing234 · Pull Request #576 · allenai/scispacy

Chessing234 · 2026-04-16T01:07:10Z

Bug

`scripts/convert_freqs.py` computes an OOV probability from the input frequency file and unpacks it in `main`:

```python

read_freqs(...)

oov_prob = math.log(counts.smoother(0)) - log_total
return probs, oov_prob

main(...)

probs, oov_prob = (
read_freqs(input_path, min_freq=min_word_frequency)
if input_path is not None
else ({}, -20)
)
```

But the serialization step throws that value away:

```python
json.dumps({"lang": "en", "settings": {"oov_prob": -20.502029418945312}})
```

Root cause

The hardcoded float never changes — every lexeme file the script produces records the same OOV probability regardless of the corpus that was analyzed. That also contradicts the `-20` default chosen a few lines above when `input_path is None`: those two branches are supposed to produce distinct settings but the output file is identical.

Fix

Write the `oov_prob` that was just computed (or the `-20` default if no input was supplied) instead of the frozen literal, so the serialized settings reflect the distribution the script was asked to convert.

…type export_umls_json.py prints per-concept summary statistics. The aliases block pairs 'one alias' (== 1) with 'more than one alias' (> 1). The types block pairs 'one type' (== 1) with 'more than one type' (>= 1), so every concept with >= 1 type is counted under both with_one_type_count and with_more_than_one_type_count, inflating the 'more than one type' statistic by the count of single-type concepts. Change >= 1 to > 1 to match the aliases pattern and the variable's name.

…alue scripts/convert_freqs.py computes the out-of-vocabulary probability from the input frequencies in read_freqs: oov_prob = math.log(counts.smoother(0)) - log_total return probs, oov_prob and main() already unpacks it: probs, oov_prob = ( read_freqs(input_path, min_freq=min_word_frequency) if input_path is not None else ({}, -20) ) But the file write then ignores it and hardcodes a specific float: json.dumps({"lang": "en", "settings": {"oov_prob": -20.502029418945312}}) so every generated lexeme file always reports the same oov_prob regardless of the corpus. This also contradicts the `-20` default chosen when `input_path is None` — with the hardcoded value those two branches produce different files despite going through the same path. Use the computed/selected `oov_prob` value so the serialized settings reflect the actual distribution (or the explicit default when no input is given).

Chessing234 added 2 commits April 15, 2026 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix convert_freqs.py writing hardcoded oov_prob instead of computed value#576

Fix convert_freqs.py writing hardcoded oov_prob instead of computed value#576
Chessing234 wants to merge 2 commits intoallenai:mainfrom
Chessing234:fix/convert-freqs-hardcoded-oov-prob

Chessing234 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 16, 2026

Bug

read_freqs(...)

main(...)

Root cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant