Skip to content

Fix convert_freqs.py writing hardcoded oov_prob instead of computed value#576

Open
Chessing234 wants to merge 2 commits intoallenai:mainfrom
Chessing234:fix/convert-freqs-hardcoded-oov-prob
Open

Fix convert_freqs.py writing hardcoded oov_prob instead of computed value#576
Chessing234 wants to merge 2 commits intoallenai:mainfrom
Chessing234:fix/convert-freqs-hardcoded-oov-prob

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

`scripts/convert_freqs.py` computes an OOV probability from the input frequency file and unpacks it in `main`:

```python

read_freqs(...)

oov_prob = math.log(counts.smoother(0)) - log_total
return probs, oov_prob

main(...)

probs, oov_prob = (
read_freqs(input_path, min_freq=min_word_frequency)
if input_path is not None
else ({}, -20)
)
```

But the serialization step throws that value away:

```python
json.dumps({"lang": "en", "settings": {"oov_prob": -20.502029418945312}})
```

Root cause

The hardcoded float never changes — every lexeme file the script produces records the same OOV probability regardless of the corpus that was analyzed. That also contradicts the `-20` default chosen a few lines above when `input_path is None`: those two branches are supposed to produce distinct settings but the output file is identical.

Fix

Write the `oov_prob` that was just computed (or the `-20` default if no input was supplied) instead of the frozen literal, so the serialized settings reflect the distribution the script was asked to convert.

…type

export_umls_json.py prints per-concept summary statistics. The aliases
block pairs 'one alias' (== 1) with 'more than one alias' (> 1). The
types block pairs 'one type' (== 1) with 'more than one type' (>= 1),
so every concept with >= 1 type is counted under both with_one_type_count
and with_more_than_one_type_count, inflating the 'more than one type'
statistic by the count of single-type concepts.

Change >= 1 to > 1 to match the aliases pattern and the variable's name.
…alue

scripts/convert_freqs.py computes the out-of-vocabulary probability
from the input frequencies in read_freqs:

    oov_prob = math.log(counts.smoother(0)) - log_total
    return probs, oov_prob

and main() already unpacks it:

    probs, oov_prob = (
        read_freqs(input_path, min_freq=min_word_frequency)
        if input_path is not None
        else ({}, -20)
    )

But the file write then ignores it and hardcodes a specific float:

    json.dumps({"lang": "en", "settings": {"oov_prob": -20.502029418945312}})

so every generated lexeme file always reports the same oov_prob
regardless of the corpus. This also contradicts the `-20` default
chosen when `input_path is None` — with the hardcoded value those two
branches produce different files despite going through the same path.

Use the computed/selected `oov_prob` value so the serialized settings
reflect the actual distribution (or the explicit default when no input
is given).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant