Skip to content

Commit 24b61ba

Browse files
authored
Modify kenlm dependency for pypi compatibility (#2)
1 parent 60bbb29 commit 24b61ba

7 files changed

Lines changed: 44 additions & 32 deletions

File tree

.coveragerc

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
[run]
2+
omit = tests/*
23
dynamic_context = test_function
3-
omit =
4-
# No coverage for tests
5-
pyctcdecode/tests/*
64

75
[report]
86
# Regexes for lines to exclude from consideration

.github/workflows/tests_and_lint.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ jobs:
2929
- name: Install dependencies
3030
run: |
3131
python -m pip install --upgrade pip
32+
pip install https://github.com/kpu/kenlm/archive/master.zip
3233
pip install -e .[dev]
3334
- name: Run lint checks
3435
run: |
@@ -47,6 +48,7 @@ jobs:
4748
- name: Install dependencies
4849
run: |
4950
python -m pip install --upgrade pip
51+
pip install https://github.com/kpu/kenlm/archive/master.zip
5052
pip install -e .[dev]
5153
- name: Test with pytest
5254
run: |

README.md

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
<a href="http://www.repostatus.org/#active"><img src="http://www.repostatus.org/badges/latest/active.svg" /></a>
1+
<a href="https://github.com/kensho-technologies/pyctcdecode/actions?query=workflow%3A%22Tests+and+lint%22"><img src="https://github.com/kensho-technologies/pyctcdecode/workflows/Tests%20and%20lint/badge.svg" /></a>
2+
<a href="https://codecov.io/gh/kensho-technologies/pyctcdecode"><img src="https://codecov.io/gh/kensho-technologies/pyctcdecode/branch/main/graph/badge.svg" /></a>
23
<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" /></a>
4+
<a href="http://www.repostatus.org/#active"><img src="http://www.repostatus.org/badges/latest/active.svg" /></a>
35
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg" /></a>
46

57
## pyctcdecode
68

7-
A fast and feature-rich CTC beam search decoder for speech recognition written in Python, offering n-gram (kenlm) language model support similar to DeepSpeech, but incorporating many new features such as byte pair encoding to support modern architectures like Nvidia's [Conformer-CTC](tutorials/01_pipeline_nemo.ipynb) or Facebooks's [Wav2Vec2](tutorials/02_asr_huggingface.ipynb).
9+
A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support similar to PaddlePaddle's decoder, but incorporating many new features such as byte pair encoding and real-time decoding to support models like Nvidia's [Conformer-CTC](tutorials/01_pipeline_nemo.ipynb) or Facebook's [Wav2Vec2](tutorials/02_asr_huggingface.ipynb).
810

911
``` bash
1012
pip install pyctcdecode
@@ -15,10 +17,10 @@ pip install pyctcdecode
1517
- 🔥 hotword boosting
1618
- 🤖 handling of BPE vocabulary
1719
- 👥 multi-LM support for 2+ models
18-
- 🕒 stateful LM for realtime decoding
20+
- 🕒 stateful LM for real-time decoding
1921
- ✨ native frame index annotation of words
2022
- 💨 fast runtime, comparable to C++ implementation
21-
- 🐍 easy to modify Python code
23+
- 🐍 easy-to-modify Python code
2224

2325
### Quick Start:
2426

@@ -45,7 +47,7 @@ decoder = build_ctcdecoder(
4547
text = decoder.decode(logits)
4648
```
4749

48-
if the vocabulary is BPE based, adjust the labels and set the `is_bpe` flag (merging of tokens for the LM is handled automatically):
50+
If the vocabulary is BPE based, adjust the labels and set the `is_bpe` flag (merging of tokens for the LM is handled automatically):
4951

5052
``` python
5153
labels = ["<unk>", "▁bug", "s", "▁bunny"]
@@ -58,14 +60,18 @@ decoder = build_ctcdecoder(
5860
text = decoder.decode(logits)
5961
```
6062

61-
improve domain specificity by adding hotwords during inference:
63+
Improve domain specificity by adding important contextual words ("hotwords") during inference:
6264

6365
``` python
6466
hotwords = ["looney tunes", "anthropomorphic"]
65-
text = decoder.decode(logits, hotwords=hotwords)
67+
text = decoder.decode(
68+
logits,
69+
hotwords=hotwords,
70+
hotwords_weight=10.0,
71+
)
6672
```
6773

68-
batch support via multiprocessing:
74+
Batch support via multiprocessing:
6975

7076
``` python
7177
from multiprocessing import Pool
@@ -74,7 +80,7 @@ with Pool() as pool:
7480
text_list = decoder.decode_batch(logits_list, pool)
7581
```
7682

77-
use `pyctcdecode` for a production Conformer-CTC model:
83+
Use `pyctcdecode` for a pretrained Conformer-CTC model:
7884

7985
``` python
8086
import nemo.collections.asr as nemo_asr
@@ -88,25 +94,25 @@ decoder = build_ctcdecoder(asr_model.decoder.vocabulary, is_bpe=True)
8894
decoder.decode(logits)
8995
```
9096

91-
The tutorials folder contains many well documented notebook examples on how to run speech recognition from scratch using pretrained models from Nvidia's [NeMo](https://github.com/NVIDIA/NeMo) and Huggingface/Facebook's [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html).
97+
The tutorials folder contains many well documented notebook examples on how to run speech recognition using pretrained models from Nvidia's [NeMo](https://github.com/NVIDIA/NeMo) and Huggingface/Facebook's [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html).
9298

9399
For more details on how to use all of pyctcdecode's features, have a look at our [main tutorial](tutorials/00_basic_usage.ipynb).
94100

95101
### Why pyctcdecode?
96102

97-
The flexibility of using Python allows us to implement various new features while keeping runtime competitive through little tricks like caching and beam pruning. When comparing pyctcdecode's runtime and accuracy to a standard C++ decoders, we see favorable trade offs between speed and accuracy, see code [here](tutorials/03_eval_performance.ipynb).
103+
In scientific computing, there’s often a tension between a language’s performance and its ease of use for prototyping and experimentation. Although C++ is the conventional choice for CTC decoders, we decided to try building one in Python. This choice allowed us to easily implement experimental features, while keeping runtime competitive through optimizations like caching and beam pruning. We compare the performance of `pyctcdecode` to an industry standard C++ decoder at various beam widths (shown as inline annotations), allowing us to visualize the trade-off of word error rate (y-axis) vs runtime (x-axis). For beam widths of 10 or greater, pyctcdecode yields strictly superior performance, with lower error rates in less time, see code [here](tutorials/03_eval_performance.ipynb).
98104

99105
<p align="center"><img src="docs/images/performance.png"></p>
100106

101-
Python also allows us to do nifty things like hotword support (at inference time) with only a few lines of code.
107+
The use of Python allows us to easily implement features like hotword support with only a few lines of code.
102108

103109
<p align="center"><img width="800px" src="docs/images/hotwords.png"></p>
104110

105-
The full beam results contain the language model state to enable real time inference as well as word based logit indices (frames) to calculate timing and confidence scores of individual words natively through the decoding process.
106-
111+
`pyctcdecode` can return either a single transcript, or the full results of the beam search algorithm. The latter provides the language model state to enable real-time inference as well as word-based logit indices (frames) to enable word-based timing and confidence score calculations natively through the decoding process.
112+
107113
<p align="center"><img width="450px" src="docs/images/beam_output.png"></p>
108114

109-
Additional features such as BPE vocabulary as well as examples of pyctcdecode as part of a full speech recognition pipeline can be found in the [tutorials section](tutorials).
115+
Additional features such as BPE vocabulary, as well as examples of `pyctcdecode` as part of a full speech recognition pipeline, can be found in the [tutorials section](tutorials).
110116

111117
### Resources:
112118

pyctcdecode/decoder.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,16 @@
2525
from .language_model import AbstractLanguageModel, HotwordScorer, LanguageModel
2626

2727

28+
logger = logging.getLogger(__name__)
29+
30+
2831
try:
2932
import kenlm # type: ignore
3033
except ImportError:
31-
pass
34+
logger.warning(
35+
"kenlm python bindings are not installed. Most likely you want to install it using: "
36+
"pip install https://github.com/kpu/kenlm/archive/master.zip"
37+
)
3238

3339

3440
# type hints
@@ -53,8 +59,6 @@
5359
NULL_FRAMES: Frames = (-1, -1) # placeholder that gets replaced with positive integer frame indices
5460
EMPTY_START_BEAM: Beam = ("", "", "", None, [], NULL_FRAMES, 0.0)
5561

56-
logger = logging.getLogger(__name__)
57-
5862

5963
def _normalize_whitespace(text: str) -> str:
6064
"""Efficiently normalize whitespace."""

pyctcdecode/language_model.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from __future__ import division
33

44
import abc
5+
import logging
56
import re
67
from typing import Iterable, List, Optional, Pattern, Tuple
78

@@ -19,10 +20,16 @@
1920
)
2021

2122

23+
logger = logging.getLogger(__name__)
24+
25+
2226
try:
2327
import kenlm # type: ignore
2428
except ImportError:
25-
pass
29+
logger.warning(
30+
"kenlm python bindings are not installed. Most likely you want to install it using: "
31+
"pip install https://github.com/kpu/kenlm/archive/master.zip"
32+
)
2633

2734

2835
def _get_empty_lm_state() -> kenlm.State:

setup.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,6 @@
77
from setuptools import find_packages, setup # type: ignore
88

99

10-
# https://packaging.python.org/guides/single-sourcing-package-version/
11-
# #single-sourcing-the-version
12-
13-
1410
logger = logging.getLogger(__name__)
1511

1612

@@ -57,7 +53,6 @@ def find_long_description():
5753
"codecov",
5854
"flake8",
5955
"jupyter",
60-
"kenlm@https://github.com/kpu/kenlm/archive/master.zip",
6156
"mypy",
6257
"nbconvert",
6358
"nbformat",

tutorials/02_pipeline_huggingface.ipynb

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## How to use pyctcdecode when working with a Hugginface model"
7+
"## How to use pyctcdecode when working with a Huggingface model"
88
]
99
},
1010
{
@@ -74,9 +74,9 @@
7474
"cell_type": "markdown",
7575
"metadata": {},
7676
"source": [
77-
"The vocabulary is in a slighly unconventional shape so we will replace `\"<pad>\"` with `\"\"` and `\"|\"` with `\" \"` as well as the other special tokens (which are essentially unused)\n",
77+
"The vocabulary is in a slightly unconventional shape so we will replace `\"<pad>\"` with `\"\"` and `\"|\"` with `\" \"` as well as the other special tokens (which are essentially unused)\n",
7878
"\n",
79-
"We need to standaradize the special tokens and then specifically pass which index is the ctc blank token index (since it's not the last). For that reason we have to manually build the Alphabet and the decoder instead of using the convenience wrapper `build_ctcdecoder`."
79+
"We need to standardize the special tokens and then specifically pass which index is the ctc blank token index (since it's not the last). For that reason we have to manually build the Alphabet and the decoder instead of using the convenience wrapper `build_ctcdecoder`."
8080
]
8181
},
8282
{
@@ -108,8 +108,8 @@
108108
"vocab_list[3] = \"\"\n",
109109
"# convert space character representation\n",
110110
"vocab_list[4] = \" \"\n",
111-
"# specify ctc blank char index, since conventially it is the last entry of the logit matrix\n",
112-
"alphabet = Alphabet.build_bpe_alphabet(vocab_list, ctc_token_idx=0)\n",
111+
"# specify ctc blank char index, since conventionally it is the last entry of the logit matrix\n",
112+
"alphabet = Alphabet.build_alphabet(vocab_list, ctc_token_idx=0)\n",
113113
"\n",
114114
"# build the decoder and decode the logits\n",
115115
"decoder = BeamSearchDecoderCTC(alphabet)\n",

0 commit comments

Comments
 (0)