You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A fast and feature-rich CTC beam search decoder for speech recognition written in Python, offering n-gram (kenlm) language model support similar to DeepSpeech, but incorporating many new features such as byte pair encoding to support modern architectures like Nvidia's [Conformer-CTC](tutorials/01_pipeline_nemo.ipynb) or Facebooks's [Wav2Vec2](tutorials/02_asr_huggingface.ipynb).
9
+
A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support similar to PaddlePaddle's decoder, but incorporating many new features such as byte pair encoding and real-time decoding to support models like Nvidia's [Conformer-CTC](tutorials/01_pipeline_nemo.ipynb) or Facebook's [Wav2Vec2](tutorials/02_asr_huggingface.ipynb).
8
10
9
11
```bash
10
12
pip install pyctcdecode
@@ -15,10 +17,10 @@ pip install pyctcdecode
15
17
- 🔥 hotword boosting
16
18
- 🤖 handling of BPE vocabulary
17
19
- 👥 multi-LM support for 2+ models
18
-
- 🕒 stateful LM for realtime decoding
20
+
- 🕒 stateful LM for real-time decoding
19
21
- ✨ native frame index annotation of words
20
22
- 💨 fast runtime, comparable to C++ implementation
21
-
- 🐍 easy to modify Python code
23
+
- 🐍 easy-to-modify Python code
22
24
23
25
### Quick Start:
24
26
@@ -45,7 +47,7 @@ decoder = build_ctcdecoder(
45
47
text = decoder.decode(logits)
46
48
```
47
49
48
-
if the vocabulary is BPE based, adjust the labels and set the `is_bpe` flag (merging of tokens for the LM is handled automatically):
50
+
If the vocabulary is BPE based, adjust the labels and set the `is_bpe` flag (merging of tokens for the LM is handled automatically):
49
51
50
52
```python
51
53
labels = ["<unk>", "▁bug", "s", "▁bunny"]
@@ -58,14 +60,18 @@ decoder = build_ctcdecoder(
58
60
text = decoder.decode(logits)
59
61
```
60
62
61
-
improve domain specificity by adding hotwords during inference:
63
+
Improve domain specificity by adding important contextual words ("hotwords") during inference:
The tutorials folder contains many well documented notebook examples on how to run speech recognition from scratch using pretrained models from Nvidia's [NeMo](https://github.com/NVIDIA/NeMo) and Huggingface/Facebook's [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html).
97
+
The tutorials folder contains many well documented notebook examples on how to run speech recognition using pretrained models from Nvidia's [NeMo](https://github.com/NVIDIA/NeMo) and Huggingface/Facebook's [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html).
92
98
93
99
For more details on how to use all of pyctcdecode's features, have a look at our [main tutorial](tutorials/00_basic_usage.ipynb).
94
100
95
101
### Why pyctcdecode?
96
102
97
-
The flexibility of using Python allows us to implement various new features while keeping runtime competitive through little tricks like caching and beam pruning. When comparing pyctcdecode's runtime and accuracy to a standard C++ decoders, we see favorable trade offs between speed and accuracy, see code [here](tutorials/03_eval_performance.ipynb).
103
+
In scientific computing, there’s often a tension between a language’s performance and its ease of use for prototyping and experimentation. Although C++ is the conventional choice for CTC decoders, we decided to try building one in Python. This choice allowed us to easily implement experimental features, while keeping runtime competitive through optimizations like caching and beam pruning. We compare the performance of `pyctcdecode` to an industry standard C++ decoder at various beam widths (shown as inline annotations), allowing us to visualize the trade-off of word error rate (y-axis) vs runtime (x-axis). For beam widths of 10 or greater, pyctcdecode yields strictly superior performance, with lower error rates in less time, see code [here](tutorials/03_eval_performance.ipynb).
The full beam results contain the language model state to enable realtime inference as well as wordbased logit indices (frames) to calculate timing and confidence scores of individual words natively through the decoding process.
106
-
111
+
`pyctcdecode` can return either a single transcript, or the full results of the beam search algorithm. The latter provides the language model state to enable real-time inference as well as word-based logit indices (frames) to enable word-based timing and confidence score calculations natively through the decoding process.
Additional features such as BPE vocabulary as well as examples of pyctcdecode as part of a full speech recognition pipeline can be found in the [tutorials section](tutorials).
115
+
Additional features such as BPE vocabulary, as well as examples of `pyctcdecode` as part of a full speech recognition pipeline, can be found in the [tutorials section](tutorials).
Copy file name to clipboardExpand all lines: tutorials/02_pipeline_huggingface.ipynb
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
"cell_type": "markdown",
5
5
"metadata": {},
6
6
"source": [
7
-
"## How to use pyctcdecode when working with a Hugginface model"
7
+
"## How to use pyctcdecode when working with a Huggingface model"
8
8
]
9
9
},
10
10
{
@@ -74,9 +74,9 @@
74
74
"cell_type": "markdown",
75
75
"metadata": {},
76
76
"source": [
77
-
"The vocabulary is in a slighly unconventional shape so we will replace `\"<pad>\"` with `\"\"` and `\"|\"` with `\"\"` as well as the other special tokens (which are essentially unused)\n",
77
+
"The vocabulary is in a slightly unconventional shape so we will replace `\"<pad>\"` with `\"\"` and `\"|\"` with `\"\"` as well as the other special tokens (which are essentially unused)\n",
78
78
"\n",
79
-
"We need to standaradize the special tokens and then specifically pass which index is the ctc blank token index (since it's not the last). For that reason we have to manually build the Alphabet and the decoder instead of using the convenience wrapper `build_ctcdecoder`."
79
+
"We need to standardize the special tokens and then specifically pass which index is the ctc blank token index (since it's not the last). For that reason we have to manually build the Alphabet and the decoder instead of using the convenience wrapper `build_ctcdecoder`."
80
80
]
81
81
},
82
82
{
@@ -108,8 +108,8 @@
108
108
"vocab_list[3] = \"⁇\"\n",
109
109
"# convert space character representation\n",
110
110
"vocab_list[4] = \"\"\n",
111
-
"# specify ctc blank char index, since conventially it is the last entry of the logit matrix\n",
0 commit comments