You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add callback to monitor progress in whisper transcription (#37483)
* Add callback to monitor progress in whisper transcription
* Added `` around variables, rewording
* Add example of `monitor_progress`.
---------
Co-authored-by: Eric B <[email protected]>
`FullyShardedDataParallel` and DeepSpeed ZeRO Stage 3).
462
463
return_timestamps (`bool`, *optional*):
463
464
Whether to return the timestamps with the text. This enables the `WhisperTimestampsLogitsProcessor`.
465
+
For audios longer than 30 seconds, it is necessary to set `return_timestamps=True`.
464
466
task (`str`, *optional*):
465
467
Task to use for generation, either "translate" or "transcribe".
466
468
language (`str` or list of `str`, *optional*):
@@ -533,14 +535,19 @@ def generate(
533
535
force_unique_generate_call (`bool`, *optional*):
534
536
Whether to force a unique call to the underlying GenerationMixin's [`~generation.GenerationMixin.generate`] method. This is useful for assisted decoding and testing purposes to ensure
535
537
that only one call to [`~generation.GenerationMixin.generate`] is made and therefore decoder input token ids and eos token ids are returned.
If provided, this function can be called to report the progress of the audio transcription. The function
540
+
takes a tensor argument `p` of shape `(n, 2)`, where `n` is the batch size. `p[i, 0]` contains the
541
+
index of the audio frame that is currently being transcribed for batch item `i`. `p[i, 1]` contains
542
+
the total number of frames for batch item `i`. No return value is expected.
536
543
kwargs (`dict[str, Any]`, *optional*):
537
544
Ad hoc parametrization of `generate_config` and/or additional model-specific kwargs that will be
538
545
forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder
539
546
specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder_*.
540
547
Return:
541
548
[`~utils.ModelOutput`] or `dict[str, Any]` or `torch.LongTensor`:
542
549
543
-
A:
550
+
One of the following:
544
551
- [`~utils.ModelOutput`] when `return_dict_in_generate=True` and (`return_timestamps=False` or `force_unique_generate_call=True`), including the decoder input ids and end of sequence id.
545
552
- `dict[str, Any]` when (`return_dict_in_generate=True` and `return_timestamps=True`) or `return_segments=True` or `return_token_timestamps=True`.
546
553
- `torch.LongTensor` in all other cases, excluding the decoder input ids and end of sequence id.
" Folks, if you watch the show, you know, I spent a lot of time right over there. Patiently and astutely scrutinizing the boxwood and mahogany chest set of the day's biggest stories developing the central headline pawns, definitely maneuvering an oso topical night to F6, fainting a classic Sicilian, nade door variation on the news, all the while seeing eight moves deep and patiently marshalling the latest press releases into a fisher's shows in Lip Nitsky attack that culminates in the elegant lethal slow-played, all-passant checkmate that is my nightly monologue. But sometimes, sometimes, folks, I. CHEERING AND APPLAUSE Sometimes I startle away, cubside down in the monkey bars of a condemned playground on a super fun site. Get all hept up on goofballs. Rummage that were discarded tag bag of defective toys. Yank out a fist bowl of disembodied doll limbs, toss them on a stained kid's place mat from a defunct dennies. set up a table inside a rusty cargo container down by the Wharf and challenged toothless drifters to the godless bughouse blitz of tournament that is my segment. Meanwhile."
594
601
```
595
602
603
+
The `monitor_progress` callback can be used to monitor the progress of the transcription:
604
+
```python
605
+
>>> from tqdm import tqdm
606
+
607
+
>>> # prepare inputs like above
608
+
609
+
>>> # define a callback to monitor the progress of the transcription.
" Folks, if you watch the show, you know, I spent a lot of time right over there. Patiently and astutely scrutinizing the boxwood and mahogany chest set of the day's biggest stories developing the central headline pawns, definitely maneuvering an oso topical night to F6, fainting a classic Sicilian, nade door variation on the news, all the while seeing eight moves deep and patiently marshalling the latest press releases into a fisher's shows in Lip Nitsky attack that culminates in the elegant lethal slow-played, all-passant checkmate that is my nightly monologue. But sometimes, sometimes, folks, I. CHEERING AND APPLAUSE Sometimes I startle away, cubside down in the monkey bars of a condemned playground on a super fun site. Get all hept up on goofballs. Rummage that were discarded tag bag of defective toys. Yank out a fist bowl of disembodied doll limbs, toss them on a stained kid's place mat from a defunct dennies. set up a table inside a rusty cargo container down by the Wharf and challenged toothless drifters to the godless bughouse blitz of tournament that is my segment. Meanwhile."
625
+
```
626
+
596
627
- *Shortform transcription*: If passed mel input features are <= 30 seconds, there are two possibilities:
597
628
- `return_timestamps=False`: the whole audio will be transcribed with a single call to GenerationMixin's [`~generation.GenerationMixin.generate`].
598
629
- `return_timestamps=True`: the audio will be transcribed using the same logic as long-form transcription.
@@ -763,6 +794,9 @@ def generate(
763
794
764
795
# 6 Transcribe audio until we reach the end of all input audios
0 commit comments