Speech-to-Text transcription tool for Swiss German and other languages using:
- Faster-Whisper for fast ASR transcription
- PyAnnote for Voice Activity Detection (VAD) and speaker diarization
- CTC Forced Alignment for word-level timestamps using wav2vec2 models
- 🎤 Speaker Diarization: Automatically identify different speakers in audio
- 📝 Word-level Timestamps: Precise timing for each word via CTC alignment
- 🔊 Voice Activity Detection: Segment audio by speech regions
- 📊 Detailed Logging: Comprehensive logs for avg_logprob, alignment scores, speaker info
- 🎬 SRT Output: Standard subtitle format with optional speaker labels
# Clone the repository
git clone https://github.com/i4Ds/stt4sg-transcribe
cd stt4sg-transcribe
# Create virtual environment with uv
uv venv
source .venv/bin/activate
# Install dependencies
uv syncPyAnnote models require a HuggingFace token. Set the HF_TOKEN environment variable:
export HF_TOKEN="your-huggingface-token"Or pass it via --hf-token argument.
# Transcribe with default settings (VAD on, diarization off)
uv run main.py audio.mp3
# Specify language
uv run main.py audio.mp3 -l de
# Custom output path
uv run main.py audio.mp3 -o output.srt# Enable speaker diarization
uv run main.py audio.mp3 --diarization
# Disable VAD
uv run main.py audio.mp3 --no-vad
# Disable CTC alignment
uv run main.py audio.mp3 --no-alignment
# Specify number of speakers (requires diarization)
uv run main.py audio.mp3 --diarization -n 2
# Use a specific Whisper model
uv run main.py audio.mp3 -m medium
# Use CPU instead of GPU
uv run main.py audio.mp3 --device cpuusage: main.py [-h] [-o OUTPUT] [--output-dir OUTPUT_DIR] [-m MODEL] [-l LANGUAGE]
[--task {transcribe,translate}] [--log-progress]
[--no-vad] [--vad-method VAD_METHOD] [--vad-params VAD_PARAMS]
[--diarization] [--diarization-method DIARIZATION_METHOD]
[--diarization-params DIARIZATION_PARAMS] [-n NUM_SPEAKERS]
[--min-speakers MIN_SPEAKERS] [--max-speakers MAX_SPEAKERS]
[--no-alignment] [--alignment-model ALIGNMENT_MODEL]
[--batch-size BATCH_SIZE]
[--device {cuda,cpu}] [--compute-type {float16,float32,int8}]
[--no-speaker-labels] [--no-logs] [--hf-token HF_TOKEN]
audio_path
Arguments:
audio_path Path to audio file
-o, --output OUTPUT Output SRT path (default: outputs/srt/<filename>.srt)
--output-dir OUTPUT_DIR Output folder for SRTs (default: outputs/srt)
-m, --model MODEL Whisper model (default: large-v3)
-l, --language LANGUAGE Language code (auto-detect if not specified)
--task {transcribe,translate} Task to perform (default: transcribe)
--log-progress Log transcription progress
Diarization:
--no-vad Disable VAD (on by default)
--vad-method VAD_METHOD VAD method: pyannote | speechbrain | nemo | silero (default: silero)
--vad-params VAD_PARAMS JSON dict of VAD parameters
--diarization Enable speaker diarization (implies VAD)
--diarization-method METHOD Diarization method: pyannote | nemo | speechbrain (default: pyannote)
--diarization-params PARAMS JSON dict of diarization parameters
-n, --num-speakers NUM Number of speakers
--min-speakers MIN Minimum number of speakers (default: 2)
--max-speakers MAX Maximum number of speakers
Alignment:
--no-alignment Disable CTC alignment
--alignment-model MODEL Custom alignment model
Performance:
--batch-size BATCH_SIZE Batch size for transcription (enables batched inference and **disables** condition_on_previous_text)
Device:
--device {cuda,cpu} Device to use for processing
--compute-type TYPE Compute type: float16 | float32 | int8
Output:
--no-speaker-labels Disable speaker labels in SRT output
--no-logs Don't save log files
Auth:
--hf-token HF_TOKEN HuggingFace token (or set HF_TOKEN env)
Silero VAD supports threshold and neg_threshold:
--vad-method silero --vad-params '{"threshold": 0.5, "neg_threshold": 0.365}'
PyAnnote VAD supports min_duration_on and min_duration_off in this setup:
--vad-method pyannote --vad-params '{"min_duration_on": 0.0, "min_duration_off": 0.0}'
PyAnnote diarization supports pipeline_params, including segmentation settings:
--diarization-params '{"pipeline_params": {"segmentation": {"threshold": 0.44, "min_duration_off": 0.58}}}'
- Default:
outputs/srt/<filename>.srt - Custom path via
-oargument
All logs are saved in logs/timestamps/ with timestamps:
*_vad_diarization.json: Speech segments and speaker info*_transcription.json: Transcription with avg_logprob, no_speech_prob*_alignment.json: Word-level timestamps with alignment confidence scores*_speaker_alignment.json: Speaker assignments per segment
- VAD (PyAnnote): Detect speech segments to guide transcription
- Transcription (Faster-Whisper): Transcribe using VAD segment timestamps
- CTC Alignment (wav2vec2): Generate word-level timestamps
- Diarization (PyAnnote): Identify different speakers
- Speaker Assignment: Map speakers to segments with purity calculation
- SRT Generation: Create subtitle file with speaker labels
from pipeline import TranscriptionPipeline, TranscriptionConfig
config = TranscriptionConfig(
whisper_model="large-v3",
language="de",
use_vad=True,
use_diarization=False,
use_alignment=True,
)
pipeline = TranscriptionPipeline(config)
result = pipeline.transcribe("audio.mp3")
print(result["srt_content"])When running PyTorch, Faster-Whisper, or similar GPU libraries, you may encounter errors such as:
Unable to load libcudnn_cnn.so.9*
Invalid handle. Cannot load symbol cudnnCreateConvolutionDescriptor
This usually means either:
- cuDNN 9.x is not installed, or
- The libraries are installed but not visible in your
LD_LIBRARY_PATH. Add them to your library path:
export LD_LIBRARY_PATH="$(python3 -c 'import importlib.util,os;pkgs=["nvidia.cublas","nvidia.cudnn"];print(":".join([os.path.join((spec.submodule_search_locations[0] if (spec:=importlib.util.find_spec(p)) and spec.submodule_search_locations else os.path.dirname(spec.origin)),"lib") for p in pkgs]))'):$LD_LIBRARY_PATH"or if you use fish:
set -x LD_LIBRARY_PATH (python3 -c 'import importlib.util,os;pkgs=["nvidia.cublas","nvidia.cudnn"];print(":".join([os.path.join((spec.submodule_search_locations[0] if (spec:=importlib.util.find_spec(p)) and spec.submodule_search_locations else os.path.dirname(spec.origin)),"lib") for p in pkgs]))') $LD_LIBRARY_PATH