A macOS app that captures any audio playing on your Mac — YouTube, Zoom, Twitch, livestreams, or any other app — transcribes it, translates it, and displays the original and translation side by side in a floating glass overlay.
Everything runs locally on Apple silicon. No API keys, accounts, or cloud services are required, and your audio never leaves the Mac.
system audio ──► Whisper ──► local LLM ──► glass overlay
BlackHole MLX Ollama PyObjC
- MLX Whisper for transcription
- Gemma 4 through Ollama for translation
- Silero VAD for speech detection
- PyObjC for the always-on-top overlay
- You watch foreign-language media — films, series, news, YouTube, Twitch, documentaries — and want live captions + translation instead of waiting for subtitles.
- You're learning a language and want the original and translation side by side on real speech.
- You sit in calls/meetings in a language you only half-speak.
- You need captions and translation for accessibility on audio that has none.
- You want it fully offline and private.
Most similar projects are cloud-based, transcription-only, file-based, tied to a meeting platform, or designed for OBS.
Live Translation combines:
- system-wide audio capture;
- real-time transcription;
- sentence-level LLM translation;
- a floating bilingual overlay;
- local processing;
- segmentation designed to avoid losing or repeating words.
Requires macOS with Apple silicon.
git clone https://github.com/KazKozDev/live-translation.git
cd live-translation
./setup.shOr install manually:
python3 -m venv .venv
./.venv/bin/pip install -r requirements.txt
brew bundle --file=BrewfileUse requirements.lock.txt if you need the exact pinned environment.
macOS does not allow ordinary apps to capture system output directly, so Live Translation uses BlackHole 2ch.
To keep hearing audio while capturing it:
- Open Audio MIDI Setup
- Click + → Create Multi-Output Device
- Enable your normal output and BlackHole 2ch
- Select the Multi-Output Device in System Settings → Sound → Output
The first launch may request microphone permission. This is required to read BlackHole audio.
Supported MLX Whisper models:
small— lowest resource usagemedium— balancedturbo— recommended for live uselarge— highest accuracy
Pre-download the default models:
./.venv/bin/python -c "from huggingface_hub import snapshot_download; \
[snapshot_download(r) for r in (
'mlx-community/whisper-medium-mlx',
'mlx-community/whisper-large-v3-turbo'
)]"Supported Ollama models:
gemma4:26b-mlx
gemma4:e4b-mlx
gemma4:12b-mlx
Download them with:
for model in gemma4:26b-mlx gemma4:e4b-mlx gemma4:12b-mlx; do
ollama pull "$model"
doneRecommended preset: Whisper Turbo + Gemma 4 26B
Launch LiveTranslate.app, or run:
./.venv/bin/python live_translate_overlay.py --target ruExample with fixed languages and a different Whisper model:
./.venv/bin/python live_translate_overlay.py \
--source es \
--target en \
--whisper large \
--ollama-model gemma4:26b-mlx./install-app.shOr choose another destination:
./install-app.sh ~/AppsThe app is ad-hoc signed. On first launch, right-click it and select Open.
Fixed five-second chunks often make Whisper cut words, repeat phrases, or add false punctuation.
The default pipeline instead:
- detects speech and pauses with Silero VAD;
- creates overlapping speech segments;
- removes duplicated text at segment boundaries;
- fixes punctuation introduced by artificial pauses;
- rebuilds complete sentences;
- sends those sentences to the translation model.
Under normal load, the queue catches up instead of dropping speech. During severe overload, stale audio may be skipped to return to live output.
Enable it with --streaming. Streaming re-transcribes a rolling buffer and commits words after two consecutive passes agree. Use --show-partial to see the unconfirmed tail and --update-seconds 1.5 to control re-transcription frequency.
It produces smoother partial captions but uses more compute and may lose content under load, so it is disabled by default.
Silero VAD removes silence and non-speech before transcription. The pipeline also filters common false outputs such as "thanks for watching", "subtitles by amara.org", and similar phrases. This reduces hallucinations but does not eliminate them completely.
The LLM receives complete sentences plus recent context, producing more coherent translations than word-by-word machine translation.
--source LANGUAGE
--target LANGUAGE
--whisper MODEL
--ollama-model MODEL
--silence-rms VALUE
--vad-min-speech-ms VALUE
--audio-queue SIZE
--show-partial
--streaming
--update-seconds VALUE
View all options:
./.venv/bin/python live_translate_overlay.py --help- macOS and Apple silicon only
- BlackHole setup is required
- larger models increase latency and memory usage
- automatic language detection can be unstable with mixed-language audio
- streaming mode is more compute-demanding
- Whisper can still hallucinate occasionally
live_translate_overlay.py CLI, audio pipeline, and Cocoa overlay
live_translation/ segmentation, cleanup, and translation
LiveTranslate.app macOS app bundle
install-app.sh app installer
setup.sh / Brewfile setup files
requirements*.txt dependencies
tests/ unit tests
./.venv/bin/pip install -r requirements-dev.txt
./.venv/bin/python -m ruff check .
./.venv/bin/python -m pyright
./.venv/bin/python -m pytestMIT — use, modify, and redistribute freely.
Thanks to Alex Ziskind for the benchmark that pointed to the MLX Whisper path.
