Private, on-device live speech translation with local voice cloning.
Bao Translate is a tiny interpreter that runs on your Android phone. You speak, it captions and translates the words, then it speaks the translation out loud. If you enroll a short local voice profile, translated speech can be converted toward your timbre. After the model stack is downloaded, conversation audio and voice profiles stay on the device.
- Live speech translation: microphone audio flows through local VAD, STT, translation, TTS, and playback.
- Streaming captions: English uses a sherpa-onnx Zipformer transducer; 10 other languages use lazy Vosk caption models; Arabic falls back to chunked Whisper captions.
- On-device translation: Qwen2.5 1.5B is the required LiteRT-LM translation model; Gemma 4 E2B is an optional upgrade.
- On-device speech: Kokoro handles en/es/fr/hi/it/pt/zh; optional Supertonic handles de/ja/ko/ru/ar; Android platform TTS is the fallback when a supplemental voice is unavailable.
- Voice cloning: OpenVoice ONNX tone conversion is provisioned with the required stack and is applied when a local or peer voice profile is available.
- Conversation modes: face-to-face use, continuous translation, per-speaker language selection, Bluetooth audio routing, and Nearby Connections relay between Android devices.
- Model management: downloads are curated, status-tracked, resumable, integrity-checked, and visible in the app.
flowchart LR
mic["Microphone"] --> vad["Silero VAD"]
vad --> stt["Whisper Base STT"]
vad -. live audio .-> caption{"Caption route"}
caption -->|en| sherpa["sherpa-onnx Zipformer"]
caption -->|"es/fr/de/zh/ja/ko/pt/it/ru/hi"| vosk["Vosk small model"]
caption -->|"ar or missing model"| chunked["chunked Whisper fallback"]
stt --> translate["LiteRT-LM translation"]
translate --> tts{"TTS router"}
tts -->|"en/es/fr/hi/it/pt/zh"| kokoro["Kokoro"]
tts -->|"de/ja/ko/ru/ar if installed"| supertonic["Supertonic"]
tts -->|fallback| platform["Android platform TTS"]
kokoro --> clone{"Voice profile?"}
supertonic --> clone
platform --> clone
clone -->|yes| openvoice["OpenVoice tone conversion"]
clone -->|"no or conversion fails"| playback["Audio playback"]
openvoice --> playback
playback --> speaker["Speaker or selected output"]
playback -. Nearby relay .-> peer["Paired Android device"]
The live caption path is optimized for responsiveness while Whisper provides the final transcript used for translation. The TTS router chooses the best local speech engine for the target language, then OpenVoice can convert the synthesized audio toward the enrolled speaker timbre.
flowchart TD
firstRun["First app run"] --> required["Required readiness gate"]
required --> whisper["Whisper Base STT - 148 MB"]
required --> qwen["Qwen2.5 1.5B translation - 1.5 GB"]
required --> silero["Silero VAD - 2 MB"]
required --> kokoro["Kokoro TTS - 142 MB app estimate"]
required --> openvoice["OpenVoice tone conversion - 125 MB app estimate"]
firstRun --> auto["Auto-provisioned upgrade"]
auto --> streaming["English streaming ASR - 44 MB"]
settings["Optional settings download"] --> gemma["Gemma 4 E2B translation - 2.5 GB"]
settings --> supertonic["Supertonic TTS - 80 MB"]
languageUse["First use of non-English caption language"] --> vosk["Lazy Vosk caption model - 32 to 86 MB each"]
Required models must be ready before the core translation loop is considered ready. The English streaming caption model is auto-provisioned but does not block the required readiness gate. Vosk caption models are downloaded only when a language needs them.
sequenceDiagram
participant A as Device A
participant B as Device B
A->>A: Capture, caption, transcribe
A->>A: Translate into listener language
A->>B: Send Nearby payload and speaker metadata
B->>B: Synthesize translated speech
B->>B: Apply peer timbre when available
B->>B: Play through selected output route
Nearby Connections is used for the two-device mesh. Each side can keep its own source and target language, and peer timbre metadata lets the receiving device speak the translation in the other speaker's voice profile when available.
English, Spanish, French, German, Chinese, Japanese, Korean, Portuguese, Italian, Russian, Arabic, and Hindi.
| Capability | Coverage |
|---|---|
| Translation targets | All 12 selectable languages |
| Source language | Manual selection or Auto detect |
| Live captions | English via sherpa-onnx Zipformer; es/fr/de/zh/ja/ko/pt/it/ru/hi via Vosk; Arabic via chunked Whisper fallback |
| Local speech | Kokoro for en/es/fr/hi/it/pt/zh; Supertonic for de/ja/ko/ru/ar when installed; Android platform TTS fallback |
| Voice conversion | OpenVoice tone conversion for enrolled local and peer voice profiles |
Models are downloaded in-app from curated sources on first use or from settings.
| Role | Model | Provisioning |
|---|---|---|
| Voice activity detection | Silero VAD | Required |
| Speech-to-text | Whisper Base through sherpa-onnx | Required |
| Streaming captions, English | sherpa-onnx Zipformer transducer | Auto-provisioned |
| Streaming captions, 10 languages | Vosk small models | Lazy per language |
| Translation | Qwen2.5 1.5B through LiteRT-LM | Required |
| Translation upgrade | Gemma 4 E2B through LiteRT-LM | Optional |
| Text-to-speech | Kokoro Multi-Lang | Required |
| Supplemental TTS | Supertonic TTS through sherpa-onnx | Optional |
| Fallback TTS | Android TextToSpeech engine | Device-provided |
| Voice conversion | OpenVoice tone converter and reference encoder through ONNX Runtime | Required |
Requirements:
- Android 12 / API 31 or newer.
- Enough local storage for the selected model stack.
- Network access for initial model downloads.
- Optional Bluetooth headset or second Android device for advanced conversation testing.
Steps:
- Install an APK you have access to, or build from source.
- Launch the app and download the required models.
- Optional: enroll your voice in settings.
- Choose source and target languages, pick audio devices if needed, and start translating.
Source builds that need in-app model downloads also need the Hugging Face OAuth client and redirect scheme described in DEVELOPMENT.md.
The Android Gradle project is rooted at Android/src.
cd Android/src
./gradlew :app:assembleDebug
./gradlew :app:testDebugUnitTest
./gradlew :app:connectedDebugAndroidTest
./gradlew :app:smokeE2e
adb install -r app/build/outputs/apk/debug/app-debug.apkBuild notes:
applicationId = com.bao.translate.versionName = 1.0.15;versionCode = 33.compileSdk = 37;minSdk = 31;targetSdk = 35.- The wrapper uses Gradle 9.5.1 with AGP 9.2.1 and Kotlin 2.4.0.
- Gradle toolchains use JDK 26 for compilation and emit Java 17 bytecode.
- The vendored
sherpa-onnxAAR lives underAndroid/src/app/libs/. sherpa-onnxandonnxruntime-androidboth shiplibonnxruntime.so; packaging keeps one shared object withjniLibs.pickFirsts.
See DEVELOPMENT.md for local setup, Hugging Face OAuth configuration, and verification notes.
flowchart TB
root["Repository root"] --> android["Android/src - Android Gradle app"]
root --> brand["bao-translate - brand and store assets"]
root --> docs["docs/adr - architecture decisions"]
root --> mcp["mcp - Model Context Protocol guide"]
root --> skills["skills - bundled and featured agent skills"]
root --> allowlists["model_allowlists - curated model lists"]
android --> feature["Bao Translate custom task"]
android --> gallery["AI Edge Gallery foundation"]
feature --> audio["audio routing and playback"]
feature --> stt["VAD, STT, streaming captions"]
feature --> translate["LiteRT-LM translation"]
feature --> tts["Kokoro, Supertonic, platform TTS, OpenVoice"]
feature --> nearby["Nearby conversation mesh"]
| Path | Purpose |
|---|---|
| Android/ | Android application docs and Gradle project |
| Android/src | App source, Gradle wrapper, tests, and resources |
| bao-translate/ | Brand assets and app icon resources |
| docs/ | Architecture decisions and supporting documentation |
| mcp/ | Model Context Protocol integration guide |
| skills/ | Agent skill documentation and examples |
| model_allowlists/ and model_allowlist.json | Curated model allowlists |
| Function_Calling_Guide.md | Guide for adding custom mobile actions |
| Bug_Reporting_Guide.md | Android bug report capture guide |
- Android app guide
- Development setup
- Contribution policy
- Bug reporting
- Function calling
- MCP integration
- Agent skills
- Model allowlists
Recommended local gates before shipping Android changes:
cd Android/src
./gradlew :app:verifyReleaseReady
./gradlew :app:smokeE2e
./gradlew :app:connectedDebugAndroidTestUse a physical device for full translation, microphone capture, Bluetooth audio routing, voice cloning, and Nearby conversation validation. Emulator coverage is useful for compile, unit, and focused UI checks, but it does not replace hardware verification for the audio pipeline.
Bao Translate builds on these open-source projects:
- Google AI Edge Gallery for the app foundation.
- LiteRT and LiteRT-LM for local model execution.
- sherpa-onnx for on-device STT, streaming ASR, and TTS.
- Vosk for multilingual streaming recognition.
- ONNX Runtime for voice-conversion graphs.
- Whisper for speech recognition.
- Qwen2.5 for required translation.
- Gemma for optional translation.
- Kokoro for multilingual TTS.
- OpenVoice for cross-lingual tone-color conversion.
- Silero VAD for voice activity detection.
- Hugging Face for LiteRT-LM model hosting.
- Found a bug? Use the local bug report template and include the details from the Bug Reporting Guide.
- Have an idea? Use the local feature request template.
- Planning a code change? Read CONTRIBUTING.md first so expectations are clear.
Bao Translate is licensed under the Apache License, Version 2.0. See LICENSE. This project is derived from Google AI Edge Gallery; upstream copyright notices and Apache-2.0 licensing are retained.
