Production TTS pipeline based on Chatterbox-TTS with async job queue, multi-candidate Whisper validation, neural denoising, and automated quality comparison.
Built on Chatterbox TTS by Resemble AI and Chatterbox-TTS-Extended by petermg.
# Clone and start
git clone https://github.com/x90skysn3k/Chatterbox-Pro.git
cd Chatterbox-Pro
# Add your voice reference file
cp /path/to/your-voice.wav voices/default.wav
# Start server (requires NVIDIA GPU + Docker with nvidia-container-toolkit)
docker compose up -d
# Generate speech
curl -X POST http://localhost:8004/tts \
-H 'Content-Type: application/json' \
-d '{"text": "Hello world.", "predefined_voice_id": "default.wav"}' \
| jq .job_id
# Returns: {"job_id": "abc12345", "status": "processing"}
# Poll status
curl http://localhost:8004/status/abc12345
# Download WAV when done
curl http://localhost:8004/result/abc12345 --output speech.wavpip install -r requirements.txt
./install-patches.sh # Apply local fixes over pip package
python3 server.py # Starts on port 8004Runs on M1/M2/M3/M4 via MPS (Metal Performance Shaders). Slower than CUDA (~10x) but fully functional.
# Create venv
python3 -m venv venv && source venv/bin/activate
# Install PyTorch with MPS support
pip install torch torchaudio
# Install deps
pip install -r requirements.txt
./install-patches.sh
# Set MPS-specific env vars
export KMP_DUPLICATE_LIB_OK=TRUE
export PYTORCH_ENABLE_MPS_FALLBACK=1
# Start server
python3 server.pyMac notes:
faster-whisperrequires CUDA — falls back to OpenAI Whisper automatically- Parallel workers limited to 1 on MPS (can't parallelize GPU ops)
- Generation is ~10x slower than a P40 but works for testing and light use
- Model downloads ~3GB on first run (cached in
~/.cache/huggingface)
| Model | Params | Speed (P40) | CFG/Exagg | Notes |
|---|---|---|---|---|
| standard (default) | 500M | ~10-15 min/scene | Yes | Best quality, full control |
| turbo | 350M | ~2-3 min/scene | No | 2x faster, paralinguistic tags |
| multilingual | 500M | ~10-15 min/scene | Yes | 23 languages |
Select model via API parameter: "model": "standard" (or turbo, multilingual).
| GPU | Compute | Dtype | VRAM | Notes |
|---|---|---|---|---|
| Tesla P40 | 6.1 | float32 | 24GB | Tested, production |
| RTX 2080 Ti | 7.5 | float16 | 11GB | Volta/Turing, half VRAM |
| RTX 3090 | 8.6 | bfloat16 | 24GB | Ampere, fast + efficient |
| A100 | 8.0 | bfloat16 | 40/80GB | Datacenter, best perf |
| RTX 4090 | 8.9 | bfloat16 | 24GB | Ada Lovelace, fastest |
| H100 | 9.0 | bfloat16 | 80GB | Hopper, top tier |
Dtype is auto-detected based on GPU compute capability. Override with env var:
CHATTERBOX_DTYPE=float32 docker compose up # Force float32
CHATTERBOX_DTYPE=bfloat16 docker compose up # Force bfloat16| Variable | Default | Description |
|---|---|---|
DEFAULT_VOICE |
default.wav |
Voice reference file in voices/ |
CHATTERBOX_DTYPE |
auto |
Force dtype: auto, float32, float16, bfloat16 |
Async job queue architecture: submit, poll, download. No long-lived HTTP connections.
Returns { "job_id": "abc12345", "status": "processing" } instantly.
{
"text": "Your narration text here.",
"voice_mode": "predefined",
"predefined_voice_id": "default.wav",
"temperature": 0.75,
"exaggeration": 0.65,
"cfg_weight": 0.4,
"speed_factor": 1.0,
"split_text": true,
"chunk_size": 250,
"seed": 0,
"model": "standard",
"apply_watermark": false,
"top_p": 0.8,
"repetition_penalty": 2.0,
"num_candidates": 2,
"max_attempts": 3,
"skip_normalization": false,
"use_silero_vad": false
}Returns real-time generation progress:
{
"job_id": "abc12345",
"status": "processing",
"elapsed": 45,
"stage": "generation",
"chunk": "2/5",
"chunk_pct": "40%",
"candidate": "cand 1 attempt 1",
"text": "Most traders lose money..."
}Returns binary WAV audio. Deletes job after download.
When server-side normalization is applied, the response includes headers so clients can skip their own pass:
X-Audio-Normalized: ebu
X-Audio-Loudnorm: I=-16:TP=-1.5:LRA=11
{
"status": "healthy",
"version": "1.2.0",
"uptime": "2h 15m 30s",
"uptime_seconds": 8130,
"model": {
"loaded": true,
"type": "standard"
},
"vram": {
"allocated_mb": 3058.8,
"reserved_mb": 3088.0,
"max_allocated_mb": 3058.8,
"total_mb": 24438.8,
"device": "Tesla P40",
"used_pct": 12.5
},
"gpu": {
"compute_capability": "6.1",
"supports_bf16": false,
"supports_fp16": false,
"supports_tf32": false
},
"jobs": { "active": 0, "done": 5, "failed": 0 },
"generation_count": 42,
"disk": { "temp_mb": 0.3, "output_mb": 56.7 }
}WebSocket endpoint for instant TTS using the Chatterbox Turbo model (350M, 1-step decoder). Text is split into sentence chunks and audio streams as each chunk generates.
Client → { "text": "Hello world.", "voice": "default.wav" }
Server → { "status": "generating", "text_length": 12 }
Server → { "status": "chunks", "count": 1 }
Server → { "status": "chunk", "index": 0, "total": 1, "gen_time": 1.2 }
Server → [binary: PCM16 @ 24kHz mono]
Server → { "status": "done", "chunks": 1, "elapsed": 1.2 }
Audio is raw PCM16 at 24kHz, one binary frame per chunk. Client plays chunks progressively via Web Audio API.
Standalone HTML page with text input, voice selection, waveform visualization, and download. Open in browser to test streaming TTS.
{ "voices": ["default.wav", "custom.wav"] }Multipart file upload. Accepts .wav, .mp3, .flac.
curl -F "file=@my-voice.wav" http://localhost:8004/upload-voice
# → { "filename": "my-voice.wav", "size": 48000 }Text Input
│
▼
Text Preprocessing (spacing, dot-letters, sound words, pause tags)
│
▼
Sentence Tokenization (NLTK) + Smart Batching (min 80 chars, max 300)
│
▼
Per-Chunk Generation (2 candidates, deterministic seeds, parallel workers)
│ top_p + repetition_penalty forwarded to T3 sampling
│
▼
Per-Chunk VAD Trim (Silero VAD removes leading/trailing silence per chunk)
│ preserves all internal pauses, 150ms speech padding
│
▼
Whisper Validation (faster-whisper medium, fuzzy match > 0.85)
│ retry up to 3x per candidate if failed (same 0.85 threshold)
│
▼
Multi-Factor Candidate Scoring (whisper accuracy + speaking rate + duration)
│
▼
Equal-Power Crossfade Concatenation (50ms sqrt overlap-add)
│ + pause tag splicing
│
▼
Post-Concatenation Processing (in order):
│
├─ Auto-Editor (silence trim, threshold=0.04, margin=0.4s)
│ OR Silero VAD (caps internal silence at 500ms, default)
│
├─ pyrnnoise Denoising (neural noise reduction, mono 48kHz)
│
└─ Two-Pass EBU R128 Loudnorm (-16 LUFS, -1.5 TP, 11 LRA)
measure → apply with linear=true (preserves dynamics)
│
▼
Output WAV (192kHz)
Silero VAD runs twice with different goals:
-
Per-chunk VAD (
_vad_trim_chunk) — Runs on each candidate WAV after generation. Only trims leading/trailing non-speech. Usesmin_silence_duration_ms=9999so it never splits on internal pauses. This ensures chunks have clean edges for crossfade blending. -
Final VAD (
_apply_silero_vad_trim) — Runs on the concatenated audio. Caps any internal silence gaps longer than 500ms. Polishes the overall pacing after crossfade.
When use_silero_vad=true (default), both stages run and auto-editor is skipped. Set use_silero_vad=false to use auto-editor instead (amplitude-based, less intelligent).
Compare TTS models and pipeline configurations with standardized test sentences.
# Compare current model (quick: 3 tests)
./run.sh tts-compare-quick
# Compare specific models
./run.sh tts-compare current,turbo,hf-800m
# Full comparison (all 7 tests)
npm run qa:tts-compare -- --models current,turbo --voice default.wav
# Use local M1 Max
npm run qa:tts-compare -- --models current --local --quicktools/chatterbox-pro/qa-compare/
2026-04-03T14-30-00/
report.html <-- Open this! Audio players + metrics side-by-side
metrics/results.json
audio/
current/
simple-narration.wav
numbers-and-stats.wav
...
turbo/
...
7 standardized sentences in test-corpus.json covering:
- Simple narration (prosody baseline)
- Numbers and stats (pronunciation)
- Proper names (trader names like Livermore, Tudor Jones)
- Emotional range (somber tone, natural pauses)
- Technical terms (VIX, S&P, algo jargon)
- Short punchy sentences (rhythm, emphasis)
- Long complex sentences (breath pacing, dramatic build)
| Metric | Method | Pass threshold |
|---|---|---|
| Text accuracy | Whisper transcription + fuzzy match | > 0.85 |
| Duration | ffprobe | within 10% expected |
| Generation speed | wall clock | lower = better |
| Peak level | ffmpeg astats | -6 to -0.5 dBFS |
| File size | stat | informational |
| Parameter | Value | Notes |
|---|---|---|
| temperature | 0.75 | Natural variance |
| exaggeration | 0.65 | Authoritative calm |
| cfg_weight | 0.4 | Balanced delivery |
| top_p | 0.8 | Nucleus sampling threshold (T3 default) |
| repetition_penalty | 2.0 | Prevents repeated tokens (T3 default) |
| speed_factor | 1.0 | Must be 1.0 (other values cause double voice) |
| voice | default.wav | Custom reference |
| whisper_model | medium | Best accuracy/speed tradeoff |
| num_candidates | 2 | Per chunk, scored by quality |
| max_attempts | 3 | Retries per candidate |
| skip_normalization | false | Set true if client normalizes |
| use_silero_vad | false | Intelligent silence trim (alt to auto-editor) |
# Server managed by systemd
ssh root@your-server "systemctl restart chatterbox-pro"
ssh root@your-server "journalctl -u chatterbox-pro -f"
# Check VRAM
ssh root@your-server "nvidia-smi"
# Health check
curl http://your-server:8004/healthcd tools/chatterbox-pro
source venv/bin/activate
KMP_DUPLICATE_LIB_OK=TRUE python server.py
# Runs on port 8004 locallyDifferences: use_faster_whisper=False (needs CUDA), num_parallel_workers=1 (MPS limitation).
VRAM usage grows over time. Server now runs VRAM cleanup between jobs automatically, but restart after every 2-3 videos if needed:
ssh root@your-server "systemctl restart chatterbox-pro"The /health endpoint reports vram.used_pct — restart when approaching 80%.
Occasionally silently crashes on certain audio. Falls back to longest-transcript selection. Switch to OpenAI Whisper if persistent: set use_faster_whisper=False.
- Guard two-pass loudnorm against
-infcrash on quiet audio - Fix excessive WAV generation per chunk (1 attempt per candidate, retry on Whisper fail)
- Add trailing noise trimming before Whisper validation
generate_batch()now hastop_p/repetition_penaltyparity withgenerate()- Server prints version on startup, in
/health, and UI - Clean log path (
logs/server.logonly)
- Raise min chunk length 20→80 chars (prevents TTS hallucinations)
- Equal-power crossfade between chunks (eliminates clicks/pops, no 3dB dip)
- MD5-based conditional caching for voice embeddings
- Forward
top_pandrepetition_penaltythroughtts.py→ T3 sampling - Multi-factor scored candidate selection (replaces shortest-duration)
- Silero VAD integration for intelligent silence trimming (opt-in)
- VRAM leak management with proactive cleanup between jobs
skip_normalizationparam +X-Audio-Normalizedresponse headers/healthendpoint with VRAM, GPU, jobs, disk stats- Rotating file log handler (
logs/server.log) - Fix RNNoise stereo bug (
-ac 2→-ac 1) - Fix Whisper retry threshold (0.95 → 0.85, matched to initial)
- Two-pass EBU R128 loudnorm (preserves dynamics vs single-pass)
- Reorder post-processing: auto-editor → denoise → loudnorm
- Thread safety fix for
_jobsdict access - Temp file cleanup (remove files >24h old)
- Python 3.10.x
- FFmpeg on PATH
- CUDA 12.8 (P40) or MPS (M1 Max)
- PyTorch 2.7.0
- faster-whisper + openai-whisper
- pyrnnoise 0.3.8
- auto-editor 27.1.1
Full list: requirements.txt
server.py FastAPI async job queue server
Chatter.py Main TTS pipeline (batching, validation, post-processing)
test-corpus.json QA comparison test sentences
deploy-p40.sh P40 deployment script (gitignored)
voices/ Reference audio files
default.wav Production voice
chatterbox/src/ Core model implementations
chatterbox/tts.py ChatterboxTTS class (500M)
chatterbox/vc.py ChatterboxVC (voice conversion)
models/t3/ T3 language model
models/s3gen/ S3Gen vocoder
models/voice_encoder/ Speaker embedding
models/s3tokenizer/ Speech tokenizer
temp/ Temporary candidate WAVs (gitignored)
output/ Final output files (gitignored)
qa-compare/ Comparison run results (gitignored)