Chatterbox-Pro

Production TTS pipeline based on Chatterbox-TTS with async job queue, multi-candidate Whisper validation, neural denoising, and automated quality comparison.

Built on Chatterbox TTS by Resemble AI and Chatterbox-TTS-Extended by petermg.

Quick Start (Docker)

# Clone and start
git clone https://github.com/x90skysn3k/Chatterbox-Pro.git
cd Chatterbox-Pro

# Add your voice reference file
cp /path/to/your-voice.wav voices/default.wav

# Start server (requires NVIDIA GPU + Docker with nvidia-container-toolkit)
docker compose up -d

# Generate speech
curl -X POST http://localhost:8004/tts \
  -H 'Content-Type: application/json' \
  -d '{"text": "Hello world.", "predefined_voice_id": "default.wav"}' \
  | jq .job_id
# Returns: {"job_id": "abc12345", "status": "processing"}

# Poll status
curl http://localhost:8004/status/abc12345

# Download WAV when done
curl http://localhost:8004/result/abc12345 --output speech.wav

Quick Start (Linux/CUDA)

pip install -r requirements.txt
./install-patches.sh   # Apply local fixes over pip package
python3 server.py      # Starts on port 8004

Quick Start (Mac — Apple Silicon)

Runs on M1/M2/M3/M4 via MPS (Metal Performance Shaders). Slower than CUDA (~10x) but fully functional.

# Create venv
python3 -m venv venv && source venv/bin/activate

# Install PyTorch with MPS support
pip install torch torchaudio

# Install deps
pip install -r requirements.txt
./install-patches.sh

# Set MPS-specific env vars
export KMP_DUPLICATE_LIB_OK=TRUE
export PYTORCH_ENABLE_MPS_FALLBACK=1

# Start server
python3 server.py

Mac notes:

faster-whisper requires CUDA — falls back to OpenAI Whisper automatically
Parallel workers limited to 1 on MPS (can't parallelize GPU ops)
Generation is ~10x slower than a P40 but works for testing and light use
Model downloads ~3GB on first run (cached in ~/.cache/huggingface)

Model Support Matrix

Model	Params	Speed (P40)	CFG/Exagg	Notes
standard (default)	500M	~10-15 min/scene	Yes	Best quality, full control
turbo	350M	~2-3 min/scene	No	2x faster, paralinguistic tags
multilingual	500M	~10-15 min/scene	Yes	23 languages

Select model via API parameter: "model": "standard" (or turbo, multilingual).

GPU Compatibility

GPU	Compute	Dtype	VRAM	Notes
Tesla P40	6.1	float32	24GB	Tested, production
RTX 2080 Ti	7.5	float16	11GB	Volta/Turing, half VRAM
RTX 3090	8.6	bfloat16	24GB	Ampere, fast + efficient
A100	8.0	bfloat16	40/80GB	Datacenter, best perf
RTX 4090	8.9	bfloat16	24GB	Ada Lovelace, fastest
H100	9.0	bfloat16	80GB	Hopper, top tier

Dtype is auto-detected based on GPU compute capability. Override with env var:

CHATTERBOX_DTYPE=float32 docker compose up   # Force float32
CHATTERBOX_DTYPE=bfloat16 docker compose up  # Force bfloat16

Environment Variables

Variable	Default	Description
`DEFAULT_VOICE`	`default.wav`	Voice reference file in `voices/`
`CHATTERBOX_DTYPE`	`auto`	Force dtype: `auto`, `float32`, `float16`, `bfloat16`

Server API

Async job queue architecture: submit, poll, download. No long-lived HTTP connections.

Endpoints

`POST /tts` — Submit generation job

Returns { "job_id": "abc12345", "status": "processing" } instantly.

{
  "text": "Your narration text here.",
  "voice_mode": "predefined",
  "predefined_voice_id": "default.wav",
  "temperature": 0.75,
  "exaggeration": 0.65,
  "cfg_weight": 0.4,
  "speed_factor": 1.0,
  "split_text": true,
  "chunk_size": 250,
  "seed": 0,
  "model": "standard",
  "apply_watermark": false,
  "top_p": 0.8,
  "repetition_penalty": 2.0,
  "num_candidates": 2,
  "max_attempts": 3,
  "skip_normalization": false,
  "use_silero_vad": false
}

`GET /status/{job_id}` — Poll progress

Returns real-time generation progress:

{
  "job_id": "abc12345",
  "status": "processing",
  "elapsed": 45,
  "stage": "generation",
  "chunk": "2/5",
  "chunk_pct": "40%",
  "candidate": "cand 1 attempt 1",
  "text": "Most traders lose money..."
}

`GET /result/{job_id}` — Download WAV

Returns binary WAV audio. Deletes job after download.

`GET /result/{job_id}` — Response Headers

When server-side normalization is applied, the response includes headers so clients can skip their own pass:

X-Audio-Normalized: ebu
X-Audio-Loudnorm: I=-16:TP=-1.5:LRA=11

`GET /health` — Server health check

{
  "status": "healthy",
  "version": "1.2.0",
  "uptime": "2h 15m 30s",
  "uptime_seconds": 8130,
  "model": {
    "loaded": true,
    "type": "standard"
  },
  "vram": {
    "allocated_mb": 3058.8,
    "reserved_mb": 3088.0,
    "max_allocated_mb": 3058.8,
    "total_mb": 24438.8,
    "device": "Tesla P40",
    "used_pct": 12.5
  },
  "gpu": {
    "compute_capability": "6.1",
    "supports_bf16": false,
    "supports_fp16": false,
    "supports_tf32": false
  },
  "jobs": { "active": 0, "done": 5, "failed": 0 },
  "generation_count": 42,
  "disk": { "temp_mb": 0.3, "output_mb": 56.7 }
}

`WS /stream` — Real-time streaming TTS (Turbo)

WebSocket endpoint for instant TTS using the Chatterbox Turbo model (350M, 1-step decoder). Text is split into sentence chunks and audio streams as each chunk generates.

Client → { "text": "Hello world.", "voice": "default.wav" }
Server → { "status": "generating", "text_length": 12 }
Server → { "status": "chunks", "count": 1 }
Server → { "status": "chunk", "index": 0, "total": 1, "gen_time": 1.2 }
Server → [binary: PCM16 @ 24kHz mono]
Server → { "status": "done", "chunks": 1, "elapsed": 1.2 }

Audio is raw PCM16 at 24kHz, one binary frame per chunk. Client plays chunks progressively via Web Audio API.

`GET /stream-test` — Streaming test page

Standalone HTML page with text input, voice selection, waveform visualization, and download. Open in browser to test streaming TTS.

`GET /voices` — List voice files

{ "voices": ["default.wav", "custom.wav"] }

`POST /upload-voice` — Upload voice reference

Multipart file upload. Accepts .wav, .mp3, .flac.

curl -F "file=@my-voice.wav" http://localhost:8004/upload-voice
# → { "filename": "my-voice.wav", "size": 48000 }

Audio Pipeline

Text Input
  │
  ▼
Text Preprocessing (spacing, dot-letters, sound words, pause tags)
  │
  ▼
Sentence Tokenization (NLTK) + Smart Batching (min 80 chars, max 300)
  │
  ▼
Per-Chunk Generation (2 candidates, deterministic seeds, parallel workers)
  │  top_p + repetition_penalty forwarded to T3 sampling
  │
  ▼
Per-Chunk VAD Trim (Silero VAD removes leading/trailing silence per chunk)
  │  preserves all internal pauses, 150ms speech padding
  │
  ▼
Whisper Validation (faster-whisper medium, fuzzy match > 0.85)
  │  retry up to 3x per candidate if failed (same 0.85 threshold)
  │
  ▼
Multi-Factor Candidate Scoring (whisper accuracy + speaking rate + duration)
  │
  ▼
Equal-Power Crossfade Concatenation (50ms sqrt overlap-add)
  │  + pause tag splicing
  │
  ▼
Post-Concatenation Processing (in order):
  │
  ├─ Auto-Editor (silence trim, threshold=0.04, margin=0.4s)
  │    OR Silero VAD (caps internal silence at 500ms, default)
  │
  ├─ pyrnnoise Denoising (neural noise reduction, mono 48kHz)
  │
  └─ Two-Pass EBU R128 Loudnorm (-16 LUFS, -1.5 TP, 11 LRA)
       measure → apply with linear=true (preserves dynamics)
  │
  ▼
Output WAV (192kHz)

Two-Stage VAD Pipeline

Silero VAD runs twice with different goals:

Per-chunk VAD (_vad_trim_chunk) — Runs on each candidate WAV after generation. Only trims leading/trailing non-speech. Uses min_silence_duration_ms=9999 so it never splits on internal pauses. This ensures chunks have clean edges for crossfade blending.
Final VAD (_apply_silero_vad_trim) — Runs on the concatenated audio. Caps any internal silence gaps longer than 500ms. Polishes the overall pacing after crossfade.

When use_silero_vad=true (default), both stages run and auto-editor is skipped. Set use_silero_vad=false to use auto-editor instead (amplitude-based, less intelligent).

QA Model Comparison

Compare TTS models and pipeline configurations with standardized test sentences.

Quick start

# Compare current model (quick: 3 tests)
./run.sh tts-compare-quick

# Compare specific models
./run.sh tts-compare current,turbo,hf-800m

# Full comparison (all 7 tests)
npm run qa:tts-compare -- --models current,turbo --voice default.wav

# Use local M1 Max
npm run qa:tts-compare -- --models current --local --quick

What it produces

tools/chatterbox-pro/qa-compare/
  2026-04-03T14-30-00/
    report.html          <-- Open this! Audio players + metrics side-by-side
    metrics/results.json
    audio/
      current/
        simple-narration.wav
        numbers-and-stats.wav
        ...
      turbo/
        ...

Test corpus

7 standardized sentences in test-corpus.json covering:

Simple narration (prosody baseline)
Numbers and stats (pronunciation)
Proper names (trader names like Livermore, Tudor Jones)
Emotional range (somber tone, natural pauses)
Technical terms (VIX, S&P, algo jargon)
Short punchy sentences (rhythm, emphasis)
Long complex sentences (breath pacing, dramatic build)

Metrics computed

Metric	Method	Pass threshold
Text accuracy	Whisper transcription + fuzzy match	> 0.85
Duration	ffprobe	within 10% expected
Generation speed	wall clock	lower = better
Peak level	ffmpeg astats	-6 to -0.5 dBFS
File size	stat	informational

Voice Settings (Production)

Parameter	Value	Notes
temperature	0.75	Natural variance
exaggeration	0.65	Authoritative calm
cfg_weight	0.4	Balanced delivery
top_p	0.8	Nucleus sampling threshold (T3 default)
repetition_penalty	2.0	Prevents repeated tokens (T3 default)
speed_factor	1.0	Must be 1.0 (other values cause double voice)
voice	default.wav	Custom reference
whisper_model	medium	Best accuracy/speed tradeoff
num_candidates	2	Per chunk, scored by quality
max_attempts	3	Retries per candidate
skip_normalization	false	Set true if client normalizes
use_silero_vad	false	Intelligent silence trim (alt to auto-editor)

Infrastructure

P40 Server (Primary)

# Server managed by systemd
ssh root@your-server "systemctl restart chatterbox-pro"
ssh root@your-server "journalctl -u chatterbox-pro -f"

# Check VRAM
ssh root@your-server "nvidia-smi"

# Health check
curl http://your-server:8004/health

M1 Max (Fallback)

cd tools/chatterbox-pro
source venv/bin/activate
KMP_DUPLICATE_LIB_OK=TRUE python server.py
# Runs on port 8004 locally

Differences: use_faster_whisper=False (needs CUDA), num_parallel_workers=1 (MPS limitation).

Known Issues

Memory leak on P40

VRAM usage grows over time. Server now runs VRAM cleanup between jobs automatically, but restart after every 2-3 videos if needed:

ssh root@your-server "systemctl restart chatterbox-pro"

The /health endpoint reports vram.used_pct — restart when approaching 80%.

faster-whisper crashes

Occasionally silently crashes on certain audio. Falls back to longest-transcript selection. Switch to OpenAI Whisper if persistent: set use_faster_whisper=False.

Changelog

v1.2.0

Guard two-pass loudnorm against -inf crash on quiet audio
Fix excessive WAV generation per chunk (1 attempt per candidate, retry on Whisper fail)
Add trailing noise trimming before Whisper validation
generate_batch() now has top_p/repetition_penalty parity with generate()
Server prints version on startup, in /health, and UI
Clean log path (logs/server.log only)

v1.1.0

Raise min chunk length 20→80 chars (prevents TTS hallucinations)
Equal-power crossfade between chunks (eliminates clicks/pops, no 3dB dip)
MD5-based conditional caching for voice embeddings
Forward top_p and repetition_penalty through tts.py → T3 sampling
Multi-factor scored candidate selection (replaces shortest-duration)
Silero VAD integration for intelligent silence trimming (opt-in)
VRAM leak management with proactive cleanup between jobs
skip_normalization param + X-Audio-Normalized response headers
/health endpoint with VRAM, GPU, jobs, disk stats
Rotating file log handler (logs/server.log)
Fix RNNoise stereo bug (-ac 2 → -ac 1)
Fix Whisper retry threshold (0.95 → 0.85, matched to initial)
Two-pass EBU R128 loudnorm (preserves dynamics vs single-pass)
Reorder post-processing: auto-editor → denoise → loudnorm
Thread safety fix for _jobs dict access
Temp file cleanup (remove files >24h old)

Dependencies

Python 3.10.x
FFmpeg on PATH
CUDA 12.8 (P40) or MPS (M1 Max)
PyTorch 2.7.0
faster-whisper + openai-whisper
pyrnnoise 0.3.8
auto-editor 27.1.1

Full list: requirements.txt

File Structure

server.py              FastAPI async job queue server
Chatter.py             Main TTS pipeline (batching, validation, post-processing)
test-corpus.json       QA comparison test sentences
deploy-p40.sh          P40 deployment script (gitignored)
voices/                Reference audio files
  default.wav    Production voice
chatterbox/src/        Core model implementations
  chatterbox/tts.py    ChatterboxTTS class (500M)
  chatterbox/vc.py     ChatterboxVC (voice conversion)
  models/t3/           T3 language model
  models/s3gen/        S3Gen vocoder
  models/voice_encoder/ Speaker embedding
  models/s3tokenizer/  Speech tokenizer
temp/                  Temporary candidate WAVs (gitignored)
output/                Final output files (gitignored)
qa-compare/            Comparison run results (gitignored)

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
chatterbox/src/chatterbox		chatterbox/src/chatterbox
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Chatter.py		Chatter.py
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
auth.py		auth.py
docker-compose.yml		docker-compose.yml
install-patches.sh		install-patches.sh
jobs_db.py		jobs_db.py
requirements.base.with.versions.txt		requirements.base.with.versions.txt
requirements.txt		requirements.txt
requirements_frozen.txt		requirements_frozen.txt
ruff.toml		ruff.toml
scenario.py		scenario.py
server.py		server.py
sip_bridge.py		sip_bridge.py
start-local.sh		start-local.sh
stream.html		stream.html
studio.html		studio.html
test-corpus.json		test-corpus.json
test-models.py		test-models.py
test_short.py		test_short.py

Folders and files

Latest commit

History

Repository files navigation

Chatterbox-Pro

Quick Start (Docker)

Quick Start (Linux/CUDA)

Quick Start (Mac — Apple Silicon)

Model Support Matrix

GPU Compatibility

Environment Variables

Server API

Endpoints

POST /tts — Submit generation job

GET /status/{job_id} — Poll progress

GET /result/{job_id} — Download WAV

GET /result/{job_id} — Response Headers

GET /health — Server health check

WS /stream — Real-time streaming TTS (Turbo)

GET /stream-test — Streaming test page

GET /voices — List voice files

POST /upload-voice — Upload voice reference

Audio Pipeline

Two-Stage VAD Pipeline

QA Model Comparison

Quick start

What it produces

Test corpus

Metrics computed

Voice Settings (Production)

Infrastructure

P40 Server (Primary)

M1 Max (Fallback)

Known Issues

Memory leak on P40

faster-whisper crashes

Changelog

v1.2.0

v1.1.0

Dependencies

File Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /tts` — Submit generation job

`GET /status/{job_id}` — Poll progress

`GET /result/{job_id}` — Download WAV

`GET /result/{job_id}` — Response Headers

`GET /health` — Server health check

`WS /stream` — Real-time streaming TTS (Turbo)

`GET /stream-test` — Streaming test page

`GET /voices` — List voice files

`POST /upload-voice` — Upload voice reference

Packages