Skip to content

x90skysn3k/Chatterbox-Pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chatterbox-Pro

Production TTS pipeline based on Chatterbox-TTS with async job queue, multi-candidate Whisper validation, neural denoising, and automated quality comparison.

Built on Chatterbox TTS by Resemble AI and Chatterbox-TTS-Extended by petermg.

Quick Start (Docker)

# Clone and start
git clone https://github.com/x90skysn3k/Chatterbox-Pro.git
cd Chatterbox-Pro

# Add your voice reference file
cp /path/to/your-voice.wav voices/default.wav

# Start server (requires NVIDIA GPU + Docker with nvidia-container-toolkit)
docker compose up -d

# Generate speech
curl -X POST http://localhost:8004/tts \
  -H 'Content-Type: application/json' \
  -d '{"text": "Hello world.", "predefined_voice_id": "default.wav"}' \
  | jq .job_id
# Returns: {"job_id": "abc12345", "status": "processing"}

# Poll status
curl http://localhost:8004/status/abc12345

# Download WAV when done
curl http://localhost:8004/result/abc12345 --output speech.wav

Quick Start (Linux/CUDA)

pip install -r requirements.txt
./install-patches.sh   # Apply local fixes over pip package
python3 server.py      # Starts on port 8004

Quick Start (Mac — Apple Silicon)

Runs on M1/M2/M3/M4 via MPS (Metal Performance Shaders). Slower than CUDA (~10x) but fully functional.

# Create venv
python3 -m venv venv && source venv/bin/activate

# Install PyTorch with MPS support
pip install torch torchaudio

# Install deps
pip install -r requirements.txt
./install-patches.sh

# Set MPS-specific env vars
export KMP_DUPLICATE_LIB_OK=TRUE
export PYTORCH_ENABLE_MPS_FALLBACK=1

# Start server
python3 server.py

Mac notes:

  • faster-whisper requires CUDA — falls back to OpenAI Whisper automatically
  • Parallel workers limited to 1 on MPS (can't parallelize GPU ops)
  • Generation is ~10x slower than a P40 but works for testing and light use
  • Model downloads ~3GB on first run (cached in ~/.cache/huggingface)

Model Support Matrix

Model Params Speed (P40) CFG/Exagg Notes
standard (default) 500M ~10-15 min/scene Yes Best quality, full control
turbo 350M ~2-3 min/scene No 2x faster, paralinguistic tags
multilingual 500M ~10-15 min/scene Yes 23 languages

Select model via API parameter: "model": "standard" (or turbo, multilingual).

GPU Compatibility

GPU Compute Dtype VRAM Notes
Tesla P40 6.1 float32 24GB Tested, production
RTX 2080 Ti 7.5 float16 11GB Volta/Turing, half VRAM
RTX 3090 8.6 bfloat16 24GB Ampere, fast + efficient
A100 8.0 bfloat16 40/80GB Datacenter, best perf
RTX 4090 8.9 bfloat16 24GB Ada Lovelace, fastest
H100 9.0 bfloat16 80GB Hopper, top tier

Dtype is auto-detected based on GPU compute capability. Override with env var:

CHATTERBOX_DTYPE=float32 docker compose up   # Force float32
CHATTERBOX_DTYPE=bfloat16 docker compose up  # Force bfloat16

Environment Variables

Variable Default Description
DEFAULT_VOICE default.wav Voice reference file in voices/
CHATTERBOX_DTYPE auto Force dtype: auto, float32, float16, bfloat16

Server API

Async job queue architecture: submit, poll, download. No long-lived HTTP connections.

Endpoints

POST /tts — Submit generation job

Returns { "job_id": "abc12345", "status": "processing" } instantly.

{
  "text": "Your narration text here.",
  "voice_mode": "predefined",
  "predefined_voice_id": "default.wav",
  "temperature": 0.75,
  "exaggeration": 0.65,
  "cfg_weight": 0.4,
  "speed_factor": 1.0,
  "split_text": true,
  "chunk_size": 250,
  "seed": 0,
  "model": "standard",
  "apply_watermark": false,
  "top_p": 0.8,
  "repetition_penalty": 2.0,
  "num_candidates": 2,
  "max_attempts": 3,
  "skip_normalization": false,
  "use_silero_vad": false
}

GET /status/{job_id} — Poll progress

Returns real-time generation progress:

{
  "job_id": "abc12345",
  "status": "processing",
  "elapsed": 45,
  "stage": "generation",
  "chunk": "2/5",
  "chunk_pct": "40%",
  "candidate": "cand 1 attempt 1",
  "text": "Most traders lose money..."
}

GET /result/{job_id} — Download WAV

Returns binary WAV audio. Deletes job after download.

GET /result/{job_id} — Response Headers

When server-side normalization is applied, the response includes headers so clients can skip their own pass:

X-Audio-Normalized: ebu
X-Audio-Loudnorm: I=-16:TP=-1.5:LRA=11

GET /health — Server health check

{
  "status": "healthy",
  "version": "1.2.0",
  "uptime": "2h 15m 30s",
  "uptime_seconds": 8130,
  "model": {
    "loaded": true,
    "type": "standard"
  },
  "vram": {
    "allocated_mb": 3058.8,
    "reserved_mb": 3088.0,
    "max_allocated_mb": 3058.8,
    "total_mb": 24438.8,
    "device": "Tesla P40",
    "used_pct": 12.5
  },
  "gpu": {
    "compute_capability": "6.1",
    "supports_bf16": false,
    "supports_fp16": false,
    "supports_tf32": false
  },
  "jobs": { "active": 0, "done": 5, "failed": 0 },
  "generation_count": 42,
  "disk": { "temp_mb": 0.3, "output_mb": 56.7 }
}

WS /stream — Real-time streaming TTS (Turbo)

WebSocket endpoint for instant TTS using the Chatterbox Turbo model (350M, 1-step decoder). Text is split into sentence chunks and audio streams as each chunk generates.

Client → { "text": "Hello world.", "voice": "default.wav" }
Server → { "status": "generating", "text_length": 12 }
Server → { "status": "chunks", "count": 1 }
Server → { "status": "chunk", "index": 0, "total": 1, "gen_time": 1.2 }
Server → [binary: PCM16 @ 24kHz mono]
Server → { "status": "done", "chunks": 1, "elapsed": 1.2 }

Audio is raw PCM16 at 24kHz, one binary frame per chunk. Client plays chunks progressively via Web Audio API.

GET /stream-test — Streaming test page

Standalone HTML page with text input, voice selection, waveform visualization, and download. Open in browser to test streaming TTS.

GET /voices — List voice files

{ "voices": ["default.wav", "custom.wav"] }

POST /upload-voice — Upload voice reference

Multipart file upload. Accepts .wav, .mp3, .flac.

curl -F "file=@my-voice.wav" http://localhost:8004/upload-voice
# → { "filename": "my-voice.wav", "size": 48000 }

Audio Pipeline

Text Input
  │
  ▼
Text Preprocessing (spacing, dot-letters, sound words, pause tags)
  │
  ▼
Sentence Tokenization (NLTK) + Smart Batching (min 80 chars, max 300)
  │
  ▼
Per-Chunk Generation (2 candidates, deterministic seeds, parallel workers)
  │  top_p + repetition_penalty forwarded to T3 sampling
  │
  ▼
Per-Chunk VAD Trim (Silero VAD removes leading/trailing silence per chunk)
  │  preserves all internal pauses, 150ms speech padding
  │
  ▼
Whisper Validation (faster-whisper medium, fuzzy match > 0.85)
  │  retry up to 3x per candidate if failed (same 0.85 threshold)
  │
  ▼
Multi-Factor Candidate Scoring (whisper accuracy + speaking rate + duration)
  │
  ▼
Equal-Power Crossfade Concatenation (50ms sqrt overlap-add)
  │  + pause tag splicing
  │
  ▼
Post-Concatenation Processing (in order):
  │
  ├─ Auto-Editor (silence trim, threshold=0.04, margin=0.4s)
  │    OR Silero VAD (caps internal silence at 500ms, default)
  │
  ├─ pyrnnoise Denoising (neural noise reduction, mono 48kHz)
  │
  └─ Two-Pass EBU R128 Loudnorm (-16 LUFS, -1.5 TP, 11 LRA)
       measure → apply with linear=true (preserves dynamics)
  │
  ▼
Output WAV (192kHz)

Two-Stage VAD Pipeline

Silero VAD runs twice with different goals:

  1. Per-chunk VAD (_vad_trim_chunk) — Runs on each candidate WAV after generation. Only trims leading/trailing non-speech. Uses min_silence_duration_ms=9999 so it never splits on internal pauses. This ensures chunks have clean edges for crossfade blending.

  2. Final VAD (_apply_silero_vad_trim) — Runs on the concatenated audio. Caps any internal silence gaps longer than 500ms. Polishes the overall pacing after crossfade.

When use_silero_vad=true (default), both stages run and auto-editor is skipped. Set use_silero_vad=false to use auto-editor instead (amplitude-based, less intelligent).


QA Model Comparison

Compare TTS models and pipeline configurations with standardized test sentences.

Quick start

# Compare current model (quick: 3 tests)
./run.sh tts-compare-quick

# Compare specific models
./run.sh tts-compare current,turbo,hf-800m

# Full comparison (all 7 tests)
npm run qa:tts-compare -- --models current,turbo --voice default.wav

# Use local M1 Max
npm run qa:tts-compare -- --models current --local --quick

What it produces

tools/chatterbox-pro/qa-compare/
  2026-04-03T14-30-00/
    report.html          <-- Open this! Audio players + metrics side-by-side
    metrics/results.json
    audio/
      current/
        simple-narration.wav
        numbers-and-stats.wav
        ...
      turbo/
        ...

Test corpus

7 standardized sentences in test-corpus.json covering:

  • Simple narration (prosody baseline)
  • Numbers and stats (pronunciation)
  • Proper names (trader names like Livermore, Tudor Jones)
  • Emotional range (somber tone, natural pauses)
  • Technical terms (VIX, S&P, algo jargon)
  • Short punchy sentences (rhythm, emphasis)
  • Long complex sentences (breath pacing, dramatic build)

Metrics computed

Metric Method Pass threshold
Text accuracy Whisper transcription + fuzzy match > 0.85
Duration ffprobe within 10% expected
Generation speed wall clock lower = better
Peak level ffmpeg astats -6 to -0.5 dBFS
File size stat informational

Voice Settings (Production)

Parameter Value Notes
temperature 0.75 Natural variance
exaggeration 0.65 Authoritative calm
cfg_weight 0.4 Balanced delivery
top_p 0.8 Nucleus sampling threshold (T3 default)
repetition_penalty 2.0 Prevents repeated tokens (T3 default)
speed_factor 1.0 Must be 1.0 (other values cause double voice)
voice default.wav Custom reference
whisper_model medium Best accuracy/speed tradeoff
num_candidates 2 Per chunk, scored by quality
max_attempts 3 Retries per candidate
skip_normalization false Set true if client normalizes
use_silero_vad false Intelligent silence trim (alt to auto-editor)

Infrastructure

P40 Server (Primary)

# Server managed by systemd
ssh root@your-server "systemctl restart chatterbox-pro"
ssh root@your-server "journalctl -u chatterbox-pro -f"

# Check VRAM
ssh root@your-server "nvidia-smi"

# Health check
curl http://your-server:8004/health

M1 Max (Fallback)

cd tools/chatterbox-pro
source venv/bin/activate
KMP_DUPLICATE_LIB_OK=TRUE python server.py
# Runs on port 8004 locally

Differences: use_faster_whisper=False (needs CUDA), num_parallel_workers=1 (MPS limitation).


Known Issues

Memory leak on P40

VRAM usage grows over time. Server now runs VRAM cleanup between jobs automatically, but restart after every 2-3 videos if needed:

ssh root@your-server "systemctl restart chatterbox-pro"

The /health endpoint reports vram.used_pct — restart when approaching 80%.

faster-whisper crashes

Occasionally silently crashes on certain audio. Falls back to longest-transcript selection. Switch to OpenAI Whisper if persistent: set use_faster_whisper=False.


Changelog

v1.2.0

  • Guard two-pass loudnorm against -inf crash on quiet audio
  • Fix excessive WAV generation per chunk (1 attempt per candidate, retry on Whisper fail)
  • Add trailing noise trimming before Whisper validation
  • generate_batch() now has top_p/repetition_penalty parity with generate()
  • Server prints version on startup, in /health, and UI
  • Clean log path (logs/server.log only)

v1.1.0

  • Raise min chunk length 20→80 chars (prevents TTS hallucinations)
  • Equal-power crossfade between chunks (eliminates clicks/pops, no 3dB dip)
  • MD5-based conditional caching for voice embeddings
  • Forward top_p and repetition_penalty through tts.py → T3 sampling
  • Multi-factor scored candidate selection (replaces shortest-duration)
  • Silero VAD integration for intelligent silence trimming (opt-in)
  • VRAM leak management with proactive cleanup between jobs
  • skip_normalization param + X-Audio-Normalized response headers
  • /health endpoint with VRAM, GPU, jobs, disk stats
  • Rotating file log handler (logs/server.log)
  • Fix RNNoise stereo bug (-ac 2-ac 1)
  • Fix Whisper retry threshold (0.95 → 0.85, matched to initial)
  • Two-pass EBU R128 loudnorm (preserves dynamics vs single-pass)
  • Reorder post-processing: auto-editor → denoise → loudnorm
  • Thread safety fix for _jobs dict access
  • Temp file cleanup (remove files >24h old)

Dependencies

  • Python 3.10.x
  • FFmpeg on PATH
  • CUDA 12.8 (P40) or MPS (M1 Max)
  • PyTorch 2.7.0
  • faster-whisper + openai-whisper
  • pyrnnoise 0.3.8
  • auto-editor 27.1.1

Full list: requirements.txt


File Structure

server.py              FastAPI async job queue server
Chatter.py             Main TTS pipeline (batching, validation, post-processing)
test-corpus.json       QA comparison test sentences
deploy-p40.sh          P40 deployment script (gitignored)
voices/                Reference audio files
  default.wav    Production voice
chatterbox/src/        Core model implementations
  chatterbox/tts.py    ChatterboxTTS class (500M)
  chatterbox/vc.py     ChatterboxVC (voice conversion)
  models/t3/           T3 language model
  models/s3gen/        S3Gen vocoder
  models/voice_encoder/ Speaker embedding
  models/s3tokenizer/  Speech tokenizer
temp/                  Temporary candidate WAVs (gitignored)
output/                Final output files (gitignored)
qa-compare/            Comparison run results (gitignored)

About

Production-grade async TTS server built on Chatterbox. Docker support, GPU auto-detection (Pascal→Hopper), multi-model (Standard/Turbo/Multilingual), Whisper validation, conditional caching, pause tags, memory leak fixes, and CI/CD.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages