VoxCPM2 Chirp/Click Artifact & Voice Consistency in One-Shot Cloning

Thank you for all your hard work on VoxCPM and the recent VoxCPM2 release. So far I am very impressed with the results for a project I am working on. Using with nanovllm-voxcpm on an AWS EC2 g6.xlarge (NVidia L4 24GB) I am achieving streaming responses within 250ms and LoRA training completes 2000 steps in approx. 15min.

Are you able to provide some guidance for a small but annoying issue?

One-shot voice cloning on the base VoxCPM2 model produces a chirp/click at the start of every generated audio segment - this appears to be the tail of the reference audio leaking through the DiT's `prefix_feat_cond` conditioning. For example if my reference audio ends with "enemy" the generated audio appears to start with "emy". 

While I have attempted changes through nanovllm-voxcpm as described below, I can reproduce this issue easily with the default VoxCPM LoRA WebUI.

I do have a mix of audio segments that appear to work fine with these changes for some voices but am unable to work out the main factor could be causing this.

**Things attempted:**
1. **Blank audio padding on reference clip** - Did not help; last audio patch still used as DiT conditioning, silence padding made it worse
2. **Reference mode** (send `ref_audio_latents_base64` only) - Eliminated chirp but significantly degraded voice quality; reverted
3. **Zero `prefix_feat_cond` patch** (to nanovllm-voxcpm) - Zeroed the DiT conditioning during prefill to match training behaviour; did not completely resolve
4. **Dual mode - ref_audio + prompt latents together** - Send both for maximum voice conditioning; did not completely resolve
5. **PCM chirp trim - 100ms** (HACK) - Pipeline-level trim of first ~2 patches when cloning is active; addresses residual LM continuation artifact but still does not work completely
6. **Transcript mismatch** - Obviously extra words in the reference text cause problems but even when reviewed carefully the Chirp/Click exists



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VoxCPM2 Chirp/Click Artifact & Voice Consistency in One-Shot Cloning #272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VoxCPM2 Chirp/Click Artifact & Voice Consistency in One-Shot Cloning #272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions