Thank you for all your hard work on VoxCPM and the recent VoxCPM2 release. So far I am very impressed with the results for a project I am working on. Using with nanovllm-voxcpm on an AWS EC2 g6.xlarge (NVidia L4 24GB) I am achieving streaming responses within 250ms and LoRA training completes 2000 steps in approx. 15min.
Are you able to provide some guidance for a small but annoying issue?
One-shot voice cloning on the base VoxCPM2 model produces a chirp/click at the start of every generated audio segment - this appears to be the tail of the reference audio leaking through the DiT's prefix_feat_cond conditioning. For example if my reference audio ends with "enemy" the generated audio appears to start with "emy".
While I have attempted changes through nanovllm-voxcpm as described below, I can reproduce this issue easily with the default VoxCPM LoRA WebUI.
I do have a mix of audio segments that appear to work fine with these changes for some voices but am unable to work out the main factor could be causing this.
Things attempted:
- Blank audio padding on reference clip - Did not help; last audio patch still used as DiT conditioning, silence padding made it worse
- Reference mode (send
ref_audio_latents_base64 only) - Eliminated chirp but significantly degraded voice quality; reverted
- Zero
prefix_feat_cond patch (to nanovllm-voxcpm) - Zeroed the DiT conditioning during prefill to match training behaviour; did not completely resolve
- Dual mode - ref_audio + prompt latents together - Send both for maximum voice conditioning; did not completely resolve
- PCM chirp trim - 100ms (HACK) - Pipeline-level trim of first ~2 patches when cloning is active; addresses residual LM continuation artifact but still does not work completely
- Transcript mismatch - Obviously extra words in the reference text cause problems but even when reviewed carefully the Chirp/Click exists
Thank you for all your hard work on VoxCPM and the recent VoxCPM2 release. So far I am very impressed with the results for a project I am working on. Using with nanovllm-voxcpm on an AWS EC2 g6.xlarge (NVidia L4 24GB) I am achieving streaming responses within 250ms and LoRA training completes 2000 steps in approx. 15min.
Are you able to provide some guidance for a small but annoying issue?
One-shot voice cloning on the base VoxCPM2 model produces a chirp/click at the start of every generated audio segment - this appears to be the tail of the reference audio leaking through the DiT's
prefix_feat_condconditioning. For example if my reference audio ends with "enemy" the generated audio appears to start with "emy".While I have attempted changes through nanovllm-voxcpm as described below, I can reproduce this issue easily with the default VoxCPM LoRA WebUI.
I do have a mix of audio segments that appear to work fine with these changes for some voices but am unable to work out the main factor could be causing this.
Things attempted:
ref_audio_latents_base64only) - Eliminated chirp but significantly degraded voice quality; revertedprefix_feat_condpatch (to nanovllm-voxcpm) - Zeroed the DiT conditioning during prefill to match training behaviour; did not completely resolve