Feature Request: Reliable pause/silence control in Voice Design mode
Context
When using VoxCPM2 in Voice Design mode (text description in parentheses, no reference audio) for long-form narration, sentence-to-sentence pacing is too tight — the next sentence begins almost immediately after a period, which is fatiguing for the listener.
I tried the following approaches to control pause duration. None is reliably effective:
What doesn't work
-
Explicit duration instructions in the Voice Design prompt — adding phrases like "take a one-second pause after every period" or "long pauses between sentences" into the parenthetical description does not produce measurable silence between sentences. The model does not interpret timing instructions semantically.
-
Inline parenthetical pause commands mid-text — inserting (pause for 1 second) or similar in the middle of the input text causes those words to be spoken verbatim. Parentheses are only parsed as a style descriptor at the very start of the input, not as inline control tokens.
-
Ellipsis + em-dash (... — ...) between sentences produces slightly longer pauses than a plain period — better than nothing, but the difference is subtle and not nearly enough for comfortable narration pacing. It is also undocumented and pollutes the text for any downstream consumer (subtitles, transcripts, etc.).
Related existing issues
To my knowledge, no existing issue covers input-side pause duration control.
Proposal
Any of the following would resolve this use case — in rough order of preference:
-
SSML-style inline break tokens, e.g. <break time="1000ms"/> or [pause:1s]. Deterministic, composable, and familiar from other TTS systems (Amazon Polly, Azure, ElevenLabs).
-
Configurable punctuation→pause mapping on model.generate(...), e.g. pause_map={".": 0.8, ",": 0.3, ";": 0.5}. Even if implemented as post-processing on the output waveform, this would unblock most users.
-
Documentation on which input patterns reliably produce longer silences in the current model, so users can make informed choices without reverse-engineering.
Environment
- VoxCPM2 (
voxcpm==2.0.2)
- macOS 15.x, Apple M4 Pro
- PyTorch 2.11.0, MPS backend, bfloat16
- Python 3.12
Feature Request: Reliable pause/silence control in Voice Design mode
Context
When using VoxCPM2 in Voice Design mode (text description in parentheses, no reference audio) for long-form narration, sentence-to-sentence pacing is too tight — the next sentence begins almost immediately after a period, which is fatiguing for the listener.
I tried the following approaches to control pause duration. None is reliably effective:
What doesn't work
Explicit duration instructions in the Voice Design prompt — adding phrases like "take a one-second pause after every period" or "long pauses between sentences" into the parenthetical description does not produce measurable silence between sentences. The model does not interpret timing instructions semantically.
Inline parenthetical pause commands mid-text — inserting
(pause for 1 second)or similar in the middle of the input text causes those words to be spoken verbatim. Parentheses are only parsed as a style descriptor at the very start of the input, not as inline control tokens.Ellipsis + em-dash (
... — ...) between sentences produces slightly longer pauses than a plain period — better than nothing, but the difference is subtle and not nearly enough for comfortable narration pacing. It is also undocumented and pollutes the text for any downstream consumer (subtitles, transcripts, etc.).Related existing issues
To my knowledge, no existing issue covers input-side pause duration control.
Proposal
Any of the following would resolve this use case — in rough order of preference:
SSML-style inline break tokens, e.g.
<break time="1000ms"/>or[pause:1s]. Deterministic, composable, and familiar from other TTS systems (Amazon Polly, Azure, ElevenLabs).Configurable punctuation→pause mapping on
model.generate(...), e.g.pause_map={".": 0.8, ",": 0.3, ";": 0.5}. Even if implemented as post-processing on the output waveform, this would unblock most users.Documentation on which input patterns reliably produce longer silences in the current model, so users can make informed choices without reverse-engineering.
Environment
voxcpm==2.0.2)