Feature request: reliable pause/silence control for Voice Design mode

## Feature Request: Reliable pause/silence control in Voice Design mode

### Context

When using VoxCPM2 in **Voice Design mode** (text description in parentheses, no reference audio) for long-form narration, sentence-to-sentence pacing is too tight — the next sentence begins almost immediately after a period, which is fatiguing for the listener.

I tried the following approaches to control pause duration. None is reliably effective:

### What doesn't work

1. **Explicit duration instructions in the Voice Design prompt** — adding phrases like *"take a one-second pause after every period"* or *"long pauses between sentences"* into the parenthetical description does not produce measurable silence between sentences. The model does not interpret timing instructions semantically.

2. **Inline parenthetical pause commands mid-text** — inserting `(pause for 1 second)` or similar in the middle of the input text causes those words to be spoken verbatim. Parentheses are only parsed as a style descriptor at the very start of the input, not as inline control tokens.

3. **Ellipsis + em-dash** (`... — ...`) between sentences produces slightly longer pauses than a plain period — better than nothing, but the difference is subtle and not nearly enough for comfortable narration pacing. It is also undocumented and pollutes the text for any downstream consumer (subtitles, transcripts, etc.).

### Related existing issues

- #56 — "Speed parameter?" (open, maintainer: *"we are considering adding this parameter"*)
- #210 — "Style control instructions ignored" (closed; clarified that Ultimate Cloning ignores style prompts — but this issue is about Voice Design, which is the style-controlled mode)
- #117 — "Support word and punctuation timestamps" (about output timestamps, not input pause control)

To my knowledge, no existing issue covers **input-side pause duration control**.

### Proposal

Any of the following would resolve this use case — in rough order of preference:

1. **SSML-style inline break tokens**, e.g. `<break time="1000ms"/>` or `[pause:1s]`. Deterministic, composable, and familiar from other TTS systems (Amazon Polly, Azure, ElevenLabs).

2. **Configurable punctuation→pause mapping** on `model.generate(...)`, e.g. `pause_map={".": 0.8, ",": 0.3, ";": 0.5}`. Even if implemented as post-processing on the output waveform, this would unblock most users.

3. **Documentation** on which input patterns reliably produce longer silences in the current model, so users can make informed choices without reverse-engineering.

### Environment

- VoxCPM2 (`voxcpm==2.0.2`)
- macOS 15.x, Apple M4 Pro
- PyTorch 2.11.0, MPS backend, bfloat16
- Python 3.12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: reliable pause/silence control for Voice Design mode #276

Feature Request: Reliable pause/silence control in Voice Design mode

Context

What doesn't work

Related existing issues

Proposal

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: reliable pause/silence control for Voice Design mode #276

Description

Feature Request: Reliable pause/silence control in Voice Design mode

Context

What doesn't work

Related existing issues

Proposal

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions