Skip to content

Feature request: reliable pause/silence control for Voice Design mode #276

@ergut

Description

@ergut

Feature Request: Reliable pause/silence control in Voice Design mode

Context

When using VoxCPM2 in Voice Design mode (text description in parentheses, no reference audio) for long-form narration, sentence-to-sentence pacing is too tight — the next sentence begins almost immediately after a period, which is fatiguing for the listener.

I tried the following approaches to control pause duration. None is reliably effective:

What doesn't work

  1. Explicit duration instructions in the Voice Design prompt — adding phrases like "take a one-second pause after every period" or "long pauses between sentences" into the parenthetical description does not produce measurable silence between sentences. The model does not interpret timing instructions semantically.

  2. Inline parenthetical pause commands mid-text — inserting (pause for 1 second) or similar in the middle of the input text causes those words to be spoken verbatim. Parentheses are only parsed as a style descriptor at the very start of the input, not as inline control tokens.

  3. Ellipsis + em-dash (... — ...) between sentences produces slightly longer pauses than a plain period — better than nothing, but the difference is subtle and not nearly enough for comfortable narration pacing. It is also undocumented and pollutes the text for any downstream consumer (subtitles, transcripts, etc.).

Related existing issues

To my knowledge, no existing issue covers input-side pause duration control.

Proposal

Any of the following would resolve this use case — in rough order of preference:

  1. SSML-style inline break tokens, e.g. <break time="1000ms"/> or [pause:1s]. Deterministic, composable, and familiar from other TTS systems (Amazon Polly, Azure, ElevenLabs).

  2. Configurable punctuation→pause mapping on model.generate(...), e.g. pause_map={".": 0.8, ",": 0.3, ";": 0.5}. Even if implemented as post-processing on the output waveform, this would unblock most users.

  3. Documentation on which input patterns reliably produce longer silences in the current model, so users can make informed choices without reverse-engineering.

Environment

  • VoxCPM2 (voxcpm==2.0.2)
  • macOS 15.x, Apple M4 Pro
  • PyTorch 2.11.0, MPS backend, bfloat16
  • Python 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions