Vui - 100M Parameter On-Device Conversational Text-to-Speech

Vui is a lightweight, open-source text-to-speech model with only 100M parameters, designed for natural conversational speech synthesis. Built on a Llama-style transformer architecture, it generates expressive multi-speaker dialogue with breaths, laughter, hesitations, and other non-verbal sounds.

Trained on 40,000 hours of real audio conversations. Runs on consumer GPUs.

Try the live demo on Hugging Face Spaces

Features

100M parameters - small enough for on-device and edge deployment
Conversational speech - trained on real conversations, not studio recordings
Non-verbal sounds - generates [breath], [laugh], [sigh], [hesitate], [tut] naturally
Multi-speaker - COHOST model handles two-speaker dialogues
Voice cloning - clone from audio samples with the base model
Streaming - real-time streaming synthesis with CUDA graph acceleration
Custom audio codec - Fluac, a modified DAC with FSQ that reduces token rate from 86Hz to 21.5Hz (4x reduction)

Quick Start

from vui.model import Vui
from vui.inference import render
from torchcodec.encoders import AudioEncoder

model = Vui.from_pretrained(Vui.ABRAHAM).cuda().eval()

audio = render(model, """
So [breath] the thing about this is, it's not what you'd expect, right?
Um, it's actually [hesitate] completely different.
""")

encoder = AudioEncoder(audio[0], sample_rate=22050)
encoder.to_file("output.wav")

Comparison with Other Small TTS Models

Model	Params	Conversational	Multi-Speaker	Voice Cloning	Breaths & Non-Verbal	Streaming
Vui	100M	Yes	Yes	Yes	Yes	Yes
Kokoro	82M	No	No	No	No	No
Pocket TTS	100M	No	No	Yes	No	No
KittenTTS	14-80M	No	No	No	No	No
Orpheus	150M+	Partial	No	No	Partial	No

Models

Model	Description
`Vui.BASE`	Base checkpoint trained on 40k hours of audio conversations
`Vui.ABRAHAM`	Single-speaker model with context-aware replies
`Vui.COHOST`	Two-speaker model for multi-speaker dialogue

Architecture

Vui is a Llama-style causal transformer that predicts audio tokens from text:

Text encoder: ByT5 byte-level tokenizer
Decoder: 6-layer transformer, 512 dim, 8 heads, RMSNorm, SiLU, RoPE
Audio codec: Fluac - a modified Descript Audio Codec using Finite Scalar Quantization (FSQ) with 9 codebooks at 1000 entries each
Token rate: ~21.5 Hz (vs 86 Hz for standard DAC), enabling longer context windows
Inference: KV caching + CUDA graphs for fast autoregressive generation

Non-Verbal Sound Tags

Vui understands these inline tags for expressive speech:

[breath]    - breathing sounds
[laugh]     - laughter
[sigh]      - sighing
[hesitate]  - filled pauses / um / uh
[tut]       - tutting

Example:

And I'm Jamie um [laugh] today, we're diving into a [hesitate] topic
that's transforming customer service [breath] voice technology for agents.

Installation

Before running demo.py, you must accept model terms for Voice Activity Detection and Segmentation on Hugging Face.

Linux

uv pip install -e .

Windows

uv venv
.venv\Scripts\activate
uv pip install -e .
uv pip install triton_windows

Demo

python demo.py

Or try it on Hugging Face Spaces.

Voice Cloning

You can clone voices with the base model. Pass an audio sample and the model will adapt to the speaker's characteristics. Quality varies as the model hasn't been extensively trained for this task.

FAQ

Developed on two 4090s: https://x.com/harrycblum/status/1752698806184063153
The model does hallucinate occasionally - this is the best achievable with limited compute resources.
VAD slows things down but is needed to remove silence regions.

Attributions

Citation

@software{vui_2025,
  author = {Coultas Blum, Harry},
  month = {01},
  title = {{Vui: 100M Parameter Conversational Text-to-Speech}},
  url = {https://github.com/fluxions-ai/vui},
  version = {1.0.0},
  year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
src/vui		src/vui
.gitignore		.gitignore
LICENSE		LICENSE
demo.py		demo.py
inference.ipynb		inference.ipynb
inference.py		inference.py
pyproject.toml		pyproject.toml
readme.md		readme.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vui - 100M Parameter On-Device Conversational Text-to-Speech

Features

Quick Start

Comparison with Other Small TTS Models

Models

Architecture

Non-Verbal Sound Tags

Installation

Linux

Windows

Demo

Voice Cloning

FAQ

Attributions

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vui - 100M Parameter On-Device Conversational Text-to-Speech

Features

Quick Start

Comparison with Other Small TTS Models

Models

Architecture

Non-Verbal Sound Tags

Installation

Linux

Windows

Demo

Voice Cloning

FAQ

Attributions

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages