Skip to content

Conversation

@xenova
Copy link

@xenova xenova commented Sep 11, 2025

Currently, cfg_weight=0.0 fails because the current inference code assumes we always perform classifier free guidance. As a result, the following code fails:

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cpu")

text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text, cfg_weight=0.0)
ta.save("test-1.wav", wav, model.sr)

with this error:

    wav = model.generate(text, cfg_weight=0.0)
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/chatterbox/tts.py", line 246, in generate
    speech_tokens = self.t3.inference(
        t3_cond=self.conds.t3,
    ...<6 lines>...
        top_p=top_p,
    )
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/chatterbox/models/t3/t3.py", line 309, in inference
    inputs_embeds = torch.cat([embeds, bos_embed], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 2 for tensor number 1 in the list.

This PR fixes this, allowing cfg_weight to be set to 0. It's ~2x faster, but does produce slightly worse results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant