Add StyleTTS 2 #35790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

eustlb wants to merge 43 commits into huggingface:main from eustlb:add-style-tts-2

Contributor

eustlb commented Jan 20, 2025 •

edited

Loading

What does this PR do?

Adds StyleTTS 2 to support the original model but also other checkpoints like Kokoro.

🆕 🔥 This implementation also add batch support (early benchmarks show ~50% inference speed improvement for BS 128) and mask support for padded inputs.

Note

This implementation differs slightly from the original codebase. The aim here is to have the clearest possible correspondence of naming and structure compared to the original papers (StyleTTS 2 that builds on top of StyleTTS).

TODO

Benchmarks

BS 1

~8% faster than the original codebase (benchmarking script: original and transformers)


          first commit

3c6ed92

eustlb added New model Audio labels

eustlb added 25 commits

January 24, 2025 15:45


          add StyleTextToSpeech2AcousticTextEncoderPretrainedModel

b4029c0


          full draft

48518f6


          full working forward

5edf3ee


          fixes

94177ff


          better config

3c326ed


          config update

33b123c


          update

fe2e245


          merge AdaIn & AdaIn1d

aee3da5


          improve

9af6631


          match paper formulations

a1711d9


          add batch support

15aec17

fix

5a76bf9

fix

cdd1497

fix

d25b9a0


          batched style voice

b5d4e5d


          draft proc & tokenizer

a2fcc8d


          processor update

9a07a3c


          lengths -> attention_mask in generate

ae4a24b


          processor update

836b735


          prepare tests

54f58c4


          conversion script

6ea1046


          processor fix

43c3d74


          tokenization update

dd6e94a


          docstring update

1608aa0


          StyleTTS-2 -> StyleTTS 2

6e08d78

eustlb marked this pull request as ready for review

February 25, 2025 11:33

eustlb requested a review from Cyrilvallez

February 25, 2025 11:37

Contributor Author

eustlb commented Feb 25, 2025

cc @Cyrilvallez, modeling code is ready for review 🤗

eustlb added 2 commits

February 25, 2025 15:59


          performance fix

381b4f5


          handle text input > model_max_length

7966d00

Cyrilvallez reviewed

View reviewed changes

Member

Cyrilvallez left a comment

Alright! Super sorry about the (very) long delay! Very very nice implementation, not much to complain about here! Great work! 🤗
Mostly just a few recurrent but easy to fix stuff:

Use only the config in __init__ if possible. I did not check each one, and some may have additional params that do not really make sense in the config as they are hard-coded, such as the boolean args in StyleTextToSpeech2AdainResBlock1d, but let's still try to use config as much as possible (and add additional parameters if still needed for internal-only purposes)
If possible, all the transpose ops should have explicit dim number, let's try not to use -1 if number of dims is always the same (not sure here)
The weight_norm function should probably always be one or the other, as layer names depend on it (if needed, let's use a hard pytorch version check for this one)

src/transformers/models/style_text_to_speech_2/configuration_style_text_to_speech_2.py Outdated Show resolved Hide resolved

src/transformers/models/style_text_to_speech_2/convert_kokoro_weights_to_hf.py Show resolved Hide resolved

src/transformers/models/style_text_to_speech_2/modeling_style_text_to_speech_2.py Outdated

Comment on lines 149 to 150

		if hasattr(nn.utils.parametrizations, "weight_norm"):
		weight_norm = nn.utils.parametrizations.weight_norm

Member

Cyrilvallez Mar 19, 2025

We can only use this one anyway due to the layer naming in the state dicts no? Did you add the check for older torch versions? Also, IMO it would help clarity to wrap the layer declaration directly with the function, as opposed to applying it afterwards

Contributor Author

eustlb Apr 28, 2025

True! We do not need to check for older torch versions right? since parametrizations.weight_norm is supported anyway for torch>=2.1

src/transformers/models/style_text_to_speech_2/modeling_style_text_to_speech_2.py Outdated Show resolved Hide resolved

src/transformers/models/style_text_to_speech_2/modeling_style_text_to_speech_2.py

+                          batch_first=True,
+                          enforce_sorted=False
+                      )
+                      self.lstm.flatten_parameters()

Member

Cyrilvallez Mar 19, 2025

Humm, do we need this call every forward?

src/transformers/models/style_text_to_speech_2/modeling_style_text_to_speech_2.py Outdated Show resolved Hide resolved

src/transformers/models/style_text_to_speech_2/modeling_style_text_to_speech_2.py Outdated Show resolved Hide resolved

src/transformers/models/style_text_to_speech_2/modeling_style_text_to_speech_2.py

+              class StyleTextToSpeech2Decoder(StyleTextToSpeech2PretrainedModel):
+                  base_model_prefix = "decoder"
+                  config_class = StyleTextToSpeech2DecoderConfig
+                  main_input_name = "hidden_states"

Member

Cyrilvallez Mar 19, 2025

Here it's actually different from the default, but with a quick glance at the library i'm not sure it's useful to set it, but not sure

src/transformers/models/style_text_to_speech_2/processing_style_text_to_speech_2.py Outdated Show resolved Hide resolved

src/transformers/models/style_text_to_speech_2/processing_style_text_to_speech_2.py Outdated Show resolved Hide resolved

Member

Cyrilvallez commented Mar 19, 2025

Also, I assume modular could not be used at all here?

eustlb and others added 15 commits

April 28, 2025 17:16


          import update

01d0249


          Merge branch 'main' into add-style-tts-2

59ac163


          add processor doc

c3060e6


          clearer weight_norm

67605da


          config update

aafca77


          cached_file instead of get_file_from_repo

b5a7d90


          explicit dims in transpose

d99fe85


          no need for device placement

48d26a9


          avoid using sub functions

aa40b34


          avoid useless transpose

6a1f1d7


          better naming

867150b


          remove base_model_prefix

03d3347


          unnecessary

0bac051


          unnecessary


          generator from config

beedec5

pagezyhf mentioned this pull request

Supporting hexgrad/Kokoro-82M huggingface/Microsoft-Azure#22

Open

monuminu commented Jul 28, 2025

any update on this ?

Contributor Author

eustlb commented Jul 28, 2025

Hey @monuminu, this PR went stale due to other priorities. I’ll try to get it moving ASAP, though I’m not sure I’ll manage before the summer break. Thanks for checking in, I’ll bump its priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Audio New model