[WiP] Add xcodec2 model #37868

Deep-unlearning · 2025-04-29T14:58:31Z

What does this PR do?

This PR adds support for XCodec2 a high fidelity general neural audio codec used in Llasa a Text-to-Speech model, to the Transformers library.

This model is composed of 5 components:

A Semantic Encoder
An Acoustic Encoder
A VectorQuantizer
A Semantic Decoder
An Acoustic Decoder

This is still a draft PR. Work done so far:

Adapted the model to Transformers format in modeling_xcodec2.py and modular_xcodec2.py.

Todo

Add the checkpoint conversion scripts and push to the hub
Support batch inference
Write Tests
Add documentation

Who can review?

cc: @ArthurZucker
cc: @eustlb @Vaibhavs10 for visibility

HuggingFaceDocBuilderDev · 2025-04-29T15:11:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2025-04-30T14:12:11Z

ff to ping me once this is ready!

ebezzam

@Deep-unlearning my first time reviewing a model addition so sorry if I'm nit-picky! I recently did a deep dive through DAC and EnCodec so most of my comments are about making things consistent with those models:

simplifying the configuration
whether we keep nn.Sequential and weight_norm. @eustlb will probably know better for that
similar integration tests

docs/source/en/model_doc/xcodec2.md

src/transformers/models/xcodec2/configuration_xcodec2.py

tests/models/xcodec2/test_modeling_xcodec2.py

ebezzam · 2025-07-25T15:54:28Z

tests/models/xcodec2/test_modeling_xcodec2.py

+class XCodec2IntegrationTest(unittest.TestCase):
+    def test_integration(self):
+        expected_rmse = 0.07212554663419724
+        expected_codes = [


wrap around

# fmt: off expected_codes = [...] # fmt: on

to avoid make fixup from making new line for each element

ebezzam · 2025-07-25T16:00:49Z

tests/models/xcodec2/test_modeling_xcodec2.py

+            audio_codes = model.encode(inputs["input_values"], return_dict=False)
+            codes = audio_codes.squeeze(0).squeeze(0).tolist()
+
+            self.assertEqual(codes, expected_codes)


Check new tests from EnCodec that @eustlb and I iterated on, namely:

checking with torch.testing.assert_close

making gist out of the script you used to compute the expected outputs

ebezzam · 2025-07-25T16:01:17Z

tests/models/xcodec2/test_modeling_xcodec2.py

+
+
+# Copied from transformers.tests.encodec.test_modeling_encodec.compute_rmse
+def compute_rmse(arr1, arr2):


not being used? also there's a new version

ebezzam · 2025-07-25T16:01:40Z

tests/models/xcodec2/test_modeling_xcodec2.py

+        arr_enc_dec = input_values_enc_dec[0].cpu().numpy()
+        arr_enc_dec_truncated = arr_enc_dec[:, : arr.shape[1]]
+        rmse = np.sqrt(((arr - arr_enc_dec_truncated) ** 2).mean())
+        self.assertTrue(np.abs(rmse - expected_rmse) < 1e-6)


adding a batch test?

ebezzam · 2025-07-25T16:16:36Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        return x
+
+
+class XCodec2DecoderLayer(LlamaDecoderLayer):


Note for @eustlb
This component is meant to replace TransformerBlock from the original implementation

eustlb

Thanks for the work, @Deep-unlearning! 🤗
For this first review pass, I focused on reacting to @ebezzam’s comments. I’ll take a broader look once those are addressed.

docs/source/en/model_doc/xcodec2.md

src/transformers/models/auto/feature_extraction_auto.py

src/transformers/models/xcodec2/configuration_xcodec2.py

eustlb · 2025-07-28T10:09:23Z

src/transformers/models/xcodec2/modular_xcodec2.py

+    return weight_norm(nn.Conv1d(*args, **kwargs))
+
+
+class XCodec2SnakeBeta(nn.Module):


Both are ok IMO, here since it's specifically a modified snake I would actually keep XCodec2SnakeBeta

src/transformers/models/xcodec2/modular_xcodec2.py

eustlb · 2025-07-28T11:35:21Z

src/transformers/models/xcodec2/modular_xcodec2.py

+class EncoderBlock(nn.Module):
+    def __init__(self, dim: int = 16, stride: int = 1, dilations=(1, 3, 9)):
+        super().__init__()
+        runits = [ResidualUnit(dim // 2, dilation=d) for d in dilations]
+        self.block = nn.Sequential(
+            *runits,
+            Activation1d(activation=XCodec2SnakeBeta(dim // 2, alpha_logscale=True)),
+            WNConv1d(
+                dim // 2,
+                dim,
+                kernel_size=2 * stride,
+                stride=stride,
+                padding=stride // 2 + stride % 2,
+            ),
+        )
+
+    def forward(self, x):
+        return self.block(x)


If simple and clear, yes!

src/transformers/models/xcodec2/modular_xcodec2.py

eustlb · 2025-07-28T11:38:10Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        nn.init.constant_(m.bias, 0)
+
+
+class XCodec2CodecEncoder_Transformer(nn.Module):


Use a naming for aligned with Transformers conventions (letting you check other modelling), we do not take _

ebezzam · 2025-08-14T14:02:19Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        semantic_model_config = AutoConfig.from_pretrained("facebook/w2v-bert-2.0", output_hidden_states=True)
+        self.semantic_model = AutoModel.from_config(semantic_model_config)
+        self.semantic_model.eval()
+
+        self.SemanticEncoder_module = XCodec2SemanticEncoder(
+            config.semantic_hidden_size, config.semantic_hidden_size, config.semantic_hidden_size
+        )
+
+        self.CodecEnc = XCodec2CodecEncoder_Transformer()
+
+        self.generator = XCodec2CodecDecoderVocos(config=config)
+
+        self.fc_prior = nn.Linear(2048, 2048)
+        self.fc_post_a = nn.Linear(2048, 1024)
+        feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/w2v-bert-2.0")


@eustlb should loading other checkpoints with from_pretrained be wrapped in something like a processor? or fine to use inside modeling code?

eustlb

Let's iterate on the attention implementation. A lot of lines seems to come from handling non-causal attention, which should be simply handled by setting self.is_causal = False when inheriting and passing attention mask when required.

src/transformers/models/xcodec2/configuration_xcodec2.py

eustlb · 2025-09-04T15:29:54Z

src/transformers/models/xcodec2/modular_xcodec2.py

+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            is_causal=self.is_causal,


from what I understood, this is_causal is the only diff with LlamaAttention. I don't get why it's necessary though?
Normally, setting self.is_causal = False should be enough (see here)? Or of curse I am missing a specificity here

I think you're right. I didn't see where you pointed that there is a getattr(module, "is_causal", True), so I had kept Steven's version since it wasn't clear how/if self.is_causal is used in LlamaAttention because it isn't passed to attention_interface.

In my opinion, it could be made clear in LlamaAttention by having a similar comment also in LlamaAttention here.

src/transformers/models/xcodec2/modular_xcodec2.py

eustlb · 2025-09-04T16:03:35Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        return x
+
+
+class Xcodec2DecoderLayer(LlamaDecoderLayer):


I don't get what we're doing here. Let's say we're doing non-causal attention. To get a specific non-causal attention, simply passing the correct attention mask to LLamaAttention is enough. When no attention_mask is providied, simply having self.is_causal=False in LLamaAttention is enough to have non-causal attention, otherwise it means there's a (BIG) bug.

Looking into it. The original author's implementation seems closer to LlamaDecoderLayer but that leads to some functional issues when passing attention_mask as is.

ebezzam · 2025-09-05T09:48:29Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        self.alpha.requires_grad = alpha_trainable
+        self.beta.requires_grad = alpha_trainable


@eustlb FYI original model has this set to True and it's never set to False at any point (usage). I will remove as I don't see why we would want these parameters to be trainable during inference?

github-actions · 2025-09-05T13:30:30Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, xcodec, xcodec2

ebezzam · 2025-09-05T13:31:48Z

src/transformers/models/xcodec/modeling_xcodec.py

@@ -539,7 +539,7 @@ def forward(

        >>> inputs = feature_extractor(raw_audio=audio_sample, return_tensors="pt")

-        >>> outputs = model(**inputs)
+        >>> outputs = model(inputs["input_values"])


@eustlb DAC, Xcodec, and Xcodec2 don't support model(**inputs) as padding_mask is not an accepted input. Is that fine? or should padding_mask be added as an input even if it isn't used?

Add xcodec model

277a96f

code formatting

349feae

Deep-unlearning and others added 13 commits May 20, 2025 18:11

typo xcodec2 name

e5f1da8

add xcodec2 in init file

fc0907c

fix import

ea0acbf

fix weight_norm init

8542db7

remove unused import

e98d981

add convert file

d1cd3ac

add ModelOutput class

74fa506

nit

02f5c94

fix device issue

dd0a17c

fix forward

3786203

nit

93dbfad

doc draft

d4d8c6a

draft test

c40912e

Deep-unlearning changed the title ~~[WiP] Add xcodec model~~ [WiP] Add xcodec2 model Jun 3, 2025

Deep-unlearning and others added 6 commits June 3, 2025 16:23

match tensor with the orignal implementation

17eb48c

Add doc file for xcodec2

3760438

finish model doc for xcodec2

a2faa55

update doc

31319fb

working xcodec2

8d9f8df

add test file for xcodec2

e5a1838

Deep-unlearning marked this pull request as ready for review June 5, 2025 16:03

github-actions bot requested review from ArthurZucker and Rocketknight1 June 5, 2025 16:03

nit

473f95a

ArthurZucker added the New model label Jul 7, 2025

ArthurZucker removed the request for review from Rocketknight1 July 7, 2025 12:05

ArthurZucker removed their request for review July 7, 2025 12:05

Deep-unlearning requested a review from eustlb July 7, 2025 13:53

xcodec2 use EncodecFeatureExtractor

dd8aace

ebezzam added the Audio label Jul 22, 2025

ebezzam self-requested a review July 24, 2025 15:56

ebezzam requested changes Jul 25, 2025

View reviewed changes

ebezzam reviewed Jul 25, 2025

View reviewed changes

eustlb reviewed Jul 28, 2025

View reviewed changes

ebezzam mentioned this pull request Jul 29, 2025

[Draft] Add Llasa TTS family of models #39760

Draft

4 tasks

ebezzam reviewed Aug 14, 2025

View reviewed changes

ebezzam added 11 commits August 26, 2025 15:10

Merge branch 'main' into add-xcodec2

f6cf875

Standardize with Xcodec.

244bdb6

Merge branch 'main' into add-xcodec2

a84a69f

Merge branch 'main' into add-xcodec2

9d743e8

Address some PR comments and standardize.

2e23505

Remove Sequential.

0316080

Remove weight norm from model definition.

fcbeab7

Remove padding.

3c50dd2

Better use of modular and better init.

bfe535b

Style and format checks.

f287f6a

Address some modeling tests.

88cc8a7

ebezzam requested a review from eustlb September 4, 2025 10:57

eustlb reviewed Sep 4, 2025

View reviewed changes

Better use of modular for Attention and cleanup.

8eddd59

ebezzam reviewed Sep 5, 2025

View reviewed changes

ebezzam added 3 commits September 5, 2025 12:00

Remove asserts, expose params in config.

dda588b

Clean up internal and better docstrings.

1218679

Fix example.

5378c81

ebezzam reviewed Sep 5, 2025

View reviewed changes



		# Copied from transformers.tests.encodec.test_modeling_encodec.compute_rmse
		def compute_rmse(arr1, arr2):

		return weight_norm(nn.Conv1d(args, *kwargs))


		class XCodec2SnakeBeta(nn.Module):

		nn.init.constant_(m.bias, 0)


		class XCodec2CodecEncoder_Transformer(nn.Module):

		self.alpha.requires_grad = alpha_trainable
		self.beta.requires_grad = alpha_trainable

[WiP] Add xcodec2 model #37868

Are you sure you want to change the base?

[WiP] Add xcodec2 model #37868

Conversation

Deep-unlearning commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Todo

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2025

Uh oh!

ArthurZucker commented Apr 30, 2025

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Deep-unlearning commented Apr 29, 2025 •

edited

Loading

ebezzam Sep 5, 2025 •

edited

Loading