-
Notifications
You must be signed in to change notification settings - Fork 30.3k
[WiP] Add xcodec2 model #37868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WiP] Add xcodec2 model #37868
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ff to ping me once this is ready! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Deep-unlearning my first time reviewing a model addition so sorry if I'm nit-picky! I recently did a deep dive through DAC and EnCodec so most of my comments are about making things consistent with those models:
- simplifying the configuration
- whether we keep
nn.Sequential
andweight_norm
. @eustlb will probably know better for that - similar integration tests
class XCodec2IntegrationTest(unittest.TestCase): | ||
def test_integration(self): | ||
expected_rmse = 0.07212554663419724 | ||
expected_codes = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap around
# fmt: off
expected_codes = [...]
# fmt: on
to avoid make fixup
from making new line for each element
audio_codes = model.encode(inputs["input_values"], return_dict=False) | ||
codes = audio_codes.squeeze(0).squeeze(0).tolist() | ||
|
||
self.assertEqual(codes, expected_codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
# Copied from transformers.tests.encodec.test_modeling_encodec.compute_rmse | ||
def compute_rmse(arr1, arr2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not being used? also there's a new version
arr_enc_dec = input_values_enc_dec[0].cpu().numpy() | ||
arr_enc_dec_truncated = arr_enc_dec[:, : arr.shape[1]] | ||
rmse = np.sqrt(((arr - arr_enc_dec_truncated) ** 2).mean()) | ||
self.assertTrue(np.abs(rmse - expected_rmse) < 1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a batch test?
return x | ||
|
||
|
||
class XCodec2DecoderLayer(LlamaDecoderLayer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for @eustlb
This component is meant to replace TransformerBlock
from the original implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work, @Deep-unlearning! 🤗
For this first review pass, I focused on reacting to @ebezzam’s comments. I’ll take a broader look once those are addressed.
return weight_norm(nn.Conv1d(*args, **kwargs)) | ||
|
||
|
||
class XCodec2SnakeBeta(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are ok IMO, here since it's specifically a modified snake I would actually keep XCodec2SnakeBeta
class EncoderBlock(nn.Module): | ||
def __init__(self, dim: int = 16, stride: int = 1, dilations=(1, 3, 9)): | ||
super().__init__() | ||
runits = [ResidualUnit(dim // 2, dilation=d) for d in dilations] | ||
self.block = nn.Sequential( | ||
*runits, | ||
Activation1d(activation=XCodec2SnakeBeta(dim // 2, alpha_logscale=True)), | ||
WNConv1d( | ||
dim // 2, | ||
dim, | ||
kernel_size=2 * stride, | ||
stride=stride, | ||
padding=stride // 2 + stride % 2, | ||
), | ||
) | ||
|
||
def forward(self, x): | ||
return self.block(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If simple and clear, yes!
nn.init.constant_(m.bias, 0) | ||
|
||
|
||
class XCodec2CodecEncoder_Transformer(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a naming for aligned with Transformers conventions (letting you check other modelling), we do not take _
semantic_model_config = AutoConfig.from_pretrained("facebook/w2v-bert-2.0", output_hidden_states=True) | ||
self.semantic_model = AutoModel.from_config(semantic_model_config) | ||
self.semantic_model.eval() | ||
|
||
self.SemanticEncoder_module = XCodec2SemanticEncoder( | ||
config.semantic_hidden_size, config.semantic_hidden_size, config.semantic_hidden_size | ||
) | ||
|
||
self.CodecEnc = XCodec2CodecEncoder_Transformer() | ||
|
||
self.generator = XCodec2CodecDecoderVocos(config=config) | ||
|
||
self.fc_prior = nn.Linear(2048, 2048) | ||
self.fc_post_a = nn.Linear(2048, 1024) | ||
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/w2v-bert-2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eustlb should loading other checkpoints with from_pretrained
be wrapped in something like a processor? or fine to use inside modeling code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's iterate on the attention implementation. A lot of lines seems to come from handling non-causal attention, which should be simply handled by setting self.is_causal = False
when inheriting and passing attention mask when required.
value_states, | ||
attention_mask, | ||
dropout=0.0 if not self.training else self.attention_dropout, | ||
is_causal=self.is_causal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I understood, this is_causal is the only diff with LlamaAttention. I don't get why it's necessary though?
Normally, setting self.is_causal = False
should be enough (see here)? Or of curse I am missing a specificity here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. I didn't see where you pointed that there is a getattr(module, "is_causal", True)
, so I had kept Steven's version since it wasn't clear how/if self.is_causal
is used in LlamaAttention
because it isn't passed to attention_interface
.
In my opinion, it could be made clear in LlamaAttention
by having a similar comment also in LlamaAttention
here.
return x | ||
|
||
|
||
class Xcodec2DecoderLayer(LlamaDecoderLayer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what we're doing here. Let's say we're doing non-causal attention. To get a specific non-causal attention, simply passing the correct attention mask to LLamaAttention
is enough. When no attention_mask is providied, simply having self.is_causal=False
in LLamaAttention is enough to have non-causal attention, otherwise it means there's a (BIG) bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into it. The original author's implementation seems closer to LlamaDecoderLayer
but that leads to some functional issues when passing attention_mask
as is.
self.alpha.requires_grad = alpha_trainable | ||
self.beta.requires_grad = alpha_trainable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, xcodec, xcodec2 |
@@ -539,7 +539,7 @@ def forward( | |||
|
|||
>>> inputs = feature_extractor(raw_audio=audio_sample, return_tensors="pt") | |||
|
|||
>>> outputs = model(**inputs) | |||
>>> outputs = model(inputs["input_values"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eustlb DAC, Xcodec, and Xcodec2 don't support model(**inputs)
as padding_mask
is not an accepted input. Is that fine? or should padding_mask
be added as an input even if it isn't used?
What does this PR do?
This PR adds support for XCodec2 a high fidelity general neural audio codec used in Llasa a Text-to-Speech model, to the Transformers library.
This model is composed of 5 components:
This is still a draft PR. Work done so far:
modeling_xcodec2.py
andmodular_xcodec2.py
.Todo
Who can review?
cc: @ArthurZucker
cc: @eustlb @Vaibhavs10 for visibility