Skip to content

Asking about the best practice to fine-tuning? #271

@Crestina2001

Description

@Crestina2001

Goal

mimic a specific timbre

dataset preparation

Question 1: length of all the audio segments combined shall be at least?
Question 2: recommended length range of each segment?
Question 3: Is it beneficial to trim the leading and ending silence? What about leading and ending breath(I have methods to trim it, so I just wonder whether we shall trim it)?
Question 4: Do we need to normalize volume?

hyper-params and success metrics

Question 5: How to tune batch size and maximum steps? Is there an empirical formula for the size of the dataset vs maximum steps?
Question 6: I have noticed a phenomenon during fine-tuning: the loss on validation datasets has risen after a few epochs, but the timbre similarity improves. So do we need to watch the loss on val datasets at all? Is the training loss the only thing we need to watch?
Question 7: what loss/diff is considered to be 'good', which seems to drop unsteadily.

other confusions

Question 8: Without reference audios and only given the lora-fted model, the timbre may vary a lot: the varying can be like emotional changes, volume changes and pitch changes. It is unrealistic that the training clips are all very similar in emotions, tones etc. And we do want some emotional changes when it reads long texts. But the changes are too abrupt and often unintended. What is the best way to increase the steadiness of the trained model?

Many thanks to any helpful suggestions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions