Goal
mimic a specific timbre
dataset preparation
Question 1: length of all the audio segments combined shall be at least?
Question 2: recommended length range of each segment?
Question 3: Is it beneficial to trim the leading and ending silence? What about leading and ending breath(I have methods to trim it, so I just wonder whether we shall trim it)?
Question 4: Do we need to normalize volume?
hyper-params and success metrics
Question 5: How to tune batch size and maximum steps? Is there an empirical formula for the size of the dataset vs maximum steps?
Question 6: I have noticed a phenomenon during fine-tuning: the loss on validation datasets has risen after a few epochs, but the timbre similarity improves. So do we need to watch the loss on val datasets at all? Is the training loss the only thing we need to watch?
Question 7: what loss/diff is considered to be 'good', which seems to drop unsteadily.
other confusions
Question 8: Without reference audios and only given the lora-fted model, the timbre may vary a lot: the varying can be like emotional changes, volume changes and pitch changes. It is unrealistic that the training clips are all very similar in emotions, tones etc. And we do want some emotional changes when it reads long texts. But the changes are too abrupt and often unintended. What is the best way to increase the steadiness of the trained model?
Many thanks to any helpful suggestions!
Goal
mimic a specific timbre
dataset preparation
Question 1: length of all the audio segments combined shall be at least?
Question 2: recommended length range of each segment?
Question 3: Is it beneficial to trim the leading and ending silence? What about leading and ending breath(I have methods to trim it, so I just wonder whether we shall trim it)?
Question 4: Do we need to normalize volume?
hyper-params and success metrics
Question 5: How to tune batch size and maximum steps? Is there an empirical formula for the size of the dataset vs maximum steps?
Question 6: I have noticed a phenomenon during fine-tuning: the loss on validation datasets has risen after a few epochs, but the timbre similarity improves. So do we need to watch the loss on val datasets at all? Is the training loss the only thing we need to watch?
Question 7: what loss/diff is considered to be 'good', which seems to drop unsteadily.
other confusions
Question 8: Without reference audios and only given the lora-fted model, the timbre may vary a lot: the varying can be like emotional changes, volume changes and pitch changes. It is unrealistic that the training clips are all very similar in emotions, tones etc. And we do want some emotional changes when it reads long texts. But the changes are too abrupt and often unintended. What is the best way to increase the steadiness of the trained model?
Many thanks to any helpful suggestions!