Asking about the best practice to fine-tuning?

## Goal
mimic a specific timbre
## dataset preparation
Question 1: length of all the audio segments combined shall be at least?
Question 2: recommended length range of each segment?
Question 3: Is it beneficial to trim the leading and ending silence? What about leading and ending breath(I have methods to trim it, so I just wonder whether we shall trim it)?
Question 4: Do we need to normalize volume?
## hyper-params and success metrics
Question 5: How to tune batch size and maximum steps? Is there an empirical formula for the size of the dataset vs maximum steps?
Question 6: I have noticed a phenomenon during fine-tuning: the loss on validation datasets has risen after a few epochs, but the timbre similarity improves. So do we need to watch the loss on val datasets at all? Is the training loss the only thing we need to watch?
Question 7: what loss/diff is considered to be 'good', which seems to drop unsteadily.
## other confusions
Question 8: Without reference audios and only given the lora-fted model, the timbre may vary a lot: the varying can be like emotional changes, volume changes and pitch changes. It is unrealistic that the training clips are all very similar in emotions, tones etc. And we do want some emotional changes when it reads long texts. But the changes are too abrupt and often unintended. What is the best way to increase the steadiness of the trained model?

Many thanks to any helpful suggestions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asking about the best practice to fine-tuning? #271

Goal

dataset preparation

hyper-params and success metrics

other confusions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Asking about the best practice to fine-tuning? #271

Description

Goal

dataset preparation

hyper-params and success metrics

other confusions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions