Handling Variable Mel Spectrogram Shapes Due to Varying num_frames

I've been looking at the function extract_mfcc_from_audio() in feature_extraction_pipline.py, and I noticed that the number of frames (num_frames) is extracted dynamically from the corresponding video file using ffmpeg.probe(). Since different videos might have different durations, this means the extracted Mel spectrograms could have different shapes (i.e., different time steps).

I have a few questions about how you handle this variability:

Do all videos in your dataset have the same duration, ensuring a consistent num_frames?

If not, how do you handle the case where the extracted Mel spectrograms have different shapes?
       
       Do you apply padding or truncation later in the pipeline?

       Does your model support variable-length inputs?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling Variable Mel Spectrogram Shapes Due to Varying num_frames #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Handling Variable Mel Spectrogram Shapes Due to Varying num_frames #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions