I've been looking at the function extract_mfcc_from_audio() in feature_extraction_pipline.py, and I noticed that the number of frames (num_frames) is extracted dynamically from the corresponding video file using ffmpeg.probe(). Since different videos might have different durations, this means the extracted Mel spectrograms could have different shapes (i.e., different time steps).
I have a few questions about how you handle this variability:
Do all videos in your dataset have the same duration, ensuring a consistent num_frames?
If not, how do you handle the case where the extracted Mel spectrograms have different shapes?
Do you apply padding or truncation later in the pipeline?
Does your model support variable-length inputs?