[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199
[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199AdnanElAssadi56 wants to merge 70 commits intoembeddings-benchmark:mainfrom
Conversation
|
@Samoed I edited the collator to handle one video item. |
|
Yeah, I forgot to update collator after changing to one video |
|
Results from |
|
This is hard to match results for ravdess. I think you can run one of these tasks (from https://arxiv.org/pdf/2512.19687) |
|
MSR-VTT: |
|
@Samoed Anything else here? |
|
Can you resolve my comments and make CI green? |
|
I think tests are failing because of |
|
Strange. I get it without problems import mteb
meta = mteb.models.ModelMeta.from_hub("facebook/pe-av-base-16-frame")
meta.n_embedding_parameters
# 51576832 |
|
@Samoed Tests resolved here |
isaac-chung
left a comment
There was a problem hiding this comment.
Got a few non-blocking questions. Can be addressed in a separate PR.
mteb/tasks/video/classification/eng/kinetics400_classification.py
Outdated
Show resolved
Hide resolved
|
@Samoed @isaac-chung Changed input_column to list. |
|
lint is giving error because list is mutable |
|
It's looking for something like this I think: from typing import ClassVar
input_column_name: ClassVar[list[str]] = ["video", "audio"] |
|
@Samoed Can you give a look here when you have the time? |
|
How do you tell VA2C and V2C tasks apart? Is it that: only in VA2C tasks, we process the audio, regardless if it's from the video or in a separate column? |
|
@Samoed @isaac-chung @KennethEnevoldsen |
|
One main thing we should clarify is how to handle video with and without audio + separate audio |
It basically looks like this, with audio col being present when the video has audio.
|
|
Results from Do i merge @KennethEnevoldsen @Samoed @isaac-chung ? |
|
No, please address comments from our reviews |
Is there anything pending? |
|
Comments that unresolved |
…x collator output - Revert input_column_name from Mapping[str, str] to str | Sequence[str] - Remove VideoInputItem wrapper, pass frames tensor directly - Make VideoCollator return BatchedInput (consistent with AudioCollator) - MultimodalCollator uses static methods instead of chaining collators
|
@Samoed Any more points? |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
alright it sounds like we can keep it as a sequence, but we need to documents the limitations.
I don't see why we don't want to support text+video in classification - it seems like we are avoiding creating a general solution. We will have to deal with this regardless at some point.
mteb/tasks/video/classification/eng/kinetics400_classification.py
Outdated
Show resolved
Hide resolved
…ations - Rename VideoCollator -> FramesCollator, MultimodalCollator -> VideoCollator - Update VideoInput docstring to clarify frames-only, audio in AudioInput - Update input_column_name docs in classification/clustering base classes - Use ClassVar[Sequence[str]] for video task input_column_name - Extract isinstance check to top of zeroshot evaluator __call__ - Improve task_pipelines.py skip comment for multi-column tasks - Add TODO for MSR-VTT dataset reupload
|
Added relevant issues from discussions above and resolved convos (commented the issue links here as well). |

(From closed PR)
Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797
Also includes some remaining components from the parallel video integration work we accidently did.