Skip to content

[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199

Open
AdnanElAssadi56 wants to merge 70 commits intoembeddings-benchmark:mainfrom
AdnanElAssadi56:mveb-video-integration
Open

[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199
AdnanElAssadi56 wants to merge 70 commits intoembeddings-benchmark:mainfrom
AdnanElAssadi56:mveb-video-integration

Conversation

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor

@AdnanElAssadi56 AdnanElAssadi56 commented Mar 5, 2026

(From closed PR)

Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797

Also includes some remaining components from the parallel video integration work we accidently did.

@Samoed Samoed added new model Questions related to adding a new model to the benchmark new dataset Issues related to adding a new task or dataset video video extension labels Mar 5, 2026
@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@Samoed I edited the collator to handle one video item.

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Mar 5, 2026

Yeah, I forgot to update collator after changing to one video

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

Results from "facebook/pe-av-small":

RAVDESS_AVClustering.json

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Mar 5, 2026

This is hard to match results for ravdess. I think you can run one of these tasks (from https://arxiv.org/pdf/2512.19687)
image

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

MSR-VTT:
Some discrepancy but maybe because we are using audio as well.
MSRVTTV2T.json

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@Samoed Anything else here?

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Mar 7, 2026

Can you resolve my comments and make CI green?

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

I think tests are failing because of n_embedding_parameters. It can't be calculated by the method in Model Meta:
Could not calculate embedding parameters for facebook/pe-av-base-16-frame as config.json could not be loaded

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Mar 7, 2026

Strange. I get it without problems

import mteb

meta = mteb.models.ModelMeta.from_hub("facebook/pe-av-base-16-frame")
meta.n_embedding_parameters
# 51576832

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

AdnanElAssadi56 commented Mar 7, 2026

@Samoed Tests resolved here

Copy link
Copy Markdown
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a few non-blocking questions. Can be addressed in a separate PR.

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@Samoed @isaac-chung Changed input_column to list.

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

lint is giving error because list is mutable

@isaac-chung
Copy link
Copy Markdown
Collaborator

It's looking for something like this I think:

from typing import ClassVar

input_column_name: ClassVar[list[str]] = ["video", "audio"]

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@Samoed Can you give a look here when you have the time?

@isaac-chung
Copy link
Copy Markdown
Collaborator

isaac-chung commented Mar 13, 2026

How do you tell VA2C and V2C tasks apart? Is it that: only in VA2C tasks, we process the audio, regardless if it's from the video or in a separate column?

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@Samoed @isaac-chung @KennethEnevoldsen
This is somewhat of a blocker right now. Can we discuss the approach here if you are available?

@isaac-chung
Copy link
Copy Markdown
Collaborator

One main thing we should clarify is how to handle video with and without audio + separate audio

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@AdnanElAssadi56 so how does a dataset with video (incl. audio) look on HF and how is it different if it also includes audio.

It basically looks like this, with audio col being present when the video has audio.

video audio label
VideoDecoder_0 {"array": [...], "sr": 16000} 0
VideoDecoder_1 {"array": [...], "sr": 16000} 3

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

AdnanElAssadi56 commented Apr 10, 2026

Results from pe-av-small
RAVDESSAVClustering.json

Do i merge @KennethEnevoldsen @Samoed @isaac-chung ?

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Apr 10, 2026

No, please address comments from our reviews

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

No, please address comments from our reviews

Is there anything pending?

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Apr 10, 2026

Comments that unresolved

…x collator output

- Revert input_column_name from Mapping[str, str] to str | Sequence[str]
- Remove VideoInputItem wrapper, pass frames tensor directly
- Make VideoCollator return BatchedInput (consistent with AudioCollator)
- MultimodalCollator uses static methods instead of chaining collators
@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@Samoed Any more points?

@KennethEnevoldsen KennethEnevoldsen mentioned this pull request Apr 14, 2026
64 tasks
Copy link
Copy Markdown
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright it sounds like we can keep it as a sequence, but we need to documents the limitations.

I don't see why we don't want to support text+video in classification - it seems like we are avoiding creating a general solution. We will have to deal with this regardless at some point.

…ations

- Rename VideoCollator -> FramesCollator, MultimodalCollator -> VideoCollator
- Update VideoInput docstring to clarify frames-only, audio in AudioInput
- Update input_column_name docs in classification/clustering base classes
- Use ClassVar[Sequence[str]] for video task input_column_name
- Extract isinstance check to top of zeroshot evaluator __call__
- Improve task_pipelines.py skip comment for multi-column tasks
- Add TODO for MSR-VTT dataset reupload
@AdnanElAssadi56
Copy link
Copy Markdown
Contributor Author

@KennethEnevoldsen @Samoed

Added relevant issues from discussions above and resolved convos (commented the issue links here as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new dataset Issues related to adding a new task or dataset new model Questions related to adding a new model to the benchmark video video extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add model: PE-AV

4 participants