InternVideo-Next Multi-modality probes

Hello @Revliter  ,
Thank you so much for your previous response on issue https://github.com/OpenGVLab/InternVideo/issues/312 — I really appreciate the time you've taken to help. I've been digging deeper into the text encoder training for InternVideo-Next and have run into a few questions I'd love your input on.

Question 1 — Correct Training Setup: Paper vs. InternVideo2 
In the paper, you mention freezing the ViT backbone and training only the text encoder. However, in issue #312 you pointed me toward the InternVideo2 multi-modality training, which uses a slightly different setup:
 - Vision backbone → fully frozen
 - Text backbone → fully frozen
 - clip-projector (vision side) → unfrozen
 - Alignment layer added on the vision side

Could you clarify which approach is correct for reproducing InternVideo-Next zero shot t2v results? Specifically: Should I follow the InternVideo2 setup exactly, or Adapt it to better match the paper — e.g., unfreeze the text backbone, and optionally freeze/unfreeze the clip-projector and add alignment on the text and/or vision side?

Question 2 — Dimension Alignment with SigLIP2 1B Teacher
You mentioned that SigLIP2 1B (giant opt) was used as a teacher in Stage 1 pretraining. However, its embedding dimensionality is quite different from the resulting InternVideo-Next vision encoder. How was dimension alignment handled between the two models?
Additionally — and I'm not sure if you tried this — i would expect the InternVideo-Next vision encoder shift away from SigLIP2's embedding space after Stage 2, making the two spaces incomparable at that point right?

Question 3 — Text-Side Training Settings, Epochs, and Room for Improvement
A few related sub-questions here:
 - Training config: Do the text-side training settings (temperature, epochs, weight decay, learning rate) fully follow the InternVideo2 configs?
 - Epoch discrepancy: In the paper, zero-shot T2V results are compared against InternVideo2 CLIP-L/14, which was trained for 3 epochs, whereas the InternVideo-Next multi-modality probe (Section: Multi-modality Tasks) mentions 5 epochs. Could you clarify this difference?
 - Are these results final? You refer to these experiments as probes — do you believe there is room for improvement with further tuning (e.g., dataset size, text encoder size, hyperparameters), or are the reported numbers the expected ceiling for this configuration?

I find this work incredibly insightful and plan to use the vision encoder in my diploma thesis given its strong potential. These clarifications would really help me move forward.
Thank you so much in advance for your time and help! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternVideo-Next Multi-modality probes #318

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

InternVideo-Next Multi-modality probes #318

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions