LeJEPA for time series / video

Very inspiring work on SIGReg/LeJEPA!

I noticed that your demo GIFs are videos, but the code and the paper only mention images. We're training a time-series encoder using JEPA, which is more similar to video than to images due to its time-domain nature, and we're now trying to apply LeJEPA instead of the classical JEPA. I would appreciate your thoughts on how to apply LeJEPA when the time dimension is present.

1. In your video demos, do you train the encoder on images and then apply it to the video frame-by-frame? I.e. is it a true video encoder, or just an image encoder applied to video?
2. My thinking is that in the case of video or time series, we would need a predictor like in V-JEPA, with SIGReg replacing EMA+StopGrad. Maybe-maybe we can make the predictor much simpler (an MLP instead of a transformer), but I doubt we can do without it. And then the training objective will be prediction, not invariance. Does this match your intuition? And will the SIGReg work with the Prediction objective instead of invariance?
3. Our preliminary experiments show that when training a time series JEPA with SIGReg, time-dimension embeddings collapse, so we applied SIGReg twice - across all embeddings of _each_ batch sample individually, and across aggregated embeddings _between_ batch samples. This seems to prevent the collapse, but we haven't finished the downstream task validations yet (it's a bit more developed than in CV, unfortunately).

PS If anyone else is interested in time-series self-supervised JEPA / representation learning - I'd be very interested to chat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeJEPA for time series / video #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LeJEPA for time series / video #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions