Stefan A. Baumann · Felix Krause · Ming Gui · Björn Ommer
CompVis @ LMU Munich, MCML
* equal contribution
NeurIPS 2025 Spotlight
We present DisMo, a paradigm that learns a semantic motion representation space from videos that is disentangled from static content information such as appearance, structure, viewing angle and even object category. We leverage this invariance and condition off-the-shelf video models on extracted motion embeddings. This setup achieves state-of-the-art performance on open-world motion transfer with a high degree of transferability in cross-category and -viewpoint settings. Beyond that, DisMo's learned representations are suitable for downstream tasks such as zero-shot action classification.
We have tested our setup on Ubuntu 22.04.4 LTS.
First, clone the repository into your desired location:
git clone [email protected]:CompVis/DisMo.git
cd DisMo
We recommend using a package manager, e.g., Miniconda. When installed, you can create and activate a new environment:
conda create -n dismo python=3.11
conda activate dismo
Afterwards install PyTorch. We have tested this setup with PyTorch 2.7.1 and CUDA 12.6:
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
If you need to install an alternative version, e.g. due to incompatible CUDA versions, see the official instructions.
Finally, install all other packages:
pip install -r requirements.txt
(Optional) We use the torchcodec package for data loading, which expects ffmpeg to be installed. If you plan to train DisMo yourself and you don't have a ffmpeg version installed yet, an easy way is to use conda:
conda install ffmpeg
To use DisMo for motion transfer purposes, we provide the code and LoRA weights of an adapted CogVideoX-5B-I2V video model,conditioned on motion embeddings and text prompts. The simplest way to use it is via torch.hub:
cogvideox = torch.hub.load("CompVis/DisMo", "cogvideox5b_i2v_large")
Alternatively, you can also instantiate and load the model yourself:
from dismo.video_model_finetuning.cogvideox import CogVideoXMotionAdapter_5B_TI2V_Large
cogvideox = CogVideoXMotionAdapter_5B_TI2V_Large()
state_dict = torch.load("/path/to/finetuned/cogvideox/checkpoint/cogvideox5b_i2v_large.pt")
cogvideox.load_state_dict(state_dict, strict=False)
cogvideox.requires_grad_(False)
cogvideox.eval()
You can then use the model's sample function to generate new videos by transferring motion from motion_videos to images. Since CogVideoX is a text-to-video model at its core, we recommend to additionally provide describing prompts alongside the target images for better generation results:
generated_videos = cogvideox.sample(
motion_videos=driving_videos,
images=target_images,
prompts=target_text_prompts,
)
The sample function comes with some other arguments (e.g., classifier-free text guidance). Please have a look in the code for more details.
During motion transfer, the video model internally uses DisMo's pre-trained motion extractor for encoding input videos into motion embeddings. However, the motion extractor can also be used as a standalone model to extract sequences of motion embeddings from input videos. This might be useful for video analysis purposes or other downstream tasks. Once again, the easiest way to load the model is via torch.hub:
motion_extractor = torch.hub.load("CompVis/DisMo", "motion_extractor_large")
Similarly, you can also manually instantiate and load the model:
from dismo.model import MotionExtractor_Large
motion_extractor = MotionExtractor_Large()
state_dict = torch.load("/path/to/motion/extractor/checkpoint/motion_extractor_large.pt")
motion_extractor.load_state_dict(state_dict)
motion_extractor.requires_grad_(False)
motion_extractor.eval()
To extract motion sequences from arbitrarily long videos, we provide the forward_sliding function, which extracts embeddings consecutively in a sliding window fashion. This is necessary, since DisMo only saw video clips of length 8 during training:
import torch
# videos are expected to have shape [B, T, H, W, C] in (-1, 1) range
dummy_video = torch.rand((B, num_frames, 256, 256, 3)).mul(2).sub(1)
# we get a motion embedding for each frame, except for the last 4
motion_embeddings = motion_extractor.forward_sliding(dummy_video)
Note that the resulting motion embeddings have a temporal length of num_frames - 4, since the longest possible prediction distance was set to 4 during training.
If you want to train DisMo yourself, we provide a training script that is suitable for multi-gpu training. Please note that the script instantiates DisMo with default parameters. To train other variants (e.g., changing the width, depth, etc.) you must modify the train.py accordingly. This equally holds true for video model adaptation.
DisMo needs unlabelled videos for training. This repository takes advantage of the webdataset library and format for efficient and scalable data loading. Please refer to their page for further instructions on how to shard your video files accordingly.
Single-GPU training can be launched via
python train.py --data_paths /path/to/preprocessed/shards --out_dir output/test --compile TrueSimilarly, multi-GPU training, e.g., on 2 GPUs, can be launched using torchrun:
torchrun --nnodes 1 --nproc-per-node 2 train.py [...]Training can be continued from a previous checkpoint by specifying, e.g., --load_checkpoint output/test/checkpoints/checkpoint_0100000.pt.
Remove --compile True for significantly faster startup time at the cost of slower training & significantly increased VRAM usage.
We release the weights of our pre-trained motion extractor and LoRA weight of an adapted CogVideoX-5B-I2V model via HuggingFace under the CC BY-NC 4.0 license. If you are interested in using our model weights commercially, please contact us. We will release other model variants in the future, e.g., more sophisticated fine-tuned video models. Due to legal concerns, we do not release the weights of the frame generator that was trained alongside the motion extractor.
- Some code is adapted from flow-poke-transformer by Stefan A. Baumann et al. (LMU), which in turn adapts some code from k-diffusion by Katherine Crowson (MIT)
- The code for fine-tuning CogVideoX models is adapted from CogKit (Apache 2.0)
- The DINOv2 code is adapted from minDinoV2 by Simo Ryu, which is based on the official implementation by Oquab et al. (Apache 2.0)
If you find our work useful, please cite our paper:
@inproceedings{resslerdismo,
title={DisMo: Disentangled Motion Representations for Open-World Motion Transfer},
author={Ressler-Antal, Thomas and Fundel, Frank and Alaya, Malek Ben and Baumann, Stefan Andreas and Krause, Felix and Gui, Ming and Ommer, Bj{\"o}rn},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}