MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
Lixing Xiao1
·
Shunlin Lu 2
·
Huaijin Pi3
·
Ke Fan4
·
Liang Pan3
·
Yueer Zhou1
·
Ziyong Feng5
·
Xiaowei Zhou1
·
Sida Peng1†
·
Jingbo Wang6
1Zhejiang University 2The Chinese University of Hong Kong, Shenzhen 3The University of Hong Kong
4Shanghai Jiao Tong University 5DeepGlint 6Shanghai AI Lab
ICCV 2025
- [2025-06] MotionStreamer has been accepted to ICCV 2025! 🎉
- Release the processing script of 272-dim motion representation.
- Release the processed 272-dim Motion Representation of HumanML3D dataset. Only for academic usage.
- Release the training code and checkpoint of our TMR-based motion evaluator trained on the processed 272-dim HumanML3D dataset.
- Release the training and evaluation code as well as checkpoint of Causal TAE.
- Release the training code of original motion generation model and streaming generation model (MotionStreamer).
- Release the checkpoint and demo inference code of original motion generation model.
- Release complete code for MotionStreamer.
For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our GitHub repo.
conda env create -f environment.yaml
conda activate mgpt
Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following:
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
To facilitate researchers, we provide the processed 272-dim Motion Representation of:
HumanML3D dataset at this link.
BABEL dataset at this link.
❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the AMASS License.
- Download the processed 272-dim HumanML3D dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272
cd ./humanml3d_272
unzip texts.zip
unzip motion_data.zip
The dataset is organized as:
./humanml3d_272
├── mean_std
├── Mean.npy
├── Std.npy
├── split
├── train.txt
├── val.txt
├── test.txt
├── texts
├── 000000.txt
...
├── motion_data
├── 000000.npy
...
- Download the processed 272-dim BABEL dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272
cd ./babel_272
unzip texts.zip
unzip motion_data.zip
The dataset is organized as:
./babel_272
├── t2m_babel_mean_std
├── Mean.npy
├── Std.npy
├── split
├── train.txt
├── val.txt
├── texts
├── 000000.txt
...
├── motion_data
├── 000000.npy
...
- Download the processed streaming 272-dim BABEL dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream
cd ./babel_272_stream
unzip train_stream.zip
unzip train_stream_text.zip
unzip val_stream.zip
unzip val_stream_text.zip
The dataset is organized as:
./babel_272_stream
├── train_stream
├── seq1.npy
...
├── train_stream_text
├── seq1.txt
...
├── val_stream
├── seq1.npy
...
├── val_stream_text
├── seq1.txt
...
NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t).
Then, our BABEL-stream is constructed as:
seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length)
seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length)
seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length)
Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1.
-
Train our TMR-based motion evaluator on the processed 272-dim HumanML3D dataset:
bash TRAIN_evaluator_272.sh
After training for 100 epochs, the checkpoint will be stored at:
Evaluator_272/experiments/temos/EXP1/checkpoints/
.⬇️ We provide the evaluator checkpoint on Hugging Face, download it following:
python humanml3d_272/prepare/download_evaluator_ckpt.py
The downloaded checkpoint will be stored at:
Evaluator_272/
. -
Train the Causal TAE:
bash TRAIN_causal_TAE.sh ${NUM_GPUS}
e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8
The checkpoint will be stored at:
Experiments/causal_TAE_t2m_272/
Tensorboard visualization:
tensorboard --logdir='Experiments/causal_TAE_t2m_272'
⬇️ We provide the Causal TAE checkpoint on Hugging Face, download it following:
python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py
-
Train text to motion model:
We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage).
3.1 Get motion latents:
python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents
3.2 Download sentence-T5-XXL model on Hugging Face:
huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/
3.3 Train text to motion generation model:
bash TRAIN_t2m.sh ${NUM_GPUS}
e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8
The checkpoint will be stored at:
Experiments/t2m_model/
Tensorboard visualization:
tensorboard --logdir='Experiments/t2m_model'
⬇️ We provide the text to motion model checkpoint on Hugging Face, download it following:
python humanml3d_272/prepare/download_t2m_model_ckpt.py
-
Train streaming motion generation model (MotionStreamer):
We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data).
4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data:
bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272
e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272
The checkpoint will be stored at:
Experiments/causal_TAE_t2m_babel_272/
Tensorboard visualization:
tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272'
⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on Hugging Face, download it following:
python humanml3d_272/prepare/download_Causal_TAE_t2m_babel_272_ckpt.py
4.2 Get motion latents of both HumanML3D-272 and the processed BABEL-272-stream dataset:
python get_latent.py --resume-pth Causal_TAE_t2m_babel/net_last.pth --latent_dir babel_272_stream/t2m_babel_latents --dataname t2m_babel_272
4.3 Train MotionStreamer model:
bash TRAIN_motionstreamer.sh ${NUM_GPUS}
e.g., if you have 8 GPUs, run: bash TRAIN_motionstreamer.sh 8
The checkpoint will be stored at:
Experiments/motionstreamer_model/
Tensorboard visualization:
tensorboard --logdir='Experiments/motionstreamer_model'
-
Evaluate the metrics of the processed 272-dim HumanML3D dataset:
bash EVAL_GT.sh
( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )
-
Evaluate the metrics of Causal TAE:
bash EVAL_causal_TAE.sh
( FID and MPJPE (mm) are reported. )
-
Evaluate the metrics of text to motion model:
bash EVAL_t2m.sh
( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )
- Inference of text to motion model:
[Option1] Recover from joint position
python demo_t2m.py --text 'a person is walking like a mummy.' --mode pos --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
[Option2] Recover from joint rotation
python demo_t2m.py --text 'a person is walking like a mummy.' --mode rot --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
In our 272-dim representation, Inverse Kinematics (IK) is not needed. For further conversion to BVH format, please refer to this repo (Step 6: Representation_272 to BVH conversion). The BVH format of motion animation can be visualizd and edited in Blender.
This repository builds upon the following awesome datasets and projects:
If our project is helpful for your research, please consider citing :
@InProceedings{Xiao_2025_ICCV,
author = {Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
title = {MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {10086-10096}
}