Skip to content

zju3dv/CFSynthesis

Repository files navigation

CFSynthesis: Controllable and Free-view 3D Human Video Synthesis


1Zhejiang University2The Chinese University of Hong Kong

Overeview

Overeview

⚒️ Installation

Prerequisites: python>=3.10, CUDA>=11.7, and ffmpeg.

Install dependencies:

  • Tested GPUs: A100, We require at least 40 GB of GPU memory.
pip install -r requirements.txt

🚀 Training and Inference

Prepare Datasets

The data processing code is located in CFSynthesis/render_dataset.
Prepare your training data in the following format (we use ASIT as an example) in the corresponding folder:

render_dataset/path/to/datasets
  ├── gBR_sFM_c08_d06_mBR5
  │   ├── gBR_sFM_c08_d06_mBR5_0001.png
  │   ├── gBR_sFM_c08_d06_mBR5_0002.png
  │   ...
  ├── gLO_sFM_c01_d13_mLO1
  │   ├── gLO_sFM_c01_d13_mLO1_0001.png
  │   ├── gLO_sFM_c01_d13_mLO1_0002.png

⚠️ Note: You need to save the data at a resolution of 512×512.

We use the following tools (please ensure that all dependencies and pretrained checkpoints are properly set up):

Install Detectron2:

git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
wget https://dl.fbaipublicfiles.com/densepose/densepose_rcnn_R_101_FPN_s1x/165712084/model_final_c6ab63.pkl \
  -P /path/to/detectron2/projects/DensePose/checkpoints/

Generate UV maps and backgrounds:

bash process.sh path/to/datasets /absolute/path/to/detectron2

Generate foregrounds:

python select_inference.py --input_img_path path/to/datasets/gt --save_path path/to/datasets/ref
python SemanticGuidedHumanMatting/seg_bg_image_folder.py \
    --images_dir path/to/datasets/ref \
    --result_dir path/to/datasets/ref_seg \
    --pretrained_weight SemanticGuidedHumanMatting/pretrained/SGHM-ResNet50.pth

You can now obtain the following data structure:

/path/to/datasets
  ├── gt
  │   ├── gBR_sFM_c08_d06_mBR5_0001.png
  │   ├── gBR_sFM_c08_d06_mBR5_0002.png
  │   ├── gLO_sFM_c01_d13_mLO1_0001.png
  │   ├── gLO_sFM_c01_d13_mLO1_0002.png
  │   ...
  ├── ref_control
  │   ├── gBR_sFM_c08_d06_mBR5_0001.png
  │   ├── gBR_sFM_c08_d06_mBR5_0002.png
  │   ├── gLO_sFM_c01_d13_mLO1_0001.png
  │   ├── gLO_sFM_c01_d13_mLO1_0002.png
  ├── cond
  │   ├── gBR_sFM_c08_d06_mBR5_0001.png
  │   ├── gBR_sFM_c08_d06_mBR5_0002.png
  │   ├── gLO_sFM_c01_d13_mLO1_0001.png
  │   ├── gLO_sFM_c01_d13_mLO1_0002.png
  ├── images-seg
  │   ├── gBR_sFM_c08_d06_mBR5_0001.png
  │   ├── gBR_sFM_c08_d06_mBR5_0002.png
  │   ├── gLO_sFM_c01_d13_mLO1_0001.png
  │   ├── gLO_sFM_c01_d13_mLO1_0002.png
  ├── ref
  │   ├── gLO_sFM_c01_d13_mLO1.png 
  │   ├── gLO_sFM_c08_d13_mLO1.png 

Synthesize videos:

python tools/synthesize_video.py --root 4d_2/ --fps 30 --clean
/path/to/datasets
  ├── gt
  │   ├── gBR_sFM_c08_d06_mBR5.mp4
  │   ├── gLO_sFM_c01_d13_mLO1.mp4
  │   ...
  ├── ref_control
  │   ├── gBR_sFM_c08_d06_mBR5.mp4
  │   ├── gLO_sFM_c01_d13_mLO1.mp4
  ├── cond
  │   ├── gBR_sFM_c08_d06_mBR5.mp4
  │   ├── gLO_sFM_c01_d13_mLO1.mp4
  ├── images-seg
  │   ├── gBR_sFM_c08_d06_mBR5.mp4
  │   ├── gLO_sFM_c01_d13_mLO1.mp4
  ├── ref
  │   ├── gLO_sFM_c01_d13_mLO1.png 
  │   ├── gLO_sFM_c08_d13_mLO1.png 

Inference

Run inference:

python -m scripts.pipeline.pose2vid \
  --config ./configs/animation/animation.yaml -W 512 -H 512 -L 96

🏋️‍♂️ Training

Data Preparation

Extract the meta info of your dataset:

python tools/extract_meta_info.py --root_path /path/to/your/video_dir/gt --dataset_name asit 

Update the training config:

data:
  meta_paths:
    - "./data/asit_meta.json"

Stage 1

Download base models from Hugging Face.
We recommend using git lfs to download large files.

Place the models as follows:

pretrained_weights
|-- ckpts  
|   |-- denoising_unet.pth
|   |-- guidance_encoder_depth.pth
|   |-- guidance_encoder_dwpose.pth
|   |-- guidance_encoder_normal.pth
|   |-- guidance_encoder_semantic_map.pth
|   |-- reference_unet.pth
|-- control_v11p_sd15_openpose
|   |-- diffusion_pytorch_model.bin
|-- image_encoder
|   |-- config.json
|   `-- pytorch_model.bin
|-- sd-vae-ft-mse
|   |-- config.json
|   |-- diffusion_pytorch_model.bin
|   `-- diffusion_pytorch_model.safetensors
`-- stable-diffusion-v1-5
    |-- feature_extractor
    |   `-- preprocessor_config.json
    |-- model_index.json
    |-- unet
    |   |-- config.json
    |   `-- diffusion_pytorch_model.bin
    `-- v1-inference.yaml

Run Stage 1 training:

accelerate launch train_stage_1.py --config configs/train/stage1.yaml

Stage 2

Download the pretrained motion module weights
mm_sd_v15_v2.ckpt
and place it under ./pretrained_weights.

Specify Stage 1 weights in the config file stage2.yaml:

stage1_ckpt_dir: './exp_output/stage1'
stage1_ckpt_step: 30000 

Run Stage 2 training:

accelerate launch train_stage_2.py --config configs/train/stage2.yaml

🙏 Acknowledgements

This project builds upon the excellent work of:

We thank the authors for releasing their code and models.


🎓 Citation

If you find this codebase useful, please cite:

@inproceedings{cui2025cfsynthesis,
  title={CFSynthesis: Controllable and Free-view 3D Human Video Synthesis},
  author={Cui, Liyuan and Xu, Xiaogang and Dong, Wenqi and Yang, Zesong and Bao, Hujun and Cui, Zhaopeng},
  booktitle={Proceedings of the 2025 International Conference on Multimedia Retrieval},
  pages={135--144},
  year={2025}
}

About

About [ICMR 2025] CFSynthesis: Controllable and Free-view 3D Human Video Synthesis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published