Camera Depth Models (CDM)

This repository is a lightweight fork of the CDM module released in ByteDance-Seed/manip-as-in-sim-suite, a depth estimation library that leverages Vision Transformer encoders to turn noisy RGB-D sensor readings into clean, metric depth maps.

About This Fork

The goal of the fork is to make the camera depth models easy to install and use for inference as a standalone Python package. To this end:

The original models are wrapped in a camera_depth_models package
The inference API has been slighly simplified: it handles conversions internally so you can just call model.infer_depth and get a metric depth estimate.

Getting Started

Installation

pip install -e .

Download Pretrained Checkpoints

Pretrained CDM models for each supported camera are available on Hugging Face:
https://huggingface.co/collections/depth-anything/camera-depth-models-68b521181dedd223f4b020db

Download the checkpoints you need (for example, cdm_d435.ckpt) and place them wherever you keep large model files.

Example Script

We provide a minimal example in scripts/main.py. Update the model_path as needed and run. The script loads a model with camera_depth_models.load_model, runs infer_depth, and displays the RGB input, raw sensor depth, and the estimated metric depth side-by-side.

Python API

import cv2
import numpy as np

from camera_depth_models import load_model

device = "cuda"
model = load_model("vitl", "path/to/model.ckpt", device)

rgb = cv2.imread("assets/example_data/color_12.png")[:, :, ::-1]  # BGR -> RGB
depth = cv2.imread("assets/example_data/depth_12.png", cv2.IMREAD_UNCHANGED) / 1000.0

# Returns metric depth (meters), not inverse depth
pred_depth = model.infer_depth(rgb, depth, input_size=518)

Overview

Camera Depth Models are sensor-specific depth networks trained to produce clean, simulation-like depth maps from noisy real-world inputs. By bridging the visual gap between simulation and reality, CDMs allow robotic policies trained in simulation to operate on real hardware with minimal adaptation.

Key Features

Metric Depth Estimation – produces absolute depth in meters.
Multi-Camera Support – tuned checkpoints for Intel RealSense D405/D435/L515, Stereolabs ZED 2i, and Azure Kinect.
Real-time Ready – lightweight inference suitable for robot control loops.
Sim-to-Real Transfer – outputs depth maps that mimic the noise profile of simulation.

Architecture

CDM relies on a dual-branch Vision Transformer design:

RGB Branch extracts semantic context from RGB images.
Depth Branch processes noisy depth measurements.
Cross-Attention Fusion blends semantic cues and scale cues.
DPT Decoder reconstructs the final metric depth map.

Supported ViT encoder sizes:

vits: 64 features / 384 channels
vitb: 128 features / 768 channels
vitl: 256 features / 1024 channels (all released checkpoints use this configuration)
vitg: 384 features / 1536 channels

Training Pipeline

The upstream authors train CDMs using synthetic datasets augmented with learned camera-specific noise models:

Noise Modeling – learn hole/value noise patterns from real sensor captures.
Synthetic Data Generation – apply the noise models to clean simulation depth.
CDM Training – train the ViT-based model on this synthetic-but-realistic corpus.

Datasets include HyperSim, DREDS, HISS, and IRS (over 280k images).

Supported Cameras

We distribute checkpoints for:

Intel RealSense D405 / D435 / L515
Stereolabs ZED 2i (Quality + Neural modes)
Microsoft Azure Kinect

Performance

The released CDMs achieve state-of-the-art accuracy on metric depth estimation:

Higher accuracy than prompt-guided monocular depth estimators.
Strong zero-shot generalization across camera hardware.
Fast enough for closed-loop manipulation policies.

Citation

If you use Camera Depth Models in your research, please cite the original paper:

@article{liu2025manipulation,
  title={Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots},
  author={Liu, Minghuan and Zhu, Zhengbang and Han, Xiaoshen and Hu, Peng and Lin, Haotong and
          Li, Xinyao and Chen, Jingxiao and Xu, Jiafeng and Yang, Yichu and Lin, Yunfeng and
          Li, Xinghang and Yu, Yong and Zhang, Weinan and Kong, Tao and Kang, Bingyi},
  journal={arXiv preprint},
  year={2025}
}

License

This project is distributed under the Apache 2.0 License. See LICENSE for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets/example_data		assets/example_data
camera_depth_models		camera_depth_models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Camera Depth Models (CDM)

About This Fork

Getting Started

Installation

Download Pretrained Checkpoints

Example Script

Python API

Overview

Key Features

Architecture

Training Pipeline

Supported Cameras

Performance

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

giuschio/camera_depth_models

Folders and files

Latest commit

History

Repository files navigation

Camera Depth Models (CDM)

About This Fork

Getting Started

Installation

Download Pretrained Checkpoints

Example Script

Python API

Overview

Key Features

Architecture

Training Pipeline

Supported Cameras

Performance

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages