Skip to content

Very Poor Reconstruction with Custom Images or am I doing it WRONG? #87

@AntarCreates

Description

@AntarCreates

Hi, I tried to reconstruct my custom scene using depthsplat from colmap data using a pretrained weight but the quality turned out extremely bad (Image for reference). Based on the community demos ( a lot of which are from the standard datasets used), I am assuming I did something wrong. I am including a complete report on my approach below

This is from depthSplat
Image

This is from 3DGS as baseline

Image

DepthSplat on COLMAP Data - Inference Report

Dataset

  • Source: Custom COLMAP reconstruction with 273 registered images
  • Original resolution: 3840×2160 (4K)
  • Camera model: PINHOLE
  • Scene: Indoor scan (mlss_data)

Setup

  • Environment: Conda with Python 3.10
  • PyTorch: 2.4.0 with CUDA 12.4
  • GPU: NVIDIA RTX 4090 (24GB VRAM)
  • Model: depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pth

Note: Setup worked flawlessly on RTX 4090 with CUDA 12.4. xformers 0.0.27.post2 had no compatibility issues (unlike RTX 5090 which had xformers problems).

Data Preparation

Approach: Followed official dataset conversion methodology from src/scripts/convert_dl3dv_test.py, adapted for COLMAP input.

1. Image Resizing

Resized COLMAP-registered images from 4K to 1920×1080 (2K) to meet the "2k+" requirement:

# Created images_2 directory with 273 images at 1920×1080
python resize_script.py  # Using PIL with LANCZOS resampling

2. Conversion to .torch Format

Converted COLMAP binary format to DepthSplat's expected format per official guidelines:

Key data structure:

{
    'key': '000000',
    'url': 'mlss_colmap_scene',
    'timestamps': torch.tensor([0, 1, 2, ..., 272], dtype=int64),
    'cameras': torch.tensor(shape=[273, 18], dtype=float32),
    'images': [list of 273 JPG tensors as raw bytes]
}

Camera format (18 values per camera):

  • [fx/w, fy/h, cx/w, cy/h, 0.0, 0.0, w2c_00, w2c_01, ..., w2c_23]
  • Intrinsics are normalized by image dimensions (per README camera conventions)
  • Extrinsics are W2C (world-to-camera) matrices flattened to 12 values, OpenCV convention
  • COLMAP already provides W2C matrices (from qvec/tvec), stored directly

Verified: Our data structure matches official DL3DV format:

# Example from our dataset:
Cameras shape: torch.Size([273, 18])
Intrinsics: fx/w=0.6473, fy/h=1.1566, cx/w=0.5000, cy/h=0.5000
W2C matrix (3x4): stored as 12 flattened values

Important: Data must be wrapped in a list: [data_dict] when saving to .torch file.

3. Dataset Structure

depthsplat/
├── datasets/
│   └── mlss_2k/
│       └── test/
│           ├── 000000.torch
│           └── index.json  # {"000000": {}}
└── assets/
    └── mlss_2k_eval_index.json

Evaluation index format:

{
  "000000": {
    "context": [0, 45, 90, 135, 180, 225],
    "target": [1, 2, 3, ...]
  }
}

Inference Commands

Attempt 1: 4 views @ 256×448 (Low Quality)

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m src.main \
  +experiment=dl3dv \
  dataset.roots=[datasets/mlss_2k] \
  dataset.image_shape=[256,448] \
  dataset.ori_image_shape=[1080,1920] \
  model.encoder.num_scales=2 \
  model.encoder.upsample_factor=4 \
  model.encoder.lowest_feature_resolution=8 \
  model.encoder.monodepth_vit_type=vitb \
  model.encoder.gaussian_adapter.gaussian_scale_max=0.1 \
  checkpointing.pretrained_model=pretrained/depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pth \
  mode=test \
  dataset/view_sampler=evaluation \
  dataset.view_sampler.num_context_views=4 \
  dataset.view_sampler.index_path=assets/mlss_2k_eval_index.json \
  test.save_gaussian=true \
  test.compute_scores=false \
  output_dir=outputs/mlss_2k_test

Result: 41MB .ply file, poor quality

Attempt 2: 8 views @ 448×768 (Better Quality)

# Same command, changed:
dataset.image_shape=[448,768]
dataset.view_sampler.num_context_views=8
# Updated context views in eval index to: [0, 34, 68, 102, 136, 170, 204, 238]

Result: 169MB .ply file, improved quality

Attempt 3: 6 views @ 512×960 (Best Quality - SUCCESS)

# Same command, changed:
dataset.image_shape=[512,960]
dataset.view_sampler.num_context_views=6
test.render_chunk_size=5  # Added for memory management
# Updated context views in eval index to: [0, 45, 90, 135, 180, 225]

Result: 183MB .ply file, best achievable quality

  • Encoder: 0.76s per scene
  • Decoder: ~0.007s per rendered view
  • Peak VRAM: 14.4 GB

Issues Encountered

1. CUDA Out of Memory (OOM)

Problem: Higher view counts and resolutions consistently hit OOM on 24GB RTX 4090.

Configurations that failed:

  • 12 views @ 512×960: OOM during encoding (tried to allocate 11.25 GiB)
  • 12 views @ 448×768: OOM during rendering (tried to allocate 19.93 GiB)
  • 6 views @ 512×960 with metrics: OOM during LPIPS computation (tried to allocate 5.62 GiB)

Workaround: Settled on 6 views @ 512×960 without metrics computation.

2. Metrics Computation Failed

Problem: test.compute_scores=true with LPIPS requires ~5-6GB additional VRAM beyond rendering.

Impact: Could not generate SSIM/PSNR/LPIPS metrics even at 448×768 resolution.

Attempted:

  • 6 views @ 512×960 with metrics: OOM
  • 6 views @ 448×768 with metrics: OOM

3. Memory Scaling

Observation: Memory usage scales significantly with:

  • Number of context views (more views = more Gaussians generated)
  • Input resolution (higher res = larger feature maps)
  • LPIPS perceptual loss network (separate VGG-based network)

Note: The memory bottleneck appears to be in the rendering phase where Gaussians are rasterized, not during encoding.

Results Summary

Run Views Input Res Output Size Status VRAM Peak
1 4 256×448 41 MB ✅ Success ~8 GB
2 8 448×768 169 MB ✅ Success ~11 GB
3 6 512×960 183 MB ✅ Success 14.4 GB
4 12 512×960 - ❌ OOM -
5 12 448×768 - ❌ OOM -
6 6 512×960 + metrics - ❌ OOM -

Best result: Run 3 (outputs/mlss_2k_6v_512x960/gaussians/000000.ply)

Questions for Authors

  1. Data format verification: We followed the official convert_dl3dv_test.py structure with COLMAP as input source:

    • Normalized intrinsics: ✅
    • W2C matrices in OpenCV convention: ✅
    • Wrapped in list format: ✅
    • Successfully produced splats, inference worked correctly

    Is this the correct approach for COLMAP data?

  2. Memory optimization: Are there recommended settings to reduce VRAM usage for higher resolutions? We tried:

    • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    • test.render_chunk_size=5
    • Still hit OOM with 12 views @ 512×960
  3. Metrics computation: Is LPIPS the bottleneck? Could we compute only PSNR/SSIM without LPIPS?

  4. Expected quality: With 6 views @ 512×960, is the result comparable to your paper's quality, or would 12 views significantly improve it?

Hardware Context

  • RTX 4090 worked perfectly with CUDA 12.4 + xformers 0.0.27.post2
  • Previous attempt on RTX 5090 failed due to xformers compatibility issues
  • 24GB VRAM appears limiting for higher-resolution inference with multiple views

Generated: March 5, 2026

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions