Hi, I tried to reconstruct my custom scene using depthsplat from colmap data using a pretrained weight but the quality turned out extremely bad (Image for reference). Based on the community demos ( a lot of which are from the standard datasets used), I am assuming I did something wrong. I am including a complete report on my approach below
This is from depthSplat

This is from 3DGS as baseline
DepthSplat on COLMAP Data - Inference Report
Dataset
- Source: Custom COLMAP reconstruction with 273 registered images
- Original resolution: 3840×2160 (4K)
- Camera model: PINHOLE
- Scene: Indoor scan (mlss_data)
Setup
- Environment: Conda with Python 3.10
- PyTorch: 2.4.0 with CUDA 12.4
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- Model:
depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pth
Note: Setup worked flawlessly on RTX 4090 with CUDA 12.4. xformers 0.0.27.post2 had no compatibility issues (unlike RTX 5090 which had xformers problems).
Data Preparation
Approach: Followed official dataset conversion methodology from src/scripts/convert_dl3dv_test.py, adapted for COLMAP input.
1. Image Resizing
Resized COLMAP-registered images from 4K to 1920×1080 (2K) to meet the "2k+" requirement:
# Created images_2 directory with 273 images at 1920×1080
python resize_script.py # Using PIL with LANCZOS resampling
2. Conversion to .torch Format
Converted COLMAP binary format to DepthSplat's expected format per official guidelines:
Key data structure:
{
'key': '000000',
'url': 'mlss_colmap_scene',
'timestamps': torch.tensor([0, 1, 2, ..., 272], dtype=int64),
'cameras': torch.tensor(shape=[273, 18], dtype=float32),
'images': [list of 273 JPG tensors as raw bytes]
}
Camera format (18 values per camera):
[fx/w, fy/h, cx/w, cy/h, 0.0, 0.0, w2c_00, w2c_01, ..., w2c_23]
- Intrinsics are normalized by image dimensions (per README camera conventions)
- Extrinsics are W2C (world-to-camera) matrices flattened to 12 values, OpenCV convention
- COLMAP already provides W2C matrices (from qvec/tvec), stored directly
Verified: Our data structure matches official DL3DV format:
# Example from our dataset:
Cameras shape: torch.Size([273, 18])
Intrinsics: fx/w=0.6473, fy/h=1.1566, cx/w=0.5000, cy/h=0.5000
W2C matrix (3x4): stored as 12 flattened values
Important: Data must be wrapped in a list: [data_dict] when saving to .torch file.
3. Dataset Structure
depthsplat/
├── datasets/
│ └── mlss_2k/
│ └── test/
│ ├── 000000.torch
│ └── index.json # {"000000": {}}
└── assets/
└── mlss_2k_eval_index.json
Evaluation index format:
{
"000000": {
"context": [0, 45, 90, 135, 180, 225],
"target": [1, 2, 3, ...]
}
}
Inference Commands
Attempt 1: 4 views @ 256×448 (Low Quality)
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m src.main \
+experiment=dl3dv \
dataset.roots=[datasets/mlss_2k] \
dataset.image_shape=[256,448] \
dataset.ori_image_shape=[1080,1920] \
model.encoder.num_scales=2 \
model.encoder.upsample_factor=4 \
model.encoder.lowest_feature_resolution=8 \
model.encoder.monodepth_vit_type=vitb \
model.encoder.gaussian_adapter.gaussian_scale_max=0.1 \
checkpointing.pretrained_model=pretrained/depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pth \
mode=test \
dataset/view_sampler=evaluation \
dataset.view_sampler.num_context_views=4 \
dataset.view_sampler.index_path=assets/mlss_2k_eval_index.json \
test.save_gaussian=true \
test.compute_scores=false \
output_dir=outputs/mlss_2k_test
Result: 41MB .ply file, poor quality
Attempt 2: 8 views @ 448×768 (Better Quality)
# Same command, changed:
dataset.image_shape=[448,768]
dataset.view_sampler.num_context_views=8
# Updated context views in eval index to: [0, 34, 68, 102, 136, 170, 204, 238]
Result: 169MB .ply file, improved quality
Attempt 3: 6 views @ 512×960 (Best Quality - SUCCESS)
# Same command, changed:
dataset.image_shape=[512,960]
dataset.view_sampler.num_context_views=6
test.render_chunk_size=5 # Added for memory management
# Updated context views in eval index to: [0, 45, 90, 135, 180, 225]
Result: 183MB .ply file, best achievable quality
- Encoder: 0.76s per scene
- Decoder: ~0.007s per rendered view
- Peak VRAM: 14.4 GB
Issues Encountered
1. CUDA Out of Memory (OOM)
Problem: Higher view counts and resolutions consistently hit OOM on 24GB RTX 4090.
Configurations that failed:
- 12 views @ 512×960: OOM during encoding (tried to allocate 11.25 GiB)
- 12 views @ 448×768: OOM during rendering (tried to allocate 19.93 GiB)
- 6 views @ 512×960 with metrics: OOM during LPIPS computation (tried to allocate 5.62 GiB)
Workaround: Settled on 6 views @ 512×960 without metrics computation.
2. Metrics Computation Failed
Problem: test.compute_scores=true with LPIPS requires ~5-6GB additional VRAM beyond rendering.
Impact: Could not generate SSIM/PSNR/LPIPS metrics even at 448×768 resolution.
Attempted:
- 6 views @ 512×960 with metrics: OOM
- 6 views @ 448×768 with metrics: OOM
3. Memory Scaling
Observation: Memory usage scales significantly with:
- Number of context views (more views = more Gaussians generated)
- Input resolution (higher res = larger feature maps)
- LPIPS perceptual loss network (separate VGG-based network)
Note: The memory bottleneck appears to be in the rendering phase where Gaussians are rasterized, not during encoding.
Results Summary
| Run |
Views |
Input Res |
Output Size |
Status |
VRAM Peak |
| 1 |
4 |
256×448 |
41 MB |
✅ Success |
~8 GB |
| 2 |
8 |
448×768 |
169 MB |
✅ Success |
~11 GB |
| 3 |
6 |
512×960 |
183 MB |
✅ Success |
14.4 GB |
| 4 |
12 |
512×960 |
- |
❌ OOM |
- |
| 5 |
12 |
448×768 |
- |
❌ OOM |
- |
| 6 |
6 |
512×960 + metrics |
- |
❌ OOM |
- |
Best result: Run 3 (outputs/mlss_2k_6v_512x960/gaussians/000000.ply)
Questions for Authors
-
Data format verification: We followed the official convert_dl3dv_test.py structure with COLMAP as input source:
- Normalized intrinsics: ✅
- W2C matrices in OpenCV convention: ✅
- Wrapped in list format: ✅
- Successfully produced splats, inference worked correctly
Is this the correct approach for COLMAP data?
-
Memory optimization: Are there recommended settings to reduce VRAM usage for higher resolutions? We tried:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
test.render_chunk_size=5
- Still hit OOM with 12 views @ 512×960
-
Metrics computation: Is LPIPS the bottleneck? Could we compute only PSNR/SSIM without LPIPS?
-
Expected quality: With 6 views @ 512×960, is the result comparable to your paper's quality, or would 12 views significantly improve it?
Hardware Context
- RTX 4090 worked perfectly with CUDA 12.4 + xformers 0.0.27.post2
- Previous attempt on RTX 5090 failed due to xformers compatibility issues
- 24GB VRAM appears limiting for higher-resolution inference with multiple views
Generated: March 5, 2026
Hi, I tried to reconstruct my custom scene using depthsplat from colmap data using a pretrained weight but the quality turned out extremely bad (Image for reference). Based on the community demos ( a lot of which are from the standard datasets used), I am assuming I did something wrong. I am including a complete report on my approach below
This is from depthSplat

This is from 3DGS as baseline
DepthSplat on COLMAP Data - Inference Report
Dataset
Setup
depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pthNote: Setup worked flawlessly on RTX 4090 with CUDA 12.4. xformers 0.0.27.post2 had no compatibility issues (unlike RTX 5090 which had xformers problems).
Data Preparation
Approach: Followed official dataset conversion methodology from
src/scripts/convert_dl3dv_test.py, adapted for COLMAP input.1. Image Resizing
Resized COLMAP-registered images from 4K to 1920×1080 (2K) to meet the "2k+" requirement:
2. Conversion to .torch Format
Converted COLMAP binary format to DepthSplat's expected format per official guidelines:
Key data structure:
{ 'key': '000000', 'url': 'mlss_colmap_scene', 'timestamps': torch.tensor([0, 1, 2, ..., 272], dtype=int64), 'cameras': torch.tensor(shape=[273, 18], dtype=float32), 'images': [list of 273 JPG tensors as raw bytes] }Camera format (18 values per camera):
[fx/w, fy/h, cx/w, cy/h, 0.0, 0.0, w2c_00, w2c_01, ..., w2c_23]Verified: Our data structure matches official DL3DV format:
Important: Data must be wrapped in a list:
[data_dict]when saving to .torch file.3. Dataset Structure
Evaluation index format:
{ "000000": { "context": [0, 45, 90, 135, 180, 225], "target": [1, 2, 3, ...] } }Inference Commands
Attempt 1: 4 views @ 256×448 (Low Quality)
Result: 41MB .ply file, poor quality
Attempt 2: 8 views @ 448×768 (Better Quality)
Result: 169MB .ply file, improved quality
Attempt 3: 6 views @ 512×960 (Best Quality - SUCCESS)
Result: 183MB .ply file, best achievable quality
Issues Encountered
1. CUDA Out of Memory (OOM)
Problem: Higher view counts and resolutions consistently hit OOM on 24GB RTX 4090.
Configurations that failed:
Workaround: Settled on 6 views @ 512×960 without metrics computation.
2. Metrics Computation Failed
Problem:
test.compute_scores=truewith LPIPS requires ~5-6GB additional VRAM beyond rendering.Impact: Could not generate SSIM/PSNR/LPIPS metrics even at 448×768 resolution.
Attempted:
3. Memory Scaling
Observation: Memory usage scales significantly with:
Note: The memory bottleneck appears to be in the rendering phase where Gaussians are rasterized, not during encoding.
Results Summary
Best result: Run 3 (
outputs/mlss_2k_6v_512x960/gaussians/000000.ply)Questions for Authors
Data format verification: We followed the official
convert_dl3dv_test.pystructure with COLMAP as input source:Is this the correct approach for COLMAP data?
Memory optimization: Are there recommended settings to reduce VRAM usage for higher resolutions? We tried:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truetest.render_chunk_size=5Metrics computation: Is LPIPS the bottleneck? Could we compute only PSNR/SSIM without LPIPS?
Expected quality: With 6 views @ 512×960, is the result comparable to your paper's quality, or would 12 views significantly improve it?
Hardware Context
Generated: March 5, 2026