Very Poor Reconstruction with Custom Images or am I doing it WRONG?

Hi, I tried to reconstruct my custom scene using depthsplat from colmap data using a pretrained weight but the quality turned out extremely bad (Image for reference). Based on the community demos ( a lot of which are from the standard datasets used), I am assuming I did something wrong. I am including a complete report on my approach below

This is from depthSplat
<img width="1540" height="906" alt="Image" src="https://github.com/user-attachments/assets/1c2d6ab3-888f-4d4d-811e-e77dad4ef592" />

This is from 3DGS as baseline

<img width="547" height="260" alt="Image" src="https://github.com/user-attachments/assets/96fbb901-74de-46f0-80ef-8220397d1d01" />

# DepthSplat on COLMAP Data - Inference Report

## Dataset
- **Source**: Custom COLMAP reconstruction with 273 registered images
- **Original resolution**: 3840×2160 (4K)
- **Camera model**: PINHOLE
- **Scene**: Indoor scan (mlss_data)

## Setup
- **Environment**: Conda with Python 3.10
- **PyTorch**: 2.4.0 with CUDA 12.4
- **GPU**: NVIDIA RTX 4090 (24GB VRAM)
- **Model**: `depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pth`

**Note**: Setup worked flawlessly on RTX 4090 with CUDA 12.4. xformers 0.0.27.post2 had no compatibility issues (unlike RTX 5090 which had xformers problems).

## Data Preparation

**Approach**: Followed official dataset conversion methodology from `src/scripts/convert_dl3dv_test.py`, adapted for COLMAP input.

### 1. Image Resizing
Resized COLMAP-registered images from 4K to 1920×1080 (2K) to meet the "2k+" requirement:
```bash
# Created images_2 directory with 273 images at 1920×1080
python resize_script.py  # Using PIL with LANCZOS resampling
```

### 2. Conversion to .torch Format
Converted COLMAP binary format to DepthSplat's expected format per official guidelines:

**Key data structure**:
```python
{
    'key': '000000',
    'url': 'mlss_colmap_scene',
    'timestamps': torch.tensor([0, 1, 2, ..., 272], dtype=int64),
    'cameras': torch.tensor(shape=[273, 18], dtype=float32),
    'images': [list of 273 JPG tensors as raw bytes]
}
```

**Camera format** (18 values per camera):
- `[fx/w, fy/h, cx/w, cy/h, 0.0, 0.0, w2c_00, w2c_01, ..., w2c_23]`
- Intrinsics are normalized by image dimensions (per README camera conventions)
- Extrinsics are W2C (world-to-camera) matrices flattened to 12 values, OpenCV convention
- COLMAP already provides W2C matrices (from qvec/tvec), stored directly

**Verified**: Our data structure matches official DL3DV format:
```python
# Example from our dataset:
Cameras shape: torch.Size([273, 18])
Intrinsics: fx/w=0.6473, fy/h=1.1566, cx/w=0.5000, cy/h=0.5000
W2C matrix (3x4): stored as 12 flattened values
```

**Important**: Data must be wrapped in a list: `[data_dict]` when saving to .torch file.

### 3. Dataset Structure
```
depthsplat/
├── datasets/
│   └── mlss_2k/
│       └── test/
│           ├── 000000.torch
│           └── index.json  # {"000000": {}}
└── assets/
    └── mlss_2k_eval_index.json
```

**Evaluation index format**:
```json
{
  "000000": {
    "context": [0, 45, 90, 135, 180, 225],
    "target": [1, 2, 3, ...]
  }
}
```

## Inference Commands

### Attempt 1: 4 views @ 256×448 (Low Quality)
```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m src.main \
  +experiment=dl3dv \
  dataset.roots=[datasets/mlss_2k] \
  dataset.image_shape=[256,448] \
  dataset.ori_image_shape=[1080,1920] \
  model.encoder.num_scales=2 \
  model.encoder.upsample_factor=4 \
  model.encoder.lowest_feature_resolution=8 \
  model.encoder.monodepth_vit_type=vitb \
  model.encoder.gaussian_adapter.gaussian_scale_max=0.1 \
  checkpointing.pretrained_model=pretrained/depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pth \
  mode=test \
  dataset/view_sampler=evaluation \
  dataset.view_sampler.num_context_views=4 \
  dataset.view_sampler.index_path=assets/mlss_2k_eval_index.json \
  test.save_gaussian=true \
  test.compute_scores=false \
  output_dir=outputs/mlss_2k_test
```
**Result**: 41MB .ply file, poor quality

### Attempt 2: 8 views @ 448×768 (Better Quality)
```bash
# Same command, changed:
dataset.image_shape=[448,768]
dataset.view_sampler.num_context_views=8
# Updated context views in eval index to: [0, 34, 68, 102, 136, 170, 204, 238]
```
**Result**: 169MB .ply file, improved quality

### Attempt 3: 6 views @ 512×960 (Best Quality - SUCCESS)
```bash
# Same command, changed:
dataset.image_shape=[512,960]
dataset.view_sampler.num_context_views=6
test.render_chunk_size=5  # Added for memory management
# Updated context views in eval index to: [0, 45, 90, 135, 180, 225]
```
**Result**: 183MB .ply file, best achievable quality
- Encoder: 0.76s per scene
- Decoder: ~0.007s per rendered view
- Peak VRAM: 14.4 GB

## Issues Encountered

### 1. CUDA Out of Memory (OOM)
**Problem**: Higher view counts and resolutions consistently hit OOM on 24GB RTX 4090.

**Configurations that failed**:
- 12 views @ 512×960: OOM during encoding (tried to allocate 11.25 GiB)
- 12 views @ 448×768: OOM during rendering (tried to allocate 19.93 GiB)
- 6 views @ 512×960 with metrics: OOM during LPIPS computation (tried to allocate 5.62 GiB)

**Workaround**: Settled on 6 views @ 512×960 without metrics computation.

### 2. Metrics Computation Failed
**Problem**: `test.compute_scores=true` with LPIPS requires ~5-6GB additional VRAM beyond rendering.

**Impact**: Could not generate SSIM/PSNR/LPIPS metrics even at 448×768 resolution.

**Attempted**:
- 6 views @ 512×960 with metrics: OOM
- 6 views @ 448×768 with metrics: OOM

### 3. Memory Scaling
**Observation**: Memory usage scales significantly with:
- Number of context views (more views = more Gaussians generated)
- Input resolution (higher res = larger feature maps)
- LPIPS perceptual loss network (separate VGG-based network)

**Note**: The memory bottleneck appears to be in the rendering phase where Gaussians are rasterized, not during encoding.

## Results Summary

| Run | Views | Input Res | Output Size | Status | VRAM Peak |
|-----|-------|-----------|-------------|--------|-----------|
| 1 | 4 | 256×448 | 41 MB | ✅ Success | ~8 GB |
| 2 | 8 | 448×768 | 169 MB | ✅ Success | ~11 GB |
| 3 | 6 | 512×960 | 183 MB | ✅ Success | 14.4 GB |
| 4 | 12 | 512×960 | - | ❌ OOM | - |
| 5 | 12 | 448×768 | - | ❌ OOM | - |
| 6 | 6 | 512×960 + metrics | - | ❌ OOM | - |

**Best result**: Run 3 (`outputs/mlss_2k_6v_512x960/gaussians/000000.ply`)

## Questions for Authors

1. **Data format verification**: We followed the official `convert_dl3dv_test.py` structure with COLMAP as input source:
   - Normalized intrinsics: ✅
   - W2C matrices in OpenCV convention: ✅
   - Wrapped in list format: ✅
   - Successfully produced splats, inference worked correctly

   Is this the correct approach for COLMAP data?

2. **Memory optimization**: Are there recommended settings to reduce VRAM usage for higher resolutions? We tried:
   - `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
   - `test.render_chunk_size=5`
   - Still hit OOM with 12 views @ 512×960

3. **Metrics computation**: Is LPIPS the bottleneck? Could we compute only PSNR/SSIM without LPIPS?

4. **Expected quality**: With 6 views @ 512×960, is the result comparable to your paper's quality, or would 12 views significantly improve it?

## Hardware Context
- RTX 4090 worked perfectly with CUDA 12.4 + xformers 0.0.27.post2
- Previous attempt on RTX 5090 failed due to xformers compatibility issues
- 24GB VRAM appears limiting for higher-resolution inference with multiple views

---

Generated: March 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very Poor Reconstruction with Custom Images or am I doing it WRONG? #87

DepthSplat on COLMAP Data - Inference Report

Dataset

Setup

Data Preparation

1. Image Resizing

2. Conversion to .torch Format

3. Dataset Structure

Inference Commands

Attempt 1: 4 views @ 256×448 (Low Quality)

Attempt 2: 8 views @ 448×768 (Better Quality)

Attempt 3: 6 views @ 512×960 (Best Quality - SUCCESS)

Issues Encountered

1. CUDA Out of Memory (OOM)

2. Metrics Computation Failed

3. Memory Scaling

Results Summary

Questions for Authors

Hardware Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run	Views	Input Res	Output Size	Status	VRAM Peak
1	4	256×448	41 MB	✅ Success	~8 GB
2	8	448×768	169 MB	✅ Success	~11 GB
3	6	512×960	183 MB	✅ Success	14.4 GB
4	12	512×960	-	❌ OOM	-
5	12	448×768	-	❌ OOM	-
6	6	512×960 + metrics	-	❌ OOM	-

Very Poor Reconstruction with Custom Images or am I doing it WRONG? #87

Description

DepthSplat on COLMAP Data - Inference Report

Dataset

Setup

Data Preparation

1. Image Resizing

2. Conversion to .torch Format

3. Dataset Structure

Inference Commands

Attempt 1: 4 views @ 256×448 (Low Quality)

Attempt 2: 8 views @ 448×768 (Better Quality)

Attempt 3: 6 views @ 512×960 (Best Quality - SUCCESS)

Issues Encountered

1. CUDA Out of Memory (OOM)

2. Metrics Computation Failed

3. Memory Scaling

Results Summary

Questions for Authors

Hardware Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions