Skip to content

[Performance] Reducing I/O Overhead in MultiBandTiffDataset Startup #1

@karthik-0306

Description

@karthik-0306

Hi Orion AI Lab Team,

I've been profiling the TIRAuxCloud data pipeline and identified a significant performance bottleneck during the initialization phase of the MultiBandTiffDataset.

The Problem:
Currently, preload_band_maps performs a linear scan of every TIFF file in the dataset to extract band descriptions. For large datasets (e.g 20k+ patches), this creates an O(N) startup delay that can take several minutes due to redundant disk I/O.

Observations:
Looking at the training_file_lists, the data is already partitioned by sensor (Landsat vs. VIIRS). Since band mappings are consistent within these data products, re-reading metadata for every individual file is unnecessary.

Proposed Optimization:
I suggest implementing a Lazy Metadata Cache. Instead of a proactive scan, the loader would:

Identify the sensor "fingerprint" (via filename or first-band header) as u prefer for already seperate datasets or merged dataset from various satellites.
Cache the mapping for that sensor type.
Reuse the cached map for all subsequent files of the same type.

This would reduce the startup cost from thousands of I/O operations to virtually zero after the first sample is cached, making the framework significantly more scalable.

I'd be happy to submit a PR with this refactored logic if you're open to it!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions