Hi Orion AI Lab Team,
I've been profiling the TIRAuxCloud data pipeline and identified a significant performance bottleneck during the initialization phase of the MultiBandTiffDataset.
The Problem:
Currently, preload_band_maps performs a linear scan of every TIFF file in the dataset to extract band descriptions. For large datasets (e.g 20k+ patches), this creates an O(N) startup delay that can take several minutes due to redundant disk I/O.
Observations:
Looking at the training_file_lists, the data is already partitioned by sensor (Landsat vs. VIIRS). Since band mappings are consistent within these data products, re-reading metadata for every individual file is unnecessary.
Proposed Optimization:
I suggest implementing a Lazy Metadata Cache. Instead of a proactive scan, the loader would:
Identify the sensor "fingerprint" (via filename or first-band header) as u prefer for already seperate datasets or merged dataset from various satellites.
Cache the mapping for that sensor type.
Reuse the cached map for all subsequent files of the same type.
This would reduce the startup cost from thousands of I/O operations to virtually zero after the first sample is cached, making the framework significantly more scalable.
I'd be happy to submit a PR with this refactored logic if you're open to it!
Hi Orion AI Lab Team,
I've been profiling the TIRAuxCloud data pipeline and identified a significant performance bottleneck during the initialization phase of the MultiBandTiffDataset.
The Problem:
Currently, preload_band_maps performs a linear scan of every TIFF file in the dataset to extract band descriptions. For large datasets (e.g 20k+ patches), this creates an O(N) startup delay that can take several minutes due to redundant disk I/O.
Observations:
Looking at the training_file_lists, the data is already partitioned by sensor (Landsat vs. VIIRS). Since band mappings are consistent within these data products, re-reading metadata for every individual file is unnecessary.
Proposed Optimization:
I suggest implementing a Lazy Metadata Cache. Instead of a proactive scan, the loader would:
Identify the sensor "fingerprint" (via filename or first-band header) as u prefer for already seperate datasets or merged dataset from various satellites.
Cache the mapping for that sensor type.
Reuse the cached map for all subsequent files of the same type.
This would reduce the startup cost from thousands of I/O operations to virtually zero after the first sample is cached, making the framework significantly more scalable.
I'd be happy to submit a PR with this refactored logic if you're open to it!