-
Notifications
You must be signed in to change notification settings - Fork 70
Description
I work on a dataset with 1000 images and create and delete a lot of labels for these images. Writing and deleting label objects via write_element / delete_element_from_disk can take up to 60s per element when >10k elements are in the SpatialData object. The slowdown mostly happens in elements_paths_on_disk in spatialdata._core.spatialdata. Following change helped for me fix the issue, saving ~1000 labels in 40s:
def elements_paths_on_disk(self) -> list[str]:
"""
Get the paths of the elements saved in the Zarr store.
Returns
-------
A list of paths of the elements saved in the Zarr store.
"""
if self.path is None:
raise ValueError("The SpatialData object is not backed by a Zarr store.")
store = parse_url(self.path, mode="r").store
elements_in_zarr = []
groups_stored = store.listdir()
for group in groups_stored:
if group in ["images", "labels", "points", "shapes"]:
group_elems = [os.path.join(group, elem) for elem in store.listdir(group)]
elements_in_zarr.extend(group_elems)
return elements_in_zarr
In delete_element_from_disk calling write_consolidate_metadata takes a long time( ~1min). When expecting to delete a lot of images other users should call sdata.write() with consolidate_metadata=False, or rewrite delete_elements_from_disk such that when given a list of elements to delete write_consolidated_metadata is called only once at the end of the list.
Code to reproduce problem:
from spatialdata.datasets import blobs
import numpy as np
import time
import spatialdata as sd
sdata = blobs()
sdata.write('test', consolidate_metadata=True)
test = np.empty((1,1), dtype=np.uint8)
for i in range(1500):
sdata[f'test{i}'] = sd.models.Labels2DModel().parse(test, dims=('y','x'))
start = time.time()
sdata.write_element(f'test{i}')
print(f'Wrote test{i} in ', time.time()-start)
for i in range(1500):
start = time.time()
sdata.delete_element_from_disk(f'test{i}')
print(f'Deleted test{i} in ', time.time()-start)