Skip to content

write_element and delete_element_from_disk are very slow when SpatialData object contains large number of elements #938

@mjheid

Description

@mjheid

I work on a dataset with 1000 images and create and delete a lot of labels for these images. Writing and deleting label objects via write_element / delete_element_from_disk can take up to 60s per element when >10k elements are in the SpatialData object. The slowdown mostly happens in elements_paths_on_disk in spatialdata._core.spatialdata. Following change helped for me fix the issue, saving ~1000 labels in 40s:

def elements_paths_on_disk(self) -> list[str]:
    """
    Get the paths of the elements saved in the Zarr store.

    Returns
    -------
    A list of paths of the elements saved in the Zarr store.
    """
    if self.path is None:
        raise ValueError("The SpatialData object is not backed by a Zarr store.")
    store = parse_url(self.path, mode="r").store
    elements_in_zarr = []

    groups_stored = store.listdir()
    for group in groups_stored:
        if group in ["images", "labels", "points", "shapes"]:
            group_elems = [os.path.join(group, elem) for elem in store.listdir(group)]
            elements_in_zarr.extend(group_elems)
    return elements_in_zarr

In delete_element_from_disk calling write_consolidate_metadata takes a long time( ~1min). When expecting to delete a lot of images other users should call sdata.write() with consolidate_metadata=False, or rewrite delete_elements_from_disk such that when given a list of elements to delete write_consolidated_metadata is called only once at the end of the list.

Code to reproduce problem:

from spatialdata.datasets import blobs
import numpy as np
import time
import spatialdata as sd

sdata = blobs()
sdata.write('test', consolidate_metadata=True)

test = np.empty((1,1), dtype=np.uint8)
for i in range(1500):
    sdata[f'test{i}'] = sd.models.Labels2DModel().parse(test, dims=('y','x'))
    start = time.time()
    sdata.write_element(f'test{i}')
    print(f'Wrote test{i} in ', time.time()-start)
for i in range(1500):
    start = time.time()
    sdata.delete_element_from_disk(f'test{i}')
    print(f'Deleted test{i} in ', time.time()-start)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions