Skip to content

Multi-document validation #41

@clbarnes

Description

@clbarnes

In rust, ome_zarr_metadata takes a similar strategy of representing and validating metadata without performing IO. However, a lot of OME-Zarr validation needs metadata from Node B in order to properly validate Node A (e.g. transformations which are paths to arrays).

The strategy I'm working on in a local branch is to allow all OME-Zarr objects to report any references they have, so that the caller can fetch those nodes' metadata however they deem appropriate, and then pass them back in for a second round of validation which only addresses the remote references. There's scope for a shallow version (where only direct references from the node of interest are gathered), or deep (where you keep expanding the graph whenever a referred-to node has its own references).

The API looks something like

@dataclass
class ContextValue:
    zarr_metadata: ZarrV3Metadata
    ome: OMEZarrMetadata

@dataclass
class ValidationContext:
    node_path: Path
    """Absolute path to this Zarr node within the store."""

    references: dict[Path, ContextValue]
    """Paths must be absolute within the store and must contain `node_path`."""

    def index_relative(self, path: Path) -> ContextValue:
        abs_path = self.node_path / path
        return self.references[abs_path]

class ReferrerMixin:
    """API for an OME-Zarr metadata object which may refer to other Zarr nodes."""
    @abstractmethod
    def gather_references(self, paths: set[Path]):
        pass

    @abstractmethod
    def validate_references(self, context: ValidationContext):
        pass

class MultiscaleImageDataset(ReferrerMixin):
    def gather_references(self, paths: set[Path]):
        paths.add(self.path)

    def validate_references(self, context: ValidationContext):
        ctx = context.index_relative(self.path)
        assert ctx.zarr_metadata.is_array()

class MultiscaleImage(ReferrerMixin):
    def gather_references(self, paths: set[Path]):
        for ds in self.datasets:
            ds.gather_references(paths)
        ...
   
    def validate_references(self, context: ValidationContext):
        dtypes = set()
        for ds in self.datasets:
            ds.validate_references(context)
            ds_ctx = context.index_relative(ds.path)
            dtypes.add(ds_ctx.zarr_metadata.data_type)
            assert len(dtypes) <= 1
            ...
        ...

# and so on for other OME-Zarr metadata classes

store_url = "https://my.zarr.store"
node_root = Path("/node/of/interest")
node_of_interest = get_zarr_node(store_url, node_root)
root_ome = parse_ome(node_of_interest.attributes)
refs = set()
root_ome.gather_references(refs)
context = ValidationContext(node_root, {node_root: ContextValue(node_of_interest, root_ome)})
for ref in refs:
    path = node_root / ref
    node = get_zarr_node(store_url, path)
    context[path] = ContextValue(node, parse_ome(node.attributes))
    # extend the graph if you want

root_ome.validate_references(context)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions