Multi-document validation

In rust, [ome_zarr_metadata](github.com/zarrs/ome_zarr_metadata) takes a similar strategy of representing and validating metadata without performing IO. However, a lot of OME-Zarr validation needs metadata from Node B in order to properly validate Node A (e.g. transformations which are paths to arrays).

The strategy I'm working on in a local branch is to allow all OME-Zarr objects to report any references they have, so that the caller can fetch those nodes' metadata however they deem appropriate, and then pass them back in for a second round of validation which _only_ addresses the remote references. There's scope for a shallow version (where only direct references from the node of interest are gathered), or deep (where you keep expanding the graph whenever a referred-to node has its own references).

The API looks something like

```python
@dataclass
class ContextValue:
    zarr_metadata: ZarrV3Metadata
    ome: OMEZarrMetadata

@dataclass
class ValidationContext:
    node_path: Path
    """Absolute path to this Zarr node within the store."""

    references: dict[Path, ContextValue]
    """Paths must be absolute within the store and must contain `node_path`."""

    def index_relative(self, path: Path) -> ContextValue:
        abs_path = self.node_path / path
        return self.references[abs_path]

class ReferrerMixin:
    """API for an OME-Zarr metadata object which may refer to other Zarr nodes."""
    @abstractmethod
    def gather_references(self, paths: set[Path]):
        pass

    @abstractmethod
    def validate_references(self, context: ValidationContext):
        pass

class MultiscaleImageDataset(ReferrerMixin):
    def gather_references(self, paths: set[Path]):
        paths.add(self.path)

    def validate_references(self, context: ValidationContext):
        ctx = context.index_relative(self.path)
        assert ctx.zarr_metadata.is_array()

class MultiscaleImage(ReferrerMixin):
    def gather_references(self, paths: set[Path]):
        for ds in self.datasets:
            ds.gather_references(paths)
        ...
   
    def validate_references(self, context: ValidationContext):
        dtypes = set()
        for ds in self.datasets:
            ds.validate_references(context)
            ds_ctx = context.index_relative(ds.path)
            dtypes.add(ds_ctx.zarr_metadata.data_type)
            assert len(dtypes) <= 1
            ...
        ...

# and so on for other OME-Zarr metadata classes

store_url = "https://my.zarr.store"
node_root = Path("/node/of/interest")
node_of_interest = get_zarr_node(store_url, node_root)
root_ome = parse_ome(node_of_interest.attributes)
refs = set()
root_ome.gather_references(refs)
context = ValidationContext(node_root, {node_root: ContextValue(node_of_interest, root_ome)})
for ref in refs:
    path = node_root / ref
    node = get_zarr_node(store_url, path)
    context[path] = ContextValue(node, parse_ome(node.attributes))
    # extend the graph if you want

root_ome.validate_references(context)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-document validation #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-document validation #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions