Skip to content

URL syntax for icechunk #576

@jbms

Description

@jbms

I'm in the process of implementing support for ZEP 8-style URL syntax in Neuroglancer, and in conjunction with that am planning to also add support for icechunk format.

Here are some examples of existing URLs that are supported or will be supported:

gs://bucket/path/to/array/|zarr3:
gs://bucket/path/to/group/|zarr3:path/to/array/
gs://bucket/path/to/file.zip|zip:path/to/array/|zarr3:
gs://bucket/path/to/file.zip|zip:path/to/nested.zip|zip:path/to/array/|zarr3:
gs://bucket/path/to/ocdbt/|ocdbt:path/to/array/|zarr3:

Note that the URL consists of a "pipeline" of |-separated components, where the first component must be a base kvstore protocol (e.g. gs, s3, http), followed by zero or more kvstore adapter schemes, like zip, followed by a data format scheme, e.g. zarr2, zarr3, precomputed, n5, etc. There is also format auto-detection, which can add necessary kvstore adapter schemes and the final data format scheme automatically. For example, if you type just gs://bucket/path/to/file.zip then it will first get completed to gs://bucket/path/to/file.zip|zip: (based on the content, not the filename) and then if there is a zarr array or group at the root within the zip file it will get further completed to gs://bucket/path/to/file.zip|zip:|zarr3:.

Note that whether some part of the path goes before or after the |zarr: currently doesn't matter, but if a group storage transformer were used, it would matter. I'm planning to normalize urls so that the outer kvstore url points to the topmost valid zarr v3 group. E.g. if we normalize to gs://bucket/path/to/group/|zarr3:path/to/array/ then that means:

  • gs://bucket/path/to/zarr.json does NOT exist, but
  • gs://bucket/path/to/group/zarr.json does exist, and
  • gs://bucket/path/to/group/path/zarr.json does exist and
  • gs://bucket/path/to/group/path/to/zarr.json does exist and
  • gs://bucket/path/to/group/path/to/array/zarr.json does exist

With this background out of the way, for icechunk, there are two questions:

  1. How to encode the branch / tag / snapshot in the URL
  2. Whether to treat icechunk as a key-value store adapter, or as a final data format in place of zarr3.

Some possible options:

gs://bucket/path/to/icechunk_repo/|icechunk:branch.main/path/to/array/|
gs://bucket/path/to/icechunk_repo/|icechunk:branch.main@path/to/array/|
gs://bucket/path/to/icechunk_repo/|icechunk:branch.main|zarr3:path/to/array/|

gs://bucket/path/to/icechunk_repo/refs/branch.main/|icechunk:path/to/array/|
gs://bucket/path/to/icechunk_repo/refs/branch.main/|icechunk:|zarr3:path/to/array/|

Neuroglancer needs a URL syntax to support it at all, unlike e.g. zarr-python, but it would be nice to standardize on a syntax that will also be supported by other tools in the future.

Choosing a URL syntax that includes a final |zarr3: component most closely corresponds to the current zarr-python integration where icechunk just behaves as a key-value store and translates the metadata and chunks back to the standard zarr v3 metadata encoding and key encoding. I think the right choice, though, depends on how you expect to evolve icechunk in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions