Skip to content

Conversation

emmanuelmathot
Copy link

@emmanuelmathot emmanuelmathot commented Aug 22, 2025

This PR addresses terminology inconsistencies and potential confusion around key terms used throughout the specification, particularly around "groups", "datasets", and hierarchical structures.

Changes

  1. Updated abstract and intro section to emphase Zarr's storage foundation and GeoZarr's semantic additions purpose

  2. Added explicit reference to Zarr core specification terminology

  3. Updated and clarified key terms:

    • Store: Root-level container in a Zarr hierarchy
    • Dataset: A group containing data variables and their coordinate variables
    • Multiscale Group: A group containing child groups at different resolutions
    • Child Group: Any group contained within another group
  4. Updated terminology usage in:

    • Clause 4 (Terms and Definitions)
    • Clause 7 (Unified Data Model) including UML diagram
    • Clause 9 (Zarr Encoding) core and overviews sections
  5. Added proper cross-references using AsciiDoc syntax (e.g., <<term-dataset>>)

Impact

These changes:

  • Resolve the overloading of the term "dataset"
  • Clarify the hierarchical relationships between components
  • Align with Zarr core specification terminology
  • Make the specification more precise and easier to understand

Resolves discussion from #86 regarding terminology consistency.

@d-v-b, @maxrjones, @geospatial-jeff

…Model, addressing semantic constructs and use cases for geospatial data workflows.
@d-v-b
Copy link

d-v-b commented Sep 3, 2025

@christophenoel , maybe we continue here our discussion from #86 (comment), which touches on the definition of a "multiscale group" in this PR.

The price of consistency here is complexity -- clients will need to distinguish between "dataset that's just a collection of variables"

I disagree:

Client not supporting overviews: just reading the dataset as usual (strong requirement for some stakeholders)
Client supporting overviews: reading the multiscales attribute to retrieve the downscales

This simplifies the dataset definition, because datasets are no longer tasked with two roles (containing variables and / or containing other datasets).

In CDM/NetCDF the dataset can includes other dataset... that's a matter of fact.

Does CDM / NetCDF already define a layout for representing overviews, or is the overview layout a new contribution from GeoZarr? If there's already a convention, then it might make sense to follow it. But if the convention is being created in this spec, then you have the opportunity to define your own semantics that are narrower than NetCDF / CDM.

@christophenoel
Copy link

Very relevant point.

In NetCDF/CDM, a Dataset is typically represented by the root group.
In xarray, a Dataset is a collection of variables, coordinates, and dimensions that can be opened or displayed.

This does not prevent a Dataset from containing groups, even if those groups are not directly exposed when opening the file in xarray.

The reason for using the term Dataset in this context is to emphasise that it corresponds to a group node in the hierarchy that actually contains variables (as opposed to a purely structural container group). This also allows to define requirements on this type of entity.

@christophenoel
Copy link

Does CDM / NetCDF already define a layout for representing overviews, or is the overview layout a new contribution from GeoZarr? If there's already a convention, then it might make sense to follow it. But if the convention is being created in this spec, then you have the opportunity to define your own semantics that are narrower than NetCDF / CDM.

Overview is a feature requested by stakeholders in order to support capabilities comparable to COG.

That said, note that this feature must remain consistent with the foundation of the GeoZarr unified data model, which is the Common Data Model (CDM). The intent is to provide an extension point within the model, not to alter or redefine the model itself.

@christophenoel
Copy link

For information:

  1. In EOPF : No Notion of Dataset

  2. ESA EOF -EOS API Evolution Best Practices (released soon) - in their data model (based on EOPF)

An EO Product is structured as a hierarchy rooted at a top-level dataset entity, which may include one or more child datasets. Each dataset node contains variables along with their associated coordinates, dimensions, and metadata.

@d-v-b
Copy link

d-v-b commented Sep 3, 2025

I think these two options are both equally consistent with CDM, which is to say "not at all", because CDM requires that the attributes of a group be a string or a 1-dimensional scalar, and that excludes JSON objects like the "multiscales" used in both examples:

Option A: multiscale group distinct from dataset

Multiscale Dataset

A Zarr group that only contains other Zarr groups. It has a special attribute that declares the names of the subgroups it contains.

Members

Only datasets. No DataArrays.

Attributes

field constraint required notes
"multiscales" multiscales object yes this field declares the names of sub-datasets

Regular Dataset

A Zarr group that contains dataarrays

Members

only DataArrays.

Attributes

field constraint required notes
"grid_mapping" string sure name of the grid mapping variable
...other CF attributes ... ... ...

DataArray

Zarr array that represents measured quantities.

Example

This is the ONLY way to represent this set of overviews.

/measurements/r10m/          # Multiscale root group with multiscales metadata
├── 0/                       # Native resolution (zoom level 0)  
│   ├── band1                # Data variable at zoom level 0
│   ├── band2                # Data variable at zoom level 0
│   └── spatial_ref          # Coordinate reference variable
├── 1/                       # First overview level
│   ├── band1                # Data variable at zoom level 1
│   ├── band2                # Data variable at zoom level 1
│   └── spatial_ref          # Coordinate reference variable
└── 2/                       # Second overview level
    ├── band1                # Data variable at zoom level 2
    ├── band2                # Data variable at zoom level 2
    └── spatial_ref          # Coordinate reference variable
----

Option B: Dataset definition is overloaded to support multiscales

Dataset

A Zarr group that contains other Zarr groups (other datasets), some of which might be overviews, if the "multiscales" key is present in this group's attributes. It might also contain Zarr arrays, some of which might also be part of the set of overviews.

Members

DataArrays or Datasets. If "multiscales" is present in attributes, then the datasets described in "multiscales"
MUST be contained in this group.

Attributes

field constraint required notes
"grid_mapping" string sure name of the grid mapping variable
"multiscales" multiscales object only if this is a multiscale dataset presence of this key means the dataset contains sub-datasets that should be treated as overviews
...other CF attributes ... ... ...

DataArray

Zarr array that represents measured quantities.

Example

There are two ways to represent the same overviews:

/measurements/r10m/          # Dataset Group with native resolution and multiscales metadata
├── band1                    # Native resolution variable
├── band2
├── spatial_ref
├── 1/                       # First overview level
│   ├── band1
│   ├── band2
│   └── spatial_ref
└── 2/                       # Second overview level
    ├── band1
    ├── band2
    └── spatial_ref
----

and

/measurements/r10m/          # Multiscale root group with multiscales metadata
├── 0/                       # Native resolution (zoom level 0)  
│   ├── band1                # Data variable at zoom level 0
│   ├── band2                # Data variable at zoom level 0
│   └── spatial_ref          # Coordinate reference variable
├── 1/                       # First overview level
│   ├── band1                # Data variable at zoom level 1
│   ├── band2                # Data variable at zoom level 1
│   └── spatial_ref          # Coordinate reference variable
└── 2/                       # Second overview level
    ├── band1                # Data variable at zoom level 2
    ├── band2                # Data variable at zoom level 2
    └── spatial_ref          # Coordinate reference variable
----

@d-v-b
Copy link

d-v-b commented Sep 3, 2025

in the above example, the first option is much more friendly for static analysis and automated traversal. The price is that we need an additional type definition (for our multiscale dataset). But neither A nor B introduces any new elements to netcdf / cdm (other than JSON as a type for attributes),

Rather, A is a narrower subset of netcdf / cdm, This should be fine unless your expectation is that literally any cdm-compliant dataset should be convertible to geozarr without any layout transformations, and this seems like a tall order!

@christophenoel
Copy link

christophenoel commented Sep 3, 2025

because CDM requires that the attributes of a group be a string or a 1-dimensional scalar, and that excludes JSON objects

You raise a very critical point: because of the CDM/NetCDF limitations all JSON objects are typically escaped. I would be interested to discuss if this constraint must be relaxed in GeoZarr model, and the encoding to Zarr describing when/what to unescape.

Option A: multiscale group distinct from dataset

Option B: Dataset definition is overloaded to support multiscales

Thank you for summarising the proposals.

From my perspective, option A is not feasible:

  • First, adding overviews would alter the structure of an existing product (and thus being resource-consuming)
  • Second, several of our clients and partners have explicitly stated that the introduction of overviews must not require any change in the current reading process (backward compatibility). Other standards (e.g. NetCDF, COG, GeoTIFF) introduced extensions without breaking existing readers.

@d-v-b
Copy link

d-v-b commented Sep 3, 2025

You raise a very critical point: because of the CDM/NetCDF limitations all JSON objects are typically escaped. I would be interested to discuss if this constraint must be relaxed in GeoZarr model, and the encoding to Zarr describing when/what to unescape.

Should I open a separate issue for this discussion? The examples in the draft geozarr spec use JSON for attributes, as opposed to raw strings.

@christophenoel
Copy link

Should I open a separate issue for this discussion?

Yes, feel free to open a separate issue for this.
I meant the NetCDF attributes are typically escaped, we work with JSON object, but this is actually not documented.

@d-v-b
Copy link

d-v-b commented Sep 3, 2025

An additional problem with option B is that it forces the data variables for the source dataset to share the same namespace with the names of the overview datasets. Ideally these would be completely separate namespaces to prevent any possible name collisions.

First, adding overviews would alter the structure of an existing product (and thus being resource-consuming)

That's correct. Option A strongly incentivizes data producers to plan for overviews by creating the layout accordingly. From a Zarr POV, it's generally best to create a hierarchy exactly once instead of adding arrays and groups piecemeal, so this might be guiding data producers in the right direction. But I can also see how this workflow might be inconvenient.

Second, several of our clients and partners have explicitly stated that the introduction of overviews must not require any change in the current reading process (backward compatibility). Other standards (e.g. NetCDF, COG, GeoTIFF) introduced extensions without breaking existing readers.

I would be curious to hear more about how option A would break existing readers. Is the name of the dataset of particular important to readers, such that opening r10m/0 is a breaking change compared to opening r10m?


A container for datasets, variables, dimensions, and metadata in Zarr. Groups may be nested to represent a logical hierarchy (e.g., for resolutions or collections).
A group that contains one or more data variables along with their associated coordinate variables, having a consistent relationship between these components. A dataset represents a coherent set of related data arrays and follows the unified data model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Unified Data Model in capitals wherever it is formal reference to the clause 7 definition?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it say what the group is first? group is probably still a container for datasets that can be nested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We inherit from Zarr for the group terminology. the section starts with:

GeoZarr specification inherits https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[concepts and terminology from the Zarr core specification].
The following terms adds Geozarr specificity to the existing Zarr terminology

I would like to avoid repeating the Zarr terminology in order to limit the maintenance if they evolve.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitals solved

@christophenoel
Copy link

christophenoel commented Sep 4, 2025

I would be curious to hear more about how option A would break existing readers. Is the name of the dataset of particular important to readers, such that opening r10m/0 is a breaking change compared to opening r10m?

I think the discussion on multiscales is diverging from the base terminology discussion, so I will reply in the related issue #83.


=== Foundational Model and Standards Reuse

GeoZarr adopts established data model concepts because Zarr itself provides only array storage without semantic interpretation. The Unidata Common Data Model (CDM) provides the conceptual framework for understanding dimensions, variables, and attributes, while CF Conventions provide standardized metadata semantics. This reuse ensures compatibility with existing scientific software while avoiding reinvention of proven concepts.

==== Common Data Model (CDM)

The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the unified model:
The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the Unified Data Model:

- **Dimensions** – Integer-valued, named axes that define the extents of data variables.
- **Coordinate Variables** – Variables that supply coordinate values along dimensions, establishing spatial or temporal context.
- **Data Variables** – Multidimensional arrays representing observed or simulated phenomena, associated with dimensions and coordinate variables.
- **Attributes** – Key-value metadata elements used to describe variables and datasets semantically.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CDM uses a particular type system for attributes that is not a 1:1 match for Zarr's attributes type system.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion for describing that here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what i added so far

  • Clarified that CDM concepts are adapted for Zarr's JSON type system
  • Acknowledged differences while preserving semantic compatibility


This clause defines the structural organisation of stores conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources.
This clause defines the structural organisation of stores conforming to the Unified Data Model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "store" is already defined in the Zarr spec: see https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id25. So the use of the same term with a different meaning in GeoZarr will likely become a point of confusion.

Copy link
Author

@emmanuelmathot emmanuelmathot Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually intent to use the store term as it is in Zarr so I will clarify and reference the Zarr terminology

…concepts, refining terminology, and ensuring consistent references to hierarchies and stores across multiple sections.
…'s type system and ensure compatibility with CDM semantics.
@emmanuelmathot emmanuelmathot requested a review from d-v-b September 4, 2025 13:57
Copy link

@christophenoel christophenoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a new reading. Congrats for the work.


GeoZarr aims to bridge scientific and geospatial communities by enabling round-trip transformations with formats such as NetCDF and GeoTIFF, and supporting compatibility with tools in the scientific Python and geospatial ecosystems. This Standard enables scalable, standards-compliant, and semantically rich data structures for cloud-native Earth observation applications.
By providing a standardized framework for geospatial semantics, GeoZarr enables scientific and geospatial applications to fully utilize cloud-native storage architectures while maintaining the rich metadata and coordinate referencing required for Earth observation workflows. The result is a modern, scalable approach to storing and accessing geospatial data that meets the needs of both data providers and consumers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this introduction.


Typical use cases include the storage, transformation, discovery, and processing of raster and gridded data, data cubes with temporal or vertical dimensions, and catalogue-enabled datasets integrated with metadata standards such as STAC and OGC Tile Matrix Sets.
=== Why GeoZarr Exists

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may be missing an important clarification to justify the purpose of Geozarr: There are already existing conventions for geospatial data in Zarr, as implemented in Xarray, NCZarr, GDAL, those conventions primarily translate aspects of the CF/NetCDF data model into Zarr encoding.

However:

  1. The CF/NetCDF data model itself may lack certain capabilities, such as support for multiscale overviews, affine transforms, etc. .
  2. The current encoding conventions to Zarr – for example, mapping all NetCDF attributes into Zarr string attributes – may not be optimal and could be revisited.


A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations.
A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. The Unified Data Model provides a standardized framework for expressing spatial relationships, coordinate systems, and scientific metadata.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the current definition is not ideal, since an abstract model should not be defined for a specific format. Instead, it should stand independently and be applicable across formats, with Zarr being one possible encoding of that model (as for CDM, CF abstract model, UDM, etc.)

Suggested change
A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. The Unified Data Model provides a standardized framework for expressing spatial relationships, coordinate systems, and scientific metadata.
A conceptual model for structuring geospatial data using CDM-based constructs. It enables consistent representation of coordinate referencing, metadata integration, and multiscale data. The Unified Data Model provides a standard framework for describing spatial relationships, coordinate systems, and scientific metadata, which can then be encoded in formats such as Zarr.


- **Dimensions** – Integer-valued, named axes that define the extents of data variables.
- **Coordinate Variables** – Variables that supply coordinate values along dimensions, establishing spatial or temporal context.
- **Data Variables** – Multidimensional arrays representing observed or simulated phenomena, associated with dimensions and coordinate variables.
- **Attributes** – Key-value metadata elements used to describe variables and datasets semantically.
- **Groups** – Optional hierarchical containers enabling logical organisation of resources and metadata.

The unified data model adopts these CDM components without modification excluding the user-defined types. Semantic interpretation remains consistent with the original CDM specification. GeoZarr structures are mapped to CDM constructs to ensure compatibility and clarity.
The Unified Data Model adopts these CDM components with adaptations for Zarr's type system. While the conceptual structure remains consistent with the original CDM specification, attribute types are mapped to Zarr's JSON-compatible type system. GeoZarr structures preserve CDM semantics while conforming to Zarr's encoding constraints.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a very good point to add.


Multiscale datasets are composed of a set of Zarr groups representing multiple zoom levels. Each level stores coarser-resolution resampled versions of the original data variables.
A <<term-multiscale-group,multiscale group>> contains one or more child groups, where each child group is a <<term-dataset,dataset>> representing a zoom level of the data. Additional resolution levels can be added over time, with each new level storing a coarser-resolution resampled version of the original data variables.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multiscale support, I think we need a consensus that gives flexibility to data producers:

Suggested change
The multiscale group may include or exclude the native data, and the child zoom-level groups may likewise include or exclude the native level (0). This flexibility allows producers to handle different scenarios, such as adding overviews later to an existing archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants