-
Notifications
You must be signed in to change notification settings - Fork 15
Clarify terminology across specification #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Clarify terminology across specification #89
Conversation
…Model, addressing semantic constructs and use cases for geospatial data workflows.
… and completeness
@christophenoel , maybe we continue here our discussion from #86 (comment), which touches on the definition of a "multiscale group" in this PR.
Does CDM / NetCDF already define a layout for representing overviews, or is the overview layout a new contribution from GeoZarr? If there's already a convention, then it might make sense to follow it. But if the convention is being created in this spec, then you have the opportunity to define your own semantics that are narrower than NetCDF / CDM. |
Very relevant point. In NetCDF/CDM, a Dataset is typically represented by the root group. This does not prevent a Dataset from containing groups, even if those groups are not directly exposed when opening the file in xarray. The reason for using the term Dataset in this context is to emphasise that it corresponds to a group node in the hierarchy that actually contains variables (as opposed to a purely structural container group). This also allows to define requirements on this type of entity. |
Overview is a feature requested by stakeholders in order to support capabilities comparable to COG. That said, note that this feature must remain consistent with the foundation of the GeoZarr unified data model, which is the Common Data Model (CDM). The intent is to provide an extension point within the model, not to alter or redefine the model itself. |
For information:
|
I think these two options are both equally consistent with CDM, which is to say "not at all", because CDM requires that the Option A: multiscale group distinct from datasetMultiscale DatasetA Zarr group that only contains other Zarr groups. It has a special attribute that declares the names of the subgroups it contains. MembersOnly datasets. No DataArrays. Attributes
Regular DatasetA Zarr group that contains dataarrays Membersonly DataArrays. Attributes
DataArrayZarr array that represents measured quantities. ExampleThis is the ONLY way to represent this set of overviews.
Option B: Dataset definition is overloaded to support multiscalesDatasetA Zarr group that contains other Zarr groups (other datasets), some of which might be overviews, if the MembersDataArrays or Datasets. If Attributes
DataArrayZarr array that represents measured quantities. ExampleThere are two ways to represent the same overviews:
and
|
in the above example, the first option is much more friendly for static analysis and automated traversal. The price is that we need an additional type definition (for our multiscale dataset). But neither A nor B introduces any new elements to netcdf / cdm (other than JSON as a type for attributes), Rather, A is a narrower subset of netcdf / cdm, This should be fine unless your expectation is that literally any cdm-compliant dataset should be convertible to geozarr without any layout transformations, and this seems like a tall order! |
You raise a very critical point: because of the CDM/NetCDF limitations all JSON objects are typically escaped. I would be interested to discuss if this constraint must be relaxed in GeoZarr model, and the encoding to Zarr describing when/what to unescape.
Thank you for summarising the proposals. From my perspective, option A is not feasible:
|
Should I open a separate issue for this discussion? The examples in the draft geozarr spec use JSON for attributes, as opposed to raw strings. |
Yes, feel free to open a separate issue for this. |
An additional problem with option B is that it forces the data variables for the source dataset to share the same namespace with the names of the overview datasets. Ideally these would be completely separate namespaces to prevent any possible name collisions.
That's correct. Option A strongly incentivizes data producers to plan for overviews by creating the layout accordingly. From a Zarr POV, it's generally best to create a hierarchy exactly once instead of adding arrays and groups piecemeal, so this might be guiding data producers in the right direction. But I can also see how this workflow might be inconvenient.
I would be curious to hear more about how option A would break existing readers. Is the name of the dataset of particular important to readers, such that opening |
|
||
A container for datasets, variables, dimensions, and metadata in Zarr. Groups may be nested to represent a logical hierarchy (e.g., for resolutions or collections). | ||
A group that contains one or more data variables along with their associated coordinate variables, having a consistent relationship between these components. A dataset represents a coherent set of related data arrays and follows the unified data model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use Unified Data Model in capitals wherever it is formal reference to the clause 7 definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it say what the group is first? group is probably still a container for datasets that can be nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We inherit from Zarr for the group
terminology. the section starts with:
GeoZarr specification inherits https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[concepts and terminology from the Zarr core specification].
The following terms adds Geozarr specificity to the existing Zarr terminology
I would like to avoid repeating the Zarr terminology in order to limit the maintenance if they evolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capitals solved
I think the discussion on multiscales is diverging from the base terminology discussion, so I will reply in the related issue #83. |
|
||
=== Foundational Model and Standards Reuse | ||
|
||
GeoZarr adopts established data model concepts because Zarr itself provides only array storage without semantic interpretation. The Unidata Common Data Model (CDM) provides the conceptual framework for understanding dimensions, variables, and attributes, while CF Conventions provide standardized metadata semantics. This reuse ensures compatibility with existing scientific software while avoiding reinvention of proven concepts. | ||
|
||
==== Common Data Model (CDM) | ||
|
||
The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the unified model: | ||
The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the Unified Data Model: | ||
|
||
- **Dimensions** – Integer-valued, named axes that define the extents of data variables. | ||
- **Coordinate Variables** – Variables that supply coordinate values along dimensions, establishing spatial or temporal context. | ||
- **Data Variables** – Multidimensional arrays representing observed or simulated phenomena, associated with dimensions and coordinate variables. | ||
- **Attributes** – Key-value metadata elements used to describe variables and datasets semantically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the CDM uses a particular type system for attributes that is not a 1:1 match for Zarr's attributes type system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a suggestion for describing that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what i added so far
- Clarified that CDM concepts are adapted for Zarr's JSON type system
- Acknowledged differences while preserving semantic compatibility
|
||
This clause defines the structural organisation of stores conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. | ||
This clause defines the structural organisation of stores conforming to the Unified Data Model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "store" is already defined in the Zarr spec: see https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id25. So the use of the same term with a different meaning in GeoZarr will likely become a point of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually intent to use the store term as it is in Zarr so I will clarify and reference the Zarr terminology
…concepts, refining terminology, and ensuring consistent references to hierarchies and stores across multiple sections.
…'s type system and ensure compatibility with CDM semantics.
…hancing clarity in class diagram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed a new reading. Congrats for the work.
|
||
GeoZarr aims to bridge scientific and geospatial communities by enabling round-trip transformations with formats such as NetCDF and GeoTIFF, and supporting compatibility with tools in the scientific Python and geospatial ecosystems. This Standard enables scalable, standards-compliant, and semantically rich data structures for cloud-native Earth observation applications. | ||
By providing a standardized framework for geospatial semantics, GeoZarr enables scientific and geospatial applications to fully utilize cloud-native storage architectures while maintaining the rich metadata and coordinate referencing required for Earth observation workflows. The result is a modern, scalable approach to storing and accessing geospatial data that meets the needs of both data providers and consumers. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this introduction.
|
||
Typical use cases include the storage, transformation, discovery, and processing of raster and gridded data, data cubes with temporal or vertical dimensions, and catalogue-enabled datasets integrated with metadata standards such as STAC and OGC Tile Matrix Sets. | ||
=== Why GeoZarr Exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may be missing an important clarification to justify the purpose of Geozarr: There are already existing conventions for geospatial data in Zarr, as implemented in Xarray, NCZarr, GDAL, those conventions primarily translate aspects of the CF/NetCDF data model into Zarr encoding.
However:
- The CF/NetCDF data model itself may lack certain capabilities, such as support for multiscale overviews, affine transforms, etc. .
- The current encoding conventions to Zarr – for example, mapping all NetCDF attributes into Zarr string attributes – may not be optimal and could be revisited.
|
||
A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. | ||
A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. The Unified Data Model provides a standardized framework for expressing spatial relationships, coordinate systems, and scientific metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the current definition is not ideal, since an abstract model should not be defined for a specific format. Instead, it should stand independently and be applicable across formats, with Zarr being one possible encoding of that model (as for CDM, CF abstract model, UDM, etc.)
A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. The Unified Data Model provides a standardized framework for expressing spatial relationships, coordinate systems, and scientific metadata. | |
A conceptual model for structuring geospatial data using CDM-based constructs. It enables consistent representation of coordinate referencing, metadata integration, and multiscale data. The Unified Data Model provides a standard framework for describing spatial relationships, coordinate systems, and scientific metadata, which can then be encoded in formats such as Zarr. |
|
||
- **Dimensions** – Integer-valued, named axes that define the extents of data variables. | ||
- **Coordinate Variables** – Variables that supply coordinate values along dimensions, establishing spatial or temporal context. | ||
- **Data Variables** – Multidimensional arrays representing observed or simulated phenomena, associated with dimensions and coordinate variables. | ||
- **Attributes** – Key-value metadata elements used to describe variables and datasets semantically. | ||
- **Groups** – Optional hierarchical containers enabling logical organisation of resources and metadata. | ||
|
||
The unified data model adopts these CDM components without modification excluding the user-defined types. Semantic interpretation remains consistent with the original CDM specification. GeoZarr structures are mapped to CDM constructs to ensure compatibility and clarity. | ||
The Unified Data Model adopts these CDM components with adaptations for Zarr's type system. While the conceptual structure remains consistent with the original CDM specification, attribute types are mapped to Zarr's JSON-compatible type system. GeoZarr structures preserve CDM semantics while conforming to Zarr's encoding constraints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a very good point to add.
|
||
Multiscale datasets are composed of a set of Zarr groups representing multiple zoom levels. Each level stores coarser-resolution resampled versions of the original data variables. | ||
A <<term-multiscale-group,multiscale group>> contains one or more child groups, where each child group is a <<term-dataset,dataset>> representing a zoom level of the data. Additional resolution levels can be added over time, with each new level storing a coarser-resolution resampled version of the original data variables. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multiscale support, I think we need a consensus that gives flexibility to data producers:
The multiscale group may include or exclude the native data, and the child zoom-level groups may likewise include or exclude the native level (0). This flexibility allows producers to handle different scenarios, such as adding overviews later to an existing archive. |
This PR addresses terminology inconsistencies and potential confusion around key terms used throughout the specification, particularly around "groups", "datasets", and hierarchical structures.
Changes
Updated abstract and intro section to emphase Zarr's storage foundation and GeoZarr's semantic additions purpose
Added explicit reference to Zarr core specification terminology
Updated and clarified key terms:
Updated terminology usage in:
Added proper cross-references using AsciiDoc syntax (e.g.,
<<term-dataset>>
)Impact
These changes:
Resolves discussion from #86 regarding terminology consistency.
@d-v-b, @maxrjones, @geospatial-jeff