Skip to content

EOPF-Explorer/data-model

Repository files navigation

EOPF GeoZarr

GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.

Overview

This library provides tools to convert EOPF datasets to GeoZarr-spec 0.4 compliant format while maintaining native projections and using /2 downsampling logic for multiscale support.

Key Features

  • GeoZarr Specification Compliance: Full compliance with GeoZarr spec 0.4
  • Native CRS Preservation: No reprojection to TMS, maintains original coordinate reference systems
  • Multiscale Support: COG-style /2 downsampling with overview levels as children groups
  • CF Conventions: Proper CF standard names and grid_mapping attributes
  • Robust Processing: Band-by-band writing with validation and retry logic
  • S3 Support: Direct output to Amazon S3 buckets with automatic credential validation
  • Parallel Processing: Optional dask cluster support for parallel chunk processing
  • Chunk Alignment: Automatic chunk alignment to prevent data corruption with dask

GeoZarr Compliance Features

  • _ARRAY_DIMENSIONS attributes on all arrays
  • CF standard names for all variables
  • grid_mapping attributes referencing CF grid_mapping variables
  • GeoTransform attributes in grid_mapping variables
  • Proper multiscales metadata structure
  • Native CRS tile matrix sets

Installation

pip install eopf-geozarr

For development:

git clone <repository-url>
cd eopf-geozarr
pip install -e ".[dev]"

Quick Start

Command Line Interface

After installation, you can use the eopf-geozarr command:

# Convert EOPF dataset to GeoZarr format (local output)
eopf-geozarr convert input.zarr output.zarr

# Convert EOPF dataset to GeoZarr format (S3 output)
eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr

# Convert with parallel processing using dask cluster
eopf-geozarr convert input.zarr output.zarr --dask-cluster

# Convert with dask cluster and verbose output
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose

# Get information about a dataset
eopf-geozarr info input.zarr

# Validate GeoZarr compliance
eopf-geozarr validate output.zarr

# Get help
eopf-geozarr --help

S3 Support

The library supports direct output to S3-compatible storage, including custom providers like OVH Cloud. Simply provide an S3 URL as the output path:

# Convert to S3
eopf-geozarr convert local_input.zarr s3://my-bucket/geozarr-data/output.zarr --verbose

S3 Configuration

Before using S3 output, ensure your S3 credentials are configured:

For AWS S3:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

For OVH Cloud Object Storage:

export AWS_ACCESS_KEY_ID=your_ovh_access_key
export AWS_SECRET_ACCESS_KEY=your_ovh_secret_key
export AWS_DEFAULT_REGION=gra  # or other OVH region
export AWS_ENDPOINT_URL=https://s3.gra.cloud.ovh.net  # OVH endpoint

For other S3-compatible providers:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=your_region
export AWS_ENDPOINT_URL=https://your-s3-endpoint.com

Alternative: AWS CLI Configuration

aws configure
# Note: For custom endpoints, you'll still need to set AWS_ENDPOINT_URL

S3 Features

  • Custom Endpoints: Support for any S3-compatible storage (AWS, OVH Cloud, MinIO, etc.)
  • Automatic Validation: The tool validates S3 access before starting conversion
  • Credential Detection: Automatically detects and validates S3 credentials
  • Error Handling: Provides helpful error messages for S3 configuration issues
  • Performance: Optimized for S3 with proper chunking and retry logic

Parallel Processing with Dask

The library supports parallel processing using dask clusters for improved performance on large datasets:

# Enable dask cluster for parallel processing
eopf-geozarr convert input.zarr output.zarr --dask-cluster

# With verbose output to see cluster information
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose

Dask Features

  • Local Cluster: Automatically starts a local dask cluster with multiple workers
  • Dashboard Access: Provides access to the dask dashboard for monitoring (shown in verbose mode)
  • Automatic Cleanup: Properly closes the cluster even if errors occur during processing
  • Chunk Alignment: Automatically aligns Zarr chunks with dask chunks to prevent data corruption
  • Memory Efficiency: Better memory management through parallel chunk processing
  • Error Handling: Graceful handling of dask import errors with helpful installation instructions

Chunk Alignment

The library includes advanced chunk alignment logic to prevent the common issue of overlapping chunks when using dask:

  • Smart Detection: Automatically detects if data is dask-backed and uses existing chunk structure
  • Aligned Calculation: Uses calculate_aligned_chunk_size() to find optimal chunk sizes that divide evenly into data dimensions
  • Proper Rechunking: Ensures datasets are rechunked to match encoding before writing
  • Fallback Logic: For non-dask arrays, uses reasonable chunk sizes that don't exceed data dimensions

This prevents errors like:

❌ Failed to write tci after 2 attempts: Specified Zarr chunks encoding['chunks']=(1, 3660, 3660)
for variable named 'tci' would overlap multiple Dask chunks

S3 Python API

import os
import xarray as xr
from eopf_geozarr import create_geozarr_dataset

# Configure for OVH Cloud (example)
os.environ['AWS_ACCESS_KEY_ID'] = 'your_ovh_access_key'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your_ovh_secret_key'
os.environ['AWS_DEFAULT_REGION'] = 'gra'
os.environ['AWS_ENDPOINT_URL'] = 'https://s3.gra.cloud.ovh.net'

# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")

# Convert directly to S3
dt_geozarr = create_geozarr_dataset(
    dt_input=dt,
    groups=["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"],
    output_path="s3://my-bucket/geozarr-data/output.zarr",
    spatial_chunk=4096,
    min_dimension=256,
    tile_width=256,
    max_retries=3
)

Python API

import xarray as xr
from eopf_geozarr import create_geozarr_dataset

# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")

# Define groups to convert (e.g., resolution groups)
groups = ["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"]

# Convert to GeoZarr compliant format
dt_geozarr = create_geozarr_dataset(
    dt_input=dt,
    groups=groups,
    output_path="path/to/output/geozarr.zarr",
    spatial_chunk=4096,
    min_dimension=256,
    tile_width=256,
    max_retries=3
)

API Reference

Main Functions

create_geozarr_dataset

Create a GeoZarr-spec 0.4 compliant dataset from EOPF data.

Parameters:

  • dt_input (xr.DataTree): Input EOPF DataTree
  • groups (List[str]): List of group names to process as Geozarr datasets
  • output_path (str): Output path for the Zarr store
  • spatial_chunk (int, default=4096): Spatial chunk size for encoding
  • min_dimension (int, default=256): Minimum dimension for overview levels
  • tile_width (int, default=256): Tile width for TMS compatibility
  • max_retries (int, default=3): Maximum number of retries for network operations

Returns:

  • xr.DataTree: DataTree containing the GeoZarr compliant data

setup_datatree_metadata_geozarr_spec_compliant

Set up GeoZarr-spec compliant CF standard names and CRS information.

Parameters:

  • dt (xr.DataTree): The data tree containing the datasets to process
  • groups (List[str]): List of group names to process as Geozarr datasets

Returns:

  • Dict[str, xr.Dataset]: Dictionary of datasets with GeoZarr compliance applied

Utility Functions

downsample_2d_array

Downsample a 2D array using block averaging.

calculate_aligned_chunk_size

Calculate a chunk size that divides evenly into the dimension size. This ensures that Zarr chunks align properly with the data dimensions, preventing chunk overlap issues when writing with Dask.

Parameters:

  • dimension_size (int): Size of the dimension to chunk
  • target_chunk_size (int): Desired chunk size

Returns:

  • int: Aligned chunk size that divides evenly into dimension_size

Example:

from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size

# For a dimension of size 5490 with target chunk size 3660
aligned_size = calculate_aligned_chunk_size(5490, 3660)  # Returns 2745

is_grid_mapping_variable

Check if a variable is a grid_mapping variable by looking for references to it.

validate_existing_band_data

Validate that a specific band exists and is complete in the dataset.

Architecture

The library is organized into the following modules:

  • conversion: Core conversion tools for EOPF to GeoZarr transformation
    • geozarr.py: Main conversion functions and GeoZarr spec compliance
    • utils.py: Utility functions for data processing and validation
  • data_api: Data access API (future development with pydantic-zarr)

GeoZarr Specification Compliance

This library implements the GeoZarr specification 0.4 with the following key requirements:

  1. Array Dimensions: All arrays must have _ARRAY_DIMENSIONS attributes
  2. CF Standard Names: All variables must have CF-compliant standard_name attributes
  3. Grid Mapping: Data variables must reference CF grid_mapping variables via grid_mapping attributes
  4. Multiscales Structure: Overview levels are stored as children groups with proper tile matrix metadata
  5. Native CRS: Coordinate reference systems are preserved without reprojection

Development

Setting up Development Environment

# Clone the repository
git clone <repository-url>
cd eopf-geozarr

# Install in development mode with all dependencies
pip install -e ".[dev,docs,all]"

# Install pre-commit hooks
pre-commit install

Running Tests

pytest

Code Quality

The project uses:

  • Black for code formatting
  • isort for import sorting
  • flake8 for linting
  • mypy for type checking
  • pre-commit for automated checks

Building Documentation

cd docs
make html

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests and ensure code quality checks pass
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Support

For questions, issues, or contributions, please visit the GitHub repository.