GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.
This library provides tools to convert EOPF datasets to GeoZarr-spec 0.4 compliant format while maintaining native projections and using /2 downsampling logic for multiscale support.
- GeoZarr Specification Compliance: Full compliance with GeoZarr spec 0.4
- Native CRS Preservation: No reprojection to TMS, maintains original coordinate reference systems
- Multiscale Support: COG-style /2 downsampling with overview levels as children groups
- CF Conventions: Proper CF standard names and grid_mapping attributes
- Robust Processing: Band-by-band writing with validation and retry logic
- S3 Support: Direct output to Amazon S3 buckets with automatic credential validation
- Parallel Processing: Optional dask cluster support for parallel chunk processing
- Chunk Alignment: Automatic chunk alignment to prevent data corruption with dask
_ARRAY_DIMENSIONS
attributes on all arrays- CF standard names for all variables
grid_mapping
attributes referencing CF grid_mapping variablesGeoTransform
attributes in grid_mapping variables- Proper multiscales metadata structure
- Native CRS tile matrix sets
pip install eopf-geozarr
For development:
git clone <repository-url>
cd eopf-geozarr
pip install -e ".[dev]"
After installation, you can use the eopf-geozarr
command:
# Convert EOPF dataset to GeoZarr format (local output)
eopf-geozarr convert input.zarr output.zarr
# Convert EOPF dataset to GeoZarr format (S3 output)
eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr
# Convert with parallel processing using dask cluster
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# Convert with dask cluster and verbose output
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
# Get information about a dataset
eopf-geozarr info input.zarr
# Validate GeoZarr compliance
eopf-geozarr validate output.zarr
# Get help
eopf-geozarr --help
The library supports direct output to S3-compatible storage, including custom providers like OVH Cloud. Simply provide an S3 URL as the output path:
# Convert to S3
eopf-geozarr convert local_input.zarr s3://my-bucket/geozarr-data/output.zarr --verbose
Before using S3 output, ensure your S3 credentials are configured:
For AWS S3:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
For OVH Cloud Object Storage:
export AWS_ACCESS_KEY_ID=your_ovh_access_key
export AWS_SECRET_ACCESS_KEY=your_ovh_secret_key
export AWS_DEFAULT_REGION=gra # or other OVH region
export AWS_ENDPOINT_URL=https://s3.gra.cloud.ovh.net # OVH endpoint
For other S3-compatible providers:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=your_region
export AWS_ENDPOINT_URL=https://your-s3-endpoint.com
Alternative: AWS CLI Configuration
aws configure
# Note: For custom endpoints, you'll still need to set AWS_ENDPOINT_URL
- Custom Endpoints: Support for any S3-compatible storage (AWS, OVH Cloud, MinIO, etc.)
- Automatic Validation: The tool validates S3 access before starting conversion
- Credential Detection: Automatically detects and validates S3 credentials
- Error Handling: Provides helpful error messages for S3 configuration issues
- Performance: Optimized for S3 with proper chunking and retry logic
The library supports parallel processing using dask clusters for improved performance on large datasets:
# Enable dask cluster for parallel processing
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# With verbose output to see cluster information
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
- Local Cluster: Automatically starts a local dask cluster with multiple workers
- Dashboard Access: Provides access to the dask dashboard for monitoring (shown in verbose mode)
- Automatic Cleanup: Properly closes the cluster even if errors occur during processing
- Chunk Alignment: Automatically aligns Zarr chunks with dask chunks to prevent data corruption
- Memory Efficiency: Better memory management through parallel chunk processing
- Error Handling: Graceful handling of dask import errors with helpful installation instructions
The library includes advanced chunk alignment logic to prevent the common issue of overlapping chunks when using dask:
- Smart Detection: Automatically detects if data is dask-backed and uses existing chunk structure
- Aligned Calculation: Uses
calculate_aligned_chunk_size()
to find optimal chunk sizes that divide evenly into data dimensions - Proper Rechunking: Ensures datasets are rechunked to match encoding before writing
- Fallback Logic: For non-dask arrays, uses reasonable chunk sizes that don't exceed data dimensions
This prevents errors like:
❌ Failed to write tci after 2 attempts: Specified Zarr chunks encoding['chunks']=(1, 3660, 3660)
for variable named 'tci' would overlap multiple Dask chunks
import os
import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Configure for OVH Cloud (example)
os.environ['AWS_ACCESS_KEY_ID'] = 'your_ovh_access_key'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your_ovh_secret_key'
os.environ['AWS_DEFAULT_REGION'] = 'gra'
os.environ['AWS_ENDPOINT_URL'] = 'https://s3.gra.cloud.ovh.net'
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Convert directly to S3
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"],
output_path="s3://my-bucket/geozarr-data/output.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)
import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Define groups to convert (e.g., resolution groups)
groups = ["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"]
# Convert to GeoZarr compliant format
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=groups,
output_path="path/to/output/geozarr.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)
Create a GeoZarr-spec 0.4 compliant dataset from EOPF data.
Parameters:
dt_input
(xr.DataTree): Input EOPF DataTreegroups
(List[str]): List of group names to process as Geozarr datasetsoutput_path
(str): Output path for the Zarr storespatial_chunk
(int, default=4096): Spatial chunk size for encodingmin_dimension
(int, default=256): Minimum dimension for overview levelstile_width
(int, default=256): Tile width for TMS compatibilitymax_retries
(int, default=3): Maximum number of retries for network operations
Returns:
xr.DataTree
: DataTree containing the GeoZarr compliant data
Set up GeoZarr-spec compliant CF standard names and CRS information.
Parameters:
dt
(xr.DataTree): The data tree containing the datasets to processgroups
(List[str]): List of group names to process as Geozarr datasets
Returns:
Dict[str, xr.Dataset]
: Dictionary of datasets with GeoZarr compliance applied
Downsample a 2D array using block averaging.
Calculate a chunk size that divides evenly into the dimension size. This ensures that Zarr chunks align properly with the data dimensions, preventing chunk overlap issues when writing with Dask.
Parameters:
dimension_size
(int): Size of the dimension to chunktarget_chunk_size
(int): Desired chunk size
Returns:
int
: Aligned chunk size that divides evenly into dimension_size
Example:
from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size
# For a dimension of size 5490 with target chunk size 3660
aligned_size = calculate_aligned_chunk_size(5490, 3660) # Returns 2745
Check if a variable is a grid_mapping variable by looking for references to it.
Validate that a specific band exists and is complete in the dataset.
The library is organized into the following modules:
conversion
: Core conversion tools for EOPF to GeoZarr transformationgeozarr.py
: Main conversion functions and GeoZarr spec complianceutils.py
: Utility functions for data processing and validation
data_api
: Data access API (future development with pydantic-zarr)
This library implements the GeoZarr specification 0.4 with the following key requirements:
- Array Dimensions: All arrays must have
_ARRAY_DIMENSIONS
attributes - CF Standard Names: All variables must have CF-compliant
standard_name
attributes - Grid Mapping: Data variables must reference CF grid_mapping variables via
grid_mapping
attributes - Multiscales Structure: Overview levels are stored as children groups with proper tile matrix metadata
- Native CRS: Coordinate reference systems are preserved without reprojection
# Clone the repository
git clone <repository-url>
cd eopf-geozarr
# Install in development mode with all dependencies
pip install -e ".[dev,docs,all]"
# Install pre-commit hooks
pre-commit install
pytest
The project uses:
- Black for code formatting
- isort for import sorting
- flake8 for linting
- mypy for type checking
- pre-commit for automated checks
cd docs
make html
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes
- Run tests and ensure code quality checks pass
- Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Built on top of the excellent xarray and zarr libraries
- Follows the GeoZarr specification for geospatial data in Zarr
- Designed for compatibility with EOPF datasets
For questions, issues, or contributions, please visit the GitHub repository.