|
1 | 1 | # EOPF GeoZarr
|
2 | 2 |
|
3 |
| -Turn EOPF datasets into a GeoZarr-style Zarr v3 store. Keep the data values intact and add standard geospatial metadata, multiscale overviews, and per-variable dimensions. |
| 3 | +GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets. |
4 | 4 |
|
5 |
| -## Quick Start |
| 5 | +Turn EOPF datasets into a GeoZarr-style Zarr v3 store while: |
| 6 | +- Preserving native CRS (no forced TMS reprojection) |
| 7 | +- Adding CF + GeoZarr compliant metadata |
| 8 | +- Building /2 multiscale overviews |
| 9 | +- Writing robust, retry-aware band data with validation |
| 10 | + |
| 11 | +## Overview |
| 12 | + |
| 13 | +This library converts EOPF datatrees into GeoZarr-spec 0.4 aligned Zarr v3 stores without forcing web-mercator style tiling. It focuses on scientific fidelity (native CRS), robust metadata (CF + GeoZarr), and operational resilience (retry + completeness auditing) while supporting multiscale /2 overviews. |
| 14 | + |
| 15 | +## Key Features |
6 | 16 |
|
7 |
| -Install (uv): |
| 17 | +- **GeoZarr Specification Compliance** (0.4 features implemented) |
| 18 | +- **Native CRS Preservation** (UTM, polar, arbitrary projections) |
| 19 | +- **Multiscale /2 Overviews** (COG-style hierarchy as child groups) |
| 20 | +- **CF Conventions** (`standard_name`, `grid_mapping`, `_ARRAY_DIMENSIONS`) |
| 21 | +- **Resilient Writing** (band-by-band with retries & auditing) |
| 22 | +- **S3 & S3-Compatible Support** (AWS, OVH, MinIO, custom endpoints) |
| 23 | +- **Optional Parallel Processing** (local Dask cluster) |
| 24 | +- **Automatic Chunk Alignment** (prevents overlapping Dask/Zarr chunks) |
| 25 | +- **HTML Summary & Validation Tools** |
| 26 | +- **STAC & Benchmark Commands** |
| 27 | +- **Consolidated Metadata** (faster open) |
| 28 | + |
| 29 | +## GeoZarr Compliance Features |
| 30 | + |
| 31 | +- `_ARRAY_DIMENSIONS` attributes on all arrays |
| 32 | +- CF standard names for all variables |
| 33 | +- `grid_mapping` attributes referencing CF grid_mapping variables |
| 34 | +- `GeoTransform` attributes in grid_mapping variables |
| 35 | +- Proper multiscales metadata structure |
| 36 | +- Native CRS tile matrix sets |
| 37 | + |
| 38 | +## Installation |
8 | 39 |
|
| 40 | +```bash |
| 41 | +pip install eopf-geozarr |
| 42 | +``` |
| 43 | + |
| 44 | +Development (uv): |
9 | 45 | ```bash
|
10 | 46 | uv sync --frozen
|
11 | 47 | uv run eopf-geozarr --help
|
12 | 48 | ```
|
13 | 49 |
|
14 |
| -Or pip: |
15 |
| - |
| 50 | +Editable (pip): |
16 | 51 | ```bash
|
17 |
| -pip install -e . |
| 52 | +pip install -e .[dev] |
18 | 53 | ```
|
19 | 54 |
|
20 |
| -## Workflows |
21 |
| - |
22 |
| -For Argo / batch orchestration use: https://github.com/EOPF-Explorer/data-model-pipeline |
| 55 | +## Quick Start |
23 | 56 |
|
24 |
| -## Convert |
| 57 | +### Command Line Interface |
25 | 58 |
|
26 |
| -Remote → local: |
| 59 | +After installation, you can use the `eopf-geozarr` command: |
27 | 60 |
|
28 | 61 | ```bash
|
29 |
| -uv run eopf-geozarr convert \ |
30 |
| - "https://.../S2B_MSIL2A_... .zarr" \ |
31 |
| - "/tmp/S2B_MSIL2A_..._geozarr.zarr" \ |
32 |
| - --groups /measurements/reflectance \ |
33 |
| - --verbose |
34 |
| -``` |
| 62 | +# Convert EOPF dataset to GeoZarr format (local output) |
| 63 | +eopf-geozarr convert input.zarr output.zarr |
| 64 | + |
| 65 | +# Convert specific groups (e.g. resolution groups) |
| 66 | +eopf-geozarr convert input.zarr output.zarr --groups /measurements/r10m /measurements/r20m |
35 | 67 |
|
36 |
| -Notes: |
37 |
| -- Parent groups auto-expand to leaf datasets. |
38 |
| -- Overviews use /2 coarsening; multiscales live on parent groups. |
39 |
| -- Defaults: Blosc Zstd, conservative chunking, metadata consolidation after write. |
| 68 | +# Convert EOPF dataset to GeoZarr format (S3 output) |
| 69 | +eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr |
40 | 70 |
|
41 |
| -## S3 |
| 71 | +# Convert with parallel processing using dask cluster |
| 72 | +eopf-geozarr convert input.zarr output.zarr --dask-cluster |
42 | 73 |
|
43 |
| -Env for S3/S3-compatible storage: |
| 74 | +# Convert with dask cluster and verbose output |
| 75 | +eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose |
44 | 76 |
|
| 77 | +# Generate an HTML summary while inspecting |
| 78 | +eopf-geozarr info output.zarr --html report.html |
| 79 | + |
| 80 | +# Get information about a dataset |
| 81 | +eopf-geozarr info input.zarr |
| 82 | + |
| 83 | +# Validate GeoZarr compliance |
| 84 | +eopf-geozarr validate output.zarr |
| 85 | + |
| 86 | +# Benchmark access patterns (optional) |
| 87 | +eopf-geozarr benchmark output.zarr --samples 8 --window 1024 1024 |
| 88 | + |
| 89 | +# Produce draft STAC artifacts |
| 90 | +eopf-geozarr stac output.zarr stac_collection.json \ |
| 91 | + --bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z |
| 92 | + |
| 93 | +# Get help |
| 94 | +eopf-geozarr --help |
| 95 | +``` |
| 96 | + |
| 97 | +#### Notes |
| 98 | +- Parent groups auto-expand to leaf datasets when omitted. |
| 99 | +- Multiscale overviews are generated with /2 coarsening and attached as child groups. |
| 100 | +- Defaults: Blosc Zstd (level 3), conservative chunking, metadata consolidation enabled. |
| 101 | +- Use `--groups` to limit processing or speed up experimentation. |
| 102 | + |
| 103 | +## S3 Support |
| 104 | + |
| 105 | +Environment vars: |
45 | 106 | ```bash
|
46 | 107 | export AWS_ACCESS_KEY_ID=...
|
47 | 108 | export AWS_SECRET_ACCESS_KEY=...
|
48 |
| -export AWS_REGION=eu-west-1 |
49 |
| -# Custom endpoint (OVH, MinIO, etc.) |
50 |
| -export AWS_ENDPOINT_URL=https://s3.your-endpoint.example |
| 109 | +export AWS_DEFAULT_REGION=eu-west-1 |
| 110 | +export AWS_ENDPOINT_URL=https://s3.your-endpoint.example # optional custom endpoint |
51 | 111 | ```
|
52 | 112 |
|
53 |
| -Write to S3: |
54 |
| - |
| 113 | +Write directly: |
55 | 114 | ```bash
|
56 |
| -uv run eopf-geozarr convert \ |
57 |
| - "https://.../S2B_MSIL2A_... .zarr" \ |
58 |
| - "s3://your-bucket/path/S2B_MSIL2A_..._geozarr.zarr" \ |
59 |
| - --groups /measurements/reflectance \ |
60 |
| - --verbose |
| 115 | +eopf-geozarr convert input.zarr s3://my-bucket/path/output_geozarr.zarr --groups /measurements/r10m |
61 | 116 | ```
|
62 | 117 |
|
63 |
| -## Info & Validate |
| 118 | +Features: |
| 119 | +- Credential validation before write |
| 120 | +- Custom endpoints (OVH, MinIO, etc.) |
| 121 | +- Retry logic around object writes |
64 | 122 |
|
65 |
| -Summary: |
| 123 | +## Parallel Processing with Dask |
66 | 124 |
|
67 | 125 | ```bash
|
68 |
| -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 126 | +eopf-geozarr convert input.zarr out.zarr --dask-cluster --verbose |
69 | 127 | ```
|
| 128 | +Benefits: |
| 129 | +- Local cluster auto-start & cleanup |
| 130 | +- Chunk alignment to prevent overlapping writes |
| 131 | +- Better memory distribution for large scenes |
70 | 132 |
|
71 |
| -HTML report: |
| 133 | +## Python API |
72 | 134 |
|
73 |
| -```bash |
74 |
| -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" --html /tmp/summary.html |
| 135 | +High-level dataset conversion: |
| 136 | +```python |
| 137 | +import xarray as xr |
| 138 | +from eopf_geozarr import create_geozarr_dataset |
| 139 | + |
| 140 | +dt = xr.open_datatree("path/to/eopf.zarr", engine="zarr") |
| 141 | +out = create_geozarr_dataset( |
| 142 | + dt_input=dt, |
| 143 | + groups=["/measurements/r10m", "/measurements/r20m"], |
| 144 | + output_path="/tmp/out_geozarr.zarr", |
| 145 | + spatial_chunk=4096, |
| 146 | + min_dimension=256, |
| 147 | + tile_width=256, |
| 148 | +) |
| 149 | +``` |
| 150 | + |
| 151 | +Selective writer usage (advanced): |
| 152 | +```python |
| 153 | +from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
| 154 | +writer = GeoZarrWriter(output_path="/tmp/out.zarr", spatial_chunk=4096) |
| 155 | +# writer.write_group(...) |
75 | 156 | ```
|
76 | 157 |
|
77 |
| -Validate (counts only real data vars, skips `spatial_ref`/`crs`): |
| 158 | +## API Reference |
78 | 159 |
|
79 |
| -```bash |
80 |
| -uv run eopf-geozarr validate "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 160 | +`create_geozarr_dataset(dt_input, groups, output_path, spatial_chunk=4096, ...) -> xr.DataTree` |
| 161 | +: Produce a GeoZarr-compliant hierarchy. |
| 162 | + |
| 163 | +`setup_datatree_metadata_geozarr_spec_compliant(dt, groups) -> dict[str, xr.Dataset]` |
| 164 | +: Apply CF + GeoZarr metadata to selected groups. |
| 165 | + |
| 166 | +`downsample_2d_array(source_data, target_h, target_w) -> np.ndarray` |
| 167 | +: Block-average /2 overview generation primitive. |
| 168 | + |
| 169 | +`calculate_aligned_chunk_size(dimension_size, target_chunk_size) -> int` |
| 170 | +: Returns evenly dividing chunk to avoid overlap. |
| 171 | + |
| 172 | +## Architecture |
| 173 | + |
| 174 | +``` |
| 175 | +eopf_geozarr/ |
| 176 | + commands/ # CLI subcommands (convert, validate, info, stac, benchmark) |
| 177 | + conversion/ # Core geozarr pipeline, helpers, multiscales, encodings |
| 178 | + metrics.py # Lightweight metrics hooks (optional) |
81 | 179 | ```
|
82 | 180 |
|
83 |
| -## Benchmark (optional) |
| 181 | +## Contributing to GeoZarr Specification |
| 182 | + |
| 183 | +Upstream issue discussions influenced: |
| 184 | +- Arbitrary CRS preservation |
| 185 | +- Chunking performance & strategies |
| 186 | +- Multiscale hierarchy clarity |
| 187 | + |
| 188 | +## Benchmark & STAC Commands |
84 | 189 |
|
| 190 | +Benchmark: |
85 | 191 | ```bash
|
86 |
| -uv run eopf-geozarr benchmark "/tmp/..._geozarr.zarr" --samples 8 --window 1024 1024 |
| 192 | +eopf-geozarr benchmark /tmp/out_geozarr.zarr --samples 8 --window 1024 1024 |
87 | 193 | ```
|
88 | 194 |
|
89 |
| -## STAC |
90 |
| - |
| 195 | +STAC draft artifacts: |
91 | 196 | ```bash
|
92 |
| -uv run eopf-geozarr stac \ |
93 |
| - "/tmp/..._geozarr.zarr" \ |
94 |
| - "/tmp/..._collection.json" \ |
95 |
| - --bbox "minx miny maxx maxy" \ |
96 |
| - --start "YYYY-MM-DDTHH:MM:SSZ" \ |
97 |
| - --end "YYYY-MM-DDTHH:MM:SSZ" |
| 197 | +eopf-geozarr stac /tmp/out_geozarr.zarr /tmp/collection.json \ |
| 198 | + --bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z |
98 | 199 | ```
|
99 | 200 |
|
100 |
| -## Python API |
| 201 | +## What Gets Written |
101 | 202 |
|
102 |
| -```python |
103 |
| -from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
104 |
| -from eopf_geozarr.validation.validate import validate_store |
105 |
| -from eopf_geozarr.info.summary import summarize |
| 203 | +- `_ARRAY_DIMENSIONS` per variable (deterministic axis order) |
| 204 | +- Per-variable `grid_mapping` referencing `spatial_ref` |
| 205 | +- Multiscales metadata on parent groups; /2 overviews |
| 206 | +- Blosc Zstd compression, conservative chunking |
| 207 | +- Consolidated metadata index |
| 208 | +- Band attribute propagation across levels |
| 209 | + |
| 210 | +## Consolidated Metadata |
| 211 | + |
| 212 | +Improves open performance. Spec discussion ongoing; toggle by disabling consolidation if strict minimalism required. |
| 213 | + |
| 214 | +## Troubleshooting |
| 215 | + |
| 216 | +| Symptom | Cause | Fix | |
| 217 | +|---------|-------|-----| |
| 218 | +| Parent group empty | Only leaf groups hold arrays | Use `--groups` or rely on auto-expansion | |
| 219 | +| Overlapping chunk error | Misaligned dask vs encoding chunks | Allow auto chunk alignment or reduce spatial_chunk | |
| 220 | +| S3 auth failure | Missing env vars or endpoint | Export AWS_* vars / set AWS_ENDPOINT_URL | |
| 221 | +| HTML path is a directory | Provided path not file | A default filename is created inside | |
| 222 | + |
| 223 | +## Development & Contributing |
| 224 | +Preferred (reproducible) workflow uses [uv](https://github.com/astral-sh/uv): |
106 | 225 |
|
107 |
| -src = "https://.../S2B_MSIL2A_... .zarr" |
108 |
| -dst = "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 226 | +```bash |
| 227 | +git clone <repo-url> |
| 228 | +cd eopf-geozarr |
109 | 229 |
|
110 |
| -writer = GeoZarrWriter(src, dst, storage_options={}) |
111 |
| -writer.write(groups=["/measurements/reflectance"], verbose=True) |
| 230 | +# Ensure uv is installed (macOS/Linux quick install) |
| 231 | +curl -Ls https://astral.sh/uv/install.sh | sh # or follow official docs |
112 | 232 |
|
113 |
| -report = validate_store(dst) |
114 |
| -print(report.ok) |
| 233 | +# Create and sync environment with dev extras |
| 234 | +uv sync --extra dev |
115 | 235 |
|
116 |
| -tree = summarize(dst) |
117 |
| -print(tree["summary"]) # or write HTML via CLI |
| 236 | +# Run tools through uv (ensures correct virtual env) |
| 237 | +uv run pre-commit install |
| 238 | +uv run pytest -q |
118 | 239 | ```
|
119 | 240 |
|
120 |
| -## What it writes |
| 241 | +Common tasks: |
| 242 | +```bash |
| 243 | +uv run ruff check . |
| 244 | +uv run mypy src |
| 245 | +uv run eopf-geozarr --help |
| 246 | +``` |
121 | 247 |
|
122 |
| -- `_ARRAY_DIMENSIONS` per variable (correct axis order). |
123 |
| -- `grid_mapping = "spatial_ref"` per variable; `spatial_ref` holds CRS/georeferencing. |
124 |
| -- Multiscales on parent groups; /2 overviews. |
125 |
| -- Blosc Zstd compression; conservative chunking; consolidated metadata. |
126 |
| -- Overviews keep per-band attributes (grid_mapping reattached across levels). |
| 248 | +Fallback (less reproducible) pip editable install: |
| 249 | +```bash |
| 250 | +pip install -e '.[dev]' |
| 251 | +pre-commit install |
| 252 | +pytest |
| 253 | +``` |
127 | 254 |
|
128 |
| -## Consolidated metadata |
| 255 | +Quality stack: Ruff (lint + format), isort (via Ruff), Mypy (strict), Pytest, Coverage. |
129 | 256 |
|
130 |
| -Speeds up reads. Some tools note it isn’t in the core Zarr v3 spec yet; data stays valid. You can disable consolidation during writes or remove the index if preferred. |
| 257 | +## License & Acknowledgments |
131 | 258 |
|
132 |
| -## Troubleshooting |
| 259 | +Apache 2.0. Built atop xarray, zarr, dask; follows evolving GeoZarr specification. |
133 | 260 |
|
134 |
| -- Parent group shows no data vars: select leaves (CLI auto-expands). |
135 |
| -- S3 errors: check env vars and `AWS_ENDPOINT_URL` for custom endpoints. |
136 |
| -- HTML path is a directory: a default filename is created inside. |
| 261 | +--- |
| 262 | +For questions or issues open a GitHub issue. |
137 | 263 |
|
0 commit comments