|
1 | 1 | # EOPF GeoZarr
|
2 | 2 |
|
3 |
| -Turn EOPF datasets into a GeoZarr-style Zarr v3 store. Keep the data values intact and add standard geospatial metadata, multiscale overviews, and per-variable dimensions. |
| 3 | +GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets. |
4 | 4 |
|
5 |
| -## Quick Start |
| 5 | +Turn EOPF datasets into a GeoZarr-style Zarr v3 store while: |
| 6 | +- Preserving native CRS (no forced TMS reprojection) |
| 7 | +- Adding CF + GeoZarr compliant metadata |
| 8 | +- Building /2 multiscale overviews |
| 9 | +- Writing robust, retry-aware band data with validation |
6 | 10 |
|
7 |
| -Install (uv): |
| 11 | +## Overview |
8 | 12 |
|
| 13 | +This library converts EOPF datatrees into GeoZarr-spec 0.4 aligned Zarr v3 stores without forcing web-mercator style tiling. It focuses on scientific fidelity (native CRS), robust metadata (CF + GeoZarr), and operational resilience (retry + completeness auditing) while supporting multiscale /2 overviews. |
| 14 | + |
| 15 | +## Key Features |
| 16 | + |
| 17 | +- **GeoZarr Specification Compliance** (0.4 features implemented) |
| 18 | +- **Native CRS Preservation** (UTM, polar, arbitrary projections) |
| 19 | +- **Multiscale /2 Overviews** (COG-style hierarchy as child groups) |
| 20 | +- **CF Conventions** (`standard_name`, `grid_mapping`, `_ARRAY_DIMENSIONS`) |
| 21 | +- **Resilient Writing** (band-by-band with retries & auditing) |
| 22 | +- **S3 & S3-Compatible Support** (AWS, OVH, MinIO, custom endpoints) |
| 23 | +- **Optional Parallel Processing** (local Dask cluster) |
| 24 | +- **Automatic Chunk Alignment** (prevents overlapping Dask/Zarr chunks) |
| 25 | +- **HTML Summary & Validation Tools** |
| 26 | +- **STAC & Benchmark Commands** |
| 27 | +- **Consolidated Metadata** (faster open) |
| 28 | + |
| 29 | +## GeoZarr Compliance Features |
| 30 | + |
| 31 | +- `_ARRAY_DIMENSIONS` attributes on all arrays |
| 32 | +- CF grid mapping variables with `GeoTransform` |
| 33 | +- Per-variable `grid_mapping` references |
| 34 | +- Multiscales metadata structure on parent groups |
| 35 | +- Native CRS tile matrix logic (no forced EPSG:3857) |
| 36 | + |
| 37 | +## Installation |
| 38 | + |
| 39 | +Stable: |
| 40 | +```bash |
| 41 | +pip install eopf-geozarr |
| 42 | +``` |
| 43 | + |
| 44 | +Development (uv): |
9 | 45 | ```bash
|
10 | 46 | uv sync --frozen
|
11 | 47 | uv run eopf-geozarr --help
|
12 | 48 | ```
|
13 | 49 |
|
14 |
| -Or pip: |
15 |
| - |
| 50 | +Editable (pip): |
16 | 51 | ```bash
|
17 |
| -pip install -e . |
| 52 | +pip install -e .[dev] |
18 | 53 | ```
|
19 | 54 |
|
20 |
| -## Workflows |
21 |
| - |
22 |
| -For Argo / batch orchestration use: https://github.com/EOPF-Explorer/data-model-pipeline |
| 55 | +## Quick Start (CLI) |
23 | 56 |
|
24 |
| -## Convert |
| 57 | +Convert local → local: |
| 58 | +```bash |
| 59 | +eopf-geozarr convert input.zarr output_geozarr.zarr --groups /measurements/r10m /measurements/r20m |
| 60 | +``` |
25 | 61 |
|
26 | 62 | Remote → local:
|
27 |
| - |
28 | 63 | ```bash
|
29 |
| -uv run eopf-geozarr convert \ |
| 64 | +eopf-geozarr convert \ |
30 | 65 | "https://.../S2B_MSIL2A_... .zarr" \
|
31 | 66 | "/tmp/S2B_MSIL2A_..._geozarr.zarr" \
|
32 |
| - --groups /measurements/reflectance \ |
33 |
| - --verbose |
| 67 | + --groups /measurements/reflectance --verbose |
34 | 68 | ```
|
35 | 69 |
|
36 | 70 | Notes:
|
37 |
| -- Parent groups auto-expand to leaf datasets. |
38 |
| -- Overviews use /2 coarsening; multiscales live on parent groups. |
39 |
| -- Defaults: Blosc Zstd, conservative chunking, metadata consolidation after write. |
| 71 | +- Parent groups auto-expand to leaf datasets |
| 72 | +- Overviews: /2 coarsening, attached at parent multiscales |
| 73 | +- Defaults: Blosc Zstd level 3, conservative chunking, metadata consolidation |
40 | 74 |
|
41 |
| -## S3 |
| 75 | +Info / HTML / Validate: |
| 76 | +```bash |
| 77 | +eopf-geozarr info /tmp/..._geozarr.zarr --html report.html |
| 78 | +eopf-geozarr validate /tmp/..._geozarr.zarr |
| 79 | +``` |
42 | 80 |
|
43 |
| -Env for S3/S3-compatible storage: |
| 81 | +## S3 Support |
44 | 82 |
|
| 83 | +Environment vars: |
45 | 84 | ```bash
|
46 | 85 | export AWS_ACCESS_KEY_ID=...
|
47 | 86 | export AWS_SECRET_ACCESS_KEY=...
|
48 |
| -export AWS_REGION=eu-west-1 |
49 |
| -# Custom endpoint (OVH, MinIO, etc.) |
50 |
| -export AWS_ENDPOINT_URL=https://s3.your-endpoint.example |
| 87 | +export AWS_DEFAULT_REGION=eu-west-1 |
| 88 | +export AWS_ENDPOINT_URL=https://s3.your-endpoint.example # optional custom endpoint |
51 | 89 | ```
|
52 | 90 |
|
53 |
| -Write to S3: |
54 |
| - |
| 91 | +Write directly: |
55 | 92 | ```bash
|
56 |
| -uv run eopf-geozarr convert \ |
57 |
| - "https://.../S2B_MSIL2A_... .zarr" \ |
58 |
| - "s3://your-bucket/path/S2B_MSIL2A_..._geozarr.zarr" \ |
59 |
| - --groups /measurements/reflectance \ |
60 |
| - --verbose |
| 93 | +eopf-geozarr convert input.zarr s3://my-bucket/path/output_geozarr.zarr --groups /measurements/r10m |
61 | 94 | ```
|
62 | 95 |
|
63 |
| -## Info & Validate |
| 96 | +Features: |
| 97 | +- Credential validation before write |
| 98 | +- Custom endpoints (OVH, MinIO, etc.) |
| 99 | +- Retry logic around object writes |
64 | 100 |
|
65 |
| -Summary: |
| 101 | +## Parallel Processing with Dask |
66 | 102 |
|
67 | 103 | ```bash
|
68 |
| -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 104 | +eopf-geozarr convert input.zarr out.zarr --dask-cluster --verbose |
69 | 105 | ```
|
| 106 | +Benefits: |
| 107 | +- Local cluster auto-start & cleanup |
| 108 | +- Chunk alignment to prevent overlapping writes |
| 109 | +- Better memory distribution for large scenes |
70 | 110 |
|
71 |
| -HTML report: |
| 111 | +## Python API |
72 | 112 |
|
73 |
| -```bash |
74 |
| -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" --html /tmp/summary.html |
| 113 | +High-level dataset conversion: |
| 114 | +```python |
| 115 | +import xarray as xr |
| 116 | +from eopf_geozarr import create_geozarr_dataset |
| 117 | + |
| 118 | +dt = xr.open_datatree("path/to/eopf.zarr", engine="zarr") |
| 119 | +out = create_geozarr_dataset( |
| 120 | + dt_input=dt, |
| 121 | + groups=["/measurements/r10m", "/measurements/r20m"], |
| 122 | + output_path="/tmp/out_geozarr.zarr", |
| 123 | + spatial_chunk=4096, |
| 124 | + min_dimension=256, |
| 125 | + tile_width=256, |
| 126 | +) |
75 | 127 | ```
|
76 | 128 |
|
77 |
| -Validate (counts only real data vars, skips `spatial_ref`/`crs`): |
| 129 | +Selective writer usage (advanced): |
| 130 | +```python |
| 131 | +from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
| 132 | +writer = GeoZarrWriter(output_path="/tmp/out.zarr", spatial_chunk=4096) |
| 133 | +# writer.write_group(...) |
| 134 | +``` |
| 135 | + |
| 136 | +## API Reference |
| 137 | + |
| 138 | +`create_geozarr_dataset(dt_input, groups, output_path, spatial_chunk=4096, ...) -> xr.DataTree` |
| 139 | +: Produce a GeoZarr-compliant hierarchy. |
| 140 | + |
| 141 | +`setup_datatree_metadata_geozarr_spec_compliant(dt, groups) -> dict[str, xr.Dataset]` |
| 142 | +: Apply CF + GeoZarr metadata to selected groups. |
| 143 | + |
| 144 | +`downsample_2d_array(source_data, target_h, target_w) -> np.ndarray` |
| 145 | +: Block-average /2 overview generation primitive. |
| 146 | + |
| 147 | +`calculate_aligned_chunk_size(dimension_size, target_chunk_size) -> int` |
| 148 | +: Returns evenly dividing chunk to avoid overlap. |
| 149 | + |
| 150 | +## Architecture |
78 | 151 |
|
79 |
| -```bash |
80 |
| -uv run eopf-geozarr validate "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
81 | 152 | ```
|
| 153 | +eopf_geozarr/ |
| 154 | + commands/ # CLI subcommands (convert, validate, info, stac, benchmark) |
| 155 | + conversion/ # Core geozarr pipeline, helpers, multiscales, encodings |
| 156 | + metrics.py # Lightweight metrics hooks (optional) |
| 157 | +``` |
| 158 | + |
| 159 | +## Contributing to GeoZarr Specification |
82 | 160 |
|
83 |
| -## Benchmark (optional) |
| 161 | +Upstream issue discussions influenced: |
| 162 | +- Arbitrary CRS preservation |
| 163 | +- Chunking performance & strategies |
| 164 | +- Multiscale hierarchy clarity |
84 | 165 |
|
| 166 | +## Benchmark & STAC Commands |
| 167 | + |
| 168 | +Benchmark: |
85 | 169 | ```bash
|
86 |
| -uv run eopf-geozarr benchmark "/tmp/..._geozarr.zarr" --samples 8 --window 1024 1024 |
| 170 | +eopf-geozarr benchmark /tmp/out_geozarr.zarr --samples 8 --window 1024 1024 |
87 | 171 | ```
|
88 | 172 |
|
89 |
| -## STAC |
90 |
| - |
| 173 | +STAC draft artifacts: |
91 | 174 | ```bash
|
92 |
| -uv run eopf-geozarr stac \ |
93 |
| - "/tmp/..._geozarr.zarr" \ |
94 |
| - "/tmp/..._collection.json" \ |
95 |
| - --bbox "minx miny maxx maxy" \ |
96 |
| - --start "YYYY-MM-DDTHH:MM:SSZ" \ |
97 |
| - --end "YYYY-MM-DDTHH:MM:SSZ" |
| 175 | +eopf-geozarr stac /tmp/out_geozarr.zarr /tmp/collection.json \ |
| 176 | + --bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z |
98 | 177 | ```
|
99 | 178 |
|
100 |
| -## Python API |
| 179 | +## What Gets Written |
101 | 180 |
|
102 |
| -```python |
103 |
| -from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
104 |
| -from eopf_geozarr.validation.validate import validate_store |
105 |
| -from eopf_geozarr.info.summary import summarize |
| 181 | +- `_ARRAY_DIMENSIONS` per variable (deterministic axis order) |
| 182 | +- Per-variable `grid_mapping` referencing `spatial_ref` |
| 183 | +- Multiscales metadata on parent groups; /2 overviews |
| 184 | +- Blosc Zstd compression, conservative chunking |
| 185 | +- Consolidated metadata index |
| 186 | +- Band attribute propagation across levels |
106 | 187 |
|
107 |
| -src = "https://.../S2B_MSIL2A_... .zarr" |
108 |
| -dst = "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 188 | +## Consolidated Metadata |
109 | 189 |
|
110 |
| -writer = GeoZarrWriter(src, dst, storage_options={}) |
111 |
| -writer.write(groups=["/measurements/reflectance"], verbose=True) |
| 190 | +Improves open performance. Spec discussion ongoing; toggle by disabling consolidation if strict minimalism required. |
112 | 191 |
|
113 |
| -report = validate_store(dst) |
114 |
| -print(report.ok) |
| 192 | +## Troubleshooting |
115 | 193 |
|
116 |
| -tree = summarize(dst) |
117 |
| -print(tree["summary"]) # or write HTML via CLI |
118 |
| -``` |
| 194 | +| Symptom | Cause | Fix | |
| 195 | +|---------|-------|-----| |
| 196 | +| Parent group empty | Only leaf groups hold arrays | Use `--groups` or rely on auto-expansion | |
| 197 | +| Overlapping chunk error | Misaligned dask vs encoding chunks | Allow auto chunk alignment or reduce spatial_chunk | |
| 198 | +| S3 auth failure | Missing env vars or endpoint | Export AWS_* vars / set AWS_ENDPOINT_URL | |
| 199 | +| HTML path is a directory | Provided path not file | A default filename is created inside | |
119 | 200 |
|
120 |
| -## What it writes |
| 201 | +## Development & Contributing |
121 | 202 |
|
122 |
| -- `_ARRAY_DIMENSIONS` per variable (correct axis order). |
123 |
| -- `grid_mapping = "spatial_ref"` per variable; `spatial_ref` holds CRS/georeferencing. |
124 |
| -- Multiscales on parent groups; /2 overviews. |
125 |
| -- Blosc Zstd compression; conservative chunking; consolidated metadata. |
126 |
| -- Overviews keep per-band attributes (grid_mapping reattached across levels). |
| 203 | +```bash |
| 204 | +git clone <repo-url> |
| 205 | +cd eopf-geozarr |
| 206 | +pip install -e '.[dev]' |
| 207 | +pre-commit install |
| 208 | +pytest |
| 209 | +``` |
127 | 210 |
|
128 |
| -## Consolidated metadata |
| 211 | +Quality stack: Black, isort, Ruff, Mypy, Pytest, Coverage. |
129 | 212 |
|
130 |
| -Speeds up reads. Some tools note it isn’t in the core Zarr v3 spec yet; data stays valid. You can disable consolidation during writes or remove the index if preferred. |
| 213 | +## License & Acknowledgments |
131 | 214 |
|
132 |
| -## Troubleshooting |
| 215 | +Apache 2.0. Built atop xarray, zarr, dask; follows evolving GeoZarr specification. |
133 | 216 |
|
134 |
| -- Parent group shows no data vars: select leaves (CLI auto-expands). |
135 |
| -- S3 errors: check env vars and `AWS_ENDPOINT_URL` for custom endpoints. |
136 |
| -- HTML path is a directory: a default filename is created inside. |
| 217 | +--- |
| 218 | +For questions or issues open a GitHub issue. |
137 | 219 |
|
0 commit comments