|
1 | 1 | # EOPF GeoZarr
|
2 | 2 |
|
3 |
| -Turn EOPF datasets into a GeoZarr-style Zarr v3 store. Keep the data values intact and add standard geospatial metadata, multiscale overviews, and per-variable dimensions. |
4 |
| - |
5 |
| -## Quick Start |
6 |
| - |
7 |
| -Install (uv): |
| 3 | +GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets. |
| 4 | + |
| 5 | +Turn EOPF datasets into a GeoZarr-style Zarr v3 store while: |
| 6 | +- Preserving native CRS (no forced TMS reprojection) |
| 7 | +- Adding CF + GeoZarr compliant metadata |
| 8 | +- Building /2 multiscale overviews |
| 9 | +- Writing robust, retry-aware band data with validation |
| 10 | + |
| 11 | +--- |
| 12 | +## Table of Contents |
| 13 | +1. Overview & Key Features |
| 14 | +2. GeoZarr Compliance Details |
| 15 | +3. Installation |
| 16 | +4. Quick Start (CLI) |
| 17 | +5. S3 / Object Storage |
| 18 | +6. Parallel Processing (Dask) |
| 19 | +7. Python API Examples |
| 20 | +8. API Reference (Selected Functions) |
| 21 | +9. Architecture |
| 22 | +10. Contributing to GeoZarr Spec |
| 23 | +11. Benchmark & STAC Commands |
| 24 | +12. What Gets Written |
| 25 | +13. Consolidated Metadata |
| 26 | +14. Troubleshooting |
| 27 | +15. Development & Contributing |
| 28 | +16. License & Acknowledgments |
| 29 | + |
| 30 | +--- |
| 31 | +## 1. Overview & Key Features |
| 32 | + |
| 33 | +- Full GeoZarr spec 0.4 alignment |
| 34 | +- Native CRS preservation (UTM, polar, etc.) |
| 35 | +- Multiscale /2 overviews (COG-style) with proper hierarchy |
| 36 | +- CF `standard_name`, `_ARRAY_DIMENSIONS`, `grid_mapping` correctness |
| 37 | +- Band-by-band resilient writing (retry + completeness auditing) |
| 38 | +- S3 & generic S3-compatible storage support (AWS, OVH, MinIO) |
| 39 | +- Optional Dask-backed parallel chunk processing |
| 40 | +- Chunk alignment to avoid overlapping dask/Zarr chunk layouts |
| 41 | +- HTML summaries, validation, STAC scaffolding, benchmarking |
| 42 | + |
| 43 | +## 2. GeoZarr Compliance Details |
| 44 | + |
| 45 | +Implements: |
| 46 | +- `_ARRAY_DIMENSIONS` on all arrays |
| 47 | +- CF grid mapping variables with `GeoTransform` info |
| 48 | +- Per-variable `grid_mapping` references |
| 49 | +- Multiscales metadata structure on parent groups |
| 50 | +- Native CRS tile matrix logic (no forced EPSG:3857) |
| 51 | + |
| 52 | +## 3. Installation |
| 53 | + |
| 54 | +Stable: |
| 55 | +```bash |
| 56 | +pip install eopf-geozarr |
| 57 | +``` |
8 | 58 |
|
| 59 | +Development (uv): |
9 | 60 | ```bash
|
10 | 61 | uv sync --frozen
|
11 | 62 | uv run eopf-geozarr --help
|
12 | 63 | ```
|
13 | 64 |
|
14 |
| -Or pip: |
15 |
| - |
| 65 | +Editable (pip): |
16 | 66 | ```bash
|
17 |
| -pip install -e . |
| 67 | +pip install -e .[dev] |
18 | 68 | ```
|
19 | 69 |
|
20 |
| -## Workflows |
| 70 | +## 4. Quick Start (CLI) |
21 | 71 |
|
22 |
| -For Argo / batch orchestration use: https://github.com/EOPF-Explorer/data-model-pipeline |
23 |
| - |
24 |
| -## Convert |
| 72 | +Convert local → local: |
| 73 | +```bash |
| 74 | +eopf-geozarr convert input.zarr output_geozarr.zarr --groups /measurements/r10m /measurements/r20m |
| 75 | +``` |
25 | 76 |
|
26 | 77 | Remote → local:
|
27 |
| - |
28 | 78 | ```bash
|
29 |
| -uv run eopf-geozarr convert \ |
| 79 | +eopf-geozarr convert \ |
30 | 80 | "https://.../S2B_MSIL2A_... .zarr" \
|
31 | 81 | "/tmp/S2B_MSIL2A_..._geozarr.zarr" \
|
32 |
| - --groups /measurements/reflectance \ |
33 |
| - --verbose |
| 82 | + --groups /measurements/reflectance --verbose |
34 | 83 | ```
|
35 | 84 |
|
36 | 85 | Notes:
|
37 |
| -- Parent groups auto-expand to leaf datasets. |
38 |
| -- Overviews use /2 coarsening; multiscales live on parent groups. |
39 |
| -- Defaults: Blosc Zstd, conservative chunking, metadata consolidation after write. |
| 86 | +- Parent groups auto-expand to leaf datasets |
| 87 | +- Overviews: /2 coarsening, attached at parent multiscales |
| 88 | +- Defaults: Blosc Zstd level 3, conservative chunking, metadata consolidation |
40 | 89 |
|
41 |
| -## S3 |
| 90 | +Info / HTML / Validate: |
| 91 | +```bash |
| 92 | +eopf-geozarr info /tmp/..._geozarr.zarr --html report.html |
| 93 | +eopf-geozarr validate /tmp/..._geozarr.zarr |
| 94 | +``` |
42 | 95 |
|
43 |
| -Env for S3/S3-compatible storage: |
| 96 | +## 5. S3 / Object Storage |
44 | 97 |
|
| 98 | +Environment vars: |
45 | 99 | ```bash
|
46 | 100 | export AWS_ACCESS_KEY_ID=...
|
47 | 101 | export AWS_SECRET_ACCESS_KEY=...
|
48 |
| -export AWS_REGION=eu-west-1 |
49 |
| -# Custom endpoint (OVH, MinIO, etc.) |
50 |
| -export AWS_ENDPOINT_URL=https://s3.your-endpoint.example |
| 102 | +export AWS_DEFAULT_REGION=eu-west-1 |
| 103 | +export AWS_ENDPOINT_URL=https://s3.your-endpoint.example # optional custom endpoint |
51 | 104 | ```
|
52 | 105 |
|
53 |
| -Write to S3: |
54 |
| - |
| 106 | +Write directly: |
55 | 107 | ```bash
|
56 |
| -uv run eopf-geozarr convert \ |
57 |
| - "https://.../S2B_MSIL2A_... .zarr" \ |
58 |
| - "s3://your-bucket/path/S2B_MSIL2A_..._geozarr.zarr" \ |
59 |
| - --groups /measurements/reflectance \ |
60 |
| - --verbose |
| 108 | +eopf-geozarr convert input.zarr s3://my-bucket/path/output_geozarr.zarr --groups /measurements/r10m |
61 | 109 | ```
|
62 | 110 |
|
63 |
| -## Info & Validate |
| 111 | +Features: |
| 112 | +- Credential validation before write |
| 113 | +- Custom endpoints (OVH, MinIO, etc.) |
| 114 | +- Retry logic around object writes |
64 | 115 |
|
65 |
| -Summary: |
| 116 | +## 6. Parallel Processing (Dask) |
66 | 117 |
|
67 | 118 | ```bash
|
68 |
| -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 119 | +eopf-geozarr convert input.zarr out.zarr --dask-cluster --verbose |
69 | 120 | ```
|
| 121 | +Benefits: |
| 122 | +- Local cluster auto-start & cleanup |
| 123 | +- Chunk alignment to prevent overlapping writes |
| 124 | +- Better memory distribution for large scenes |
70 | 125 |
|
71 |
| -HTML report: |
| 126 | +## 7. Python API Examples |
72 | 127 |
|
73 |
| -```bash |
74 |
| -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" --html /tmp/summary.html |
| 128 | +High-level dataset conversion: |
| 129 | +```python |
| 130 | +import xarray as xr |
| 131 | +from eopf_geozarr import create_geozarr_dataset |
| 132 | + |
| 133 | +dt = xr.open_datatree("path/to/eopf.zarr", engine="zarr") |
| 134 | +out = create_geozarr_dataset( |
| 135 | + dt_input=dt, |
| 136 | + groups=["/measurements/r10m", "/measurements/r20m"], |
| 137 | + output_path="/tmp/out_geozarr.zarr", |
| 138 | + spatial_chunk=4096, |
| 139 | + min_dimension=256, |
| 140 | + tile_width=256, |
| 141 | +) |
| 142 | +``` |
| 143 | + |
| 144 | +Selective writer usage (advanced): |
| 145 | +```python |
| 146 | +from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
| 147 | +writer = GeoZarrWriter(output_path="/tmp/out.zarr", spatial_chunk=4096) |
| 148 | +# writer.write_group(...) |
75 | 149 | ```
|
76 | 150 |
|
77 |
| -Validate (counts only real data vars, skips `spatial_ref`/`crs`): |
| 151 | +## 8. API Reference (Selected) |
78 | 152 |
|
79 |
| -```bash |
80 |
| -uv run eopf-geozarr validate "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 153 | +`create_geozarr_dataset(dt_input, groups, output_path, spatial_chunk=4096, ...) -> xr.DataTree` |
| 154 | +: Produce a GeoZarr-compliant hierarchy. |
| 155 | + |
| 156 | +`setup_datatree_metadata_geozarr_spec_compliant(dt, groups) -> dict[str, xr.Dataset]` |
| 157 | +: Apply CF + GeoZarr metadata to selected groups. |
| 158 | + |
| 159 | +`downsample_2d_array(source_data, target_h, target_w) -> np.ndarray` |
| 160 | +: Block-average /2 overview generation primitive. |
| 161 | + |
| 162 | +`calculate_aligned_chunk_size(dimension_size, target_chunk_size) -> int` |
| 163 | +: Returns evenly dividing chunk to avoid overlap. |
| 164 | + |
| 165 | +## 9. Architecture |
| 166 | + |
| 167 | +``` |
| 168 | +eopf_geozarr/ |
| 169 | + commands/ # CLI subcommands (convert, validate, info, stac, benchmark) |
| 170 | + conversion/ # Core geozarr pipeline, helpers, multiscales, encodings |
| 171 | + metrics.py # Lightweight metrics hooks (optional) |
81 | 172 | ```
|
82 | 173 |
|
83 |
| -## Benchmark (optional) |
| 174 | +## 10. Contributing to GeoZarr Spec |
84 | 175 |
|
| 176 | +Upstream issue discussions influenced: |
| 177 | +- Arbitrary CRS preservation |
| 178 | +- Chunking performance & strategies |
| 179 | +- Multiscale hierarchy clarity |
| 180 | + |
| 181 | +## 11. Benchmark & STAC |
| 182 | + |
| 183 | +Benchmark: |
85 | 184 | ```bash
|
86 |
| -uv run eopf-geozarr benchmark "/tmp/..._geozarr.zarr" --samples 8 --window 1024 1024 |
| 185 | +eopf-geozarr benchmark /tmp/out_geozarr.zarr --samples 8 --window 1024 1024 |
87 | 186 | ```
|
88 | 187 |
|
89 |
| -## STAC |
90 |
| - |
| 188 | +STAC draft artifacts: |
91 | 189 | ```bash
|
92 |
| -uv run eopf-geozarr stac \ |
93 |
| - "/tmp/..._geozarr.zarr" \ |
94 |
| - "/tmp/..._collection.json" \ |
95 |
| - --bbox "minx miny maxx maxy" \ |
96 |
| - --start "YYYY-MM-DDTHH:MM:SSZ" \ |
97 |
| - --end "YYYY-MM-DDTHH:MM:SSZ" |
| 190 | +eopf-geozarr stac /tmp/out_geozarr.zarr /tmp/collection.json \ |
| 191 | + --bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z |
98 | 192 | ```
|
99 | 193 |
|
100 |
| -## Python API |
| 194 | +## 12. What Gets Written |
101 | 195 |
|
102 |
| -```python |
103 |
| -from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
104 |
| -from eopf_geozarr.validation.validate import validate_store |
105 |
| -from eopf_geozarr.info.summary import summarize |
| 196 | +- `_ARRAY_DIMENSIONS` per variable (deterministic axis order) |
| 197 | +- Per-variable `grid_mapping` referencing `spatial_ref` |
| 198 | +- Multiscales metadata on parent groups; /2 overviews |
| 199 | +- Blosc Zstd compression, conservative chunking |
| 200 | +- Consolidated metadata index |
| 201 | +- Band attribute propagation across levels |
106 | 202 |
|
107 |
| -src = "https://.../S2B_MSIL2A_... .zarr" |
108 |
| -dst = "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 203 | +## 13. Consolidated Metadata |
109 | 204 |
|
110 |
| -writer = GeoZarrWriter(src, dst, storage_options={}) |
111 |
| -writer.write(groups=["/measurements/reflectance"], verbose=True) |
| 205 | +Improves open performance. Spec discussion ongoing; toggle by disabling consolidation if strict minimalism required. |
112 | 206 |
|
113 |
| -report = validate_store(dst) |
114 |
| -print(report.ok) |
| 207 | +## 14. Troubleshooting |
115 | 208 |
|
116 |
| -tree = summarize(dst) |
117 |
| -print(tree["summary"]) # or write HTML via CLI |
118 |
| -``` |
| 209 | +| Symptom | Cause | Fix | |
| 210 | +|---------|-------|-----| |
| 211 | +| Parent group empty | Only leaf groups hold arrays | Use `--groups` or rely on auto-expansion | |
| 212 | +| Overlapping chunk error | Misaligned dask vs encoding chunks | Allow auto chunk alignment or reduce spatial_chunk | |
| 213 | +| S3 auth failure | Missing env vars or endpoint | Export AWS_* vars / set AWS_ENDPOINT_URL | |
| 214 | +| HTML path is a directory | Provided path not file | A default filename is created inside | |
119 | 215 |
|
120 |
| -## What it writes |
| 216 | +## 15. Development & Contributing |
121 | 217 |
|
122 |
| -- `_ARRAY_DIMENSIONS` per variable (correct axis order). |
123 |
| -- `grid_mapping = "spatial_ref"` per variable; `spatial_ref` holds CRS/georeferencing. |
124 |
| -- Multiscales on parent groups; /2 overviews. |
125 |
| -- Blosc Zstd compression; conservative chunking; consolidated metadata. |
126 |
| -- Overviews keep per-band attributes (grid_mapping reattached across levels). |
| 218 | +```bash |
| 219 | +git clone <repo-url> |
| 220 | +cd eopf-geozarr |
| 221 | +pip install -e '.[dev]' |
| 222 | +pre-commit install |
| 223 | +pytest |
| 224 | +``` |
127 | 225 |
|
128 |
| -## Consolidated metadata |
| 226 | +Quality stack: Black, isort, Ruff, Mypy, Pytest, Coverage. |
129 | 227 |
|
130 |
| -Speeds up reads. Some tools note it isn’t in the core Zarr v3 spec yet; data stays valid. You can disable consolidation during writes or remove the index if preferred. |
| 228 | +## 16. License & Acknowledgments |
131 | 229 |
|
132 |
| -## Troubleshooting |
| 230 | +Apache 2.0. Built atop xarray, zarr, dask; follows evolving GeoZarr specification. |
133 | 231 |
|
134 |
| -- Parent group shows no data vars: select leaves (CLI auto-expands). |
135 |
| -- S3 errors: check env vars and `AWS_ENDPOINT_URL` for custom endpoints. |
136 |
| -- HTML path is a directory: a default filename is created inside. |
| 232 | +--- |
| 233 | +For questions or issues open a GitHub issue. |
137 | 234 |
|
0 commit comments