Skip to content

Commit 5ecb022

Browse files
committed
docs(readme): usage, API sections and quick start
1 parent 1e27166 commit 5ecb022

File tree

1 file changed

+203
-77
lines changed

1 file changed

+203
-77
lines changed

README.md

Lines changed: 203 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -1,137 +1,263 @@
11
# EOPF GeoZarr
22

3-
Turn EOPF datasets into a GeoZarr-style Zarr v3 store. Keep the data values intact and add standard geospatial metadata, multiscale overviews, and per-variable dimensions.
3+
GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.
44

5-
## Quick Start
5+
Turn EOPF datasets into a GeoZarr-style Zarr v3 store while:
6+
- Preserving native CRS (no forced TMS reprojection)
7+
- Adding CF + GeoZarr compliant metadata
8+
- Building /2 multiscale overviews
9+
- Writing robust, retry-aware band data with validation
10+
11+
## Overview
12+
13+
This library converts EOPF datatrees into GeoZarr-spec 0.4 aligned Zarr v3 stores without forcing web-mercator style tiling. It focuses on scientific fidelity (native CRS), robust metadata (CF + GeoZarr), and operational resilience (retry + completeness auditing) while supporting multiscale /2 overviews.
14+
15+
## Key Features
616

7-
Install (uv):
17+
- **GeoZarr Specification Compliance** (0.4 features implemented)
18+
- **Native CRS Preservation** (UTM, polar, arbitrary projections)
19+
- **Multiscale /2 Overviews** (COG-style hierarchy as child groups)
20+
- **CF Conventions** (`standard_name`, `grid_mapping`, `_ARRAY_DIMENSIONS`)
21+
- **Resilient Writing** (band-by-band with retries & auditing)
22+
- **S3 & S3-Compatible Support** (AWS, OVH, MinIO, custom endpoints)
23+
- **Optional Parallel Processing** (local Dask cluster)
24+
- **Automatic Chunk Alignment** (prevents overlapping Dask/Zarr chunks)
25+
- **HTML Summary & Validation Tools**
26+
- **STAC & Benchmark Commands**
27+
- **Consolidated Metadata** (faster open)
28+
29+
## GeoZarr Compliance Features
30+
31+
- `_ARRAY_DIMENSIONS` attributes on all arrays
32+
- CF standard names for all variables
33+
- `grid_mapping` attributes referencing CF grid_mapping variables
34+
- `GeoTransform` attributes in grid_mapping variables
35+
- Proper multiscales metadata structure
36+
- Native CRS tile matrix sets
37+
38+
## Installation
839

40+
```bash
41+
pip install eopf-geozarr
42+
```
43+
44+
Development (uv):
945
```bash
1046
uv sync --frozen
1147
uv run eopf-geozarr --help
1248
```
1349

14-
Or pip:
15-
50+
Editable (pip):
1651
```bash
17-
pip install -e .
52+
pip install -e .[dev]
1853
```
1954

20-
## Workflows
21-
22-
For Argo / batch orchestration use: https://github.com/EOPF-Explorer/data-model-pipeline
55+
## Quick Start
2356

24-
## Convert
57+
### Command Line Interface
2558

26-
Remote → local:
59+
After installation, you can use the `eopf-geozarr` command:
2760

2861
```bash
29-
uv run eopf-geozarr convert \
30-
"https://.../S2B_MSIL2A_... .zarr" \
31-
"/tmp/S2B_MSIL2A_..._geozarr.zarr" \
32-
--groups /measurements/reflectance \
33-
--verbose
34-
```
62+
# Convert EOPF dataset to GeoZarr format (local output)
63+
eopf-geozarr convert input.zarr output.zarr
64+
65+
# Convert specific groups (e.g. resolution groups)
66+
eopf-geozarr convert input.zarr output.zarr --groups /measurements/r10m /measurements/r20m
3567

36-
Notes:
37-
- Parent groups auto-expand to leaf datasets.
38-
- Overviews use /2 coarsening; multiscales live on parent groups.
39-
- Defaults: Blosc Zstd, conservative chunking, metadata consolidation after write.
68+
# Convert EOPF dataset to GeoZarr format (S3 output)
69+
eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr
4070

41-
## S3
71+
# Convert with parallel processing using dask cluster
72+
eopf-geozarr convert input.zarr output.zarr --dask-cluster
4273

43-
Env for S3/S3-compatible storage:
74+
# Convert with dask cluster and verbose output
75+
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
4476

77+
# Generate an HTML summary while inspecting
78+
eopf-geozarr info output.zarr --html report.html
79+
80+
# Get information about a dataset
81+
eopf-geozarr info input.zarr
82+
83+
# Validate GeoZarr compliance
84+
eopf-geozarr validate output.zarr
85+
86+
# Benchmark access patterns (optional)
87+
eopf-geozarr benchmark output.zarr --samples 8 --window 1024 1024
88+
89+
# Produce draft STAC artifacts
90+
eopf-geozarr stac output.zarr stac_collection.json \
91+
--bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z
92+
93+
# Get help
94+
eopf-geozarr --help
95+
```
96+
97+
#### Notes
98+
- Parent groups auto-expand to leaf datasets when omitted.
99+
- Multiscale overviews are generated with /2 coarsening and attached as child groups.
100+
- Defaults: Blosc Zstd (level 3), conservative chunking, metadata consolidation enabled.
101+
- Use `--groups` to limit processing or speed up experimentation.
102+
103+
## S3 Support
104+
105+
Environment vars:
45106
```bash
46107
export AWS_ACCESS_KEY_ID=...
47108
export AWS_SECRET_ACCESS_KEY=...
48-
export AWS_REGION=eu-west-1
49-
# Custom endpoint (OVH, MinIO, etc.)
50-
export AWS_ENDPOINT_URL=https://s3.your-endpoint.example
109+
export AWS_DEFAULT_REGION=eu-west-1
110+
export AWS_ENDPOINT_URL=https://s3.your-endpoint.example # optional custom endpoint
51111
```
52112

53-
Write to S3:
54-
113+
Write directly:
55114
```bash
56-
uv run eopf-geozarr convert \
57-
"https://.../S2B_MSIL2A_... .zarr" \
58-
"s3://your-bucket/path/S2B_MSIL2A_..._geozarr.zarr" \
59-
--groups /measurements/reflectance \
60-
--verbose
115+
eopf-geozarr convert input.zarr s3://my-bucket/path/output_geozarr.zarr --groups /measurements/r10m
61116
```
62117

63-
## Info & Validate
118+
Features:
119+
- Credential validation before write
120+
- Custom endpoints (OVH, MinIO, etc.)
121+
- Retry logic around object writes
64122

65-
Summary:
123+
## Parallel Processing with Dask
66124

67125
```bash
68-
uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr"
126+
eopf-geozarr convert input.zarr out.zarr --dask-cluster --verbose
69127
```
128+
Benefits:
129+
- Local cluster auto-start & cleanup
130+
- Chunk alignment to prevent overlapping writes
131+
- Better memory distribution for large scenes
70132

71-
HTML report:
133+
## Python API
72134

73-
```bash
74-
uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" --html /tmp/summary.html
135+
High-level dataset conversion:
136+
```python
137+
import xarray as xr
138+
from eopf_geozarr import create_geozarr_dataset
139+
140+
dt = xr.open_datatree("path/to/eopf.zarr", engine="zarr")
141+
out = create_geozarr_dataset(
142+
dt_input=dt,
143+
groups=["/measurements/r10m", "/measurements/r20m"],
144+
output_path="/tmp/out_geozarr.zarr",
145+
spatial_chunk=4096,
146+
min_dimension=256,
147+
tile_width=256,
148+
)
149+
```
150+
151+
Selective writer usage (advanced):
152+
```python
153+
from eopf_geozarr.conversion.geozarr import GeoZarrWriter
154+
writer = GeoZarrWriter(output_path="/tmp/out.zarr", spatial_chunk=4096)
155+
# writer.write_group(...)
75156
```
76157

77-
Validate (counts only real data vars, skips `spatial_ref`/`crs`):
158+
## API Reference
78159

79-
```bash
80-
uv run eopf-geozarr validate "/tmp/S2B_MSIL2A_..._geozarr.zarr"
160+
`create_geozarr_dataset(dt_input, groups, output_path, spatial_chunk=4096, ...) -> xr.DataTree`
161+
: Produce a GeoZarr-compliant hierarchy.
162+
163+
`setup_datatree_metadata_geozarr_spec_compliant(dt, groups) -> dict[str, xr.Dataset]`
164+
: Apply CF + GeoZarr metadata to selected groups.
165+
166+
`downsample_2d_array(source_data, target_h, target_w) -> np.ndarray`
167+
: Block-average /2 overview generation primitive.
168+
169+
`calculate_aligned_chunk_size(dimension_size, target_chunk_size) -> int`
170+
: Returns evenly dividing chunk to avoid overlap.
171+
172+
## Architecture
173+
174+
```
175+
eopf_geozarr/
176+
commands/ # CLI subcommands (convert, validate, info, stac, benchmark)
177+
conversion/ # Core geozarr pipeline, helpers, multiscales, encodings
178+
metrics.py # Lightweight metrics hooks (optional)
81179
```
82180

83-
## Benchmark (optional)
181+
## Contributing to GeoZarr Specification
182+
183+
Upstream issue discussions influenced:
184+
- Arbitrary CRS preservation
185+
- Chunking performance & strategies
186+
- Multiscale hierarchy clarity
187+
188+
## Benchmark & STAC Commands
84189

190+
Benchmark:
85191
```bash
86-
uv run eopf-geozarr benchmark "/tmp/..._geozarr.zarr" --samples 8 --window 1024 1024
192+
eopf-geozarr benchmark /tmp/out_geozarr.zarr --samples 8 --window 1024 1024
87193
```
88194

89-
## STAC
90-
195+
STAC draft artifacts:
91196
```bash
92-
uv run eopf-geozarr stac \
93-
"/tmp/..._geozarr.zarr" \
94-
"/tmp/..._collection.json" \
95-
--bbox "minx miny maxx maxy" \
96-
--start "YYYY-MM-DDTHH:MM:SSZ" \
97-
--end "YYYY-MM-DDTHH:MM:SSZ"
197+
eopf-geozarr stac /tmp/out_geozarr.zarr /tmp/collection.json \
198+
--bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z
98199
```
99200

100-
## Python API
201+
## What Gets Written
101202

102-
```python
103-
from eopf_geozarr.conversion.geozarr import GeoZarrWriter
104-
from eopf_geozarr.validation.validate import validate_store
105-
from eopf_geozarr.info.summary import summarize
203+
- `_ARRAY_DIMENSIONS` per variable (deterministic axis order)
204+
- Per-variable `grid_mapping` referencing `spatial_ref`
205+
- Multiscales metadata on parent groups; /2 overviews
206+
- Blosc Zstd compression, conservative chunking
207+
- Consolidated metadata index
208+
- Band attribute propagation across levels
209+
210+
## Consolidated Metadata
211+
212+
Improves open performance. Spec discussion ongoing; toggle by disabling consolidation if strict minimalism required.
213+
214+
## Troubleshooting
215+
216+
| Symptom | Cause | Fix |
217+
|---------|-------|-----|
218+
| Parent group empty | Only leaf groups hold arrays | Use `--groups` or rely on auto-expansion |
219+
| Overlapping chunk error | Misaligned dask vs encoding chunks | Allow auto chunk alignment or reduce spatial_chunk |
220+
| S3 auth failure | Missing env vars or endpoint | Export AWS_* vars / set AWS_ENDPOINT_URL |
221+
| HTML path is a directory | Provided path not file | A default filename is created inside |
222+
223+
## Development & Contributing
224+
Preferred (reproducible) workflow uses [uv](https://github.com/astral-sh/uv):
106225

107-
src = "https://.../S2B_MSIL2A_... .zarr"
108-
dst = "/tmp/S2B_MSIL2A_..._geozarr.zarr"
226+
```bash
227+
git clone <repo-url>
228+
cd eopf-geozarr
109229

110-
writer = GeoZarrWriter(src, dst, storage_options={})
111-
writer.write(groups=["/measurements/reflectance"], verbose=True)
230+
# Ensure uv is installed (macOS/Linux quick install)
231+
curl -Ls https://astral.sh/uv/install.sh | sh # or follow official docs
112232

113-
report = validate_store(dst)
114-
print(report.ok)
233+
# Create and sync environment with dev extras
234+
uv sync --extra dev
115235

116-
tree = summarize(dst)
117-
print(tree["summary"]) # or write HTML via CLI
236+
# Run tools through uv (ensures correct virtual env)
237+
uv run pre-commit install
238+
uv run pytest -q
118239
```
119240

120-
## What it writes
241+
Common tasks:
242+
```bash
243+
uv run ruff check .
244+
uv run mypy src
245+
uv run eopf-geozarr --help
246+
```
121247

122-
- `_ARRAY_DIMENSIONS` per variable (correct axis order).
123-
- `grid_mapping = "spatial_ref"` per variable; `spatial_ref` holds CRS/georeferencing.
124-
- Multiscales on parent groups; /2 overviews.
125-
- Blosc Zstd compression; conservative chunking; consolidated metadata.
126-
- Overviews keep per-band attributes (grid_mapping reattached across levels).
248+
Fallback (less reproducible) pip editable install:
249+
```bash
250+
pip install -e '.[dev]'
251+
pre-commit install
252+
pytest
253+
```
127254

128-
## Consolidated metadata
255+
Quality stack: Ruff (lint + format), isort (via Ruff), Mypy (strict), Pytest, Coverage.
129256

130-
Speeds up reads. Some tools note it isn’t in the core Zarr v3 spec yet; data stays valid. You can disable consolidation during writes or remove the index if preferred.
257+
## License & Acknowledgments
131258

132-
## Troubleshooting
259+
Apache 2.0. Built atop xarray, zarr, dask; follows evolving GeoZarr specification.
133260

134-
- Parent group shows no data vars: select leaves (CLI auto-expands).
135-
- S3 errors: check env vars and `AWS_ENDPOINT_URL` for custom endpoints.
136-
- HTML path is a directory: a default filename is created inside.
261+
---
262+
For questions or issues open a GitHub issue.
137263

0 commit comments

Comments
 (0)