Skip to content

Commit bf8cacf

Browse files
committed
docs(readme): restore comprehensive usage and API sections while keeping concise quick start
1 parent 1e27166 commit bf8cacf

File tree

1 file changed

+156
-74
lines changed

1 file changed

+156
-74
lines changed

README.md

Lines changed: 156 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,137 +1,219 @@
11
# EOPF GeoZarr
22

3-
Turn EOPF datasets into a GeoZarr-style Zarr v3 store. Keep the data values intact and add standard geospatial metadata, multiscale overviews, and per-variable dimensions.
3+
GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.
44

5-
## Quick Start
5+
Turn EOPF datasets into a GeoZarr-style Zarr v3 store while:
6+
- Preserving native CRS (no forced TMS reprojection)
7+
- Adding CF + GeoZarr compliant metadata
8+
- Building /2 multiscale overviews
9+
- Writing robust, retry-aware band data with validation
610

7-
Install (uv):
11+
## Overview
812

13+
This library converts EOPF datatrees into GeoZarr-spec 0.4 aligned Zarr v3 stores without forcing web-mercator style tiling. It focuses on scientific fidelity (native CRS), robust metadata (CF + GeoZarr), and operational resilience (retry + completeness auditing) while supporting multiscale /2 overviews.
14+
15+
## Key Features
16+
17+
- **GeoZarr Specification Compliance** (0.4 features implemented)
18+
- **Native CRS Preservation** (UTM, polar, arbitrary projections)
19+
- **Multiscale /2 Overviews** (COG-style hierarchy as child groups)
20+
- **CF Conventions** (`standard_name`, `grid_mapping`, `_ARRAY_DIMENSIONS`)
21+
- **Resilient Writing** (band-by-band with retries & auditing)
22+
- **S3 & S3-Compatible Support** (AWS, OVH, MinIO, custom endpoints)
23+
- **Optional Parallel Processing** (local Dask cluster)
24+
- **Automatic Chunk Alignment** (prevents overlapping Dask/Zarr chunks)
25+
- **HTML Summary & Validation Tools**
26+
- **STAC & Benchmark Commands**
27+
- **Consolidated Metadata** (faster open)
28+
29+
## GeoZarr Compliance Features
30+
31+
- `_ARRAY_DIMENSIONS` attributes on all arrays
32+
- CF grid mapping variables with `GeoTransform`
33+
- Per-variable `grid_mapping` references
34+
- Multiscales metadata structure on parent groups
35+
- Native CRS tile matrix logic (no forced EPSG:3857)
36+
37+
## Installation
38+
39+
Stable:
40+
```bash
41+
pip install eopf-geozarr
42+
```
43+
44+
Development (uv):
945
```bash
1046
uv sync --frozen
1147
uv run eopf-geozarr --help
1248
```
1349

14-
Or pip:
15-
50+
Editable (pip):
1651
```bash
17-
pip install -e .
52+
pip install -e .[dev]
1853
```
1954

20-
## Workflows
21-
22-
For Argo / batch orchestration use: https://github.com/EOPF-Explorer/data-model-pipeline
55+
## Quick Start (CLI)
2356

24-
## Convert
57+
Convert local → local:
58+
```bash
59+
eopf-geozarr convert input.zarr output_geozarr.zarr --groups /measurements/r10m /measurements/r20m
60+
```
2561

2662
Remote → local:
27-
2863
```bash
29-
uv run eopf-geozarr convert \
64+
eopf-geozarr convert \
3065
"https://.../S2B_MSIL2A_... .zarr" \
3166
"/tmp/S2B_MSIL2A_..._geozarr.zarr" \
32-
--groups /measurements/reflectance \
33-
--verbose
67+
--groups /measurements/reflectance --verbose
3468
```
3569

3670
Notes:
37-
- Parent groups auto-expand to leaf datasets.
38-
- Overviews use /2 coarsening; multiscales live on parent groups.
39-
- Defaults: Blosc Zstd, conservative chunking, metadata consolidation after write.
71+
- Parent groups auto-expand to leaf datasets
72+
- Overviews: /2 coarsening, attached at parent multiscales
73+
- Defaults: Blosc Zstd level 3, conservative chunking, metadata consolidation
4074

41-
## S3
75+
Info / HTML / Validate:
76+
```bash
77+
eopf-geozarr info /tmp/..._geozarr.zarr --html report.html
78+
eopf-geozarr validate /tmp/..._geozarr.zarr
79+
```
4280

43-
Env for S3/S3-compatible storage:
81+
## S3 Support
4482

83+
Environment vars:
4584
```bash
4685
export AWS_ACCESS_KEY_ID=...
4786
export AWS_SECRET_ACCESS_KEY=...
48-
export AWS_REGION=eu-west-1
49-
# Custom endpoint (OVH, MinIO, etc.)
50-
export AWS_ENDPOINT_URL=https://s3.your-endpoint.example
87+
export AWS_DEFAULT_REGION=eu-west-1
88+
export AWS_ENDPOINT_URL=https://s3.your-endpoint.example # optional custom endpoint
5189
```
5290

53-
Write to S3:
54-
91+
Write directly:
5592
```bash
56-
uv run eopf-geozarr convert \
57-
"https://.../S2B_MSIL2A_... .zarr" \
58-
"s3://your-bucket/path/S2B_MSIL2A_..._geozarr.zarr" \
59-
--groups /measurements/reflectance \
60-
--verbose
93+
eopf-geozarr convert input.zarr s3://my-bucket/path/output_geozarr.zarr --groups /measurements/r10m
6194
```
6295

63-
## Info & Validate
96+
Features:
97+
- Credential validation before write
98+
- Custom endpoints (OVH, MinIO, etc.)
99+
- Retry logic around object writes
64100

65-
Summary:
101+
## Parallel Processing with Dask
66102

67103
```bash
68-
uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr"
104+
eopf-geozarr convert input.zarr out.zarr --dask-cluster --verbose
69105
```
106+
Benefits:
107+
- Local cluster auto-start & cleanup
108+
- Chunk alignment to prevent overlapping writes
109+
- Better memory distribution for large scenes
70110

71-
HTML report:
111+
## Python API
72112

73-
```bash
74-
uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" --html /tmp/summary.html
113+
High-level dataset conversion:
114+
```python
115+
import xarray as xr
116+
from eopf_geozarr import create_geozarr_dataset
117+
118+
dt = xr.open_datatree("path/to/eopf.zarr", engine="zarr")
119+
out = create_geozarr_dataset(
120+
dt_input=dt,
121+
groups=["/measurements/r10m", "/measurements/r20m"],
122+
output_path="/tmp/out_geozarr.zarr",
123+
spatial_chunk=4096,
124+
min_dimension=256,
125+
tile_width=256,
126+
)
75127
```
76128

77-
Validate (counts only real data vars, skips `spatial_ref`/`crs`):
129+
Selective writer usage (advanced):
130+
```python
131+
from eopf_geozarr.conversion.geozarr import GeoZarrWriter
132+
writer = GeoZarrWriter(output_path="/tmp/out.zarr", spatial_chunk=4096)
133+
# writer.write_group(...)
134+
```
135+
136+
## API Reference
137+
138+
`create_geozarr_dataset(dt_input, groups, output_path, spatial_chunk=4096, ...) -> xr.DataTree`
139+
: Produce a GeoZarr-compliant hierarchy.
140+
141+
`setup_datatree_metadata_geozarr_spec_compliant(dt, groups) -> dict[str, xr.Dataset]`
142+
: Apply CF + GeoZarr metadata to selected groups.
143+
144+
`downsample_2d_array(source_data, target_h, target_w) -> np.ndarray`
145+
: Block-average /2 overview generation primitive.
146+
147+
`calculate_aligned_chunk_size(dimension_size, target_chunk_size) -> int`
148+
: Returns evenly dividing chunk to avoid overlap.
149+
150+
## Architecture
78151

79-
```bash
80-
uv run eopf-geozarr validate "/tmp/S2B_MSIL2A_..._geozarr.zarr"
81152
```
153+
eopf_geozarr/
154+
commands/ # CLI subcommands (convert, validate, info, stac, benchmark)
155+
conversion/ # Core geozarr pipeline, helpers, multiscales, encodings
156+
metrics.py # Lightweight metrics hooks (optional)
157+
```
158+
159+
## Contributing to GeoZarr Specification
82160

83-
## Benchmark (optional)
161+
Upstream issue discussions influenced:
162+
- Arbitrary CRS preservation
163+
- Chunking performance & strategies
164+
- Multiscale hierarchy clarity
84165

166+
## Benchmark & STAC Commands
167+
168+
Benchmark:
85169
```bash
86-
uv run eopf-geozarr benchmark "/tmp/..._geozarr.zarr" --samples 8 --window 1024 1024
170+
eopf-geozarr benchmark /tmp/out_geozarr.zarr --samples 8 --window 1024 1024
87171
```
88172

89-
## STAC
90-
173+
STAC draft artifacts:
91174
```bash
92-
uv run eopf-geozarr stac \
93-
"/tmp/..._geozarr.zarr" \
94-
"/tmp/..._collection.json" \
95-
--bbox "minx miny maxx maxy" \
96-
--start "YYYY-MM-DDTHH:MM:SSZ" \
97-
--end "YYYY-MM-DDTHH:MM:SSZ"
175+
eopf-geozarr stac /tmp/out_geozarr.zarr /tmp/collection.json \
176+
--bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z
98177
```
99178

100-
## Python API
179+
## What Gets Written
101180

102-
```python
103-
from eopf_geozarr.conversion.geozarr import GeoZarrWriter
104-
from eopf_geozarr.validation.validate import validate_store
105-
from eopf_geozarr.info.summary import summarize
181+
- `_ARRAY_DIMENSIONS` per variable (deterministic axis order)
182+
- Per-variable `grid_mapping` referencing `spatial_ref`
183+
- Multiscales metadata on parent groups; /2 overviews
184+
- Blosc Zstd compression, conservative chunking
185+
- Consolidated metadata index
186+
- Band attribute propagation across levels
106187

107-
src = "https://.../S2B_MSIL2A_... .zarr"
108-
dst = "/tmp/S2B_MSIL2A_..._geozarr.zarr"
188+
## Consolidated Metadata
109189

110-
writer = GeoZarrWriter(src, dst, storage_options={})
111-
writer.write(groups=["/measurements/reflectance"], verbose=True)
190+
Improves open performance. Spec discussion ongoing; toggle by disabling consolidation if strict minimalism required.
112191

113-
report = validate_store(dst)
114-
print(report.ok)
192+
## Troubleshooting
115193

116-
tree = summarize(dst)
117-
print(tree["summary"]) # or write HTML via CLI
118-
```
194+
| Symptom | Cause | Fix |
195+
|---------|-------|-----|
196+
| Parent group empty | Only leaf groups hold arrays | Use `--groups` or rely on auto-expansion |
197+
| Overlapping chunk error | Misaligned dask vs encoding chunks | Allow auto chunk alignment or reduce spatial_chunk |
198+
| S3 auth failure | Missing env vars or endpoint | Export AWS_* vars / set AWS_ENDPOINT_URL |
199+
| HTML path is a directory | Provided path not file | A default filename is created inside |
119200

120-
## What it writes
201+
## Development & Contributing
121202

122-
- `_ARRAY_DIMENSIONS` per variable (correct axis order).
123-
- `grid_mapping = "spatial_ref"` per variable; `spatial_ref` holds CRS/georeferencing.
124-
- Multiscales on parent groups; /2 overviews.
125-
- Blosc Zstd compression; conservative chunking; consolidated metadata.
126-
- Overviews keep per-band attributes (grid_mapping reattached across levels).
203+
```bash
204+
git clone <repo-url>
205+
cd eopf-geozarr
206+
pip install -e '.[dev]'
207+
pre-commit install
208+
pytest
209+
```
127210

128-
## Consolidated metadata
211+
Quality stack: Black, isort, Ruff, Mypy, Pytest, Coverage.
129212

130-
Speeds up reads. Some tools note it isn’t in the core Zarr v3 spec yet; data stays valid. You can disable consolidation during writes or remove the index if preferred.
213+
## License & Acknowledgments
131214

132-
## Troubleshooting
215+
Apache 2.0. Built atop xarray, zarr, dask; follows evolving GeoZarr specification.
133216

134-
- Parent group shows no data vars: select leaves (CLI auto-expands).
135-
- S3 errors: check env vars and `AWS_ENDPOINT_URL` for custom endpoints.
136-
- HTML path is a directory: a default filename is created inside.
217+
---
218+
For questions or issues open a GitHub issue.
137219

0 commit comments

Comments
 (0)