Skip to content

Commit 104d881

Browse files
committed
docs(readme): restore comprehensive usage and API sections while keeping concise quick start
1 parent 1e27166 commit 104d881

File tree

1 file changed

+173
-76
lines changed

1 file changed

+173
-76
lines changed

README.md

Lines changed: 173 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,137 +1,234 @@
11
# EOPF GeoZarr
22

3-
Turn EOPF datasets into a GeoZarr-style Zarr v3 store. Keep the data values intact and add standard geospatial metadata, multiscale overviews, and per-variable dimensions.
4-
5-
## Quick Start
6-
7-
Install (uv):
3+
GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.
4+
5+
Turn EOPF datasets into a GeoZarr-style Zarr v3 store while:
6+
- Preserving native CRS (no forced TMS reprojection)
7+
- Adding CF + GeoZarr compliant metadata
8+
- Building /2 multiscale overviews
9+
- Writing robust, retry-aware band data with validation
10+
11+
---
12+
## Table of Contents
13+
1. Overview & Key Features
14+
2. GeoZarr Compliance Details
15+
3. Installation
16+
4. Quick Start (CLI)
17+
5. S3 / Object Storage
18+
6. Parallel Processing (Dask)
19+
7. Python API Examples
20+
8. API Reference (Selected Functions)
21+
9. Architecture
22+
10. Contributing to GeoZarr Spec
23+
11. Benchmark & STAC Commands
24+
12. What Gets Written
25+
13. Consolidated Metadata
26+
14. Troubleshooting
27+
15. Development & Contributing
28+
16. License & Acknowledgments
29+
30+
---
31+
## 1. Overview & Key Features
32+
33+
- Full GeoZarr spec 0.4 alignment
34+
- Native CRS preservation (UTM, polar, etc.)
35+
- Multiscale /2 overviews (COG-style) with proper hierarchy
36+
- CF `standard_name`, `_ARRAY_DIMENSIONS`, `grid_mapping` correctness
37+
- Band-by-band resilient writing (retry + completeness auditing)
38+
- S3 & generic S3-compatible storage support (AWS, OVH, MinIO)
39+
- Optional Dask-backed parallel chunk processing
40+
- Chunk alignment to avoid overlapping dask/Zarr chunk layouts
41+
- HTML summaries, validation, STAC scaffolding, benchmarking
42+
43+
## 2. GeoZarr Compliance Details
44+
45+
Implements:
46+
- `_ARRAY_DIMENSIONS` on all arrays
47+
- CF grid mapping variables with `GeoTransform` info
48+
- Per-variable `grid_mapping` references
49+
- Multiscales metadata structure on parent groups
50+
- Native CRS tile matrix logic (no forced EPSG:3857)
51+
52+
## 3. Installation
53+
54+
Stable:
55+
```bash
56+
pip install eopf-geozarr
57+
```
858

59+
Development (uv):
960
```bash
1061
uv sync --frozen
1162
uv run eopf-geozarr --help
1263
```
1364

14-
Or pip:
15-
65+
Editable (pip):
1666
```bash
17-
pip install -e .
67+
pip install -e .[dev]
1868
```
1969

20-
## Workflows
70+
## 4. Quick Start (CLI)
2171

22-
For Argo / batch orchestration use: https://github.com/EOPF-Explorer/data-model-pipeline
23-
24-
## Convert
72+
Convert local → local:
73+
```bash
74+
eopf-geozarr convert input.zarr output_geozarr.zarr --groups /measurements/r10m /measurements/r20m
75+
```
2576

2677
Remote → local:
27-
2878
```bash
29-
uv run eopf-geozarr convert \
79+
eopf-geozarr convert \
3080
"https://.../S2B_MSIL2A_... .zarr" \
3181
"/tmp/S2B_MSIL2A_..._geozarr.zarr" \
32-
--groups /measurements/reflectance \
33-
--verbose
82+
--groups /measurements/reflectance --verbose
3483
```
3584

3685
Notes:
37-
- Parent groups auto-expand to leaf datasets.
38-
- Overviews use /2 coarsening; multiscales live on parent groups.
39-
- Defaults: Blosc Zstd, conservative chunking, metadata consolidation after write.
86+
- Parent groups auto-expand to leaf datasets
87+
- Overviews: /2 coarsening, attached at parent multiscales
88+
- Defaults: Blosc Zstd level 3, conservative chunking, metadata consolidation
4089

41-
## S3
90+
Info / HTML / Validate:
91+
```bash
92+
eopf-geozarr info /tmp/..._geozarr.zarr --html report.html
93+
eopf-geozarr validate /tmp/..._geozarr.zarr
94+
```
4295

43-
Env for S3/S3-compatible storage:
96+
## 5. S3 / Object Storage
4497

98+
Environment vars:
4599
```bash
46100
export AWS_ACCESS_KEY_ID=...
47101
export AWS_SECRET_ACCESS_KEY=...
48-
export AWS_REGION=eu-west-1
49-
# Custom endpoint (OVH, MinIO, etc.)
50-
export AWS_ENDPOINT_URL=https://s3.your-endpoint.example
102+
export AWS_DEFAULT_REGION=eu-west-1
103+
export AWS_ENDPOINT_URL=https://s3.your-endpoint.example # optional custom endpoint
51104
```
52105

53-
Write to S3:
54-
106+
Write directly:
55107
```bash
56-
uv run eopf-geozarr convert \
57-
"https://.../S2B_MSIL2A_... .zarr" \
58-
"s3://your-bucket/path/S2B_MSIL2A_..._geozarr.zarr" \
59-
--groups /measurements/reflectance \
60-
--verbose
108+
eopf-geozarr convert input.zarr s3://my-bucket/path/output_geozarr.zarr --groups /measurements/r10m
61109
```
62110

63-
## Info & Validate
111+
Features:
112+
- Credential validation before write
113+
- Custom endpoints (OVH, MinIO, etc.)
114+
- Retry logic around object writes
64115

65-
Summary:
116+
## 6. Parallel Processing (Dask)
66117

67118
```bash
68-
uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr"
119+
eopf-geozarr convert input.zarr out.zarr --dask-cluster --verbose
69120
```
121+
Benefits:
122+
- Local cluster auto-start & cleanup
123+
- Chunk alignment to prevent overlapping writes
124+
- Better memory distribution for large scenes
70125

71-
HTML report:
126+
## 7. Python API Examples
72127

73-
```bash
74-
uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" --html /tmp/summary.html
128+
High-level dataset conversion:
129+
```python
130+
import xarray as xr
131+
from eopf_geozarr import create_geozarr_dataset
132+
133+
dt = xr.open_datatree("path/to/eopf.zarr", engine="zarr")
134+
out = create_geozarr_dataset(
135+
dt_input=dt,
136+
groups=["/measurements/r10m", "/measurements/r20m"],
137+
output_path="/tmp/out_geozarr.zarr",
138+
spatial_chunk=4096,
139+
min_dimension=256,
140+
tile_width=256,
141+
)
142+
```
143+
144+
Selective writer usage (advanced):
145+
```python
146+
from eopf_geozarr.conversion.geozarr import GeoZarrWriter
147+
writer = GeoZarrWriter(output_path="/tmp/out.zarr", spatial_chunk=4096)
148+
# writer.write_group(...)
75149
```
76150

77-
Validate (counts only real data vars, skips `spatial_ref`/`crs`):
151+
## 8. API Reference (Selected)
78152

79-
```bash
80-
uv run eopf-geozarr validate "/tmp/S2B_MSIL2A_..._geozarr.zarr"
153+
`create_geozarr_dataset(dt_input, groups, output_path, spatial_chunk=4096, ...) -> xr.DataTree`
154+
: Produce a GeoZarr-compliant hierarchy.
155+
156+
`setup_datatree_metadata_geozarr_spec_compliant(dt, groups) -> dict[str, xr.Dataset]`
157+
: Apply CF + GeoZarr metadata to selected groups.
158+
159+
`downsample_2d_array(source_data, target_h, target_w) -> np.ndarray`
160+
: Block-average /2 overview generation primitive.
161+
162+
`calculate_aligned_chunk_size(dimension_size, target_chunk_size) -> int`
163+
: Returns evenly dividing chunk to avoid overlap.
164+
165+
## 9. Architecture
166+
167+
```
168+
eopf_geozarr/
169+
commands/ # CLI subcommands (convert, validate, info, stac, benchmark)
170+
conversion/ # Core geozarr pipeline, helpers, multiscales, encodings
171+
metrics.py # Lightweight metrics hooks (optional)
81172
```
82173

83-
## Benchmark (optional)
174+
## 10. Contributing to GeoZarr Spec
84175

176+
Upstream issue discussions influenced:
177+
- Arbitrary CRS preservation
178+
- Chunking performance & strategies
179+
- Multiscale hierarchy clarity
180+
181+
## 11. Benchmark & STAC
182+
183+
Benchmark:
85184
```bash
86-
uv run eopf-geozarr benchmark "/tmp/..._geozarr.zarr" --samples 8 --window 1024 1024
185+
eopf-geozarr benchmark /tmp/out_geozarr.zarr --samples 8 --window 1024 1024
87186
```
88187

89-
## STAC
90-
188+
STAC draft artifacts:
91189
```bash
92-
uv run eopf-geozarr stac \
93-
"/tmp/..._geozarr.zarr" \
94-
"/tmp/..._collection.json" \
95-
--bbox "minx miny maxx maxy" \
96-
--start "YYYY-MM-DDTHH:MM:SSZ" \
97-
--end "YYYY-MM-DDTHH:MM:SSZ"
190+
eopf-geozarr stac /tmp/out_geozarr.zarr /tmp/collection.json \
191+
--bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z
98192
```
99193

100-
## Python API
194+
## 12. What Gets Written
101195

102-
```python
103-
from eopf_geozarr.conversion.geozarr import GeoZarrWriter
104-
from eopf_geozarr.validation.validate import validate_store
105-
from eopf_geozarr.info.summary import summarize
196+
- `_ARRAY_DIMENSIONS` per variable (deterministic axis order)
197+
- Per-variable `grid_mapping` referencing `spatial_ref`
198+
- Multiscales metadata on parent groups; /2 overviews
199+
- Blosc Zstd compression, conservative chunking
200+
- Consolidated metadata index
201+
- Band attribute propagation across levels
106202

107-
src = "https://.../S2B_MSIL2A_... .zarr"
108-
dst = "/tmp/S2B_MSIL2A_..._geozarr.zarr"
203+
## 13. Consolidated Metadata
109204

110-
writer = GeoZarrWriter(src, dst, storage_options={})
111-
writer.write(groups=["/measurements/reflectance"], verbose=True)
205+
Improves open performance. Spec discussion ongoing; toggle by disabling consolidation if strict minimalism required.
112206

113-
report = validate_store(dst)
114-
print(report.ok)
207+
## 14. Troubleshooting
115208

116-
tree = summarize(dst)
117-
print(tree["summary"]) # or write HTML via CLI
118-
```
209+
| Symptom | Cause | Fix |
210+
|---------|-------|-----|
211+
| Parent group empty | Only leaf groups hold arrays | Use `--groups` or rely on auto-expansion |
212+
| Overlapping chunk error | Misaligned dask vs encoding chunks | Allow auto chunk alignment or reduce spatial_chunk |
213+
| S3 auth failure | Missing env vars or endpoint | Export AWS_* vars / set AWS_ENDPOINT_URL |
214+
| HTML path is a directory | Provided path not file | A default filename is created inside |
119215

120-
## What it writes
216+
## 15. Development & Contributing
121217

122-
- `_ARRAY_DIMENSIONS` per variable (correct axis order).
123-
- `grid_mapping = "spatial_ref"` per variable; `spatial_ref` holds CRS/georeferencing.
124-
- Multiscales on parent groups; /2 overviews.
125-
- Blosc Zstd compression; conservative chunking; consolidated metadata.
126-
- Overviews keep per-band attributes (grid_mapping reattached across levels).
218+
```bash
219+
git clone <repo-url>
220+
cd eopf-geozarr
221+
pip install -e '.[dev]'
222+
pre-commit install
223+
pytest
224+
```
127225

128-
## Consolidated metadata
226+
Quality stack: Black, isort, Ruff, Mypy, Pytest, Coverage.
129227

130-
Speeds up reads. Some tools note it isn’t in the core Zarr v3 spec yet; data stays valid. You can disable consolidation during writes or remove the index if preferred.
228+
## 16. License & Acknowledgments
131229

132-
## Troubleshooting
230+
Apache 2.0. Built atop xarray, zarr, dask; follows evolving GeoZarr specification.
133231

134-
- Parent group shows no data vars: select leaves (CLI auto-expands).
135-
- S3 errors: check env vars and `AWS_ENDPOINT_URL` for custom endpoints.
136-
- HTML path is a directory: a default filename is created inside.
232+
---
233+
For questions or issues open a GitHub issue.
137234

0 commit comments

Comments
 (0)