Skip to content

Conversation

wietzesuijker
Copy link
Contributor

@wietzesuijker wietzesuijker commented Sep 15, 2025

Modular refactor for clearer CLI, safer/idempotent writes, resilient multiscales, optional metrics; preserves public high-level API (create_geozarr_dataset).

Core Changes

  • CLI split: convert, validate, info (--html), benchmark, stac; adds --overwrite {fail,skip,merge,replace}, --metrics-out, --crs-groups.
  • Orchestration: GeoZarrWriter centralizes base write, multiscales, consolidation, audits.
  • Multiscales: resumable, per-level completeness + retry, optional max level cap, unchanged external metadata shape.
  • Integrity: band/group completeness audit (missing/zero-byte chunks), skip intact arrays, rewrite incomplete ones.
  • Encoding: unified create_geozarr_encoding (excludes CRS/grid mapping vars, env soft cap EOPF_MAX_CHUNK_BYTES).
  • CRS: best-effort injection for supplemental groups (warn, don’t fail).
  • Metrics (optional): per-step timing + structural summary JSON.
  • Validation & consolidation: tolerant fallbacks (warnings instead of hard errors).
  • Extras: benchmark random access, minimal STAC draft, HTML structural report.

Reliability / Safety

  • Default overwrite still fail.
  • Defensive cleanup before rewrites of partial arrays.
  • Multiscale code was hardened (safer, resumable, validated) without changing the API, metadata, or on-disk layout.

Dev & Docs

  • Restored richer README (trimmed tone, added uv workflow).
  • Targeted new tests (writer, encoding, helpers, multiscales path).
  • .gitignore uv.lock to condense changes. Happy to revert if preferred.

Backward Compatibility

  • External entry (create_geozarr_dataset) intact; legacy unused internals removed.
  • Re-exports maintain import paths.

Performance Notes

  • Skips already-valid data/multiscales on reruns.
  • Chunk size soft caps memory usage.

Follow-Up Suggestions

  • Builder Pattern Core API (pipeline-friendly): geo = GeoZarrBuilder(config).with_input(dataset_or_url).select(groups_or_vars).inject_crs(crs_groups) .write(output_path).
    Benefits: composability, clearer intent sequencing, easier partial reuse (e.g., build → metrics → dry-run).
  • Structured logging (JSON) + log levels.
  • Pluggable metrics sinks (stdout, file, HTTP).
  • Multiscales schema version tag & optional validation hook.
  • Extended dtype/compression strategy matrix & autotuning.
  • Incremental resume mode (true “merge” semantics with per-var diff).

(inherits some feats from early-stage metrics exploration https://github.com/wietzesuijker/data-model/tree/feat/convert-metrics)

@wietzesuijker wietzesuijker force-pushed the feat/metrics-pr26 branch 2 times, most recently from 104d881 to bf8cacf Compare September 17, 2025 00:36
@wietzesuijker wietzesuijker changed the title metrics(v1): structured conversion metrics + CLI flags conversion refactor with resumable multiscales + JSON metrics Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant