Skip to content

Conversation

wietzesuijker
Copy link
Collaborator

@wietzesuijker wietzesuijker commented Aug 25, 2025

PR drops the mock-up and introduces a pipeline that works locally (for me).

It is still early stage, but hopefully this gives a baseline.

Getting the env right was quite fiddly (rasterio 1.4.3 / gdal).

This is the result of a couple of iterations. In an earlier stage I just bolted on top of the data-model repo. I separated it to this pipeline repo for clarity.

There are some reporting remnants that depend on / point to this unmerged data-model feature branch and related PR EOPF-Explorer/data-model#26.

@wietzesuijker
Copy link
Collaborator Author

wietzesuijker commented Aug 25, 2025

Hi! I would appreciate any feedback on this PR. I went ahead and already merged it, since this is quite early stage. I recommend reading the related issue first and then reviewing the code in the new data-model-pipeline repo. The README is a good point to understand what's going on.

@j08lue j08lue deleted the feat/initial-workflow branch August 26, 2025 06:48
Copy link

@j08lue j08lue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall nice setup with all the build tooling, and complete with docs.

Does this need to be this complex, though? I added a couple of comments below to understand the reasons or encourage simplification.

Since this is merged already, perhaps these can turn into issues for a revision, or so.

Comment on lines +24 to +29
# Pull lockfiles from data-model@<ref> (optionally a subdir)
RUN set -eux; \
mkdir -p /tmp/dm; \
curl -fsSL -o /tmp/dm/pyproject.toml "${DM_RAW_BASE}/${EOPF_GEOZARR_REF}/${EOPF_GEOZARR_SUBDIR}pyproject.toml"; \
curl -fsSL -o /tmp/dm/uv.lock "${DM_RAW_BASE}/${EOPF_GEOZARR_REF}/${EOPF_GEOZARR_SUBDIR}uv.lock"; \
test -s /tmp/dm/pyproject.toml && test -s /tmp/dm/uv.lock
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead not just uv pip install git+ from a tag of the data-model repo or so?

I have not seen this way of managing dependencies - to copy another project's project.toml.

ENV PYTHONUNBUFFERED=1 PIP_NO_CACHE_DIR=1
ARG PORTABLE_BUILD=0
ARG EOPF_GEOZARR_REF=main
ARG EOPF_GEOZARR_SUBDIR=
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this logic to add the data-model project as a subdir?

I see this option is nicely mentioned under https://github.com/EOPF-Explorer/data-model-pipeline?tab=readme-ov-file#variants, but I'm not sure why it's needed.

Comment on lines +66 to +69
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
libstdc++6 libgomp1 libexpat1 \
gdal-bin proj-bin curl \
&& rm -rf /var/lib/apt/lists/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not instead use a parent image that has GDAL built and installed?

Or is the aim to keep dependencies simple?

Comment on lines +87 to +129
python - "$STAC" $GROUPS <<'PY'
import re, sys, xarray as xr
stac, *groups = sys.argv[1:]
if any(re.fullmatch(r"\d+", g.strip()) for g in groups):
print("ERROR: 'groups' must be paths like /measurements/reflectance/r20m (got numeric token).")
sys.exit(64)
def open_dt(url):
try:
return xr.open_datatree(url, engine="zarr", consolidated=True)
except Exception:
return xr.open_datatree(url, engine="zarr", consolidated=None)
dt = open_dt(stac)
missing = []
for g in groups:
key = g.strip("/")
try:
_ = dt[key]
print(f"[ok] {g}")
except Exception as e:
print(f"[missing] {g}: {e}")
missing.append(g)
if missing:
try:
meas = dt["measurements"]
names = set()
children = getattr(meas, "groups", None) or getattr(meas, "children", None)
if isinstance(children, dict):
names.update(children.keys())
elif isinstance(children, (list, tuple)):
for s in children:
if isinstance(s, str) and s:
names.add(s.split("/", 1)[0])
if names:
print("candidates under /measurements:")
for n in sorted(names)[:50]:
print(" /measurements/" + n)
except Exception:
pass
sys.exit(64)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like at least this part should be a separate script, so it's easier to read and test. Maybe also the shell script part could be moved out of this yaml?

tests-output/

# uv
uv.lock
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore the lockfile?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants