-
Notifications
You must be signed in to change notification settings - Fork 0
Feat/initial workflow #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4ac2bef
to
cc6d716
Compare
Hi! I would appreciate any feedback on this PR. I went ahead and already merged it, since this is quite early stage. I recommend reading the related issue first and then reviewing the code in the new data-model-pipeline repo. The README is a good point to understand what's going on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall nice setup with all the build tooling, and complete with docs.
Does this need to be this complex, though? I added a couple of comments below to understand the reasons or encourage simplification.
Since this is merged already, perhaps these can turn into issues for a revision, or so.
# Pull lockfiles from data-model@<ref> (optionally a subdir) | ||
RUN set -eux; \ | ||
mkdir -p /tmp/dm; \ | ||
curl -fsSL -o /tmp/dm/pyproject.toml "${DM_RAW_BASE}/${EOPF_GEOZARR_REF}/${EOPF_GEOZARR_SUBDIR}pyproject.toml"; \ | ||
curl -fsSL -o /tmp/dm/uv.lock "${DM_RAW_BASE}/${EOPF_GEOZARR_REF}/${EOPF_GEOZARR_SUBDIR}uv.lock"; \ | ||
test -s /tmp/dm/pyproject.toml && test -s /tmp/dm/uv.lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead not just uv pip install git+
from a tag of the data-model repo or so?
I have not seen this way of managing dependencies - to copy another project's project.toml
.
ENV PYTHONUNBUFFERED=1 PIP_NO_CACHE_DIR=1 | ||
ARG PORTABLE_BUILD=0 | ||
ARG EOPF_GEOZARR_REF=main | ||
ARG EOPF_GEOZARR_SUBDIR= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this logic to add the data-model
project as a subdir?
I see this option is nicely mentioned under https://github.com/EOPF-Explorer/data-model-pipeline?tab=readme-ov-file#variants, but I'm not sure why it's needed.
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ | ||
libstdc++6 libgomp1 libexpat1 \ | ||
gdal-bin proj-bin curl \ | ||
&& rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not instead use a parent image that has GDAL built and installed?
Or is the aim to keep dependencies simple?
python - "$STAC" $GROUPS <<'PY' | ||
import re, sys, xarray as xr | ||
stac, *groups = sys.argv[1:] | ||
if any(re.fullmatch(r"\d+", g.strip()) for g in groups): | ||
print("ERROR: 'groups' must be paths like /measurements/reflectance/r20m (got numeric token).") | ||
sys.exit(64) | ||
def open_dt(url): | ||
try: | ||
return xr.open_datatree(url, engine="zarr", consolidated=True) | ||
except Exception: | ||
return xr.open_datatree(url, engine="zarr", consolidated=None) | ||
dt = open_dt(stac) | ||
missing = [] | ||
for g in groups: | ||
key = g.strip("/") | ||
try: | ||
_ = dt[key] | ||
print(f"[ok] {g}") | ||
except Exception as e: | ||
print(f"[missing] {g}: {e}") | ||
missing.append(g) | ||
if missing: | ||
try: | ||
meas = dt["measurements"] | ||
names = set() | ||
children = getattr(meas, "groups", None) or getattr(meas, "children", None) | ||
if isinstance(children, dict): | ||
names.update(children.keys()) | ||
elif isinstance(children, (list, tuple)): | ||
for s in children: | ||
if isinstance(s, str) and s: | ||
names.add(s.split("/", 1)[0]) | ||
if names: | ||
print("candidates under /measurements:") | ||
for n in sorted(names)[:50]: | ||
print(" /measurements/" + n) | ||
except Exception: | ||
pass | ||
sys.exit(64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like at least this part should be a separate script, so it's easier to read and test. Maybe also the shell script part could be moved out of this yaml?
tests-output/ | ||
|
||
# uv | ||
uv.lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ignore the lockfile?
PR drops the mock-up and introduces a pipeline that works locally (for me).
It is still early stage, but hopefully this gives a baseline.
Getting the env right was quite fiddly (rasterio 1.4.3 / gdal).
This is the result of a couple of iterations. In an earlier stage I just bolted on top of the data-model repo. I separated it to this pipeline repo for clarity.
There are some reporting remnants that depend on / point to this unmerged data-model feature branch and related PR EOPF-Explorer/data-model#26.