APPS-3915 private ecr nextflow#1554
Open
bomquangia wants to merge 48 commits into
Open
Conversation
|
SAST scan cancelled: new commit detected on PR |
Introduces a separate JIT/role for pulling Docker images from a private AWS ECR registry (least-privilege, decoupled from the existing S3-workdir role). Adds aws_ecr() to isolate ECR commands from workdir AWS_* env vars, writes short-lived JITs with mode 600 via file:// to avoid leaking under set -x, and refreshes both tokens on the same loop with non-empty validation and consecutive-failure escalation.
Fixes from adversarial review of the initial ECR work: - Fail-fast in aws_login when sts:AssumeRoleWithWebIdentity fails (prior code echoed "Successfully configured" even on failure). - Pipe dx-jobutil-get-identity-token directly to file under umask 077 so the JWT is never put in a shell variable that would leak under set -x. - Validate JIT shape (3 base64url segments) before promoting; rejects partial writes and proxy error pages that pass a size-only check. - Broaden aws_ecr unset list (AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE, AWS_CONTAINER_CREDENTIALS_*) and pin AWS_CONFIG_FILE to a dedicated dx-ecr-config path so the [ecr] profile never collides with the customer's ~/.aws/config. - Drop the post-write STS sanity call in ecr_aws_login to save one API hit per task subjob startup. - Refresh at 50% TTL (180s, was 240s) to leave more headroom on slow workers where the JIT could otherwise expire mid-flight. - Make workdir login failure abort the head job; ECR failure remains a loud warning so pipelines that don't actually use ECR aren't blocked.
Review-3 follow-up: when dx-jobutil-get-identity-token fails, the redirect target was already truncated to zero bytes by the shell. Add an explicit rm -f so we don't leave a 0-byte mode-600 file lingering on the worker.
Phase 2.1-2.3 of private ECR support for Nextflow builds.
- nextflow_utils.parse_nextflow_config_dx_fields(): best-effort parser
for dnanexus.* (workdir + ECR JIT) and aws.region. Handles dotted and
scope-block syntax; strips // and /* */ comments. Returns {} cleanly
when nextflow.config is absent (e.g. --repository <url> builds, where
NPI clones the repo and reads the config server-side).
- build_pipeline_with_npi() forwards parsed config as NPI input fields.
All fields are optional in the NPI input spec, so older NPI versions
silently ignore unknown keys (release-order risk noted in plan).
- New CLI flag --ecr-region: build-only override that lets the importer
job authenticate to a non-default-region ECR (e.g. bundle a us-west-2
image into a pipeline whose runtime aws.region is us-east-1). Has no
runtime effect; runtime ECR pulls always use aws.region per Q3.
- nextflow_templates.py unchanged: per Q4 (nextflow.config is single
source of truth) ECR config travels with the shipped resources_dir,
not as applet inputs.
Phase 2.5 of private ECR support for Nextflow builds. When `dx build --nextflow --cache-docker` resolves and pulls container images referenced by a pipeline, detect AWS ECR registry hostnames and run `aws --profile ecr ecr get-login-password | docker login` before each pull. - _extract_ecr_host_and_region: hostname parser matching the bash-side is_ecr_host helper. Commercial AWS partition only; GovCloud (us-gov-*) and China (.amazonaws.com.cn) are rejected — their STS/ECR endpoints haven't been validated. - _ecr_docker_login: shells out to `aws` rather than adding boto3 as a dxpy dep. Cached per (host, region) for the build's lifetime to avoid STS rate-limit cascades on wide-fanout pipelines. - ensure_ecr_login_for_image: public no-op-for-non-ECR helper called from both digest resolution (`docker manifest inspect`) and the actual `docker pull` in DockerImageRef._cache. The [ecr] AWS profile is set up by the importer job entrypoint (npi.sh's ecr_aws_login). If it isn't, the `aws` call fails and we log a warning then continue — the subsequent `docker pull` will surface the auth error to the user clearly.
Fixes from adversarial review of Phase 2 build-time ECR work. - nextflow_utils._strip_groovy_comments: mask string literals before comment stripping. Previously a legal config value containing `//` (e.g. `'job://workspace_id'` for ecrJobTokenSubjectClaims, S3 URLs) had its closing quote eaten by the line-comment regex, silently dropping the field. Block comments now also preserve newlines so line-anchored regexes don't shift. Removed dead _extract_quoted_value. - nextflow_builder: add _npi_input_names() and gate forwarding of new ECR/AWS input fields on the deployed NPI's input spec. The platform's applet/run rejects unknown inputs with InvalidInput, so forwarding unconditionally would break every build against an older NPI. Drop unrecognised fields with a clear warning so builds still succeed for non-ECR pipelines on older importers. - dx.py: format-validate --ecr-region against the AWS region naming pattern so typos / shell metacharacters fail before launching a remote job. - collect_images: reject us-iso-* and us-isob-* regions in addition to us-gov-* (their STS/ECR endpoints are not validated for this flow). Corrected the misleading token-lifetime comment that conflated the 1h STS session with the 12h docker registry token. Mirror docker login via `sudo -n` so subsequent `sudo docker pull` in DockerImageRef._cache finds the auth in /root/.docker/config.json rather than falling back to anonymous access. - ImageRef.DockerImageRef._cache: fail loud when ensure_ecr_login_for_image returns False on a confirmed-ECR image, with a message naming the host and pointing at the relevant nextflow.config keys + IAM permissions. Avoids the previous behaviour of letting `sudo docker pull` surface a generic "no basic auth credentials" error several layers down.
Fixes from pass 2 of the adversarial review. - collect_images._ecr_docker_login: drop the per-image `--region` arg from `aws ecr get-login-password` and rely on the [ecr] profile's region setting. This makes the user's `--ecr-region` build-time override actually take effect — without it the per-image hostname region always won and the override was informational only. Per design Q3 a build authenticates to one ECR region; pipelines mixing regions must split into multiple `--cache-docker --ecr-region` builds. - collect_images._ecr_docker_login: treat sudo docker login failure as fatal (return False) rather than logging a warning and continuing. If /root/.docker/config.json never gets the ECR token, the subsequent `sudo docker pull` in DockerImageRef._cache falls back to anonymous access and fails with a generic registry-auth error, defeating the fail-loud branch in _cache. Most likely cause is the importer image lacking NOPASSWD sudoers for docker — this surfaces it loudly instead of as a confusing pull failure. - nextflow_builder build_pipeline_with_npi: warn on every dropped config field, not just ECR-specific. Previously a silent drop of `aws_region` (e.g. against an older NPI that does not declare it) surfaced inside the importer as an opaque "ECR is configured but no AWS region is available" error.
Pass 3 verdict was Ship It with only Low/Nit findings. Cleaning up the two findings worth applying inline. - dx.py: drop the local `import re as _re` inside the build validation block; `re` is already imported at module scope (line 22).
Adds 39 unit tests covering the new Phase 2 build-time ECR code paths.
test_nextflow_collect_images.py — appended:
- TestEcrHostExtraction (15 cases): commercial AWS partition matches,
mixed-case lowering, public-registry rejection, GovCloud / Secret
(us-iso, us-isob) / China (.cn) partition rejection, malformed
hostname rejection, None/empty handling.
- TestEcrDockerLogin (4 cases): happy path verifies the exact command
shape (`aws --profile ecr ecr get-login-password` with no --region,
the docker login --password-stdin pipe, the sudo mirror); aws failure
short-circuits before docker; sudo mirror failure returns False
(matches the design — without root's docker config, sudo docker pull
later falls back to anonymous and breaks); cache hit skips both
subprocess calls on the second invocation.
- TestEnsureEcrLoginForImage (2 cases): no-op for non-ECR refs;
delegates to _ecr_docker_login with the parsed (host, region) for
ECR refs.
test_nextflow_config_parser.py — new file:
- TestParseDottedForm: all 7 keys forwarded; both single- and
double-quoted values accepted.
- TestParseScopeBlockForm: dnanexus { ... } and aws { ... } block
styles; dotted-form-wins-over-scope-block precedence.
- TestComments: regression test for the pass-1 bug — `'job://...'`
URI value must not have its `//` eaten by line-comment stripping.
Plus trailing line comment stripped, block comment stripped,
fully-commented-out key ignored.
- TestParseEdgeCases: None / empty / missing-file / unrelated-config
inputs all return {} cleanly.
- TestStripGroovyComments: direct unit tests on the comment stripper,
asserting URI preservation in both quote styles, line comment
removal, block comment removal, newline preservation across block
comments (so downstream line-anchored regexes don't shift).
All 39 tests pass against the current implementation.
Adds 12 tests covering the bash fragment in DxBashLib.groovy.dxLib() that runs at task-runtime on the worker. The bash is sliced out of the Groovy source at test time (with Groovy escape resolution `\\` -> `\` etc.) and sourced into a bash harness with `aws` and `docker` mocked via PATH so we can assert on call arguments and exit codes without launching a real DNAnexus job. Coverage: - is_ecr_host: commercial AWS partition accepted; GovCloud, China, public registries rejected. - extract_ecr_host_from_image: ECR image -> host extracted; non-ECR image -> silent (empty); mixed-case input lowercased. - ecr_region_from_host: region parsed correctly. - _is_ecr_auth_error: recognises `denied`, `ExpiredToken`; does not flag generic network errors (avoids unnecessary STS relogin cascades on wide-fanout pipelines). - ecr_docker_login: cache-file hit short-circuits without invoking `aws` or `docker` (verified via mock-call log). Source location is auto-discovered via the dnanexus parent dir, preferring nextaur-app worktrees over the main checkout (since the Phase 1 ECR helpers are still on a feature branch). Falls back to $DX_NEXTAUR_BASH_LIB env var override for CI. Tests skip cleanly when no source is available.
Six bugs were found in a final cross-cutting review pass that earlier adversarial passes missed. BUG-1 (Bug): Runtime ECR fail-loud regression - nextflow.sh head + task subjob: ecr_aws_login failure was logged-and- continued. Plan and NPI side both treat configured-but-failed ECR as fatal; runtime path silently kept the wrong behaviour. With ECR mis- configured every task would degrade into the docker-pull retry loop and fail with opaque registry errors — 3*retry*N tasks of cost. - Fix: dx-jobutil-report-error + exit 1 when ecrRoleArnToAssume is set but ecr_aws_login returns non-zero. ecr_aws_login itself still returns 0 cleanly when ECR is not configured, so non-ECR pipelines are unaffected. BUG-2 (Bug): Newline injection in nextflow.config parser - Python's `[^']*` and `[^"]*` capture classes match newlines (the `(?m)` flag only affects `^` and `$`). A pipeline author could embed `\n[default]\naws_access_key_id = ATTACKER` inside aws.region and the value would round-trip through the parser into NPI's ~/.aws/credentials heredoc, injecting a rogue [default] profile. - Fix: parse_nextflow_config_dx_fields now rejects any captured value containing CR or LF — same discipline as AwsUtils.shellSingleQuote on the runtime path. BUG-3 (Medium): aws_region drop warning was generic - When the user configures ECR but the deployed NPI does not declare aws_region, dropping it produced only the generic "may fall back to defaults or fail" warning — but ECR is GUARANTEED to fail without it. - Fix: detect ECR intent + dropped aws_region and escalate to a specific "ECR is configured but no AWS region is available" warning pointing at the importer-version mismatch. BUG-3 (refactor): extract _apply_npi_input_gate helper - The forwarding-gate logic was inline in build_pipeline_with_npi and hard to test in isolation. Extracted to a pure helper so the warning- categorisation logic is unit-testable. BUG-5 (Low): _ECR_LOGGED_IN_HOSTS lifetime contract - Documented that the cache is correct only for the lifetime of a single dx-build process and added a reset_ecr_login_cache() helper for any future long-lived caller. BUG-9: tests for the gaps - test_nextflow_config_parser.py: 6 newline-rejection tests plus a "clean value still accepted" sanity check (BUG-2 regression). - test_nextflow_collect_images.py: TestDockerImageRefEcrFailLoud with 2 cases (raises on ECR auth failure, proceeds on success) for the BUG-1 fail-loud branch in DockerImageRef._cache. - test_nextflow_builder_npi_gate.py: 8 tests covering the extracted gate helper — describe-failure, all-accepted, ECR-cluster drop, aws_region escalation (BUG-3 regression), generic-warning fallback, mixed drop. 97/97 tests passing (was 81 before; added 16 new tests).
F10 (Low symmetry): runtime ecr_aws_login now rejects malformed awsRegion - Mirrors the npi.sh region regex check and the dx.py CLI flag check. - Defense-in-depth: AwsUtils.shellSingleQuote already strips \r\n before writing /.dx-aws.env, so direct INI injection is not reachable here, but a malformed region (e.g. shell metacharacters) would otherwise produce a confusing AWS API error far from the source. F7 test parity: test_is_ecr_host_iso_rejected - New regression test confirming the bash is_ecr_host helper rejects us-iso-* and us-isob-* partitions (matching the Python-side _extract_ecr_host_and_region). The corresponding bash-side fix lives in nextaur-app/.../DxBashLib.groovy. 97 -> 98 tests passing.
Fixes from the 8th and 9th adversarial review passes. Pass-8 release blocker / bugs: - nextflow.sh::ecr_aws_login: add post-write `aws_ecr sts get-caller-identity` probe so trust-policy / role-ARN errors fail-fast in the entrypoint instead of per-task at first `docker pull` (F3). - nextflow_builder._apply_npi_input_gate: escalate from warn-and-continue to `raise DXError` when ECR was clearly requested but the deployed NPI cannot honour it (older NPI lacking ecr_* slots, dropped aws_region with ECR intent, undescribable NPI). Non-ECR pipelines still degrade to a warning so older-NPI builds keep working (F8). Pass-9 release blockers / bugs: - nextflow_builder.prepare_custom_inputs: restore APPS-3872 implementation that was accidentally reverted during a rebase. Re-establishes the description + help_text join, conditional `Default value:`, and the clean prefix logic. Verified against the existing schema tests (F9-1). - Pre-flight validation: new `preflight_validate_for_cache_docker(...)` helper, called from dx_build_app.py BEFORE the local pipeline source is uploaded to the user's project. Without this, any DXError raised by the in-build gate would leave an orphaned `.nf_source/<basename>/` upload behind, forcing the user to `dx rm -r` before retrying (F9-2). - Pre-flight tightening: with `--cache-docker`, raise unconditionally on NPI describe failure (don't depend on locally-detectable ECR intent). Closes the leak where a parser miss + describe failure produced a silent build (F9-4). - `--repository <url>` mode: pre-flight describes the deployed NPI and refuses to launch when the importer lacks ECR input slots, so a remote repo using `dnanexus.ecrRoleArnToAssume` cannot silently bypass ECR auth (F9-3). Lows: - collect_images._extract_ecr_host_and_region: strip `docker://` / `oras://` URI scheme prefixes for symmetry with the Groovy bash side (F9-9). Defensive — today's only caller already strips. - collect_images._resolve_digest: comment the best-effort vs fail-loud duplication of `ensure_ecr_login_for_image` (F9-6). - nextflow_utils._strip_groovy_comments: docstring lists what's NOT handled (triple-quoted strings, slashy strings, includeConfig chains) plus the `_accept` newline backstop (F9-7). - nextflow.sh::ecr_aws_login: comment the STS rate-limit trade-off of the per-task probe (F9-8). - _apply_npi_input_gate: docstring updated to enumerate raise vs warn paths after the F8 fail-fast escalation (F9-5). Tests: 99 -> 102. New TestPreflightValidateForCacheDocker covers F9-2/3/4 paths (describe-failure raises unconditionally on --cache-docker; --repository mode against an older NPI raises; --repository mode against a complete NPI passes).
… not common practice
…images When OIDC-based ECR auth is configured (dnanexus.ecrRoleArnToAssume), Nextaur must contact ECR at runtime to resolve the digest of latest/no-tag images before each cache lookup -- so runtime workers need OIDC credentials even for fully pre-cached pipelines. collect_images.py: - _is_floating_ecr_tag(): detects ECR images with latest or no tag - collect_docker_images(): when DX_ECR_ROLE_ARN_TO_ASSUME is set and a floating-tag ECR image is found, raise ImageRefFactoryError (default) or log a warning if DX_ECR_ALLOW_LATEST is set nextflow_builder.py: add ecr_allow_latest to _ECR_SPECIFIC_INPUTS and build_pipeline_with_npi() signature; forward to NPI when True dx.py: add --ecr-allow-latest CLI flag with validation (requires --cache-docker); set DX_ECR_ALLOW_LATEST env var before NPI run dx_build_app.py: thread ecr_allow_latest=args.ecr_allow_latest through to build_pipeline_with_npi()
…ating-tag ECR The escape hatch was misleading: even with --ecr-allow-latest the pre-cached image is silently bypassed every run. Nextaur's head job calls `docker manifest inspect` to resolve the digest of latest/no-tag images before the cache lookup, but the head job Docker daemon is never logged into ECR (only the AWS CLI [ecr] profile is set up). The inspect fails, digest stays null, cache lookup finds nothing, and the build is dead weight. collect_images.py: replace the conditional error + DX_ECR_ALLOW_LATEST warning branch with a single unconditional ImageRefFactoryError that explains the root cause and directs the user to pin the image tag or digest. nextflow_builder.py: remove ecr_allow_latest from _ECR_SPECIFIC_INPUTS and build_pipeline_with_npi() signature. dx.py: remove --ecr-allow-latest flag and its --cache-docker validation. dx_build_app.py: remove ecr_allow_latest= kwarg from build_pipeline_with_npi call.
_npi_input_names() + _apply_npi_input_gate() are transition-period scaffolding. Once the NPI version with ecr_role_arn_to_assume/ecr_job_token_audience/ ecr_job_token_subject_claims/ecr_region_override in its input spec is the minimum deployed version, this entire gate (including the describe() API call on every dx build --nextflow) can be deleted and replaced with unconditional forwarding.
eaaa3db to
022f595
Compare
test_nextflow_builder_npi_gate.py tests only the transition-period gate; add a matching TODO so it gets deleted together with the production code once the new NPI is the minimum deployed version.
… image hostname ECR tokens are region-scoped — the token must come from the same region as the registry being logged into. The region is already unambiguously encoded in every ECR image hostname (e.g. 123456789.dkr.ecr.us-east-1.amazonaws.com -> us-east-1), so deriving it from the hostname is both correct and sufficient. Changes: - collect_images._ecr_docker_login: restore --region <region> to the aws ecr get-login-password call, derived from the image hostname via _extract_ecr_host_and_region. Drops the old reliance on the [ecr] profile's region setting. Also means pipelines with images in multiple ECR regions work correctly without any extra flags (cache key is still (host, region)). - nextflow_builder: remove ecr_region_override from _ECR_SPECIFIC_INPUTS; remove ecr_region param from preflight_validate_for_cache_docker() and build_pipeline_with_npi(); update warning messages and docstrings. - dx.py: remove --ecr-region argument definition and all validation logic. - dx_build_app.py: remove ecr_region=args.ecr_region from both call sites. - test_nextflow_collect_images: update assertion to expect --region in aws call. - test_nextflow_builder_npi_gate: remove ecr_region_override from test fixtures and preflight call signatures.
aws.region in nextflow.config was forwarded to NPI as aws_region so
ecr_login.py could set the STS endpoint region and write it into the
[ecr] AWS profile. Both uses are now gone:
- STS region: boto3.client('sts') without region_name resolves the
endpoint from EC2 IMDS. STS is a global service — any regional endpoint
issues credentials valid for ECR calls in any region, including regions
different from where the NPI EC2 instance runs.
- Profile region: collect_images._ecr_docker_login now passes --region
derived from each image hostname directly to aws ecr get-login-password,
so the [ecr] profile's region setting is never consulted.
Changes:
- nextflow_utils: remove ('aws.region', 'aws_region') from _NEXTFLOW_DX_CONFIG_KEYS.
- nextflow_builder: remove aws_region_dropped check and the ECR+region
DXError from _apply_npi_input_gate; update docstrings and warning messages.
- test_nextflow_config_parser: remove aws_region parse assertions; convert
test_aws_block to assert aws_region is NOT in result; remove two
aws_region newline-injection tests (key no longer parsed).
- test_nextflow_builder_npi_gate: remove test_aws_region_drop_with_ecr_intent_raises
and test_aws_region_drop_without_ecr_intent_uses_generic_warning; update
fixtures that used aws_region.
Six targeted fixes from the Opus review pass: - nextflow.sh (_fetch_jit_to_file): add a line-count guard before the JWT regex check. A multi-line file (proxy error page, truncated write) could previously pass the grep if any single line matched the base64url pattern. Now rejects any file with more than one newline before checking the shape. (B6) - nextflow_utils.py (dotted-form regex): allow an optional trailing `;` before end-of-line. Nextflow config files sometimes use semicolons as statement terminators; without this the parser silently skipped those values. (B11) Corresponding doc fixes in nextaur-app/doc/: - ECR_Private_Registry.md (Multi-Region): runtime ECR pulls also derive region from the image hostname (DxBashLib.ecr_docker_login calls ecr_region_from_host and passes --region explicitly). Note that aws.region is still required at runtime for the [ecr] profile — the per-task --region arg overrides it for actual docker login calls. (D1) - APPS-3915_changes_overview.md: correct the ecr_aws_login description — failure IS fatal (exit 1) when ecrRoleArnToAssume is set. (D2) - Both doc files: floating-tag guard description now explicitly qualifies that the check fires only when ECR OIDC is configured (DX_ECR_ROLE_ARN_TO_ASSUME is set). (D4) - file-apps ecr_login.py: include dx-jobutil-get-identity-token stderr in the failure message so operators can see the API error without inspecting the raw job log. (S5)
… guard The regex '^[A-Za-z0-9_-]+...' assumed base64url encoding but the DNAnexus JIT token uses standard base64 (+ / =) in the signature, causing every token to be rejected. Remove the fragile format check — STS validates the token on AssumeRoleWithWebIdentity and produces a clear error on rejection. Retain the wc -l > 1 guard which catches the actual threat (proxy error pages) and add an explicit error message when it fires.
…ME override The deployed NPI (v0.10.1) does not declare ecr_* input slots, so _apply_npi_input_gate was blocking every dx build --nextflow for ECR pipelines. The runtime ECR auth path does not need those NPI inputs — DxOptions.groovy reads ecrRoleArnToAssume directly from the bundled nextflow.config at orchestrator-job time. Downgrade the DXError to a warning so runtime ECR tests can proceed. The --cache-docker preflight check (preflight_validate_for_cache_docker) is unchanged: the NPI genuinely must pull from ECR during the build job, so that path correctly requires an updated importer. Add DX_NPI_NAME env-var override to get_importer_name() so QE tests can point at a custom importer (one that declares ECR input slots) without publishing a new version to production. Cache-docker ECR tests will set DX_NPI_NAME=<test-npi-name> once a test importer is deployed in vdev.
…rge) Add get_importer_object() to nextflow_utils: returns DXApplet when DX_NPI_NAME starts with 'applet-', DXApp otherwise. Replace all three DXApp(name=get_importer_name()) call sites in nextflow_builder.py. This allows QE to set DX_NPI_NAME=applet-xxxx for cache-docker ECR testing before a proper app version of the NPI is published. Revert this commit (keep only get_importer_name changes) once the test NPI is published as a real app.
DXApp.run() takes app_input= but DXApplet.run() takes applet_input= as a positional argument. When DX_NPI_NAME points to an applet-ID, get_importer_object() returns a DXApplet and the wrong keyword caused TypeError. Branch on isinstance(importer, dxpy.DXApplet) to pick the correct kwarg name.
Remove ecr_aws_login() call from the head job main() entrypoint. The head job never pulls container images so eager ECR credential validation there mis-attributes failures and blocks pipelines that have ECR config set but no ECR images in a given run. Each task worker already calls ecr_aws_login() in nf_task_entry() independently. Update stale comments in ecr_aws_login() to reflect it is now called only from task workers, not the head job. Update nextaur_assets_25_10.staging.json to point aws:us-east-1 at the newly built NPI asset (record-J85x02j0kp5pgbZBgb9Q9JV2) which includes all APPS-3915 ECR changes. Bump toolkit_version suffix to 0.409.0.ecr1 for log traceability.
…ll back to nextflow_pipeline_importer
record-J85x02j0kp5pgbZBgb9Q9JV2 was created in a temporary test project that was subsequently cleaned up, making it inaccessible. All runtime QE tests were failing with "Cannot find latest version of nextaur plugin" because the applets could not fetch the nextaur plugin bundle at startup. Revert to record-J74P910042Jbfy169vg31z9j (the stable pre-APPS-3915 release record) so non-ECR tests that install dxpy directly from this branch continue to work. ECR-specific QE tests use build-nextaur=true which creates a fresh nextaur asset with ECR support at test time via update_and_install.sh, overwriting this committed fallback.
…input)
The local preflight and NPI-side floating-tag guard now both read ECR
intent directly from dnanexus { ecrRoleArnToAssume } in nextflow.config,
so no explicit NPI input field is required. This removes the dependency
on a custom NPI for the floating-tag rejection path.
Changes:
- collect_images.py: add scan_ecr_floating_tags_in_config(src_dir, profile)
that brace-counts profile blocks to identify floating ECR tags without
running `nextflow inspect`; lazy-imports parse_nextflow_config_dx_fields
to avoid circular import; _is_floating_ecr_tag helper detects :latest
and tag-less ECR refs
- nextflow_builder.py: preflight_validate_for_cache_docker gains a profile
param and fires the floating-tag guard after the NPI gate using the new
scanner; raises DXError before .nf_source/ upload so no orphaned data
- dx_build_app.py: pass profile=args.profile to preflight_validate_for_cache_docker
- test_nextflow_builder_npi_gate.py: add 13 unit tests covering
TestScanEcrFloatingTagsInConfig (8 cases) and TestPreflightFloatingTagGuard
(5 cases); all 27 tests pass
…e ECR
Previously, build-time ECR credentials were read from nextflow.config
(ecrRoleArnToAssume), which caused them to be bundled into the resulting
applet permanently. Every runtime execution would then attempt an OIDC/STS
exchange even when all images were safely cached in DNAnexus storage —
breaking the applet if the private registry later revoked access.
This commit implements Option 2: build-time ECR credentials are passed
exclusively via explicit CLI flags and are never read from nextflow.config.
Design principle (now documented in ECR_Private_Registry.md):
--ecr-role-arn / --ecr-job-token-audience / --ecr-job-token-subject-claims
Build-time only. Forwarded as explicit NPI inputs. Never bundled into
the applet. Runtime has zero ECR dependency after images are cached.
nextflow.config: dnanexus.ecrRoleArnToAssume
Runtime only. Bundled into the applet permanently. Drives task-worker
ECR auth on every run. Irrelevant to --cache-docker builds.
Changes:
- dx.py: add --ecr-role-arn, --ecr-job-token-audience,
--ecr-job-token-subject-claims args (only valid with --cache-docker);
validation rejects them without --cache-docker
- dx_build_app.py: thread ecr_role_arn through preflight and build_pipeline_with_npi
- nextflow_builder.py:
- _apply_npi_input_gate: silently skip ECR-specific config fields instead
of forwarding them; warning text fixed to not claim runtime ECR works
- preflight_validate_for_cache_docker: takes ecr_role_arn param; ECR slot
check and floating-tag guard fire on the flag, not on config content
- build_pipeline_with_npi: takes ecr_role_arn/audience/subject_claims;
forwards them directly to NPI input_hash bypassing config-forwarding
- collect_images.py: NPI-side floating-tag guard reads ecr_role_arn_to_assume
from env var (the NPI input exported by npi.sh), not from nextflow.config
- test_nextflow_builder_npi_gate.py: 32 tests updated/added to reflect new
design; key new tests prove ECR config in nextflow.config does not trigger
preflight checks when --ecr-role-arn flag is absent
When a .nf_source/<basename>/ folder already exists in the destination project and --ensure-upload is passed, reuse the existing upload instead of raising an AppBuilderException. This mirrors how --ensure-upload works for the local-build path and enables CI pipelines to persist SOURCE_PROJECT across GHA runs without needing to clean up between retries. Without this fix, the second run of any test that uses --ensure-upload with --cache-docker would fail with "Folder .nf_source/<name> exists... Remove the directory to avoid file duplication and retry".
…ir gate from preflight Two changes to preflight_validate_for_cache_docker: 1. Reorder: floating-tag guard fires BEFORE the ECR slot check when --ecr-role-arn is passed. Previously, the slot check ran first, which blocked test_cache_docker_floating_tag_rejected on environments where the NPI has not yet been upgraded to declare ECR inputs. User config errors (floating tags) should be surfaced before infrastructure gaps. 2. Remove _apply_npi_input_gate call from preflight. Workdir field forwarding (iam_role_arn_to_assume etc.) is irrelevant to the preflight's job of catching errors before .nf_source/ upload. The gate runs again in build_pipeline_with_npi where it actually populates the NPI input_hash. Removing it also eliminates the redundant _npi_input_names() describe call that was only needed to gate workdir fields. Add unit test TestPreflightFloatingTagGuard:: test_floating_tag_raises_before_ecr_slot_check_when_npi_lacks_ecr_inputs to pin the ordering guarantee.
… gate from preflight
…e_nextflow_config_dx_fields, _strip_groovy_comments
The fail-loud message pointed users at dnanexus.ecrRoleArnToAssume in nextflow.config, but that drives runtime ECR auth. Build-time --cache-docker pulls authenticate via the --ecr-role-arn / --ecr-job-token-audience flags; point the message there instead.
Drop preflight_validate_for_cache_docker and the local nextflow.config
floating-tag scanner. The scanner was a best-effort regex that missed the
common process { container = ... } block form, and NPI's collect_docker_images
already rejects floating-tag ECR images authoritatively on the real evaluated
config. Relying on NPI removes the duplicated, incomplete client-side check.
Also drops the now-dead _npi_input_names / ECR slot gate, the unused src_dir
parameter, the reset_ecr_login_cache helper, and the npi_gate test module.
Replace the `aws ecr get-login-password` subprocess with a boto3 call (_get_ecr_password) that reads the [ecr] AWS profile directly. boto3 is already available on the NPI worker alongside dxpy, so this drops the dependency on the aws CLI binary being present. Update the collect_images ECR tests to mock _get_ecr_password accordingly.
Keep the DX_NPI_NAME override (the nextaur nf-qe workflow sets it via the npi-name input to point cache-docker ECR tests at a custom NPI), but drop the verbose docstrings and the run_input_kwarg branching: DXApp.run() and DXApplet.run() both take the input hash as the first positional argument, so get_importer_object().run(input_hash, ...) covers app and applet alike.
A task that uses a pre-cached image (loaded from DNAnexus storage) needs no ECR access at all, so a fatal ecr_aws_login at task entry wrongly killed --cache-docker pipelines whose ECR access was later revoked or misconfigured. Make ecr_aws_login non-fatal: log a warning and continue. ECR auth is still enforced where it is actually needed — the docker-pull step, where nextaur's nxf_docker_pull returns the non-retriable exit code 2, so a task that genuinely needs ECR still fails fast.
…rker Record private AWS ECR registry support for Nextflow under the Unreleased "Added" section, and restore toolkit_version.py to 0.409.0 (the .ecr1 suffix was a non-PEP440 dev marker that would break packaging).
…port refresh_web_identity_token_loop: keep the workdir token refresh identical to master (cadence, retry count, logging) and add the ECR token refresh as a separate parallel block — the previous rewrite (_refresh_token_file helper, consecutive-failure counters, 240->180s cadence) altered the unrelated workdir path. nextflow_utils.get_importer_name: import environ at module scope instead of inside the function.
The branch had repurposed --ensure-upload, in dx build --nextflow, to mean
'reuse the already-uploaded .nf_source folder, skip the upload'. That inverts
the flag's documented meaning ('upload unconditionally') and is a silent
stale-build footgun — a user who edits their pipeline and rebuilds with
--ensure-upload would get the old source. Restore the unconditional
folder-exists error; the QE need is handled QE-side by building into a unique
per-build folder.
…lure is fatal
The head job is not just a passive actor: it runs 'docker manifest inspect'
(DockerImage.implicitLatestGetDigest) to resolve the digest of floating-tag
images for the cache lookup, which needs ECR auth against a private registry.
Nextaur runs that command with allowed exit codes {0}, so the auth failure
throws and the run errors out on the head — it is NOT a silent cache bypass as
the comment and guard message claimed. Fix the nextflow.sh head-job comment and
the floating-tag guard error message accordingly.
…tinel parity, doc re-sync to Option 1 * nextflow.sh refresh_web_identity_token_loop: route both workdir and ECR refresh writes through _fetch_jit_to_file. The previous body wrote the tmp token file under the default umask (0644) and 'mv -f' preserved that mode, silently re-widening the JIT file from 0600 to 0644 every 240s. Going through _fetch_jit_to_file gives us (umask 077), the new explicit chmod 600 (covers the pre-existing-target case where the redirect would inherit 0644), and the multi-line / proxy-interception JWT-shape check for free. * _fetch_jit_to_file: add chmod 600 after the umask-wrapped write — umask only sets the mode for newly-created files; a pre-existing 0644 leftover would otherwise stay 0644. * collect_images.py: floating-tag guard message updated to match Option 1 reality (non-fatal head-side manifest-inspect failure -> silent cache miss -> dead-weight cache); was still claiming 'run errors out on the head'. * collect_images.py: null-sentinel parity with ecr_login.py — treat None / '' / literal 'null' identically as 'ECR not configured'.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.