Skip to content

APPS-3915 private ecr nextflow#1554

Open
bomquangia wants to merge 48 commits into
masterfrom
APPS-3915-private-ecr-nextflow
Open

APPS-3915 private ecr nextflow#1554
bomquangia wants to merge 48 commits into
masterfrom
APPS-3915-private-ecr-nextflow

Conversation

@bomquangia

Copy link
Copy Markdown
Contributor

No description provided.

@bomquangia bomquangia changed the title Apps 3915 private ecr nextflow APPS-3915 private ecr nextflow May 6, 2026
@kodem-security

Copy link
Copy Markdown

SAST scan cancelled: new commit detected on PR

bomquangia added 17 commits May 11, 2026 18:17
Introduces a separate JIT/role for pulling Docker images from a private AWS
ECR registry (least-privilege, decoupled from the existing S3-workdir role).
Adds aws_ecr() to isolate ECR commands from workdir AWS_* env vars, writes
short-lived JITs with mode 600 via file:// to avoid leaking under set -x,
and refreshes both tokens on the same loop with non-empty validation and
consecutive-failure escalation.
Fixes from adversarial review of the initial ECR work:
- Fail-fast in aws_login when sts:AssumeRoleWithWebIdentity fails (prior
  code echoed "Successfully configured" even on failure).
- Pipe dx-jobutil-get-identity-token directly to file under umask 077 so
  the JWT is never put in a shell variable that would leak under set -x.
- Validate JIT shape (3 base64url segments) before promoting; rejects
  partial writes and proxy error pages that pass a size-only check.
- Broaden aws_ecr unset list (AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE,
  AWS_CONTAINER_CREDENTIALS_*) and pin AWS_CONFIG_FILE to a dedicated
  dx-ecr-config path so the [ecr] profile never collides with the
  customer's ~/.aws/config.
- Drop the post-write STS sanity call in ecr_aws_login to save one API
  hit per task subjob startup.
- Refresh at 50% TTL (180s, was 240s) to leave more headroom on slow
  workers where the JIT could otherwise expire mid-flight.
- Make workdir login failure abort the head job; ECR failure remains a
  loud warning so pipelines that don't actually use ECR aren't blocked.
Review-3 follow-up: when dx-jobutil-get-identity-token fails, the redirect
target was already truncated to zero bytes by the shell. Add an explicit
rm -f so we don't leave a 0-byte mode-600 file lingering on the worker.
Phase 2.1-2.3 of private ECR support for Nextflow builds.

- nextflow_utils.parse_nextflow_config_dx_fields(): best-effort parser
  for dnanexus.* (workdir + ECR JIT) and aws.region. Handles dotted and
  scope-block syntax; strips // and /* */ comments. Returns {} cleanly
  when nextflow.config is absent (e.g. --repository <url> builds, where
  NPI clones the repo and reads the config server-side).

- build_pipeline_with_npi() forwards parsed config as NPI input fields.
  All fields are optional in the NPI input spec, so older NPI versions
  silently ignore unknown keys (release-order risk noted in plan).

- New CLI flag --ecr-region: build-only override that lets the importer
  job authenticate to a non-default-region ECR (e.g. bundle a us-west-2
  image into a pipeline whose runtime aws.region is us-east-1). Has no
  runtime effect; runtime ECR pulls always use aws.region per Q3.

- nextflow_templates.py unchanged: per Q4 (nextflow.config is single
  source of truth) ECR config travels with the shipped resources_dir,
  not as applet inputs.
Phase 2.5 of private ECR support for Nextflow builds.

When `dx build --nextflow --cache-docker` resolves and pulls container images
referenced by a pipeline, detect AWS ECR registry hostnames and run
`aws --profile ecr ecr get-login-password | docker login` before each pull.

- _extract_ecr_host_and_region: hostname parser matching the bash-side
  is_ecr_host helper. Commercial AWS partition only; GovCloud (us-gov-*)
  and China (.amazonaws.com.cn) are rejected — their STS/ECR endpoints
  haven't been validated.
- _ecr_docker_login: shells out to `aws` rather than adding boto3 as a
  dxpy dep. Cached per (host, region) for the build's lifetime to avoid
  STS rate-limit cascades on wide-fanout pipelines.
- ensure_ecr_login_for_image: public no-op-for-non-ECR helper called from
  both digest resolution (`docker manifest inspect`) and the actual
  `docker pull` in DockerImageRef._cache.

The [ecr] AWS profile is set up by the importer job entrypoint
(npi.sh's ecr_aws_login). If it isn't, the `aws` call fails and we log a
warning then continue — the subsequent `docker pull` will surface the
auth error to the user clearly.
Fixes from adversarial review of Phase 2 build-time ECR work.

- nextflow_utils._strip_groovy_comments: mask string literals before
  comment stripping. Previously a legal config value containing `//`
  (e.g. `'job://workspace_id'` for ecrJobTokenSubjectClaims, S3 URLs)
  had its closing quote eaten by the line-comment regex, silently
  dropping the field. Block comments now also preserve newlines so
  line-anchored regexes don't shift. Removed dead _extract_quoted_value.

- nextflow_builder: add _npi_input_names() and gate forwarding of new
  ECR/AWS input fields on the deployed NPI's input spec. The platform's
  applet/run rejects unknown inputs with InvalidInput, so forwarding
  unconditionally would break every build against an older NPI. Drop
  unrecognised fields with a clear warning so builds still succeed for
  non-ECR pipelines on older importers.

- dx.py: format-validate --ecr-region against the AWS region naming
  pattern so typos / shell metacharacters fail before launching a
  remote job.

- collect_images: reject us-iso-* and us-isob-* regions in addition to
  us-gov-* (their STS/ECR endpoints are not validated for this flow).
  Corrected the misleading token-lifetime comment that conflated the
  1h STS session with the 12h docker registry token. Mirror docker
  login via `sudo -n` so subsequent `sudo docker pull` in
  DockerImageRef._cache finds the auth in /root/.docker/config.json
  rather than falling back to anonymous access.

- ImageRef.DockerImageRef._cache: fail loud when ensure_ecr_login_for_image
  returns False on a confirmed-ECR image, with a message naming the host
  and pointing at the relevant nextflow.config keys + IAM permissions.
  Avoids the previous behaviour of letting `sudo docker pull` surface a
  generic "no basic auth credentials" error several layers down.
Fixes from pass 2 of the adversarial review.

- collect_images._ecr_docker_login: drop the per-image `--region` arg
  from `aws ecr get-login-password` and rely on the [ecr] profile's
  region setting. This makes the user's `--ecr-region` build-time
  override actually take effect — without it the per-image hostname
  region always won and the override was informational only. Per
  design Q3 a build authenticates to one ECR region; pipelines mixing
  regions must split into multiple `--cache-docker --ecr-region`
  builds.

- collect_images._ecr_docker_login: treat sudo docker login failure
  as fatal (return False) rather than logging a warning and continuing.
  If /root/.docker/config.json never gets the ECR token, the
  subsequent `sudo docker pull` in DockerImageRef._cache falls back to
  anonymous access and fails with a generic registry-auth error,
  defeating the fail-loud branch in _cache. Most likely cause is the
  importer image lacking NOPASSWD sudoers for docker — this surfaces
  it loudly instead of as a confusing pull failure.

- nextflow_builder build_pipeline_with_npi: warn on every dropped
  config field, not just ECR-specific. Previously a silent drop of
  `aws_region` (e.g. against an older NPI that does not declare it)
  surfaced inside the importer as an opaque "ECR is configured but
  no AWS region is available" error.
Pass 3 verdict was Ship It with only Low/Nit findings. Cleaning up the
two findings worth applying inline.

- dx.py: drop the local `import re as _re` inside the build validation
  block; `re` is already imported at module scope (line 22).
Adds 39 unit tests covering the new Phase 2 build-time ECR code paths.

test_nextflow_collect_images.py — appended:
- TestEcrHostExtraction (15 cases): commercial AWS partition matches,
  mixed-case lowering, public-registry rejection, GovCloud / Secret
  (us-iso, us-isob) / China (.cn) partition rejection, malformed
  hostname rejection, None/empty handling.
- TestEcrDockerLogin (4 cases): happy path verifies the exact command
  shape (`aws --profile ecr ecr get-login-password` with no --region,
  the docker login --password-stdin pipe, the sudo mirror); aws failure
  short-circuits before docker; sudo mirror failure returns False
  (matches the design — without root's docker config, sudo docker pull
  later falls back to anonymous and breaks); cache hit skips both
  subprocess calls on the second invocation.
- TestEnsureEcrLoginForImage (2 cases): no-op for non-ECR refs;
  delegates to _ecr_docker_login with the parsed (host, region) for
  ECR refs.

test_nextflow_config_parser.py — new file:
- TestParseDottedForm: all 7 keys forwarded; both single- and
  double-quoted values accepted.
- TestParseScopeBlockForm: dnanexus { ... } and aws { ... } block
  styles; dotted-form-wins-over-scope-block precedence.
- TestComments: regression test for the pass-1 bug — `'job://...'`
  URI value must not have its `//` eaten by line-comment stripping.
  Plus trailing line comment stripped, block comment stripped,
  fully-commented-out key ignored.
- TestParseEdgeCases: None / empty / missing-file / unrelated-config
  inputs all return {} cleanly.
- TestStripGroovyComments: direct unit tests on the comment stripper,
  asserting URI preservation in both quote styles, line comment
  removal, block comment removal, newline preservation across block
  comments (so downstream line-anchored regexes don't shift).

All 39 tests pass against the current implementation.
Adds 12 tests covering the bash fragment in DxBashLib.groovy.dxLib() that
runs at task-runtime on the worker. The bash is sliced out of the Groovy
source at test time (with Groovy escape resolution `\\` -> `\` etc.) and
sourced into a bash harness with `aws` and `docker` mocked via PATH so we
can assert on call arguments and exit codes without launching a real
DNAnexus job.

Coverage:
- is_ecr_host: commercial AWS partition accepted; GovCloud, China,
  public registries rejected.
- extract_ecr_host_from_image: ECR image -> host extracted; non-ECR
  image -> silent (empty); mixed-case input lowercased.
- ecr_region_from_host: region parsed correctly.
- _is_ecr_auth_error: recognises `denied`, `ExpiredToken`; does not
  flag generic network errors (avoids unnecessary STS relogin
  cascades on wide-fanout pipelines).
- ecr_docker_login: cache-file hit short-circuits without invoking
  `aws` or `docker` (verified via mock-call log).

Source location is auto-discovered via the dnanexus parent dir,
preferring nextaur-app worktrees over the main checkout (since the
Phase 1 ECR helpers are still on a feature branch). Falls back to
$DX_NEXTAUR_BASH_LIB env var override for CI. Tests skip cleanly when
no source is available.
Six bugs were found in a final cross-cutting review pass that earlier
adversarial passes missed.

BUG-1 (Bug): Runtime ECR fail-loud regression
- nextflow.sh head + task subjob: ecr_aws_login failure was logged-and-
  continued. Plan and NPI side both treat configured-but-failed ECR as
  fatal; runtime path silently kept the wrong behaviour. With ECR mis-
  configured every task would degrade into the docker-pull retry loop
  and fail with opaque registry errors — 3*retry*N tasks of cost.
- Fix: dx-jobutil-report-error + exit 1 when ecrRoleArnToAssume is set
  but ecr_aws_login returns non-zero. ecr_aws_login itself still
  returns 0 cleanly when ECR is not configured, so non-ECR pipelines
  are unaffected.

BUG-2 (Bug): Newline injection in nextflow.config parser
- Python's `[^']*` and `[^"]*` capture classes match newlines (the
  `(?m)` flag only affects `^` and `$`). A pipeline author could
  embed `\n[default]\naws_access_key_id = ATTACKER` inside aws.region
  and the value would round-trip through the parser into NPI's
  ~/.aws/credentials heredoc, injecting a rogue [default] profile.
- Fix: parse_nextflow_config_dx_fields now rejects any captured value
  containing CR or LF — same discipline as AwsUtils.shellSingleQuote
  on the runtime path.

BUG-3 (Medium): aws_region drop warning was generic
- When the user configures ECR but the deployed NPI does not declare
  aws_region, dropping it produced only the generic "may fall back to
  defaults or fail" warning — but ECR is GUARANTEED to fail without it.
- Fix: detect ECR intent + dropped aws_region and escalate to a
  specific "ECR is configured but no AWS region is available" warning
  pointing at the importer-version mismatch.

BUG-3 (refactor): extract _apply_npi_input_gate helper
- The forwarding-gate logic was inline in build_pipeline_with_npi and
  hard to test in isolation. Extracted to a pure helper so the warning-
  categorisation logic is unit-testable.

BUG-5 (Low): _ECR_LOGGED_IN_HOSTS lifetime contract
- Documented that the cache is correct only for the lifetime of a
  single dx-build process and added a reset_ecr_login_cache() helper
  for any future long-lived caller.

BUG-9: tests for the gaps
- test_nextflow_config_parser.py: 6 newline-rejection tests plus a
  "clean value still accepted" sanity check (BUG-2 regression).
- test_nextflow_collect_images.py: TestDockerImageRefEcrFailLoud
  with 2 cases (raises on ECR auth failure, proceeds on success)
  for the BUG-1 fail-loud branch in DockerImageRef._cache.
- test_nextflow_builder_npi_gate.py: 8 tests covering the extracted
  gate helper — describe-failure, all-accepted, ECR-cluster drop,
  aws_region escalation (BUG-3 regression), generic-warning fallback,
  mixed drop.

97/97 tests passing (was 81 before; added 16 new tests).
F10 (Low symmetry): runtime ecr_aws_login now rejects malformed awsRegion
- Mirrors the npi.sh region regex check and the dx.py CLI flag check.
- Defense-in-depth: AwsUtils.shellSingleQuote already strips \r\n before
  writing /.dx-aws.env, so direct INI injection is not reachable here,
  but a malformed region (e.g. shell metacharacters) would otherwise
  produce a confusing AWS API error far from the source.

F7 test parity: test_is_ecr_host_iso_rejected
- New regression test confirming the bash is_ecr_host helper rejects
  us-iso-* and us-isob-* partitions (matching the Python-side
  _extract_ecr_host_and_region). The corresponding bash-side fix lives
  in nextaur-app/.../DxBashLib.groovy.

97 -> 98 tests passing.
Fixes from the 8th and 9th adversarial review passes.

Pass-8 release blocker / bugs:
- nextflow.sh::ecr_aws_login: add post-write `aws_ecr sts
  get-caller-identity` probe so trust-policy / role-ARN errors fail-fast
  in the entrypoint instead of per-task at first `docker pull` (F3).
- nextflow_builder._apply_npi_input_gate: escalate from warn-and-continue
  to `raise DXError` when ECR was clearly requested but the deployed NPI
  cannot honour it (older NPI lacking ecr_* slots, dropped aws_region
  with ECR intent, undescribable NPI). Non-ECR pipelines still degrade
  to a warning so older-NPI builds keep working (F8).

Pass-9 release blockers / bugs:
- nextflow_builder.prepare_custom_inputs: restore APPS-3872 implementation
  that was accidentally reverted during a rebase. Re-establishes the
  description + help_text join, conditional `Default value:`, and the
  clean prefix logic. Verified against the existing schema tests (F9-1).
- Pre-flight validation: new `preflight_validate_for_cache_docker(...)`
  helper, called from dx_build_app.py BEFORE the local pipeline source
  is uploaded to the user's project. Without this, any DXError raised by
  the in-build gate would leave an orphaned `.nf_source/<basename>/`
  upload behind, forcing the user to `dx rm -r` before retrying (F9-2).
- Pre-flight tightening: with `--cache-docker`, raise unconditionally on
  NPI describe failure (don't depend on locally-detectable ECR intent).
  Closes the leak where a parser miss + describe failure produced a
  silent build (F9-4).
- `--repository <url>` mode: pre-flight describes the deployed NPI and
  refuses to launch when the importer lacks ECR input slots, so a remote
  repo using `dnanexus.ecrRoleArnToAssume` cannot silently bypass ECR
  auth (F9-3).

Lows:
- collect_images._extract_ecr_host_and_region: strip `docker://` /
  `oras://` URI scheme prefixes for symmetry with the Groovy bash side
  (F9-9). Defensive — today's only caller already strips.
- collect_images._resolve_digest: comment the best-effort vs fail-loud
  duplication of `ensure_ecr_login_for_image` (F9-6).
- nextflow_utils._strip_groovy_comments: docstring lists what's NOT
  handled (triple-quoted strings, slashy strings, includeConfig chains)
  plus the `_accept` newline backstop (F9-7).
- nextflow.sh::ecr_aws_login: comment the STS rate-limit trade-off of
  the per-task probe (F9-8).
- _apply_npi_input_gate: docstring updated to enumerate raise vs warn
  paths after the F8 fail-fast escalation (F9-5).

Tests: 99 -> 102. New TestPreflightValidateForCacheDocker covers F9-2/3/4
paths (describe-failure raises unconditionally on --cache-docker;
--repository mode against an older NPI raises; --repository mode against
a complete NPI passes).
…images

When OIDC-based ECR auth is configured (dnanexus.ecrRoleArnToAssume),
Nextaur must contact ECR at runtime to resolve the digest of latest/no-tag
images before each cache lookup -- so runtime workers need OIDC credentials
even for fully pre-cached pipelines.

collect_images.py:
- _is_floating_ecr_tag(): detects ECR images with latest or no tag
- collect_docker_images(): when DX_ECR_ROLE_ARN_TO_ASSUME is set and a
  floating-tag ECR image is found, raise ImageRefFactoryError (default) or
  log a warning if DX_ECR_ALLOW_LATEST is set

nextflow_builder.py: add ecr_allow_latest to _ECR_SPECIFIC_INPUTS and
build_pipeline_with_npi() signature; forward to NPI when True

dx.py: add --ecr-allow-latest CLI flag with validation (requires
--cache-docker); set DX_ECR_ALLOW_LATEST env var before NPI run

dx_build_app.py: thread ecr_allow_latest=args.ecr_allow_latest through to
build_pipeline_with_npi()
…ating-tag ECR

The escape hatch was misleading: even with --ecr-allow-latest the pre-cached
image is silently bypassed every run. Nextaur's head job calls
`docker manifest inspect` to resolve the digest of latest/no-tag images
before the cache lookup, but the head job Docker daemon is never logged into
ECR (only the AWS CLI [ecr] profile is set up). The inspect fails, digest
stays null, cache lookup finds nothing, and the build is dead weight.

collect_images.py: replace the conditional error + DX_ECR_ALLOW_LATEST
warning branch with a single unconditional ImageRefFactoryError that explains
the root cause and directs the user to pin the image tag or digest.

nextflow_builder.py: remove ecr_allow_latest from _ECR_SPECIFIC_INPUTS and
build_pipeline_with_npi() signature.

dx.py: remove --ecr-allow-latest flag and its --cache-docker validation.

dx_build_app.py: remove ecr_allow_latest= kwarg from build_pipeline_with_npi call.
_npi_input_names() + _apply_npi_input_gate() are transition-period scaffolding.
Once the NPI version with ecr_role_arn_to_assume/ecr_job_token_audience/
ecr_job_token_subject_claims/ecr_region_override in its input spec is the minimum
deployed version, this entire gate (including the describe() API call on every
dx build --nextflow) can be deleted and replaced with unconditional forwarding.
@bomquangia bomquangia force-pushed the APPS-3915-private-ecr-nextflow branch from eaaa3db to 022f595 Compare May 11, 2026 11:18
test_nextflow_builder_npi_gate.py tests only the transition-period gate;
add a matching TODO so it gets deleted together with the production code
once the new NPI is the minimum deployed version.
… image hostname

ECR tokens are region-scoped — the token must come from the same region as the
registry being logged into. The region is already unambiguously encoded in every
ECR image hostname (e.g. 123456789.dkr.ecr.us-east-1.amazonaws.com -> us-east-1),
so deriving it from the hostname is both correct and sufficient.

Changes:
- collect_images._ecr_docker_login: restore --region <region> to the aws ecr
  get-login-password call, derived from the image hostname via
  _extract_ecr_host_and_region. Drops the old reliance on the [ecr] profile's
  region setting. Also means pipelines with images in multiple ECR regions work
  correctly without any extra flags (cache key is still (host, region)).
- nextflow_builder: remove ecr_region_override from _ECR_SPECIFIC_INPUTS;
  remove ecr_region param from preflight_validate_for_cache_docker() and
  build_pipeline_with_npi(); update warning messages and docstrings.
- dx.py: remove --ecr-region argument definition and all validation logic.
- dx_build_app.py: remove ecr_region=args.ecr_region from both call sites.
- test_nextflow_collect_images: update assertion to expect --region in aws call.
- test_nextflow_builder_npi_gate: remove ecr_region_override from test fixtures
  and preflight call signatures.
aws.region in nextflow.config was forwarded to NPI as aws_region so
ecr_login.py could set the STS endpoint region and write it into the
[ecr] AWS profile. Both uses are now gone:

- STS region: boto3.client('sts') without region_name resolves the
  endpoint from EC2 IMDS. STS is a global service — any regional endpoint
  issues credentials valid for ECR calls in any region, including regions
  different from where the NPI EC2 instance runs.

- Profile region: collect_images._ecr_docker_login now passes --region
  derived from each image hostname directly to aws ecr get-login-password,
  so the [ecr] profile's region setting is never consulted.

Changes:
- nextflow_utils: remove ('aws.region', 'aws_region') from _NEXTFLOW_DX_CONFIG_KEYS.
- nextflow_builder: remove aws_region_dropped check and the ECR+region
  DXError from _apply_npi_input_gate; update docstrings and warning messages.
- test_nextflow_config_parser: remove aws_region parse assertions; convert
  test_aws_block to assert aws_region is NOT in result; remove two
  aws_region newline-injection tests (key no longer parsed).
- test_nextflow_builder_npi_gate: remove test_aws_region_drop_with_ecr_intent_raises
  and test_aws_region_drop_without_ecr_intent_uses_generic_warning; update
  fixtures that used aws_region.
Six targeted fixes from the Opus review pass:

- nextflow.sh (_fetch_jit_to_file): add a line-count guard before the JWT
  regex check. A multi-line file (proxy error page, truncated write) could
  previously pass the grep if any single line matched the base64url pattern.
  Now rejects any file with more than one newline before checking the shape.
  (B6)

- nextflow_utils.py (dotted-form regex): allow an optional trailing `;`
  before end-of-line. Nextflow config files sometimes use semicolons as
  statement terminators; without this the parser silently skipped those
  values. (B11)

Corresponding doc fixes in nextaur-app/doc/:
- ECR_Private_Registry.md (Multi-Region): runtime ECR pulls also derive
  region from the image hostname (DxBashLib.ecr_docker_login calls
  ecr_region_from_host and passes --region explicitly). Note that aws.region
  is still required at runtime for the [ecr] profile — the per-task --region
  arg overrides it for actual docker login calls. (D1)

- APPS-3915_changes_overview.md: correct the ecr_aws_login description —
  failure IS fatal (exit 1) when ecrRoleArnToAssume is set. (D2)

- Both doc files: floating-tag guard description now explicitly qualifies
  that the check fires only when ECR OIDC is configured
  (DX_ECR_ROLE_ARN_TO_ASSUME is set). (D4)

- file-apps ecr_login.py: include dx-jobutil-get-identity-token stderr in
  the failure message so operators can see the API error without inspecting
  the raw job log. (S5)
@bomquangia bomquangia marked this pull request as ready for review May 11, 2026 12:48
… guard

The regex '^[A-Za-z0-9_-]+...' assumed base64url encoding but the DNAnexus
JIT token uses standard base64 (+ / =) in the signature, causing every
token to be rejected. Remove the fragile format check — STS validates the
token on AssumeRoleWithWebIdentity and produces a clear error on rejection.
Retain the wc -l > 1 guard which catches the actual threat (proxy error
pages) and add an explicit error message when it fires.
…ME override

The deployed NPI (v0.10.1) does not declare ecr_* input slots, so
_apply_npi_input_gate was blocking every dx build --nextflow for ECR
pipelines. The runtime ECR auth path does not need those NPI inputs —
DxOptions.groovy reads ecrRoleArnToAssume directly from the bundled
nextflow.config at orchestrator-job time. Downgrade the DXError to a
warning so runtime ECR tests can proceed.

The --cache-docker preflight check (preflight_validate_for_cache_docker)
is unchanged: the NPI genuinely must pull from ECR during the build job,
so that path correctly requires an updated importer.

Add DX_NPI_NAME env-var override to get_importer_name() so QE tests can
point at a custom importer (one that declares ECR input slots) without
publishing a new version to production. Cache-docker ECR tests will set
DX_NPI_NAME=<test-npi-name> once a test importer is deployed in vdev.
…rge)

Add get_importer_object() to nextflow_utils: returns DXApplet when
DX_NPI_NAME starts with 'applet-', DXApp otherwise. Replace all three
DXApp(name=get_importer_name()) call sites in nextflow_builder.py.

This allows QE to set DX_NPI_NAME=applet-xxxx for cache-docker ECR
testing before a proper app version of the NPI is published.
Revert this commit (keep only get_importer_name changes) once the
test NPI is published as a real app.
bomquangia added 24 commits May 18, 2026 22:20
DXApp.run() takes app_input= but DXApplet.run() takes applet_input=
as a positional argument. When DX_NPI_NAME points to an applet-ID,
get_importer_object() returns a DXApplet and the wrong keyword caused
TypeError. Branch on isinstance(importer, dxpy.DXApplet) to pick the
correct kwarg name.
Remove ecr_aws_login() call from the head job main() entrypoint. The head
job never pulls container images so eager ECR credential validation there
mis-attributes failures and blocks pipelines that have ECR config set but no
ECR images in a given run. Each task worker already calls ecr_aws_login() in
nf_task_entry() independently.

Update stale comments in ecr_aws_login() to reflect it is now called only
from task workers, not the head job.

Update nextaur_assets_25_10.staging.json to point aws:us-east-1 at the
newly built NPI asset (record-J85x02j0kp5pgbZBgb9Q9JV2) which includes
all APPS-3915 ECR changes.

Bump toolkit_version suffix to 0.409.0.ecr1 for log traceability.
record-J85x02j0kp5pgbZBgb9Q9JV2 was created in a temporary test project
that was subsequently cleaned up, making it inaccessible. All runtime QE
tests were failing with "Cannot find latest version of nextaur plugin"
because the applets could not fetch the nextaur plugin bundle at startup.

Revert to record-J74P910042Jbfy169vg31z9j (the stable pre-APPS-3915
release record) so non-ECR tests that install dxpy directly from this
branch continue to work. ECR-specific QE tests use build-nextaur=true
which creates a fresh nextaur asset with ECR support at test time via
update_and_install.sh, overwriting this committed fallback.
…input)

The local preflight and NPI-side floating-tag guard now both read ECR
intent directly from dnanexus { ecrRoleArnToAssume } in nextflow.config,
so no explicit NPI input field is required. This removes the dependency
on a custom NPI for the floating-tag rejection path.

Changes:
- collect_images.py: add scan_ecr_floating_tags_in_config(src_dir, profile)
  that brace-counts profile blocks to identify floating ECR tags without
  running `nextflow inspect`; lazy-imports parse_nextflow_config_dx_fields
  to avoid circular import; _is_floating_ecr_tag helper detects :latest
  and tag-less ECR refs
- nextflow_builder.py: preflight_validate_for_cache_docker gains a profile
  param and fires the floating-tag guard after the NPI gate using the new
  scanner; raises DXError before .nf_source/ upload so no orphaned data
- dx_build_app.py: pass profile=args.profile to preflight_validate_for_cache_docker
- test_nextflow_builder_npi_gate.py: add 13 unit tests covering
  TestScanEcrFloatingTagsInConfig (8 cases) and TestPreflightFloatingTagGuard
  (5 cases); all 27 tests pass
…e ECR

Previously, build-time ECR credentials were read from nextflow.config
(ecrRoleArnToAssume), which caused them to be bundled into the resulting
applet permanently. Every runtime execution would then attempt an OIDC/STS
exchange even when all images were safely cached in DNAnexus storage —
breaking the applet if the private registry later revoked access.

This commit implements Option 2: build-time ECR credentials are passed
exclusively via explicit CLI flags and are never read from nextflow.config.

Design principle (now documented in ECR_Private_Registry.md):
  --ecr-role-arn / --ecr-job-token-audience / --ecr-job-token-subject-claims
    Build-time only. Forwarded as explicit NPI inputs. Never bundled into
    the applet. Runtime has zero ECR dependency after images are cached.

  nextflow.config: dnanexus.ecrRoleArnToAssume
    Runtime only. Bundled into the applet permanently. Drives task-worker
    ECR auth on every run. Irrelevant to --cache-docker builds.

Changes:
- dx.py: add --ecr-role-arn, --ecr-job-token-audience,
  --ecr-job-token-subject-claims args (only valid with --cache-docker);
  validation rejects them without --cache-docker
- dx_build_app.py: thread ecr_role_arn through preflight and build_pipeline_with_npi
- nextflow_builder.py:
  - _apply_npi_input_gate: silently skip ECR-specific config fields instead
    of forwarding them; warning text fixed to not claim runtime ECR works
  - preflight_validate_for_cache_docker: takes ecr_role_arn param; ECR slot
    check and floating-tag guard fire on the flag, not on config content
  - build_pipeline_with_npi: takes ecr_role_arn/audience/subject_claims;
    forwards them directly to NPI input_hash bypassing config-forwarding
- collect_images.py: NPI-side floating-tag guard reads ecr_role_arn_to_assume
  from env var (the NPI input exported by npi.sh), not from nextflow.config
- test_nextflow_builder_npi_gate.py: 32 tests updated/added to reflect new
  design; key new tests prove ECR config in nextflow.config does not trigger
  preflight checks when --ecr-role-arn flag is absent
When a .nf_source/<basename>/ folder already exists in the destination
project and --ensure-upload is passed, reuse the existing upload instead
of raising an AppBuilderException. This mirrors how --ensure-upload works
for the local-build path and enables CI pipelines to persist SOURCE_PROJECT
across GHA runs without needing to clean up between retries.

Without this fix, the second run of any test that uses --ensure-upload with
--cache-docker would fail with "Folder .nf_source/<name> exists... Remove
the directory to avoid file duplication and retry".
…ir gate from preflight

Two changes to preflight_validate_for_cache_docker:

1. Reorder: floating-tag guard fires BEFORE the ECR slot check when
   --ecr-role-arn is passed. Previously, the slot check ran first, which
   blocked test_cache_docker_floating_tag_rejected on environments where the
   NPI has not yet been upgraded to declare ECR inputs. User config errors
   (floating tags) should be surfaced before infrastructure gaps.

2. Remove _apply_npi_input_gate call from preflight. Workdir field forwarding
   (iam_role_arn_to_assume etc.) is irrelevant to the preflight's job of
   catching errors before .nf_source/ upload. The gate runs again in
   build_pipeline_with_npi where it actually populates the NPI input_hash.
   Removing it also eliminates the redundant _npi_input_names() describe call
   that was only needed to gate workdir fields.

Add unit test TestPreflightFloatingTagGuard::
  test_floating_tag_raises_before_ecr_slot_check_when_npi_lacks_ecr_inputs
to pin the ordering guarantee.
…e_nextflow_config_dx_fields, _strip_groovy_comments
The fail-loud message pointed users at dnanexus.ecrRoleArnToAssume in
nextflow.config, but that drives runtime ECR auth. Build-time --cache-docker
pulls authenticate via the --ecr-role-arn / --ecr-job-token-audience flags;
point the message there instead.
Drop preflight_validate_for_cache_docker and the local nextflow.config
floating-tag scanner. The scanner was a best-effort regex that missed the
common process { container = ... } block form, and NPI's collect_docker_images
already rejects floating-tag ECR images authoritatively on the real evaluated
config. Relying on NPI removes the duplicated, incomplete client-side check.

Also drops the now-dead _npi_input_names / ECR slot gate, the unused src_dir
parameter, the reset_ecr_login_cache helper, and the npi_gate test module.
Replace the `aws ecr get-login-password` subprocess with a boto3 call
(_get_ecr_password) that reads the [ecr] AWS profile directly. boto3 is
already available on the NPI worker alongside dxpy, so this drops the
dependency on the aws CLI binary being present. Update the collect_images
ECR tests to mock _get_ecr_password accordingly.
Keep the DX_NPI_NAME override (the nextaur nf-qe workflow sets it via the
npi-name input to point cache-docker ECR tests at a custom NPI), but drop
the verbose docstrings and the run_input_kwarg branching: DXApp.run() and
DXApplet.run() both take the input hash as the first positional argument,
so get_importer_object().run(input_hash, ...) covers app and applet alike.
A task that uses a pre-cached image (loaded from DNAnexus storage) needs no
ECR access at all, so a fatal ecr_aws_login at task entry wrongly killed
--cache-docker pipelines whose ECR access was later revoked or misconfigured.
Make ecr_aws_login non-fatal: log a warning and continue. ECR auth is still
enforced where it is actually needed — the docker-pull step, where nextaur's
nxf_docker_pull returns the non-retriable exit code 2, so a task that genuinely
needs ECR still fails fast.
…rker

Record private AWS ECR registry support for Nextflow under the Unreleased
"Added" section, and restore toolkit_version.py to 0.409.0 (the .ecr1 suffix
was a non-PEP440 dev marker that would break packaging).
…port

refresh_web_identity_token_loop: keep the workdir token refresh identical to
master (cadence, retry count, logging) and add the ECR token refresh as a
separate parallel block — the previous rewrite (_refresh_token_file helper,
consecutive-failure counters, 240->180s cadence) altered the unrelated workdir
path. nextflow_utils.get_importer_name: import environ at module scope instead
of inside the function.
The branch had repurposed --ensure-upload, in dx build --nextflow, to mean
'reuse the already-uploaded .nf_source folder, skip the upload'. That inverts
the flag's documented meaning ('upload unconditionally') and is a silent
stale-build footgun — a user who edits their pipeline and rebuilds with
--ensure-upload would get the old source. Restore the unconditional
folder-exists error; the QE need is handled QE-side by building into a unique
per-build folder.
…lure is fatal

The head job is not just a passive actor: it runs 'docker manifest inspect'
(DockerImage.implicitLatestGetDigest) to resolve the digest of floating-tag
images for the cache lookup, which needs ECR auth against a private registry.
Nextaur runs that command with allowed exit codes {0}, so the auth failure
throws and the run errors out on the head — it is NOT a silent cache bypass as
the comment and guard message claimed. Fix the nextflow.sh head-job comment and
the floating-tag guard error message accordingly.
…tinel parity, doc re-sync to Option 1

* nextflow.sh refresh_web_identity_token_loop: route both workdir and ECR
  refresh writes through _fetch_jit_to_file. The previous body wrote the tmp
  token file under the default umask (0644) and 'mv -f' preserved that mode,
  silently re-widening the JIT file from 0600 to 0644 every 240s. Going through
  _fetch_jit_to_file gives us (umask 077), the new explicit chmod 600 (covers
  the pre-existing-target case where the redirect would inherit 0644), and the
  multi-line / proxy-interception JWT-shape check for free.
* _fetch_jit_to_file: add chmod 600 after the umask-wrapped write — umask only
  sets the mode for newly-created files; a pre-existing 0644 leftover would
  otherwise stay 0644.
* collect_images.py: floating-tag guard message updated to match Option 1
  reality (non-fatal head-side manifest-inspect failure -> silent cache miss ->
  dead-weight cache); was still claiming 'run errors out on the head'.
* collect_images.py: null-sentinel parity with ecr_login.py — treat None / ''
  / literal 'null' identically as 'ECR not configured'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant