Skip to content

Consolidation Of external dependencies#98

Open
aruokhai wants to merge 4 commits intomasterfrom
consolidate_fiels
Open

Consolidation Of external dependencies#98
aruokhai wants to merge 4 commits intomasterfrom
consolidate_fiels

Conversation

@aruokhai
Copy link
Copy Markdown
Collaborator

No description provided.

…latten scaffolding

Consolidates the four systemd units (enclave-watchdog, gvproxy,
enclave-imds-proxy, supervisor) into a single enclave-supervisor.service
that runs gvproxy, the IMDS vsock forwarder, and the lifecycle watchdog
as in-process Go libraries. The systemd unit is now inlined into
user_data.sh.tftpl rather than shipped as a standalone file.

Vendors github.com/brave/nitriding-daemon@v1.4.2 into runtime/nitriding/
and github.com/brave/viproxy@v0.1.2 into runtime/viproxy/. The runtime
binary is now the EIF entrypoint — start.sh is gone.

Watchdog owns the enclave in both native and test modes: nitro-cli on
real Nitro hosts, ENCLAVE_START_CMD/ENCLAVE_STOP_CMD overrides in the
QEMU test harness. lifecycleStop/Start latch the stopped flag in both
paths to prevent the poll loop from racing migrations. deadThreshold
bumped from 30s to 60s to comfortably cover QEMU + nitriding-TLS cold
boot.

Build artifacts moved from enclave/artifacts/ to .enclave/artifacts/
with flake.{nix,lock} kept in enclave/. test/boot-qemu.sh merged into
test/run.sh as boot_qemu() + --boot-only dispatch.

Also drops: dokploy scaffolding, CLI test matrix (test/cli/*-app/ and
.github/workflows/cli.yml), test/vsock-proxy.py (replaced by in-process
IMDS forwarder).
… tofu tree

Splits deployment scaffolding off from `enclave init`. The init command now
emits only build-time files (enclave/enclave.yaml, enclave/flake.nix, CI
workflow templates). A new `enclave tofu` subcommand emits the OpenTofu
deployment tree, with merge-only-new semantics so re-runs preserve
user-edited resources. `enclave tofu` also writes terraform.tfvars.json
from enclave.yaml — that responsibility moved out of `enclave build`,
which now only produces EIF + supervisor artifacts.

Consolidates the scaffolded tofu tree from 17 files to 5 — one main.tf
per module with section banners, plus the user_data.sh.tftpl template.
Files merged: root main.tf + variables.tf + outputs.tf → tofu/main.tf;
8 enclave-module .tf files (kms, iam, s3, ssm, vpc, ec2, vars, outputs)
→ modules/enclave/main.tf; backend bootstrap 3 files →
modules/backend/main.tf. OpenTofu concatenates all .tf in a directory
before parsing, so this is a pure refactor with no behavioral change.

Drops the now-redundant `enclave tfvars` command (subset of `enclave
tofu`) and the obsolete tofu/migrate-state.sh helper (one-shot
flat-to-module migration that's already done for any live deployment).

Also flatten the scaffolded tofu/ tree to live at the repo root rather
than under enclave/tofu/, and update CI workflow templates + scaffolded
deploy-enclave.yml to call `enclave tofu` between `enclave build` and
`tofu apply`. Fixes stale `go install …/cmd/enclave@latest` references
(package now at cli/cmd/enclave).

Verified: integration test (`make test-run`) passes 55/55 against the
consolidated layout, including migration, rollback, and supervisor
relauncher resilience.
Copy link
Copy Markdown

@arkanaai arkanaai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Consolidation of External Dependencies

Thorough review of +7,912 / −4,688 lines. The consolidation approach (vendoring nitriding v1.4.2 + viproxy, replacing shell scripts + systemd units with Go binaries) is architecturally sound. The supervisor errgroup coordination is well-designed. However, there are several issues to resolve before merge.


🔴 BUG — PCR path broken after cd tofu

cli/framework_files.go (deploy workflow template) and test/app/.github/workflows/deploy-enclave.yml

The deploy step does:

enclave tofu
cd tofu
tofu init
tofu apply -auto-approve
pcr0=$(jq -r '.PCR0' .enclave/artifacts/pcr.json)

After cd tofu, the relative path .enclave/artifacts/pcr.json resolves to tofu/.enclave/artifacts/pcr.json — which doesn't exist. PCR values will be null, breaking the verification step and the deployment manifest.

Fix: Use ../.enclave/artifacts/pcr.json after the cd tofu, or read PCR values before changing directory.

This affects both the CLI-generated template (used by every enclave init consumer) and the test app workflow.


🟡 MEDIUM — readyHandler panics on double-call

runtime/nitriding/handlers.goreadyHandler()

func readyHandler(e *Enclave) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        close(e.ready)   // panics if called twice
    }
}

close() on an already-closed channel panics. This is internal-only (bound to 127.0.0.1), but a misbehaving enclave application calling GET /enclave/ready twice will crash the entire nitriding runtime. Wrap in sync.Once.

I understand this is upstream nitriding code, but since you're now owning this fork, it's worth fixing.


🟡 MEDIUM — InsecureSkipVerify: true in watchdog health probe (production path)

supervisor/watchdog.goisRunning() fallback

client := &http.Client{
    Transport: &http.Transport{
        TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
    },
}
resp, err := client.Get(url + "/health")

When nitro-cli is not on PATH (and no ENCLAVE_START_CMD override), the watchdog falls through to this TLS health probe with cert verification disabled. This path is reachable in production if nitro-cli is missing from PATH for any reason. The enclave generates self-signed certs so verification won't work out of the box, but InsecureSkipVerify means any MITM on the vsock/loopback can impersonate the health endpoint.

Suggestion: Pin the expected self-signed cert fingerprint, or at minimum document why this is acceptable and add a log warning when this path is taken.


🟡 MEDIUM — Nonce cache goroutine DoS on public endpoint

runtime/nitriding/cache.goAdd()

Each nonce request spawns a goroutine that sleeps for TTL (1 min). The nonce endpoint (GET /enclave/nonce) is on the public-facing HTTPS server. An attacker can spawn unbounded goroutines by requesting nonces at high rate.

Upstream acknowledges this in the code comments. Since you're vendoring this, consider adding rate-limiting middleware on the nonce endpoint, or replacing the goroutine-per-entry design with a periodic sweep.


🟡 MEDIUM — Context leak in ACME cert retrieval

runtime/nitriding/enclave.gosetupAcme()

go func() {
    for {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
        defer cancel()    // <-- deferred until goroutine exits, not loop iteration
        rawData, err = cache.Get(ctx, e.cfg.FQDN)

defer cancel() inside a for-loop defers until the goroutine returns, not per iteration. Each iteration leaks a context/timer. Move cancel() to an explicit call before time.Sleep.


🟠 LOW — Shell injection surface via env vars

supervisor/watchdog.gostartEnclave() / terminateEnclave()
supervisor/migrate.golifecycleStop() / lifecycleStart()

if cmd := os.Getenv("ENCLAVE_START_CMD"); cmd != "" {
    out, err := exec.CommandContext(ctx, "sh", "-c", cmd).CombinedOutput()

ENCLAVE_START_CMD and ENCLAVE_STOP_CMD are passed to sh -c. These are documented as test harness overrides, but the code path exists in the production binary. If an attacker can write to the process environment (SSM parameter injection, container escape), they get arbitrary command execution.

Suggestion: Gate behind a --test-mode flag or build tag rather than silently checking env vars.


🟠 LOW — S3 assets bucket lacks SSL enforcement

cli/tofu_files.goaws_s3_bucket "assets" resource

The storage bucket correctly enforces TLS via aws_s3_bucket_policy with "aws:SecureTransport" = "false", but the assets bucket (storing EIF images and supervisor binaries) has no such policy. An IAM-authenticated actor could download artifacts over HTTP.


🟠 LOW — hashHandler missing return after error

runtime/nitriding/handlers.gohashHandler()

After the non-errTooMuchToRead error from io.ReadAll, the handler logs the error but does not return — execution falls through to base64 decode of partial data. Upstream bug, low impact (decode would fail), but sloppy for attestation-critical code.


ℹ️ Downstream Impact

  • enclave-skeleton repo still has the old-style systemd units + shell scripts. Needs updating after this merges or it will confuse users.
  • 21 open PRs on this repo, several (especially #53 and dependabot go.mod PRs #88-97) will conflict with this change.
  • No Go import consumers found across the org — this is consumed as a CLI tool, not a library.

✅ What looks good

  • Vendored nitriding is faithful to upstream v1.4.2 with minimal, well-documented deviations (documented in UPSTREAM.md)
  • setters.go extension is well-scoped — direct in-process call replacing HTTP round-trip
  • Supervisor errgroup coordination is correct (gvproxy → ready signal → watchdog → enclave)
  • viproxy vendoring is clean
  • No hardcoded secrets or credentials found
  • GitHub Actions use OIDC for AWS auth, no static keys
  • Test coverage exists for nitriding components (keysync, handlers, attestation, cache, certcache)

Verdict: Request changes for the PCR path bug (blocking — it breaks every deployment). The readyHandler panic and InsecureSkipVerify in production are worth fixing before merge. The rest are improvements that can be tracked as follow-ups.

⚠️ This PR touches attestation and key synchronization paths. Requesting human review per protocol-critical code policy.

🤖 Reviewed by Arkana

Copy link
Copy Markdown

@arkanaai arkanaai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review — Commit 3ebba33f (resource cleanup + error handling)

New commit reviewed. The changes here are minor cleanup — all look correct. Notes below.


✅ LGTM — Lint fixes in cli/init.go and supervisor/imds_proxy.go

defer f.Close()defer func() { _ = f.Close() }() — suppresses errcheck lint warning for deferred close on read-only or best-effort paths. Fine.

✅ LGTM — supervisor/watchdog.go:108-110 — terminateEnclave error logging

Previously silently discarded the error from terminateEnclave() on context cancellation. Now logs it at slog.Warn. Correct — shutdown errors shouldn't be fatal but shouldn't be silent either.

✅ LGTM — runtime/nitriding/handlers_test.go — Skip TestProxyHandler

The skip comment clearly explains why: pubSrv.Addr is unpopulated in library mode, the test tries to bind :443, and this code path isn't exercised in the actual runtime integration. Reasonable.

✅ LGTM — .github/workflows/integration-test.ymlAPP_LOCAL_PATH + go mod vendor

Setting APP_LOCAL_PATH=$GITHUB_WORKSPACE/test/app so the Nix flake builds the test app from the workspace tree instead of fetching a pinned revision. Avoids needing to re-pin nix_rev/nix_hash on every test app change. The additional go mod vendor in test/app/ is needed since vendor/ is gitignored and Nix requires it when building from local source. Makes sense.


🔴 Outstanding issues from previous review — still unresolved

The following issues flagged in my initial review remain unfixed in this commit. These must be addressed before merge:

  1. PCR path broken after cd tofu (cli/framework_files.go:635,894, test/app/.github/workflows/deploy-enclave.yml:47) — .enclave/artifacts/pcr.json is read after cd tofu, resolving to the wrong directory. PCR values will be null.

  2. readyHandler panics on double-call (runtime/nitriding/handlers.go:175) — close(e.ready) without sync.Once protection. Second call crashes the runtime.

  3. InsecureSkipVerify: true in production watchdog path (supervisor/watchdog.go) — health probe disables TLS verification.

  4. Nonce cache unbounded goroutine spawn (runtime/nitriding/cache.go) — public endpoint, no rate limiting.

  5. Context leak in ACME setupAcme() loop (runtime/nitriding/enclave.go) — defer cancel() inside for-loop never fires until goroutine exits.

Items 1 and 2 are the most critical. Item 1 breaks every deployment that uses the scaffolded workflow. Item 2 can crash the enclave runtime.

🤖 Reviewed by Arkana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant