Conversation
…latten scaffolding
Consolidates the four systemd units (enclave-watchdog, gvproxy,
enclave-imds-proxy, supervisor) into a single enclave-supervisor.service
that runs gvproxy, the IMDS vsock forwarder, and the lifecycle watchdog
as in-process Go libraries. The systemd unit is now inlined into
user_data.sh.tftpl rather than shipped as a standalone file.
Vendors github.com/brave/nitriding-daemon@v1.4.2 into runtime/nitriding/
and github.com/brave/viproxy@v0.1.2 into runtime/viproxy/. The runtime
binary is now the EIF entrypoint — start.sh is gone.
Watchdog owns the enclave in both native and test modes: nitro-cli on
real Nitro hosts, ENCLAVE_START_CMD/ENCLAVE_STOP_CMD overrides in the
QEMU test harness. lifecycleStop/Start latch the stopped flag in both
paths to prevent the poll loop from racing migrations. deadThreshold
bumped from 30s to 60s to comfortably cover QEMU + nitriding-TLS cold
boot.
Build artifacts moved from enclave/artifacts/ to .enclave/artifacts/
with flake.{nix,lock} kept in enclave/. test/boot-qemu.sh merged into
test/run.sh as boot_qemu() + --boot-only dispatch.
Also drops: dokploy scaffolding, CLI test matrix (test/cli/*-app/ and
.github/workflows/cli.yml), test/vsock-proxy.py (replaced by in-process
IMDS forwarder).
… tofu tree Splits deployment scaffolding off from `enclave init`. The init command now emits only build-time files (enclave/enclave.yaml, enclave/flake.nix, CI workflow templates). A new `enclave tofu` subcommand emits the OpenTofu deployment tree, with merge-only-new semantics so re-runs preserve user-edited resources. `enclave tofu` also writes terraform.tfvars.json from enclave.yaml — that responsibility moved out of `enclave build`, which now only produces EIF + supervisor artifacts. Consolidates the scaffolded tofu tree from 17 files to 5 — one main.tf per module with section banners, plus the user_data.sh.tftpl template. Files merged: root main.tf + variables.tf + outputs.tf → tofu/main.tf; 8 enclave-module .tf files (kms, iam, s3, ssm, vpc, ec2, vars, outputs) → modules/enclave/main.tf; backend bootstrap 3 files → modules/backend/main.tf. OpenTofu concatenates all .tf in a directory before parsing, so this is a pure refactor with no behavioral change. Drops the now-redundant `enclave tfvars` command (subset of `enclave tofu`) and the obsolete tofu/migrate-state.sh helper (one-shot flat-to-module migration that's already done for any live deployment). Also flatten the scaffolded tofu/ tree to live at the repo root rather than under enclave/tofu/, and update CI workflow templates + scaffolded deploy-enclave.yml to call `enclave tofu` between `enclave build` and `tofu apply`. Fixes stale `go install …/cmd/enclave@latest` references (package now at cli/cmd/enclave). Verified: integration test (`make test-run`) passes 55/55 against the consolidated layout, including migration, rollback, and supervisor relauncher resilience.
There was a problem hiding this comment.
Code Review — Consolidation of External Dependencies
Thorough review of +7,912 / −4,688 lines. The consolidation approach (vendoring nitriding v1.4.2 + viproxy, replacing shell scripts + systemd units with Go binaries) is architecturally sound. The supervisor errgroup coordination is well-designed. However, there are several issues to resolve before merge.
🔴 BUG — PCR path broken after cd tofu
cli/framework_files.go (deploy workflow template) and test/app/.github/workflows/deploy-enclave.yml
The deploy step does:
enclave tofu
cd tofu
tofu init
tofu apply -auto-approve
pcr0=$(jq -r '.PCR0' .enclave/artifacts/pcr.json)After cd tofu, the relative path .enclave/artifacts/pcr.json resolves to tofu/.enclave/artifacts/pcr.json — which doesn't exist. PCR values will be null, breaking the verification step and the deployment manifest.
Fix: Use ../.enclave/artifacts/pcr.json after the cd tofu, or read PCR values before changing directory.
This affects both the CLI-generated template (used by every enclave init consumer) and the test app workflow.
🟡 MEDIUM — readyHandler panics on double-call
runtime/nitriding/handlers.go — readyHandler()
func readyHandler(e *Enclave) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
close(e.ready) // panics if called twice
}
}close() on an already-closed channel panics. This is internal-only (bound to 127.0.0.1), but a misbehaving enclave application calling GET /enclave/ready twice will crash the entire nitriding runtime. Wrap in sync.Once.
I understand this is upstream nitriding code, but since you're now owning this fork, it's worth fixing.
🟡 MEDIUM — InsecureSkipVerify: true in watchdog health probe (production path)
supervisor/watchdog.go — isRunning() fallback
client := &http.Client{
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
}
resp, err := client.Get(url + "/health")When nitro-cli is not on PATH (and no ENCLAVE_START_CMD override), the watchdog falls through to this TLS health probe with cert verification disabled. This path is reachable in production if nitro-cli is missing from PATH for any reason. The enclave generates self-signed certs so verification won't work out of the box, but InsecureSkipVerify means any MITM on the vsock/loopback can impersonate the health endpoint.
Suggestion: Pin the expected self-signed cert fingerprint, or at minimum document why this is acceptable and add a log warning when this path is taken.
🟡 MEDIUM — Nonce cache goroutine DoS on public endpoint
runtime/nitriding/cache.go — Add()
Each nonce request spawns a goroutine that sleeps for TTL (1 min). The nonce endpoint (GET /enclave/nonce) is on the public-facing HTTPS server. An attacker can spawn unbounded goroutines by requesting nonces at high rate.
Upstream acknowledges this in the code comments. Since you're vendoring this, consider adding rate-limiting middleware on the nonce endpoint, or replacing the goroutine-per-entry design with a periodic sweep.
🟡 MEDIUM — Context leak in ACME cert retrieval
runtime/nitriding/enclave.go — setupAcme()
go func() {
for {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel() // <-- deferred until goroutine exits, not loop iteration
rawData, err = cache.Get(ctx, e.cfg.FQDN)defer cancel() inside a for-loop defers until the goroutine returns, not per iteration. Each iteration leaks a context/timer. Move cancel() to an explicit call before time.Sleep.
🟠 LOW — Shell injection surface via env vars
supervisor/watchdog.go — startEnclave() / terminateEnclave()
supervisor/migrate.go — lifecycleStop() / lifecycleStart()
if cmd := os.Getenv("ENCLAVE_START_CMD"); cmd != "" {
out, err := exec.CommandContext(ctx, "sh", "-c", cmd).CombinedOutput()ENCLAVE_START_CMD and ENCLAVE_STOP_CMD are passed to sh -c. These are documented as test harness overrides, but the code path exists in the production binary. If an attacker can write to the process environment (SSM parameter injection, container escape), they get arbitrary command execution.
Suggestion: Gate behind a --test-mode flag or build tag rather than silently checking env vars.
🟠 LOW — S3 assets bucket lacks SSL enforcement
cli/tofu_files.go — aws_s3_bucket "assets" resource
The storage bucket correctly enforces TLS via aws_s3_bucket_policy with "aws:SecureTransport" = "false", but the assets bucket (storing EIF images and supervisor binaries) has no such policy. An IAM-authenticated actor could download artifacts over HTTP.
🟠 LOW — hashHandler missing return after error
runtime/nitriding/handlers.go — hashHandler()
After the non-errTooMuchToRead error from io.ReadAll, the handler logs the error but does not return — execution falls through to base64 decode of partial data. Upstream bug, low impact (decode would fail), but sloppy for attestation-critical code.
ℹ️ Downstream Impact
enclave-skeletonrepo still has the old-style systemd units + shell scripts. Needs updating after this merges or it will confuse users.- 21 open PRs on this repo, several (especially #53 and dependabot go.mod PRs #88-97) will conflict with this change.
- No Go import consumers found across the org — this is consumed as a CLI tool, not a library.
✅ What looks good
- Vendored nitriding is faithful to upstream v1.4.2 with minimal, well-documented deviations (documented in
UPSTREAM.md) setters.goextension is well-scoped — direct in-process call replacing HTTP round-trip- Supervisor errgroup coordination is correct (gvproxy → ready signal → watchdog → enclave)
viproxyvendoring is clean- No hardcoded secrets or credentials found
- GitHub Actions use OIDC for AWS auth, no static keys
- Test coverage exists for nitriding components (keysync, handlers, attestation, cache, certcache)
Verdict: Request changes for the PCR path bug (blocking — it breaks every deployment). The readyHandler panic and InsecureSkipVerify in production are worth fixing before merge. The rest are improvements that can be tracked as follow-ups.
🤖 Reviewed by Arkana
There was a problem hiding this comment.
Follow-up Review — Commit 3ebba33f (resource cleanup + error handling)
New commit reviewed. The changes here are minor cleanup — all look correct. Notes below.
✅ LGTM — Lint fixes in cli/init.go and supervisor/imds_proxy.go
defer f.Close() → defer func() { _ = f.Close() }() — suppresses errcheck lint warning for deferred close on read-only or best-effort paths. Fine.
✅ LGTM — supervisor/watchdog.go:108-110 — terminateEnclave error logging
Previously silently discarded the error from terminateEnclave() on context cancellation. Now logs it at slog.Warn. Correct — shutdown errors shouldn't be fatal but shouldn't be silent either.
✅ LGTM — runtime/nitriding/handlers_test.go — Skip TestProxyHandler
The skip comment clearly explains why: pubSrv.Addr is unpopulated in library mode, the test tries to bind :443, and this code path isn't exercised in the actual runtime integration. Reasonable.
✅ LGTM — .github/workflows/integration-test.yml — APP_LOCAL_PATH + go mod vendor
Setting APP_LOCAL_PATH=$GITHUB_WORKSPACE/test/app so the Nix flake builds the test app from the workspace tree instead of fetching a pinned revision. Avoids needing to re-pin nix_rev/nix_hash on every test app change. The additional go mod vendor in test/app/ is needed since vendor/ is gitignored and Nix requires it when building from local source. Makes sense.
🔴 Outstanding issues from previous review — still unresolved
The following issues flagged in my initial review remain unfixed in this commit. These must be addressed before merge:
-
PCR path broken after
cd tofu(cli/framework_files.go:635,894,test/app/.github/workflows/deploy-enclave.yml:47) —.enclave/artifacts/pcr.jsonis read aftercd tofu, resolving to the wrong directory. PCR values will benull. -
readyHandlerpanics on double-call (runtime/nitriding/handlers.go:175) —close(e.ready)withoutsync.Onceprotection. Second call crashes the runtime. -
InsecureSkipVerify: truein production watchdog path (supervisor/watchdog.go) — health probe disables TLS verification. -
Nonce cache unbounded goroutine spawn (
runtime/nitriding/cache.go) — public endpoint, no rate limiting. -
Context leak in ACME
setupAcme()loop (runtime/nitriding/enclave.go) —defer cancel()inside for-loop never fires until goroutine exits.
Items 1 and 2 are the most critical. Item 1 breaks every deployment that uses the scaffolded workflow. Item 2 can crash the enclave runtime.
🤖 Reviewed by Arkana
… in deployment workflow
No description provided.