ci: E2E test automation for full deployment lifecycle by owenwahlgren · Pull Request #14 · ava-labs/avalanche-deploy

owenwahlgren · 2026-04-13T23:05:57Z

Summary

Adds automated E2E testing that runs on every merge to main and weekly (Sunday 4AM UTC), exercising the full deployment lifecycle on real AWS infrastructure against Fuji testnet.

Before this PR: CI only runs lint + syntax + dry-run. The E2E scripts exist but are never run automatically. Regressions in Ansible roles, Terraform configs, the create-l1 tool, or add-on deployments sail through CI undetected.

After this PR: Every capability in the repo is tested against real infrastructure automatically.

What's tested

L1 Full Stack (~90 min): Infra creation → node deployment → P-Chain sync → L1 creation → ValidatorManager initialization → monitoring → Blockscout → eRPC → Graph Node → Faucet → Safe multisig → ICM Relayer → health checks → rolling restart → rolling upgrade → L1 reset → teardown

Primary Network Lifecycle (~120 min): Infra creation → node deployment → P/X/C sync → upgrade/downgrade → staking key backup → key restoration → snapshots → prepare-migration → validator migration → health checks → rolling restart → teardown

Coverage gaps filled

Capability	Before	After
Safe multisig	Not tested	Deploy + health check (UI, CGW)
ICM Relayer	Not tested	Deploy + health check
Faucet	Optional	Always (ewoq key)
ValidatorManager	Conditional	Auto-installs Foundry + icm-contracts
L1 chain verification	None	RPC + block number check
Add-on health checks	Exit code only	HTTP health check with retries
L1 upgrade	Not tested	Rolling upgrade + chain survival
Key restoration	Not tested	Restore from S3, verify on target
Prepare migration	Not tested	Explicit playbook test
Automated CI runs	Never	Every merge + weekly

Bugs fixed during review

create-l1 --json stdout pollution: JSON parsing always failed (progress logs mixed with JSON). Now sources l1.env directly.
make backup-keys in L1 E2E: targeted Primary Network inventory, wrong playbook. Removed (keys backed up during deploy).
eRPC health check: hit port 4001 (Blockscout) instead of 4000 (eRPC proxy).
init-validator-manager build failure: continued into init steps with missing binary. Now skips gracefully.
TF_VARFILE relative paths: broke after cd into terraform dir. Now resolved to absolute before any cd.
Workflow secrets in if:: invalid context, Slack step never fired. Fixed with env var pattern.
Workflow inputs.skip_destroy: boolean/string type mismatch. Teardown ran even when skip requested.
Concurrency group collision: push and schedule shared one slot, could cancel each other mid-run.
Teardown state file guard: silently skipped destroy if state absent. Now always attempts destroy.

Cost

~$29/month at ~12 runs/month (8 merges + 4 weekly). Each run: L1 ~$0.47, Primary ~$1.92.

Required GitHub secrets

Secret	Required?
`AWS_ACCESS_KEY_ID`	Yes
`AWS_SECRET_ACCESS_KEY`	Yes
`AVALANCHE_PRIVATE_KEY`	Yes
`RELAYER_KEY`	No (ICM Relayer skipped if unset)
`SLACK_WEBHOOK_URL`	No (no notification if unset)

Test plan

make test-e2e-dry passes (both scripts)
shellcheck -S error passes on both scripts
YAML validation passes on both workflow files
terraform fmt -check passes on CI tfvars
All tfvar names validated against variables.tf
Code review pass: 9 bugs found and fixed
Manual workflow_dispatch run after secrets are configured

Add `--vm-id` flag to the create-l1 tool so it can deploy chains with any AvalancheGo VM plugin, not just SubnetEVM. Also add Ansible support for installing custom VM plugin binaries on validator nodes. Changes: - tools/create-l1/main.go: Add --vm-id flag, parse with ids.FromString, pass to IssueCreateChainTx instead of hardcoded constants.SubnetEVMID - ansible/roles/avalanchego/defaults/main.yml: Add custom_vm_id and custom_vm_binary_path variables - ansible/roles/avalanchego/tasks/main.yml: Add task to copy custom VM binary to the plugins directory when variables are set Closes #7

Add automated E2E testing that runs on every merge to main and weekly, exercising the full deployment lifecycle on real AWS infrastructure. L1 E2E enhancements: - Safe multisig deployment + health checks (was untested) - ICM Relayer deployment + health check (was untested) - Faucet now always runs using ewoq key (was optional) - ValidatorManager auto-installs Foundry and clones icm-contracts - L1 chain RPC verification after configuration - HTTP health checks with retries for every add-on - Rolling upgrade test with chain survival verification - Fix: create-l1 output parsing (source l1.env, not broken JSON) - Fix: remove wrong make backup-keys call (targeted Primary inventory) - Fix: eRPC health check used Blockscout port (4001 -> 4000) Primary Network E2E enhancements: - Staking key restoration test (restore from S3 to second node) - Explicit prepare-migration step before validator migration - TF_VARFILE support for CI-specific instance types New files: - .github/workflows/e2e.yml: Two parallel jobs with cancellation-safe teardown, Slack failure notifications, and job summaries - .github/workflows/cleanup-ci-resources.yml: Weekly orphan instance termination for leaked CI infrastructure - tests/ci/l1.tfvars: Cost-optimized L1 instances (~$0.31/hr) - tests/ci/primary.tfvars: Primary Network instances (~$0.96/hr) Required GitHub secrets: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AVALANCHE_PRIVATE_KEY. Optional: RELAYER_KEY, SLACK_WEBHOOK_URL.

Address CodeQL findings — restrict GITHUB_TOKEN to minimum required permissions. E2E workflow needs contents:read for checkout. Cleanup workflow needs no token permissions (only uses AWS credentials).

owenwahlgren added 2 commits April 1, 2026 22:12

github-advanced-security AI found potential problems Apr 13, 2026

View reviewed changes

Comment thread .github/workflows/cleanup-ci-resources.yml Fixed

Comment thread .github/workflows/e2e.yml Fixed

Comment thread .github/workflows/e2e.yml Fixed

Comment thread .github/workflows/e2e.yml Fixed

ci: add explicit permissions blocks to workflows

2b7554d

Address CodeQL findings — restrict GITHUB_TOKEN to minimum required permissions. E2E workflow needs contents:read for checkout. Cleanup workflow needs no token permissions (only uses AWS credentials).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: E2E test automation for full deployment lifecycle#14

ci: E2E test automation for full deployment lifecycle#14
owenwahlgren wants to merge 3 commits intomainfrom
ci/e2e-nightly

owenwahlgren commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

owenwahlgren commented Apr 13, 2026

Summary

What's tested

Coverage gaps filled

Bugs fixed during review

Cost

Required GitHub secrets

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants