Skip to content

ci: E2E test automation for full deployment lifecycle#14

Open
owenwahlgren wants to merge 3 commits intomainfrom
ci/e2e-nightly
Open

ci: E2E test automation for full deployment lifecycle#14
owenwahlgren wants to merge 3 commits intomainfrom
ci/e2e-nightly

Conversation

@owenwahlgren
Copy link
Copy Markdown
Collaborator

Summary

Adds automated E2E testing that runs on every merge to main and weekly (Sunday 4AM UTC), exercising the full deployment lifecycle on real AWS infrastructure against Fuji testnet.

Before this PR: CI only runs lint + syntax + dry-run. The E2E scripts exist but are never run automatically. Regressions in Ansible roles, Terraform configs, the create-l1 tool, or add-on deployments sail through CI undetected.

After this PR: Every capability in the repo is tested against real infrastructure automatically.

What's tested

L1 Full Stack (~90 min): Infra creation → node deployment → P-Chain sync → L1 creation → ValidatorManager initialization → monitoring → Blockscout → eRPC → Graph Node → Faucet → Safe multisig → ICM Relayer → health checks → rolling restart → rolling upgrade → L1 reset → teardown

Primary Network Lifecycle (~120 min): Infra creation → node deployment → P/X/C sync → upgrade/downgrade → staking key backup → key restoration → snapshots → prepare-migration → validator migration → health checks → rolling restart → teardown

Coverage gaps filled

Capability Before After
Safe multisig Not tested Deploy + health check (UI, CGW)
ICM Relayer Not tested Deploy + health check
Faucet Optional Always (ewoq key)
ValidatorManager Conditional Auto-installs Foundry + icm-contracts
L1 chain verification None RPC + block number check
Add-on health checks Exit code only HTTP health check with retries
L1 upgrade Not tested Rolling upgrade + chain survival
Key restoration Not tested Restore from S3, verify on target
Prepare migration Not tested Explicit playbook test
Automated CI runs Never Every merge + weekly

Bugs fixed during review

  • create-l1 --json stdout pollution: JSON parsing always failed (progress logs mixed with JSON). Now sources l1.env directly.
  • make backup-keys in L1 E2E: targeted Primary Network inventory, wrong playbook. Removed (keys backed up during deploy).
  • eRPC health check: hit port 4001 (Blockscout) instead of 4000 (eRPC proxy).
  • init-validator-manager build failure: continued into init steps with missing binary. Now skips gracefully.
  • TF_VARFILE relative paths: broke after cd into terraform dir. Now resolved to absolute before any cd.
  • Workflow secrets in if:: invalid context, Slack step never fired. Fixed with env var pattern.
  • Workflow inputs.skip_destroy: boolean/string type mismatch. Teardown ran even when skip requested.
  • Concurrency group collision: push and schedule shared one slot, could cancel each other mid-run.
  • Teardown state file guard: silently skipped destroy if state absent. Now always attempts destroy.

Cost

~$29/month at ~12 runs/month (8 merges + 4 weekly). Each run: L1 ~$0.47, Primary ~$1.92.

Required GitHub secrets

Secret Required?
AWS_ACCESS_KEY_ID Yes
AWS_SECRET_ACCESS_KEY Yes
AVALANCHE_PRIVATE_KEY Yes
RELAYER_KEY No (ICM Relayer skipped if unset)
SLACK_WEBHOOK_URL No (no notification if unset)

Test plan

  • make test-e2e-dry passes (both scripts)
  • shellcheck -S error passes on both scripts
  • YAML validation passes on both workflow files
  • terraform fmt -check passes on CI tfvars
  • All tfvar names validated against variables.tf
  • Code review pass: 9 bugs found and fixed
  • Manual workflow_dispatch run after secrets are configured

Add `--vm-id` flag to the create-l1 tool so it can deploy chains with
any AvalancheGo VM plugin, not just SubnetEVM. Also add Ansible support
for installing custom VM plugin binaries on validator nodes.

Changes:
- tools/create-l1/main.go: Add --vm-id flag, parse with ids.FromString,
  pass to IssueCreateChainTx instead of hardcoded constants.SubnetEVMID
- ansible/roles/avalanchego/defaults/main.yml: Add custom_vm_id and
  custom_vm_binary_path variables
- ansible/roles/avalanchego/tasks/main.yml: Add task to copy custom VM
  binary to the plugins directory when variables are set

Closes #7
Add automated E2E testing that runs on every merge to main and weekly,
exercising the full deployment lifecycle on real AWS infrastructure.

L1 E2E enhancements:
- Safe multisig deployment + health checks (was untested)
- ICM Relayer deployment + health check (was untested)
- Faucet now always runs using ewoq key (was optional)
- ValidatorManager auto-installs Foundry and clones icm-contracts
- L1 chain RPC verification after configuration
- HTTP health checks with retries for every add-on
- Rolling upgrade test with chain survival verification
- Fix: create-l1 output parsing (source l1.env, not broken JSON)
- Fix: remove wrong make backup-keys call (targeted Primary inventory)
- Fix: eRPC health check used Blockscout port (4001 -> 4000)

Primary Network E2E enhancements:
- Staking key restoration test (restore from S3 to second node)
- Explicit prepare-migration step before validator migration
- TF_VARFILE support for CI-specific instance types

New files:
- .github/workflows/e2e.yml: Two parallel jobs with cancellation-safe
  teardown, Slack failure notifications, and job summaries
- .github/workflows/cleanup-ci-resources.yml: Weekly orphan instance
  termination for leaked CI infrastructure
- tests/ci/l1.tfvars: Cost-optimized L1 instances (~$0.31/hr)
- tests/ci/primary.tfvars: Primary Network instances (~$0.96/hr)

Required GitHub secrets: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
AVALANCHE_PRIVATE_KEY. Optional: RELAYER_KEY, SLACK_WEBHOOK_URL.
Comment thread .github/workflows/cleanup-ci-resources.yml Fixed
Comment thread .github/workflows/e2e.yml Fixed
Comment thread .github/workflows/e2e.yml Fixed
Comment thread .github/workflows/e2e.yml Fixed
Address CodeQL findings — restrict GITHUB_TOKEN to minimum required
permissions. E2E workflow needs contents:read for checkout. Cleanup
workflow needs no token permissions (only uses AWS credentials).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants