ci: E2E test automation for full deployment lifecycle#14
Open
owenwahlgren wants to merge 3 commits intomainfrom
Open
ci: E2E test automation for full deployment lifecycle#14owenwahlgren wants to merge 3 commits intomainfrom
owenwahlgren wants to merge 3 commits intomainfrom
Conversation
Add `--vm-id` flag to the create-l1 tool so it can deploy chains with any AvalancheGo VM plugin, not just SubnetEVM. Also add Ansible support for installing custom VM plugin binaries on validator nodes. Changes: - tools/create-l1/main.go: Add --vm-id flag, parse with ids.FromString, pass to IssueCreateChainTx instead of hardcoded constants.SubnetEVMID - ansible/roles/avalanchego/defaults/main.yml: Add custom_vm_id and custom_vm_binary_path variables - ansible/roles/avalanchego/tasks/main.yml: Add task to copy custom VM binary to the plugins directory when variables are set Closes #7
Add automated E2E testing that runs on every merge to main and weekly, exercising the full deployment lifecycle on real AWS infrastructure. L1 E2E enhancements: - Safe multisig deployment + health checks (was untested) - ICM Relayer deployment + health check (was untested) - Faucet now always runs using ewoq key (was optional) - ValidatorManager auto-installs Foundry and clones icm-contracts - L1 chain RPC verification after configuration - HTTP health checks with retries for every add-on - Rolling upgrade test with chain survival verification - Fix: create-l1 output parsing (source l1.env, not broken JSON) - Fix: remove wrong make backup-keys call (targeted Primary inventory) - Fix: eRPC health check used Blockscout port (4001 -> 4000) Primary Network E2E enhancements: - Staking key restoration test (restore from S3 to second node) - Explicit prepare-migration step before validator migration - TF_VARFILE support for CI-specific instance types New files: - .github/workflows/e2e.yml: Two parallel jobs with cancellation-safe teardown, Slack failure notifications, and job summaries - .github/workflows/cleanup-ci-resources.yml: Weekly orphan instance termination for leaked CI infrastructure - tests/ci/l1.tfvars: Cost-optimized L1 instances (~$0.31/hr) - tests/ci/primary.tfvars: Primary Network instances (~$0.96/hr) Required GitHub secrets: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AVALANCHE_PRIVATE_KEY. Optional: RELAYER_KEY, SLACK_WEBHOOK_URL.
Address CodeQL findings — restrict GITHUB_TOKEN to minimum required permissions. E2E workflow needs contents:read for checkout. Cleanup workflow needs no token permissions (only uses AWS credentials).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds automated E2E testing that runs on every merge to main and weekly (Sunday 4AM UTC), exercising the full deployment lifecycle on real AWS infrastructure against Fuji testnet.
Before this PR: CI only runs lint + syntax + dry-run. The E2E scripts exist but are never run automatically. Regressions in Ansible roles, Terraform configs, the create-l1 tool, or add-on deployments sail through CI undetected.
After this PR: Every capability in the repo is tested against real infrastructure automatically.
What's tested
L1 Full Stack (~90 min): Infra creation → node deployment → P-Chain sync → L1 creation → ValidatorManager initialization → monitoring → Blockscout → eRPC → Graph Node → Faucet → Safe multisig → ICM Relayer → health checks → rolling restart → rolling upgrade → L1 reset → teardown
Primary Network Lifecycle (~120 min): Infra creation → node deployment → P/X/C sync → upgrade/downgrade → staking key backup → key restoration → snapshots → prepare-migration → validator migration → health checks → rolling restart → teardown
Coverage gaps filled
Bugs fixed during review
create-l1 --jsonstdout pollution: JSON parsing always failed (progress logs mixed with JSON). Now sourcesl1.envdirectly.make backup-keysin L1 E2E: targeted Primary Network inventory, wrong playbook. Removed (keys backed up during deploy).init-validator-managerbuild failure: continued into init steps with missing binary. Now skips gracefully.cdinto terraform dir. Now resolved to absolute before anycd.secretsinif:: invalid context, Slack step never fired. Fixed with env var pattern.inputs.skip_destroy: boolean/string type mismatch. Teardown ran even when skip requested.pushandscheduleshared one slot, could cancel each other mid-run.Cost
~$29/month at ~12 runs/month (8 merges + 4 weekly). Each run: L1 ~$0.47, Primary ~$1.92.
Required GitHub secrets
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAVALANCHE_PRIVATE_KEYRELAYER_KEYSLACK_WEBHOOK_URLTest plan
make test-e2e-drypasses (both scripts)shellcheck -S errorpasses on both scriptsterraform fmt -checkpasses on CI tfvarsworkflow_dispatchrun after secrets are configured