Goal
Make .github/skills/ctf-testing/ easier to run, debug, and extend after the current refactor is complete. The current scripts work as an end-to-end black-box test, but they are hard to narrow down when one challenge fails.
Current observations
deploy_and_test.sh handles deployment, SSH, reboot testing, and cleanup in one local script.
test_ctf_challenges.sh is one large VM-side script that solves all 18 challenges in a single run.
- The suite has pass and fail counters, but failures do not always include enough command output to quickly see what changed.
- There is no simple way to run only one challenge test, one section, or a fast smoke test after a targeted change.
Research notes
- bats-core provides TAP-compliant Bash tests, filtering, setup and teardown hooks, and cleaner failure output.
- TAP output can make test results easier to parse in CI and easier to summarize in future automation.
- Even if we do not adopt bats, splitting the VM-side tests into reusable functions would make targeted runs and debugging simpler.
Possible approaches to evaluate
- Add flags such as
--challenge 10, --section verify, --section export, and --smoke.
- Split each challenge solve into a named function with consistent diagnostics on failure.
- Add a machine-readable summary, such as TAP or JSON, while keeping plain terminal output readable.
- Keep deploy cleanup safe, but add clearer failure artifacts, for example setup log tail, failed service status, and recent journal lines.
- Consider bats-core only if the dependency cost is worth it for local and CI usage.
- Update
SKILL.md so agents know when to run smoke, one-provider, all-provider, and reboot tests.
Acceptance criteria
- A contributor can run one challenge test without running the full suite.
- Failures show the command or artifact that caused the failure, without leaking more than needed.
- The full provider flow still supports
--with-reboot.
- The skill documentation explains the new test modes clearly.
- Any new test dependency is justified and documented.
Links
Goal
Make
.github/skills/ctf-testing/easier to run, debug, and extend after the current refactor is complete. The current scripts work as an end-to-end black-box test, but they are hard to narrow down when one challenge fails.Current observations
deploy_and_test.shhandles deployment, SSH, reboot testing, and cleanup in one local script.test_ctf_challenges.shis one large VM-side script that solves all 18 challenges in a single run.Research notes
Possible approaches to evaluate
--challenge 10,--section verify,--section export, and--smoke.SKILL.mdso agents know when to run smoke, one-provider, all-provider, and reboot tests.Acceptance criteria
--with-reboot.Links