Skip to content

fix(local-gate): six targeted fixes to make local test gate reliable#407

Open
lollonet wants to merge 1 commit into
mainfrom
fix/local-gate-reliability
Open

fix(local-gate): six targeted fixes to make local test gate reliable#407
lollonet wants to merge 1 commit into
mainfrom
fix/local-gate-reliability

Conversation

@lollonet
Copy link
Copy Markdown
Owner

Sintesi

Hardening pre-tag release: 6 fix mirate per rendere il local test gate affidabile su macOS / CI generici. Nessun refactor ampio, ciascun fix è minimal e verificato.

Files modificati

File Fix
scripts/common/unified-log.sh Bash 3.2 compat (#1)
scripts/device-smoke.sh SMOKE_SKIP_NETWORK gate (#2)
tests/test_device_smoke.sh Mock Linux-only binaries + SMOKE_SKIP_NETWORK (#2)
tests/test_device_smoke_parallel_network.sh Tolera nuova indentazione (#2 collateral)
tests/test_metadata_plugin_bugs.sh Scope finally a ws_handler (#3)
client/tests/test_discover_server.sh _update_server → _apply_server (#4)
scripts/deploy.sh compose down → up -d --force-recreate (#5)
client/common/scripts/setup.sh Rimuove curl

Fix dettagliate

1. Logging Bash 3.2

unified-log.sh:103 usava printf '[%(%H:%M:%S)T]' (builtin Bash 4.2+). Su macOS default (Bash 3.2) la format string esce letterale → log lines senza timestamp. Sostituito con \$(date +%H:%M:%S). Fork cost trascurabile vs il costo del gate.

2. device-smoke test deterministico

Lo script eseguiva real DNS/NTP/mDNS/arping + journalctl + /proc reads + HTTP probes contro snapserver/metadata-service. Su macOS CI il test hangava in timeout. Introdotto SMOKE_SKIP_NETWORK=1 che gata:

  • I 4 net check paralleli (_net_check_*)
  • I live HTTP probes (JSON-RPC, /health, /status)
  • I 10 modular check Linux-only (boot_health, mounts, qos, timers, system, audio, env, mdns, snapcast, containers)
  • Lo scan journalctl Recent Errors

Production runs (firstboot, fleet-smoke, ADR-005) lasciano la flag UNSET → comportamento invariato. Test mocka i remaining Linux-only binaries (journalctl, ip, ss, vcgencmd, tc, ...) come no-op.

3. metadata-service test falso positivo

test_metadata_plugin_bugs.sh:88 usava awk '/finally:/,/logger.info.*disconnected/' che matcha il primo finally: del file (riga 665 in fetch_album_artwork), non quello di ws_handler (riga 1919). Falso positivo: ws_handler poteva regredire e il test ancora passare. Fix: scope l'awk al body di ws_handler bound tra ^async def ws_handler e il prossimo top-level def. Aggiunta belt-and-braces che il blocco estratto sia non-vuoto.

4. discover-server test obsoleto

La funzione fu rinominata da _update_server a _apply_server in un refactor precedente; il test continuava ad estrarre il nome vecchio e fallava silenziosamente (6 FAIL / 2 PASS). Riscritto per estrarre _apply_server + _current_host, stub _log e _compose_up. Mantiene gli assert su ordering (.env flushed BEFORE compose up). 13/13 PASS.

5. deploy lifecycle

install_systemd_service faceva docker compose down prima di systemctl start per assicurarsi che ExecStart leggesse il nuovo .env. Distruttivo: 30-40s silenzio audio + teardown network/image. Sostituito con docker compose up -d --force-recreate: re-evaluates compose+env, ricrea solo ciò che è cambiato, no-op altrimenti. Costo tipico 5-10s. Il || true mantiene resilienza quando il compose project non esiste ancora (fresh install).

6. Security: no curl|sh fallback

client/common/scripts/setup.sh:541 aveva curl -fsSL https://get.docker.com | sh come last-resort. prepare-sd.sh ship install-docker.sh su ogni SD layout, quindi la sua assenza significa SD malpreparata. Sostituito con failure esplicito che elenca i candidate paths controllati + recovery action (re-run prepare-sd.sh). Fail loud > pipe silenzioso.

Validation locale

Comando Risultato
bash -n su tutti i file modificati clean
shellcheck -S warning -x su tutti i file modificati clean
tests/test_logging_consolidation.sh 38/38 PASS
tests/test_device_smoke.sh 7/7 PASS (era timeout pre-fix)
tests/test_metadata_plugin_bugs.sh 12/12 PASS
client/tests/test_discover_server.sh 13/13 PASS (era 2/8 pre-fix)
tests/test_prepare_sd_required_files.sh 14/14 PASS
Server suite full 51/51 PASS
Client suite full 5/5 PASS
.venv/bin/python -m pytest tests client/tests -q 177/177 PASS

Rischi residui

  • Pi 5 non testato: il fix Add Claude Code GitHub Workflow #5 (force-recreate vs down) è invariato per il path "no systemd unit" (manual deploy.sh). Validation su Pi 4 reale prima del tag.
  • SMOKE_SKIP_NETWORK semantica overload: il flag ora gata sia network reali sia HTTP probes locali sia check modulari Linux-only. Documentato inline; rinaming a SMOKE_LOCAL_GATE=1 o SMOKE_PORTABLE=1 deferibile se il nome diventa fuorviante (out of scope qui).
  • Fix docs: enforce SSOT across documentation #6 fail-explicit: se un utente avanzato fa install standalone senza prepare-sd, ora deve copiare install-docker.sh manualmente o pre-installare Docker. Trade-off accettato per sicurezza.

No CHANGELOG entry — parallel-branch rule, accumulato per release-tag v0.7.5.

Pre-release-tag hardening pass. All fixes are minimal — no broad refactors.

1. scripts/common/unified-log.sh
   `printf '[%(%H:%M:%S)T]'` requires Bash 4.2+; macOS default is 3.2,
   silently no-ops there and produces log lines without timestamps.
   Replace with `$(date +%H:%M:%S)` — portable across Bash 3.2 and 5+,
   fork cost negligible vs. running the gate.

2. scripts/device-smoke.sh + tests/test_device_smoke.sh
   The smoke script runs real dig/chronyc/avahi-browse/arping plus
   journalctl + /proc reads + live HTTP probes against snapserver and
   metadata-service. On macOS / generic CI runners every one of those
   fails with no DNS / no mDNS responder / no Linux kernel — the test
   hung in timeout. Introduce `SMOKE_SKIP_NETWORK=1` as an escape hatch
   that gates: parallel net checks (dns/ntp/mdns/arping), live HTTP
   probes (Snapcast JSON-RPC, metadata /health, /status), the modular
   Linux-only checks (boot_health, mounts, qos, timers, system, audio,
   env, mdns, snapcast), and the journalctl recent-errors scan.
   Production runs (firstboot, fleet-smoke, ADR-005 release gate) leave
   the flag UNSET so behaviour is unchanged. Test mocks the remaining
   Linux-only binaries (journalctl, ip, ss, vcgencmd, tc, …) as no-ops.

3. tests/test_metadata_plugin_bugs.sh
   The disconnect-path assertion used
   `awk '/finally:/,/logger.info.*disconnected/'` which matches the
   FIRST `finally:` in metadata-service.py (line 665, inside
   `fetch_album_artwork`'s connection cleanup) — not `ws_handler`'s
   own finally at line 1919. False-positive: ws_handler could
   regress and the test would still pass. Scope the awk to
   `ws_handler()`'s body by bounding it between
   `^async def ws_handler` and the next top-level `def`/`async def`.
   Also assert the extracted block is non-empty (belt-and-braces).

4. client/tests/test_discover_server.sh
   The function was renamed from `_update_server` to `_apply_server` in
   an earlier refactor; the test kept extracting the old name and
   silently broke (6 FAILs, 2 PASSes). Rewrite to extract
   `_apply_server` + `_current_host`, stub `_log` and `_compose_up`,
   and assert the same invariants: no-change → return 1, otherwise
   write .env BEFORE invoking compose so the new SNAPSERVER_HOST is
   visible to the recreate, compose-failure preserves the updated .env,
   non-watch mode writes .env but doesn't restart. 13/13 PASS.

5. scripts/deploy.sh
   `install_systemd_service` ran `docker compose down >/dev/null 2>&1`
   before `systemctl start` to ensure ExecStart picks up the
   just-written .env. That's destructive: 30-40 s of audio silence in
   `both` mode plus a full network/image teardown. Replace with
   `docker compose up -d --force-recreate` which re-evaluates compose
   + env, recreates only what actually changed, no-ops otherwise.
   Typical cost 5-10 s. The `|| true` keeps deploy.sh resilient when
   no compose project exists yet (fresh install). The systemd unit's
   own ExecStartPre mem-drift guard covers a separate edge (cgroup v2
   not active at first create); this line covers the .env-change path.

6. client/common/scripts/setup.sh
   `curl -fsSL https://get.docker.com | sh` as a last-resort fallback
   when install-docker.sh isn't found is weak — pipes an unverified
   script from the internet into root sh. prepare-sd.sh ships
   install-docker.sh on every SD layout, so its absence means the SD
   card was prepared wrong. Replace with explicit failure listing the
   candidate paths it checked and the recovery action (re-run
   prepare-sd.sh). Failing loud is safer than silently piping.

Test-suite delta after fixes (excluding test_compose_health.sh per CI):
- Server: 50 → 51 PASS, 0 FAIL (was 49 PASS + 1 FAIL = test_device_smoke
  on timeout. Now passes deterministically.)
- Client: 5/5 PASS (test_discover_server.sh was failing pre-fix, now
  green.)
- Pytest: 177/177 PASS (no change, but verified unaffected).

Test_device_smoke_parallel_network.sh also updated to tolerate the
new indentation under the SMOKE_SKIP_NETWORK guard.

No CHANGELOG entry — parallel-branch rule, accumulated for v0.7.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Ready to merge — no blockers. Three minor nits below.


Findings

LOW — Duplicate skip message in device-smoke.sh "Recent Errors" section
scripts/device-smoke.sh (lines ~989 and ~998)

When SMOKE_SKIP_NETWORK=1, the "recent-errors journalctl scan skipped" message prints twice: once in the dangling if block before section "Recent Errors", and once inside the section. The pre-section block does no extra work and the message is identical.

Suggested fix: remove the pre-section if block entirely — the in-section guard and message are sufficient.

-# Recent-errors journalctl scan is also Linux-only; same gate.
-if [[ "${SMOKE_SKIP_NETWORK:-0}" == "1" ]]; then
-    info "SMOKE_SKIP_NETWORK=1 — journalctl recent-error scan skipped"
-fi
-
 section "Recent Errors"

LOW — Non-watch mode test doesn't assert compose was NOT called
client/tests/test_discover_server.sh

The test_apply "10.0.0.5" "192.168.1.20" "false" 0 0 ... case verifies that .env is written, but never asserts that _compose_up was skipped. The stub writes to $COMPOSE_LOG on any invocation, so a regression that accidentally calls compose in non-watch mode would go undetected.

Suggested addition inside test_apply, after the watch-mode assertion block:

if [[ "$watch_mode" == "false" ]]; then
    assert_eq "" "$restarted_with" \
        "$desc: non-watch mode must not invoke compose up"
fi

LOW — Multi-line comment blocks exceed CLAUDE.md convention
scripts/common/unified-log.sh:98–104, client/common/scripts/setup.sh:540–547, several blocks in scripts/device-smoke.sh

CLAUDE.md states: "Never write multi-paragraph docstrings or multi-line comment blocks — one short line max." Several comment blocks added in this PR run 5–7 lines. The WHY in each is valid, but the verbosity is excessive by project convention.


Checklist

Item Status
set -euo pipefail present in modified scripts
Variables quoted throughout
No secrets/credentials introduced
ASCII-only output for /dev/tty1 paths
No curl | sh (security fix verified)
Audio format / named pipe paths unchanged ➖ (not touched)
SSOT docs updated for changed behaviour ➖ (no arch/port/source changes)
CHANGELOG updated ➖ (PR notes parallel-branch accumulation policy)
Test coverage for new code paths ✅ (with one gap noted above)
No untrusted input in CI run: steps
SMOKE_SKIP_NETWORK unset in production paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant