fix(tmux): kill process groups; flush cwd+listener sweep; status wrong-owner check by hefgi · Pull Request #31 · hefgi/ecluse

hefgi · 2026-06-16T15:21:17Z

Fixes #30 — orphaned descendants of tmux-spawned services survive ecluse down and ecluse flush, silently colliding with future sessions.

What's wrong

In tmux mode, ecluse down runs tmux kill-session, which only sends SIGHUP to each pane's foreground process. Any multi-level child tree — sh → pnpm → node → vite, plus anything that calls setsid() like workerd — reparents and survives, ending up adopted by launchd/init while still holding its ports. After a few up/down cycles those orphans accumulate. The next ecluse up silently lands a new service on a port already held by a zombie, so the user's browser ends up serving a different worktree's content. ecluse status doesn't notice — the recorded PID is still alive and something is bound to the port, so it reports ✓ up.

ecluse flush inherits the same defect. The nohup path was already correct (fixed in PR #18 with kill_process_group + TERM→KILL grace) — tmux just never got the same treatment.

What changed

1. kill_tmux now group-kills every pane (commit 1). Before tmux kill-session, enumerate every pane PID via tmux list-panes -s -t <session> -F '#{pane_pid}' and signal each as a process group through the existing kill_process_group helper. Same TERM→KILL grace (2s) as the nohup path. New unit test mirrors kill_nohup_kills_whole_process_group — a service that launches a sleep & echo $! > child.pid; wait background child must have the child dead after kill_services.

2. ecluse flush sweeps cwd and listener ports (commit 2). Two new steps between docker compose down and worktree removal:

cwd sweep: for each subdirectory under worktree_dir, list every PID with a file open inside it (lsof +d via the existing sync::pids_in_directory) and group-kill it. Skips flush's own PID so the command doesn't suicide.
listener sweep: enumerate every base_port + slot*slot_stride and extra_ports[].base_port + slot*slot_stride across all max_slots, call validate::port_listener(port) for each, group-kill any listener PID found.
The flush confirmation prompt warns that editors/shells with files open in worktrees will be killed; --yes is unchanged for CI.

3. ecluse status flags wrong-owner ports (commit 3). New listener-identity check for every managed native service: if port_listener(port) returns a PID that's neither the stored PID nor a descendant of it (via the existing whose_pid::is_descendant), the service is reported as ✗ wrong owner (PID N) instead of healthy. JSON output gains listener_pid and wrong_owner fields. Exit code semantics unchanged — wrong-owner trips the existing exit 1 path. Six new tests cover the four-way state matrix (managed × healthy × wrong-owner combinations, including the precedence rule that wrong-owner wins over healthy=true).

4. Docs (commit 4): CHANGELOG [Unreleased] entries citing #30, a new SKILL.md troubleshooting subsection (Wrong content served on the configured URL after multiple up/down cycles), and an updated docs/src/limits.md process-management section.

Visibility changes

sync::pids_in_directory — private → pub(crate) (used by flush)
whose_pid::is_descendant — private → pub(crate) (used by status)
process::kill_process_group_with_grace — new pub wrapper around the existing module-private kill_process_group so main can drive group-kills

No new config surface. No state.json schema change. JSON output gains two additive fields. Behavior delta for tmux users: services spawned via ecluse up now actually die on ecluse down — which is the documented contract.

Verification

cargo fmt --check, cargo clippy --all-targets -- -D warnings, cargo test --bin ecluse (446 unit tests passing — 6 new) all green locally.

Manual reproduction of the bug report against the fixed binary:

# In a pnpm-based monorepo:
ecluse up feat-a; ecluse up feat-b; ecluse up feat-c
pgrep -fl 'vite.js --port' | tee before-down.txt
for s in feat-a feat-b feat-c; do ecluse down "$s" --keep-worktree; done
pgrep -fl 'vite.js --port'   # should return nothing

# Verify status wrong-owner detection:
ecluse up feat-foo
# Kill its tracked PID by signal only (not its descendants), then bind something else on its port:
nc -l 7301 &
ecluse status feat-foo
# Expected: ✗ wrong owner (PID <nc-pid>)

Test plan

ecluse down (tmux) kills pnpm wrapper chains end-to-end
ecluse flush --yes reaps setsid()-detached children with cwd in any worktree
ecluse flush --yes kills any listener on base_port + slot*slot_stride for every slot
ecluse status flags a hijacked port as ✗ wrong owner (PID N), exit 1
ecluse status --json includes listener_pid and wrong_owner for native services
CI green (clippy --all-targets, fmt --check, test --bin ecluse)

`tmux kill-session` only delivers SIGHUP to each pane's foreground process. Wrapper chains like `sh → pnpm → node → vite` reparent and survive the signal — they end up adopted by launchd/init while still holding their ports. A few `ecluse up`/`down` cycles accumulate enough orphans that the next `ecluse up` lands a service on a port already held by a previous session's zombie, silently serving the wrong worktree's content to the user. The nohup path was fixed in PR #18 with TERM→KILL grace on the whole process group. This commit applies the same pattern to tmux: - `kill_tmux` now enumerates every pane PID across all windows of the session (`tmux list-panes -s -t <session> -F '#{pane_pid}'`) and signals each as a process group via the existing `kill_process_group` helper. The session is then `tmux kill-session`'d to remove the now empty windows. - New `tmux_session_pane_pids` helper, private to the module. - New `kill_tmux_kills_whole_process_group` test mirrors the nohup regression test from PR #18 — a service that launches a backgrounded `sleep` child via `sleep 300 & echo $! > child.pid; wait`, after `kill_services` the child PID must be dead. Refs #30

`ecluse flush` previously inherited the same kill-too-narrow defect as `ecluse down`: its tmux step only ran `tmux kill-session`, so multi-level descendants (pnpm → node → vite → workerd) survived as orphans. With the tmux fix from the previous commit, `down` cleans up correctly, but flush still needs to handle the case where state.json has lost track of sessions whose orphans never made it into a pid file in the first place. Two new sweeps run between step 3 (docker compose down) and step 4 (worktree removal): 3a. cwd sweep: for each subdirectory under `worktree_dir`, list every process whose cwd is inside it (via `sync::pids_in_directory`, which wraps `lsof +d`) and TERM→KILL its process group. Runs before worktree removal so `git worktree remove` doesn't race a live process holding file handles. Skips flush's own PID so the command doesn't suicide. 3b. listener sweep: enumerate every `base_port + slot*slot_stride` and `extra_ports[].base_port + slot*slot_stride` across every configured service and every slot 1..=max_slots. For each port, `validate::port_listener` returns the listener PID (if any); TERM→KILL its process group. Catches detached daemons that no longer have an open file inside the worktree (e.g. workerd's proxy worker) but still hold a configured port. Deduplicates across the (service × slot × port) cross-product so a single multi-port process is hit once. The flush confirmation prompt is updated to warn that editors and shells with files open in the worktree will be killed. CI workflows passing `--yes` are unaffected. Visibility changes: - `sync::pids_in_directory`: private → `pub(crate)` (called from main). - `process::kill_process_group_with_grace`: new `pub` wrapper around the module-private `kill_process_group`, so main can drive group-kills without reaching into private machinery. The new sweeps don't need dedicated tests — `pids_in_directory`, `port_listener`, and `kill_process_group` each have existing unit coverage; the flush command composes them. A full integration test would require provisioning a git repo with a worktree plus a controllable subject process, which is out of proportion for the correctness-by-composition gain. Refs #30

When a previous session's orphan grabs the port that a new session's service was configured for, `ecluse status` previously reported the new service as healthy: the stored PID was alive (in a tmux pane) AND something was responding on the configured port. The fact that the "something" was a completely different process — serving the wrong worktree's content — was invisible. The user only noticed when a stale build appeared in their browser. Status now performs a listener-identity check for every managed native service: `validate::port_listener(port)` returns the actual listener PID, and if it's neither the stored PID nor a descendant (via `whose_pid::is_descendant`), the service is flagged `wrong_owner` and rendered as `✗ wrong owner (PID N)`. Exit code is unchanged: a wrong- owner row simply trips the existing `healthy=false → exit 1` path. `ServiceStatus` gains two fields: - `listener_pid: Option<u32>` — whoever is actually bound to the port, for diagnosis. Always populated when a port is given AND a listener was found. - `wrong_owner: bool` — true iff the listener is not the stored PID or one of its descendants. JSON output gains both fields verbatim. Human-table output renders `wrong_owner` via the new `status_str` helper, extracted from the inline closure in `cmd_status` so the four-way state machine (managed vs. unmanaged × healthy/down/wrong-owner) is unit-testable. Six new tests cover every branch including the precedence rule (wrong_owner wins over healthy=true). Visibility change: - `whose_pid::is_descendant`: private → `pub(crate)` for use in status. Docker services aren't checked — their host port is owned by dockerd or its rootless proxy, not by any process inside the container, so the listener-PID heuristic doesn't apply. Fixes #30

- CHANGELOG: three Unreleased entries for the down/flush/status fixes. - SKILL.md: new troubleshooting subsection 'Wrong content served on the configured URL after multiple up/down cycles' covering the symptom, the root cause, the 0.3.2+ status row format, and recovery on any version. - docs/src/limits.md: update the 'Process management is spawn-and-kill only' section to mention process-group kill, the setsid escape hatch, and the new status wrong-owner check.

hefgi added 4 commits June 16, 2026 16:11

hefgi mentioned this pull request Jun 16, 2026

ecluse down kills top-level pnpm wrapper but vite descendants survive as orphans; ecluse flush also misses them #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tmux): kill process groups; flush cwd+listener sweep; status wrong-owner check#31

fix(tmux): kill process groups; flush cwd+listener sweep; status wrong-owner check#31
hefgi wants to merge 4 commits into
mainfrom
fix/tmux-kill-process-group-and-flush-sweep

hefgi commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hefgi commented Jun 16, 2026

What's wrong

What changed

Visibility changes

Verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant