fix(tmux): kill process groups; flush cwd+listener sweep; status wrong-owner check#31
Open
hefgi wants to merge 4 commits into
Open
fix(tmux): kill process groups; flush cwd+listener sweep; status wrong-owner check#31hefgi wants to merge 4 commits into
hefgi wants to merge 4 commits into
Conversation
`tmux kill-session` only delivers SIGHUP to each pane's foreground process. Wrapper chains like `sh → pnpm → node → vite` reparent and survive the signal — they end up adopted by launchd/init while still holding their ports. A few `ecluse up`/`down` cycles accumulate enough orphans that the next `ecluse up` lands a service on a port already held by a previous session's zombie, silently serving the wrong worktree's content to the user. The nohup path was fixed in PR #18 with TERM→KILL grace on the whole process group. This commit applies the same pattern to tmux: - `kill_tmux` now enumerates every pane PID across all windows of the session (`tmux list-panes -s -t <session> -F '#{pane_pid}'`) and signals each as a process group via the existing `kill_process_group` helper. The session is then `tmux kill-session`'d to remove the now empty windows. - New `tmux_session_pane_pids` helper, private to the module. - New `kill_tmux_kills_whole_process_group` test mirrors the nohup regression test from PR #18 — a service that launches a backgrounded `sleep` child via `sleep 300 & echo $! > child.pid; wait`, after `kill_services` the child PID must be dead. Refs #30
`ecluse flush` previously inherited the same kill-too-narrow defect as
`ecluse down`: its tmux step only ran `tmux kill-session`, so multi-level
descendants (pnpm → node → vite → workerd) survived as orphans. With
the tmux fix from the previous commit, `down` cleans up correctly, but
flush still needs to handle the case where state.json has lost track of
sessions whose orphans never made it into a pid file in the first place.
Two new sweeps run between step 3 (docker compose down) and step 4
(worktree removal):
3a. cwd sweep: for each subdirectory under `worktree_dir`, list every
process whose cwd is inside it (via `sync::pids_in_directory`,
which wraps `lsof +d`) and TERM→KILL its process group. Runs
before worktree removal so `git worktree remove` doesn't race a
live process holding file handles. Skips flush's own PID so the
command doesn't suicide.
3b. listener sweep: enumerate every `base_port + slot*slot_stride`
and `extra_ports[].base_port + slot*slot_stride` across every
configured service and every slot 1..=max_slots. For each port,
`validate::port_listener` returns the listener PID (if any);
TERM→KILL its process group. Catches detached daemons that no
longer have an open file inside the worktree (e.g. workerd's
proxy worker) but still hold a configured port. Deduplicates
across the (service × slot × port) cross-product so a single
multi-port process is hit once.
The flush confirmation prompt is updated to warn that editors and shells
with files open in the worktree will be killed. CI workflows passing
`--yes` are unaffected.
Visibility changes:
- `sync::pids_in_directory`: private → `pub(crate)` (called from main).
- `process::kill_process_group_with_grace`: new `pub` wrapper around the
module-private `kill_process_group`, so main can drive group-kills
without reaching into private machinery.
The new sweeps don't need dedicated tests — `pids_in_directory`,
`port_listener`, and `kill_process_group` each have existing unit
coverage; the flush command composes them. A full integration test
would require provisioning a git repo with a worktree plus a
controllable subject process, which is out of proportion for the
correctness-by-composition gain.
Refs #30
When a previous session's orphan grabs the port that a new session's service was configured for, `ecluse status` previously reported the new service as healthy: the stored PID was alive (in a tmux pane) AND something was responding on the configured port. The fact that the "something" was a completely different process — serving the wrong worktree's content — was invisible. The user only noticed when a stale build appeared in their browser. Status now performs a listener-identity check for every managed native service: `validate::port_listener(port)` returns the actual listener PID, and if it's neither the stored PID nor a descendant (via `whose_pid::is_descendant`), the service is flagged `wrong_owner` and rendered as `✗ wrong owner (PID N)`. Exit code is unchanged: a wrong- owner row simply trips the existing `healthy=false → exit 1` path. `ServiceStatus` gains two fields: - `listener_pid: Option<u32>` — whoever is actually bound to the port, for diagnosis. Always populated when a port is given AND a listener was found. - `wrong_owner: bool` — true iff the listener is not the stored PID or one of its descendants. JSON output gains both fields verbatim. Human-table output renders `wrong_owner` via the new `status_str` helper, extracted from the inline closure in `cmd_status` so the four-way state machine (managed vs. unmanaged × healthy/down/wrong-owner) is unit-testable. Six new tests cover every branch including the precedence rule (wrong_owner wins over healthy=true). Visibility change: - `whose_pid::is_descendant`: private → `pub(crate)` for use in status. Docker services aren't checked — their host port is owned by dockerd or its rootless proxy, not by any process inside the container, so the listener-PID heuristic doesn't apply. Fixes #30
- CHANGELOG: three Unreleased entries for the down/flush/status fixes. - SKILL.md: new troubleshooting subsection 'Wrong content served on the configured URL after multiple up/down cycles' covering the symptom, the root cause, the 0.3.2+ status row format, and recovery on any version. - docs/src/limits.md: update the 'Process management is spawn-and-kill only' section to mention process-group kill, the setsid escape hatch, and the new status wrong-owner check.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #30 — orphaned descendants of tmux-spawned services survive
ecluse downandecluse flush, silently colliding with future sessions.What's wrong
In tmux mode,
ecluse downrunstmux kill-session, which only sends SIGHUP to each pane's foreground process. Any multi-level child tree —sh → pnpm → node → vite, plus anything that callssetsid()like workerd — reparents and survives, ending up adopted bylaunchd/initwhile still holding its ports. After a fewup/downcycles those orphans accumulate. The nextecluse upsilently lands a new service on a port already held by a zombie, so the user's browser ends up serving a different worktree's content.ecluse statusdoesn't notice — the recorded PID is still alive and something is bound to the port, so it reports✓ up.ecluse flushinherits the same defect. The nohup path was already correct (fixed in PR #18 withkill_process_group+ TERM→KILL grace) — tmux just never got the same treatment.What changed
1.
kill_tmuxnow group-kills every pane (commit 1). Beforetmux kill-session, enumerate every pane PID viatmux list-panes -s -t <session> -F '#{pane_pid}'and signal each as a process group through the existingkill_process_grouphelper. Same TERM→KILL grace (2s) as the nohup path. New unit test mirrorskill_nohup_kills_whole_process_group— a service that launches asleep & echo $! > child.pid; waitbackground child must have the child dead afterkill_services.2.
ecluse flushsweeps cwd and listener ports (commit 2). Two new steps between docker compose down and worktree removal:worktree_dir, list every PID with a file open inside it (lsof +dvia the existingsync::pids_in_directory) and group-kill it. Skips flush's own PID so the command doesn't suicide.base_port + slot*slot_strideandextra_ports[].base_port + slot*slot_strideacross allmax_slots, callvalidate::port_listener(port)for each, group-kill any listener PID found.The flush confirmation prompt warns that editors/shells with files open in worktrees will be killed;
--yesis unchanged for CI.3.
ecluse statusflags wrong-owner ports (commit 3). New listener-identity check for every managed native service: ifport_listener(port)returns a PID that's neither the stored PID nor a descendant of it (via the existingwhose_pid::is_descendant), the service is reported as✗ wrong owner (PID N)instead of healthy. JSON output gainslistener_pidandwrong_ownerfields. Exit code semantics unchanged — wrong-owner trips the existingexit 1path. Six new tests cover the four-way state matrix (managed × healthy × wrong-owner combinations, including the precedence rule that wrong-owner wins over healthy=true).4. Docs (commit 4): CHANGELOG
[Unreleased]entries citing #30, a new SKILL.md troubleshooting subsection (Wrong content served on the configured URL after multiple up/down cycles), and an updateddocs/src/limits.mdprocess-management section.Visibility changes
sync::pids_in_directory— private →pub(crate)(used by flush)whose_pid::is_descendant— private →pub(crate)(used by status)process::kill_process_group_with_grace— newpubwrapper around the existing module-privatekill_process_groupso main can drive group-killsNo new config surface. No state.json schema change. JSON output gains two additive fields. Behavior delta for tmux users: services spawned via
ecluse upnow actually die onecluse down— which is the documented contract.Verification
cargo fmt --check,cargo clippy --all-targets -- -D warnings,cargo test --bin ecluse(446 unit tests passing — 6 new) all green locally.Manual reproduction of the bug report against the fixed binary:
Test plan
ecluse down(tmux) kills pnpm wrapper chains end-to-endecluse flush --yesreapssetsid()-detached children with cwd in any worktreeecluse flush --yeskills any listener onbase_port + slot*slot_stridefor every slotecluse statusflags a hijacked port as✗ wrong owner (PID N), exit 1ecluse status --jsonincludeslistener_pidandwrong_ownerfor native services