Skip to content

fix(tmux): kill process groups; flush cwd+listener sweep; status wrong-owner check#31

Open
hefgi wants to merge 4 commits into
mainfrom
fix/tmux-kill-process-group-and-flush-sweep
Open

fix(tmux): kill process groups; flush cwd+listener sweep; status wrong-owner check#31
hefgi wants to merge 4 commits into
mainfrom
fix/tmux-kill-process-group-and-flush-sweep

Conversation

@hefgi

@hefgi hefgi commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Fixes #30 — orphaned descendants of tmux-spawned services survive ecluse down and ecluse flush, silently colliding with future sessions.

What's wrong

In tmux mode, ecluse down runs tmux kill-session, which only sends SIGHUP to each pane's foreground process. Any multi-level child tree — sh → pnpm → node → vite, plus anything that calls setsid() like workerd — reparents and survives, ending up adopted by launchd/init while still holding its ports. After a few up/down cycles those orphans accumulate. The next ecluse up silently lands a new service on a port already held by a zombie, so the user's browser ends up serving a different worktree's content. ecluse status doesn't notice — the recorded PID is still alive and something is bound to the port, so it reports ✓ up.

ecluse flush inherits the same defect. The nohup path was already correct (fixed in PR #18 with kill_process_group + TERM→KILL grace) — tmux just never got the same treatment.

What changed

1. kill_tmux now group-kills every pane (commit 1). Before tmux kill-session, enumerate every pane PID via tmux list-panes -s -t <session> -F '#{pane_pid}' and signal each as a process group through the existing kill_process_group helper. Same TERM→KILL grace (2s) as the nohup path. New unit test mirrors kill_nohup_kills_whole_process_group — a service that launches a sleep & echo $! > child.pid; wait background child must have the child dead after kill_services.

2. ecluse flush sweeps cwd and listener ports (commit 2). Two new steps between docker compose down and worktree removal:

  • cwd sweep: for each subdirectory under worktree_dir, list every PID with a file open inside it (lsof +d via the existing sync::pids_in_directory) and group-kill it. Skips flush's own PID so the command doesn't suicide.
  • listener sweep: enumerate every base_port + slot*slot_stride and extra_ports[].base_port + slot*slot_stride across all max_slots, call validate::port_listener(port) for each, group-kill any listener PID found.
    The flush confirmation prompt warns that editors/shells with files open in worktrees will be killed; --yes is unchanged for CI.

3. ecluse status flags wrong-owner ports (commit 3). New listener-identity check for every managed native service: if port_listener(port) returns a PID that's neither the stored PID nor a descendant of it (via the existing whose_pid::is_descendant), the service is reported as ✗ wrong owner (PID N) instead of healthy. JSON output gains listener_pid and wrong_owner fields. Exit code semantics unchanged — wrong-owner trips the existing exit 1 path. Six new tests cover the four-way state matrix (managed × healthy × wrong-owner combinations, including the precedence rule that wrong-owner wins over healthy=true).

4. Docs (commit 4): CHANGELOG [Unreleased] entries citing #30, a new SKILL.md troubleshooting subsection (Wrong content served on the configured URL after multiple up/down cycles), and an updated docs/src/limits.md process-management section.

Visibility changes

  • sync::pids_in_directory — private → pub(crate) (used by flush)
  • whose_pid::is_descendant — private → pub(crate) (used by status)
  • process::kill_process_group_with_grace — new pub wrapper around the existing module-private kill_process_group so main can drive group-kills

No new config surface. No state.json schema change. JSON output gains two additive fields. Behavior delta for tmux users: services spawned via ecluse up now actually die on ecluse down — which is the documented contract.

Verification

cargo fmt --check, cargo clippy --all-targets -- -D warnings, cargo test --bin ecluse (446 unit tests passing — 6 new) all green locally.

Manual reproduction of the bug report against the fixed binary:

# In a pnpm-based monorepo:
ecluse up feat-a; ecluse up feat-b; ecluse up feat-c
pgrep -fl 'vite.js --port' | tee before-down.txt
for s in feat-a feat-b feat-c; do ecluse down "$s" --keep-worktree; done
pgrep -fl 'vite.js --port'   # should return nothing
# Verify status wrong-owner detection:
ecluse up feat-foo
# Kill its tracked PID by signal only (not its descendants), then bind something else on its port:
nc -l 7301 &
ecluse status feat-foo
# Expected: ✗ wrong owner (PID <nc-pid>)

Test plan

  • ecluse down (tmux) kills pnpm wrapper chains end-to-end
  • ecluse flush --yes reaps setsid()-detached children with cwd in any worktree
  • ecluse flush --yes kills any listener on base_port + slot*slot_stride for every slot
  • ecluse status flags a hijacked port as ✗ wrong owner (PID N), exit 1
  • ecluse status --json includes listener_pid and wrong_owner for native services
  • CI green (clippy --all-targets, fmt --check, test --bin ecluse)

hefgi added 4 commits June 16, 2026 16:11
`tmux kill-session` only delivers SIGHUP to each pane's foreground
process. Wrapper chains like `sh → pnpm → node → vite` reparent and
survive the signal — they end up adopted by launchd/init while still
holding their ports. A few `ecluse up`/`down` cycles accumulate enough
orphans that the next `ecluse up` lands a service on a port already
held by a previous session's zombie, silently serving the wrong
worktree's content to the user.

The nohup path was fixed in PR #18 with TERM→KILL grace on the whole
process group. This commit applies the same pattern to tmux:

- `kill_tmux` now enumerates every pane PID across all windows of the
  session (`tmux list-panes -s -t <session> -F '#{pane_pid}'`) and
  signals each as a process group via the existing `kill_process_group`
  helper. The session is then `tmux kill-session`'d to remove the now
  empty windows.
- New `tmux_session_pane_pids` helper, private to the module.
- New `kill_tmux_kills_whole_process_group` test mirrors the nohup
  regression test from PR #18 — a service that launches a backgrounded
  `sleep` child via `sleep 300 & echo $! > child.pid; wait`, after
  `kill_services` the child PID must be dead.

Refs #30
`ecluse flush` previously inherited the same kill-too-narrow defect as
`ecluse down`: its tmux step only ran `tmux kill-session`, so multi-level
descendants (pnpm → node → vite → workerd) survived as orphans. With
the tmux fix from the previous commit, `down` cleans up correctly, but
flush still needs to handle the case where state.json has lost track of
sessions whose orphans never made it into a pid file in the first place.

Two new sweeps run between step 3 (docker compose down) and step 4
(worktree removal):

  3a. cwd sweep: for each subdirectory under `worktree_dir`, list every
      process whose cwd is inside it (via `sync::pids_in_directory`,
      which wraps `lsof +d`) and TERM→KILL its process group. Runs
      before worktree removal so `git worktree remove` doesn't race a
      live process holding file handles. Skips flush's own PID so the
      command doesn't suicide.

  3b. listener sweep: enumerate every `base_port + slot*slot_stride`
      and `extra_ports[].base_port + slot*slot_stride` across every
      configured service and every slot 1..=max_slots. For each port,
      `validate::port_listener` returns the listener PID (if any);
      TERM→KILL its process group. Catches detached daemons that no
      longer have an open file inside the worktree (e.g. workerd's
      proxy worker) but still hold a configured port. Deduplicates
      across the (service × slot × port) cross-product so a single
      multi-port process is hit once.

The flush confirmation prompt is updated to warn that editors and shells
with files open in the worktree will be killed. CI workflows passing
`--yes` are unaffected.

Visibility changes:
- `sync::pids_in_directory`: private → `pub(crate)` (called from main).
- `process::kill_process_group_with_grace`: new `pub` wrapper around the
  module-private `kill_process_group`, so main can drive group-kills
  without reaching into private machinery.

The new sweeps don't need dedicated tests — `pids_in_directory`,
`port_listener`, and `kill_process_group` each have existing unit
coverage; the flush command composes them. A full integration test
would require provisioning a git repo with a worktree plus a
controllable subject process, which is out of proportion for the
correctness-by-composition gain.

Refs #30
When a previous session's orphan grabs the port that a new session's
service was configured for, `ecluse status` previously reported the new
service as healthy: the stored PID was alive (in a tmux pane) AND
something was responding on the configured port. The fact that the
"something" was a completely different process — serving the wrong
worktree's content — was invisible. The user only noticed when a stale
build appeared in their browser.

Status now performs a listener-identity check for every managed native
service: `validate::port_listener(port)` returns the actual listener
PID, and if it's neither the stored PID nor a descendant (via
`whose_pid::is_descendant`), the service is flagged `wrong_owner` and
rendered as `✗ wrong owner (PID N)`. Exit code is unchanged: a wrong-
owner row simply trips the existing `healthy=false → exit 1` path.

`ServiceStatus` gains two fields:
- `listener_pid: Option<u32>` — whoever is actually bound to the port,
  for diagnosis. Always populated when a port is given AND a listener
  was found.
- `wrong_owner: bool` — true iff the listener is not the stored PID or
  one of its descendants.

JSON output gains both fields verbatim. Human-table output renders
`wrong_owner` via the new `status_str` helper, extracted from the
inline closure in `cmd_status` so the four-way state machine (managed
vs. unmanaged × healthy/down/wrong-owner) is unit-testable. Six new
tests cover every branch including the precedence rule (wrong_owner
wins over healthy=true).

Visibility change:
- `whose_pid::is_descendant`: private → `pub(crate)` for use in status.

Docker services aren't checked — their host port is owned by dockerd
or its rootless proxy, not by any process inside the container, so
the listener-PID heuristic doesn't apply.

Fixes #30
- CHANGELOG: three Unreleased entries for the down/flush/status fixes.
- SKILL.md: new troubleshooting subsection 'Wrong content served on the
  configured URL after multiple up/down cycles' covering the symptom,
  the root cause, the 0.3.2+ status row format, and recovery on any
  version.
- docs/src/limits.md: update the 'Process management is spawn-and-kill
  only' section to mention process-group kill, the setsid escape hatch,
  and the new status wrong-owner check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ecluse down kills top-level pnpm wrapper but vite descendants survive as orphans; ecluse flush also misses them

1 participant