Skip to content

Integrity scan per-task timeout + defensive worker init (v2.6.42)#57

Merged
ttlequals0 merged 2 commits intomainfrom
fix/integrity-scan-hang-2.6.42
May 1, 2026
Merged

Integrity scan per-task timeout + defensive worker init (v2.6.42)#57
ttlequals0 merged 2 commits intomainfrom
fix/integrity-scan-hang-2.6.42

Conversation

@ttlequals0
Copy link
Copy Markdown
Owner

Summary

  • Production integrity scan was stuck at 5000 active with zero counter movement for 17+ minutes (Loki heartbeat repeating identically). Root cause: when a Celery worker dies mid-scan, the in-flight calculate_file_hash_task IDs sit in Redis as PENDING forever, and _run_file_changes_check had no per-task timeout to abandon them. safe_task_ready deliberately returns False on Redis errors, so dead tasks were retained in active_tasks indefinitely.
  • Stamp every active_tasks entry with time.monotonic() submitted_at. Entries older than INTEGRITY_TASK_TIMEOUT_SECS (default 1800, env-overridable) are revoked, logged at WARNING, and dropped from the active set. Healthy tasks under the threshold keep their existing transient-Redis-retry behavior.
  • Replace the misleading cumulative files_queued heartbeat label with remaining (unsubmitted) and abandoned counts.
  • _setup_worker_process now logs entry/exit and wraps db.engine.dispose() in try/except, so a fork-time failure surfaces a stack trace instead of silently leaving the child unable to log task activity (regression observed post v2.6.41 where the worker container went silent for ~9 hours).

Version

2.6.42 (was 2.6.41).

Test plan

  • Full pytest suite: 326 passed, 8 skipped (parity with v2.6.41 + 6 new tests)
  • Docker image built for linux/amd64 and pushed: ttlequals0/pixelprobe:2.6.42 (digest sha256:b8eb44bb2801dce8b62267aad5f56d847a26f6e27e3e205c43561c7a2f92b0e3)
  • Trivy clean of HIGH/CRITICAL severities
  • Portainer webhook fired for production deploy (HTTP 204)
  • Confirm /api/version reports 2.6.42
  • Confirm _setup_worker_process: starting / complete in worker pid=... lines appear in pixelprobe-celery-worker logs after container restart
  • Confirm a fresh integrity scan progresses (heartbeat counters move every 30 s)

ttlequals0 added 2 commits May 1, 2026 11:58
…2.6.42)

The integrity scan producer in `_run_file_changes_check` had no per-task
timeout, so when Celery workers died mid-scan the 5,000 in-flight
`calculate_file_hash_task` IDs remained PENDING in Redis forever and the
loop pinned at MAX_CONCURRENT_SMALL with no submissions or completions.
`safe_task_ready` was deliberately built to return False on Redis errors,
which kept stuck tasks in `active_tasks` indefinitely.

- Stamp every `active_tasks` entry with a `time.monotonic()` `submitted_at`.
  Entries older than `INTEGRITY_TASK_TIMEOUT_SECS` (default 1800, env-overridable)
  are revoked, logged at WARNING, and dropped from the active set so the
  producer can advance. Healthy tasks under the timeout retain their
  existing transient-Redis-error retry behavior.
- Replace the misleading cumulative `files_queued` count in the heartbeat
  with explicit `remaining` (unsubmitted files) and `abandoned` counts,
  e.g.: `Progress: 225881/1167919 processed, 5000 active, 937038 remaining,
  0 abandoned`.
- `_setup_worker_process` now logs entry/exit and wraps `db.engine.dispose()`
  in a try/except so a fork-time failure surfaces a stack trace instead of
  silently leaving the child unable to log task activity (regression
  observed post v2.6.41 where the worker container went silent for ~9
  hours after a restart).

New unit tests in `tests/unit/test_maintenance_service.py` cover the
INTEGRITY_TASK_TIMEOUT_SECS env override, the `active_tasks` dict-shape
contract, and the monotonic age math used by the abandonment branch.

All 326 tests pass (parity with v2.6.41 baseline + 6 new). Trivy clean of
HIGH/CRITICAL on `ttlequals0/pixelprobe:2.6.42`.
…is fallback

Production showed the integrity-scan UI frozen at "75,000 of 1,167,919" while
the backend was advancing past 728,169 at ~285 files/sec. Two coordinated
issues, both folded into v2.6.42:

- The Phase 2 producer loop's heartbeat block only wrote `last_heartbeat` to
  the FileChangesState row. Multi-file Phase 2 left `phase_current` and
  `files_processed` frozen at their initialization value, so the UI poll
  never saw motion.
- The periodic-update block fired only when `total_files_processed % 100 == 0`.
  At 5000-active steady state the producer batches thousands of completions
  per outer-loop iteration, almost never landing exactly on a multiple of
  100. After the lucky alignment at 75,000 the modulo check stayed false for
  the rest of the scan.

Replaced the modulo with a delta check (last_progress_update tracker), made
the heartbeat (now 10s) write phase_current/files_processed/progress_message
unconditionally, and consolidated the two blocks into a single
write_progress_snapshot closure to prevent drift. Added Redis-backed real-
time progress for the integrity scan that mirrors the v2.5.67 pattern for
the regular scan: new helpers in progress_utils.py
(get/update/clear_file_changes_progress_redis with a separate
file_changes_progress: key namespace), the producer writes Redis on every
heartbeat and periodic tick, and /api/file-changes-status prefers Redis
values when the scan is active.

Tests: 9 new unit tests for the Redis round-trip, dict-shape contract,
delta-vs-modulo semantics. Full suite: 335 passed, 8 skipped.
@ttlequals0 ttlequals0 merged commit 701d9b4 into main May 1, 2026
10 checks passed
@ttlequals0 ttlequals0 deleted the fix/integrity-scan-hang-2.6.42 branch May 1, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant