Integrity scan per-task timeout + defensive worker init (v2.6.42) by ttlequals0 · Pull Request #57 · ttlequals0/PixelProbe

ttlequals0 · 2026-05-01T15:58:23Z

Summary

Production integrity scan was stuck at 5000 active with zero counter movement for 17+ minutes (Loki heartbeat repeating identically). Root cause: when a Celery worker dies mid-scan, the in-flight calculate_file_hash_task IDs sit in Redis as PENDING forever, and _run_file_changes_check had no per-task timeout to abandon them. safe_task_ready deliberately returns False on Redis errors, so dead tasks were retained in active_tasks indefinitely.
Stamp every active_tasks entry with time.monotonic() submitted_at. Entries older than INTEGRITY_TASK_TIMEOUT_SECS (default 1800, env-overridable) are revoked, logged at WARNING, and dropped from the active set. Healthy tasks under the threshold keep their existing transient-Redis-retry behavior.
Replace the misleading cumulative files_queued heartbeat label with remaining (unsubmitted) and abandoned counts.
_setup_worker_process now logs entry/exit and wraps db.engine.dispose() in try/except, so a fork-time failure surfaces a stack trace instead of silently leaving the child unable to log task activity (regression observed post v2.6.41 where the worker container went silent for ~9 hours).

Version

2.6.42 (was 2.6.41).

Test plan

Full pytest suite: 326 passed, 8 skipped (parity with v2.6.41 + 6 new tests)
Docker image built for linux/amd64 and pushed: ttlequals0/pixelprobe:2.6.42 (digest sha256:b8eb44bb2801dce8b62267aad5f56d847a26f6e27e3e205c43561c7a2f92b0e3)
Trivy clean of HIGH/CRITICAL severities
Portainer webhook fired for production deploy (HTTP 204)
Confirm /api/version reports 2.6.42
Confirm _setup_worker_process: starting / complete in worker pid=... lines appear in pixelprobe-celery-worker logs after container restart
Confirm a fresh integrity scan progresses (heartbeat counters move every 30 s)

…2.6.42) The integrity scan producer in `_run_file_changes_check` had no per-task timeout, so when Celery workers died mid-scan the 5,000 in-flight `calculate_file_hash_task` IDs remained PENDING in Redis forever and the loop pinned at MAX_CONCURRENT_SMALL with no submissions or completions. `safe_task_ready` was deliberately built to return False on Redis errors, which kept stuck tasks in `active_tasks` indefinitely. - Stamp every `active_tasks` entry with a `time.monotonic()` `submitted_at`. Entries older than `INTEGRITY_TASK_TIMEOUT_SECS` (default 1800, env-overridable) are revoked, logged at WARNING, and dropped from the active set so the producer can advance. Healthy tasks under the timeout retain their existing transient-Redis-error retry behavior. - Replace the misleading cumulative `files_queued` count in the heartbeat with explicit `remaining` (unsubmitted files) and `abandoned` counts, e.g.: `Progress: 225881/1167919 processed, 5000 active, 937038 remaining, 0 abandoned`. - `_setup_worker_process` now logs entry/exit and wraps `db.engine.dispose()` in a try/except so a fork-time failure surfaces a stack trace instead of silently leaving the child unable to log task activity (regression observed post v2.6.41 where the worker container went silent for ~9 hours after a restart). New unit tests in `tests/unit/test_maintenance_service.py` cover the INTEGRITY_TASK_TIMEOUT_SECS env override, the `active_tasks` dict-shape contract, and the monotonic age math used by the abandonment branch. All 326 tests pass (parity with v2.6.41 baseline + 6 new). Trivy clean of HIGH/CRITICAL on `ttlequals0/pixelprobe:2.6.42`.

…is fallback Production showed the integrity-scan UI frozen at "75,000 of 1,167,919" while the backend was advancing past 728,169 at ~285 files/sec. Two coordinated issues, both folded into v2.6.42: - The Phase 2 producer loop's heartbeat block only wrote `last_heartbeat` to the FileChangesState row. Multi-file Phase 2 left `phase_current` and `files_processed` frozen at their initialization value, so the UI poll never saw motion. - The periodic-update block fired only when `total_files_processed % 100 == 0`. At 5000-active steady state the producer batches thousands of completions per outer-loop iteration, almost never landing exactly on a multiple of 100. After the lucky alignment at 75,000 the modulo check stayed false for the rest of the scan. Replaced the modulo with a delta check (last_progress_update tracker), made the heartbeat (now 10s) write phase_current/files_processed/progress_message unconditionally, and consolidated the two blocks into a single write_progress_snapshot closure to prevent drift. Added Redis-backed real- time progress for the integrity scan that mirrors the v2.5.67 pattern for the regular scan: new helpers in progress_utils.py (get/update/clear_file_changes_progress_redis with a separate file_changes_progress: key namespace), the producer writes Redis on every heartbeat and periodic tick, and /api/file-changes-status prefers Redis values when the scan is active. Tests: 9 new unit tests for the Redis round-trip, dict-shape contract, delta-vs-modulo semantics. Full suite: 335 passed, 8 skipped.

ttlequals0 added 2 commits May 1, 2026 11:58

ttlequals0 merged commit 701d9b4 into main May 1, 2026
10 checks passed

ttlequals0 deleted the fix/integrity-scan-hang-2.6.42 branch May 1, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrity scan per-task timeout + defensive worker init (v2.6.42)#57

Integrity scan per-task timeout + defensive worker init (v2.6.42)#57
ttlequals0 merged 2 commits intomainfrom
fix/integrity-scan-hang-2.6.42

ttlequals0 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ttlequals0 commented May 1, 2026

Summary

Version

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant