Integrity scan per-task timeout + defensive worker init (v2.6.42)#57
Merged
ttlequals0 merged 2 commits intomainfrom May 1, 2026
Merged
Integrity scan per-task timeout + defensive worker init (v2.6.42)#57ttlequals0 merged 2 commits intomainfrom
ttlequals0 merged 2 commits intomainfrom
Conversation
…2.6.42) The integrity scan producer in `_run_file_changes_check` had no per-task timeout, so when Celery workers died mid-scan the 5,000 in-flight `calculate_file_hash_task` IDs remained PENDING in Redis forever and the loop pinned at MAX_CONCURRENT_SMALL with no submissions or completions. `safe_task_ready` was deliberately built to return False on Redis errors, which kept stuck tasks in `active_tasks` indefinitely. - Stamp every `active_tasks` entry with a `time.monotonic()` `submitted_at`. Entries older than `INTEGRITY_TASK_TIMEOUT_SECS` (default 1800, env-overridable) are revoked, logged at WARNING, and dropped from the active set so the producer can advance. Healthy tasks under the timeout retain their existing transient-Redis-error retry behavior. - Replace the misleading cumulative `files_queued` count in the heartbeat with explicit `remaining` (unsubmitted files) and `abandoned` counts, e.g.: `Progress: 225881/1167919 processed, 5000 active, 937038 remaining, 0 abandoned`. - `_setup_worker_process` now logs entry/exit and wraps `db.engine.dispose()` in a try/except so a fork-time failure surfaces a stack trace instead of silently leaving the child unable to log task activity (regression observed post v2.6.41 where the worker container went silent for ~9 hours after a restart). New unit tests in `tests/unit/test_maintenance_service.py` cover the INTEGRITY_TASK_TIMEOUT_SECS env override, the `active_tasks` dict-shape contract, and the monotonic age math used by the abandonment branch. All 326 tests pass (parity with v2.6.41 baseline + 6 new). Trivy clean of HIGH/CRITICAL on `ttlequals0/pixelprobe:2.6.42`.
…is fallback Production showed the integrity-scan UI frozen at "75,000 of 1,167,919" while the backend was advancing past 728,169 at ~285 files/sec. Two coordinated issues, both folded into v2.6.42: - The Phase 2 producer loop's heartbeat block only wrote `last_heartbeat` to the FileChangesState row. Multi-file Phase 2 left `phase_current` and `files_processed` frozen at their initialization value, so the UI poll never saw motion. - The periodic-update block fired only when `total_files_processed % 100 == 0`. At 5000-active steady state the producer batches thousands of completions per outer-loop iteration, almost never landing exactly on a multiple of 100. After the lucky alignment at 75,000 the modulo check stayed false for the rest of the scan. Replaced the modulo with a delta check (last_progress_update tracker), made the heartbeat (now 10s) write phase_current/files_processed/progress_message unconditionally, and consolidated the two blocks into a single write_progress_snapshot closure to prevent drift. Added Redis-backed real- time progress for the integrity scan that mirrors the v2.5.67 pattern for the regular scan: new helpers in progress_utils.py (get/update/clear_file_changes_progress_redis with a separate file_changes_progress: key namespace), the producer writes Redis on every heartbeat and periodic tick, and /api/file-changes-status prefers Redis values when the scan is active. Tests: 9 new unit tests for the Redis round-trip, dict-shape contract, delta-vs-modulo semantics. Full suite: 335 passed, 8 skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
5000 activewith zero counter movement for 17+ minutes (Loki heartbeat repeating identically). Root cause: when a Celery worker dies mid-scan, the in-flightcalculate_file_hash_taskIDs sit in Redis asPENDINGforever, and_run_file_changes_checkhad no per-task timeout to abandon them.safe_task_readydeliberately returnsFalseon Redis errors, so dead tasks were retained inactive_tasksindefinitely.active_tasksentry withtime.monotonic()submitted_at. Entries older thanINTEGRITY_TASK_TIMEOUT_SECS(default 1800, env-overridable) are revoked, logged at WARNING, and dropped from the active set. Healthy tasks under the threshold keep their existing transient-Redis-retry behavior.files_queuedheartbeat label withremaining(unsubmitted) andabandonedcounts._setup_worker_processnow logs entry/exit and wrapsdb.engine.dispose()in try/except, so a fork-time failure surfaces a stack trace instead of silently leaving the child unable to log task activity (regression observed post v2.6.41 where the worker container went silent for ~9 hours).Version
2.6.42(was2.6.41).Test plan
linux/amd64and pushed:ttlequals0/pixelprobe:2.6.42(digestsha256:b8eb44bb2801dce8b62267aad5f56d847a26f6e27e3e205c43561c7a2f92b0e3)/api/versionreports2.6.42_setup_worker_process: starting / complete in worker pid=...lines appear inpixelprobe-celery-workerlogs after container restart