JPEG pixel corruption detection, video freeze detection, parallel chunk scanning fixes#46
Merged
Conversation
Add pixel-level JPEG corruption detection that catches files passing PIL and ImageMagick but containing visible garbage (rainbow bands, decoder fill). Uses two signals: sustained chaos (8+ consecutive chaotic rows) and bottom-anchored solid fill (30+ identical rows reaching image bottom). Designed to avoid false positives on high-contrast thumbnails and photos.
Solid fill detection now requires 3+ chaotic rows in the 10 rows preceding the fill streak. This eliminates false positives on channel art with solid backgrounds (Fireship fanart/banner) while still catching corruption where decoder fill follows garbage data.
Add file size guard (skip >10MB), image dimensions guard (skip >30MP), and 30s timeout to prevent OOM kills and hangs when scanning large DSLR photos. Move dimensions check before RGB conversion to avoid unnecessary memory allocation. Fix chaos_region_start tracking bug in detail strings.
Add pool_reset_on_return='rollback' to SQLAlchemy engine options and db.engine.dispose() on DatabaseError/OperationalError in the scan error handler. Prevents psycopg2 PGRES_TUPLES_OK errors from blocking all subsequent scans after a worker crash.
Eliminate redundant Image.open() in JPEG pixel analysis by passing the already-loaded PIL Image from the caller. Previously every JPEG was opened 3 times (verify, load, pixel analysis), causing cumulative memory growth that killed the worker after ~700 files. Also adds pool_reset_on_return='rollback' and db.engine.dispose() on DatabaseError to recover from corrupted connections after worker crashes.
…6.15) The JPEG pixel analysis was using PIL's PixelAccess C extension for 80M+ pixel reads across 8000 files, causing the forked worker to silently crash (no OOM, no error, no segfault in dmesg). Replace with img.tobytes() to extract raw pixel data once as a Python bytes object, then compute row averages from bytes indexing -- zero PIL C calls during the analysis loop. Also add worker_max_memory_per_child=512MB to Celery config as a safety net for long-running image scans.
512MB was too aggressive -- the worker process uses ~300-400MB just for Python + Celery + Flask + SQLAlchemy before scanning any files. It was getting killed immediately during discovery phase, corrupting the DB connection and preventing any scans from completing.
…6.17) Root cause: Pillow Image.close() does not deallocate pixel data (python-pillow/Pillow#3610). tobytes() created 36MB Python allocations per image that bypassed PIL's block allocator, fragmenting memory over thousands of files until the worker was killed. Fixes: - Downscale image to ~200px wide before pixel analysis (90KB vs 36MB) - Add gc.collect() after each scan chunk (PIL circular ref cleanup) - Set PILLOW_BLOCKS_MAX=256 in docker-compose (PIL block reuse) - Remove worker_max_memory_per_child (was killing workers and corrupting DB connections via psycopg2 fork inheritance)
….18) gc.collect() at the end of each chunk triggers PIL C-level destructors in the forked worker process, causing a silent crash. Previous versions without gc.collect() survived 8000+ files across multiple chunks. With gc.collect(), the worker died after the first chunk (~1000 files).
Root cause: _create_scanning_chunks() loaded ALL file paths into memory via .all(), consuming ~200MB+ for 600K files. The worker accumulated this on top of discovery/adding phase memory, then died silently when PIL/ImageMagick processing started. Fixes: - Replace .all() with yield_per() streaming in chunk creation -- holds only one chunk (~1000 paths) in memory at a time - Add composite index (scan_status, file_path) for optimal pending file query performance - Add cancel_futures=True to ThreadPoolExecutor shutdown to prevent indefinite hangs at chunk boundaries - Revert pool_reset_on_return and db.engine.dispose that were treating symptoms of this underlying bug
…2.6.21) Root cause confirmed via pg_stat_activity: 20 ThreadPoolExecutor threads all executed UPDATE scan_state on the same row simultaneously, creating a PostgreSQL row-level lock convoy that permanently blocked scanning. Fix: progress updates now happen once per batch (100 files) from the main thread after all futures complete, instead of per-file from inside the as_completed loop. Uses db.session directly instead of deprecated Session(bind=) API.
…es (v2.6.22) The previous fix (v2.6.21) only fixed the parallel FILE path inside _scan_chunk_files. But with MAX_WORKERS=20, the code takes the parallel CHUNK path (_parallel_scan_chunks), which spawns 20 chunk-level threads each running _scan_chunk_files with num_workers=1 (sequential). Those 20 sequential threads each did per-file UPDATE scan_state, creating the same row-lock convoy. Fix: when use_atomic_increment=True (parallel chunk mode), skip per-file scan_state DB updates entirely. The chunk completion handler in _parallel_scan_chunks already handles aggregate progress.
Even with explicit scan_state updates skipped, the ORM object could be dirty from earlier attribute access. db.session.commit() flushes ALL dirty objects, causing the same UPDATE scan_state contention. Fix: db.session.expire(scan_state) before commit in parallel mode, preventing the ORM from flushing stale scan_state attributes that create row-level lock convoys.
….24) The expire(scan_state) fix for the row-lock convoy also stopped last_update from being written by the scan worker. The UI progress worker that should maintain last_update independently fails to launch when Redis is congested from previous crash retries (the .delay() call succeeds but the task is never picked up by a worker). Fix: raw SQL UPDATE scan_state SET last_update at chunk completion. This runs once per chunk (every few minutes) from one thread at a time, avoiding the contention that caused the original deadlock while keeping the scan alive for the stuck scan checker.
Collapsible grid below the main progress bar showing each parallel chunk worker's status during scanning phase. Shows directory path, files scanned/total, and mini progress bar per chunk. Collapsed by default, expandable via toggle button. Backend: extend /api/scan-status with chunks array from ScanChunk table Frontend: safe DOM-based rendering (no innerHTML), dark mode support, mobile responsive with hidden progress bars and truncated paths.
Root cause: _parallel_scan_chunks worker threads shared Flask's scoped db.session, causing "concurrent operations are not permitted" errors. Every chunk immediately errored, making scans complete with 0 files. Fixed by calling db.session.remove() at thread start for a fresh session. Also includes per-worker progress grid UI (collapsible, shows each chunk worker's status during parallel scanning).
A single chunk failure (ResourceClosedError from stale DB connection) propagated from scan_chunk() through future.result() and crashed the entire scan task. Added try/except around _scan_chunk_files so failed chunks are marked as error and retried later, without killing the scan. Reverted db.session.remove() which was corrupting the main thread's session -- Flask-SQLAlchemy already provides thread-local scoping via app_context().
…ng (v2.6.28) Two issues: 1. Progress bar disappeared after clicking Start Scan because the API returned the previous scan's "completed" status before the new one initialized. Added 15-second grace period after user-initiated start to ignore stale completed status. 2. Worker grid showed no chunks because chunk objects created in the main thread's session weren't visible to worker thread sessions. Used db.session.merge() at chunk processing start and raw SQL UPDATE at chunk completion to ensure cross-thread visibility.
Fixes 7 interconnected bugs from mixing ORM, raw SQL, and server-side cursors across threads: 1. Replace yield_per() with limit/offset pagination -- eliminates server-side cursor that held transactions open across operations 2. Pass chunk DB IDs to worker threads instead of ORM objects -- each thread queries fresh objects in its own session 3. Remove ALL scan_state writes from worker threads when in parallel chunk mode -- main thread as_completed loop is the single writer 4. Expire chunk ORM before raw SQL completion to prevent ORM from overwriting raw SQL values on commit 5. Add rollback before error-handling SQL in exception handlers 6. Remove pointless double commit and unnecessary expire calls 7. Worker grid UI: 3-second delay on scan start to avoid stale status, chunk files_scanned updated every 100 files via raw SQL
Two regressions from the limit/offset pagination change: 1. Chunk creation used OFFSET/LIMIT which shifts when concurrent processes change file status between queries, skipping ~23% of files. Replaced with keyset pagination (WHERE file_path > last_path) which is stable regardless of concurrent changes. 2. _retry_pending_files loaded ALL remaining pending files (90K+) via .all() and rescanned them sequentially, blocking scan completion for hours. Now uses count() first and skips retry if > 1000 files remain (they get picked up on the next scheduled scan).
SQLAlchemy query(ScanResult.file_path) returns Row objects where row[0] can fail with IndexError in some versions. Use row.file_path named attribute access instead.
…2.6.37) Fix scan startup crash from undefined offset variable in keyset pagination. Add startup migration to sync missing scan_state/scan_chunks columns. Fix batch pagination skipping files on shrinking pending result set. Fix scans completing with ~65% files still pending by adding raw SQL status update via Flask's db.session after each scan_file() call, bridging the cross-connection visibility gap with PixelProbe's separate StaticPool session. Remove hardcoded 30-chunk limit on scan progress grid.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Version
2.6.37
Test plan