Skip to content

JPEG pixel corruption detection, video freeze detection, parallel chunk scanning fixes#46

Merged
ttlequals0 merged 22 commits into
mainfrom
jpg-color-corruption
Apr 6, 2026
Merged

JPEG pixel corruption detection, video freeze detection, parallel chunk scanning fixes#46
ttlequals0 merged 22 commits into
mainfrom
jpg-color-corruption

Conversation

@ttlequals0
Copy link
Copy Markdown
Owner

@ttlequals0 ttlequals0 commented Apr 4, 2026

Summary

  • JPEG pixel corruption detection: Detect visually corrupted JPEGs that pass PIL/ImageMagick validation but contain visible garbage (rainbow bands, solid color fill) in decoded pixel data
  • Video freeze detection: Detect videos with frozen frames using FFmpeg freezedetect filter with black frame filtering to reduce false positives
  • Per-worker scan progress grid: Collapsible UI grid showing each parallel chunk worker's status, directory, and progress
  • Parallel chunk scanning fixes: Fixed scan startup crash (undefined offset), schema migration for missing columns, batch pagination on shrinking result sets, dual-session pending file issue, and removed hardcoded 30-chunk limit on progress grid

Version

2.6.37

Test plan

  • Docker image built and pushed (ttlequals0/pixelprobe:2.6.37 + latest)
  • Deployed to production, verified 1.15M files scanned with 0 pending
  • JPEG pixel corruption detection tested with synthetic images (7 unit tests)
  • Chunk progress grid displays all chunks (no 30-chunk cap)

Add pixel-level JPEG corruption detection that catches files passing PIL
and ImageMagick but containing visible garbage (rainbow bands, decoder
fill). Uses two signals: sustained chaos (8+ consecutive chaotic rows)
and bottom-anchored solid fill (30+ identical rows reaching image bottom).
Designed to avoid false positives on high-contrast thumbnails and photos.
Solid fill detection now requires 3+ chaotic rows in the 10 rows
preceding the fill streak. This eliminates false positives on channel
art with solid backgrounds (Fireship fanart/banner) while still
catching corruption where decoder fill follows garbage data.
Add file size guard (skip >10MB), image dimensions guard (skip >30MP),
and 30s timeout to prevent OOM kills and hangs when scanning large DSLR
photos. Move dimensions check before RGB conversion to avoid unnecessary
memory allocation. Fix chaos_region_start tracking bug in detail strings.
Add pool_reset_on_return='rollback' to SQLAlchemy engine options and
db.engine.dispose() on DatabaseError/OperationalError in the scan error
handler. Prevents psycopg2 PGRES_TUPLES_OK errors from blocking all
subsequent scans after a worker crash.
Eliminate redundant Image.open() in JPEG pixel analysis by passing the
already-loaded PIL Image from the caller. Previously every JPEG was
opened 3 times (verify, load, pixel analysis), causing cumulative memory
growth that killed the worker after ~700 files.

Also adds pool_reset_on_return='rollback' and db.engine.dispose() on
DatabaseError to recover from corrupted connections after worker crashes.
…6.15)

The JPEG pixel analysis was using PIL's PixelAccess C extension for
80M+ pixel reads across 8000 files, causing the forked worker to
silently crash (no OOM, no error, no segfault in dmesg). Replace with
img.tobytes() to extract raw pixel data once as a Python bytes object,
then compute row averages from bytes indexing -- zero PIL C calls during
the analysis loop.

Also add worker_max_memory_per_child=512MB to Celery config as a safety
net for long-running image scans.
512MB was too aggressive -- the worker process uses ~300-400MB just for
Python + Celery + Flask + SQLAlchemy before scanning any files. It was
getting killed immediately during discovery phase, corrupting the DB
connection and preventing any scans from completing.
…6.17)

Root cause: Pillow Image.close() does not deallocate pixel data
(python-pillow/Pillow#3610). tobytes() created 36MB Python allocations
per image that bypassed PIL's block allocator, fragmenting memory over
thousands of files until the worker was killed.

Fixes:
- Downscale image to ~200px wide before pixel analysis (90KB vs 36MB)
- Add gc.collect() after each scan chunk (PIL circular ref cleanup)
- Set PILLOW_BLOCKS_MAX=256 in docker-compose (PIL block reuse)
- Remove worker_max_memory_per_child (was killing workers and corrupting
  DB connections via psycopg2 fork inheritance)
….18)

gc.collect() at the end of each chunk triggers PIL C-level destructors
in the forked worker process, causing a silent crash. Previous versions
without gc.collect() survived 8000+ files across multiple chunks. With
gc.collect(), the worker died after the first chunk (~1000 files).
Root cause: _create_scanning_chunks() loaded ALL file paths into memory
via .all(), consuming ~200MB+ for 600K files. The worker accumulated
this on top of discovery/adding phase memory, then died silently when
PIL/ImageMagick processing started.

Fixes:
- Replace .all() with yield_per() streaming in chunk creation -- holds
  only one chunk (~1000 paths) in memory at a time
- Add composite index (scan_status, file_path) for optimal pending
  file query performance
- Add cancel_futures=True to ThreadPoolExecutor shutdown to prevent
  indefinite hangs at chunk boundaries
- Revert pool_reset_on_return and db.engine.dispose that were treating
  symptoms of this underlying bug
…2.6.21)

Root cause confirmed via pg_stat_activity: 20 ThreadPoolExecutor threads
all executed UPDATE scan_state on the same row simultaneously, creating
a PostgreSQL row-level lock convoy that permanently blocked scanning.

Fix: progress updates now happen once per batch (100 files) from the
main thread after all futures complete, instead of per-file from inside
the as_completed loop. Uses db.session directly instead of deprecated
Session(bind=) API.
…es (v2.6.22)

The previous fix (v2.6.21) only fixed the parallel FILE path inside
_scan_chunk_files. But with MAX_WORKERS=20, the code takes the parallel
CHUNK path (_parallel_scan_chunks), which spawns 20 chunk-level threads
each running _scan_chunk_files with num_workers=1 (sequential). Those
20 sequential threads each did per-file UPDATE scan_state, creating the
same row-lock convoy.

Fix: when use_atomic_increment=True (parallel chunk mode), skip per-file
scan_state DB updates entirely. The chunk completion handler in
_parallel_scan_chunks already handles aggregate progress.
Even with explicit scan_state updates skipped, the ORM object could be
dirty from earlier attribute access. db.session.commit() flushes ALL
dirty objects, causing the same UPDATE scan_state contention.

Fix: db.session.expire(scan_state) before commit in parallel mode,
preventing the ORM from flushing stale scan_state attributes that
create row-level lock convoys.
….24)

The expire(scan_state) fix for the row-lock convoy also stopped
last_update from being written by the scan worker. The UI progress
worker that should maintain last_update independently fails to launch
when Redis is congested from previous crash retries (the .delay() call
succeeds but the task is never picked up by a worker).

Fix: raw SQL UPDATE scan_state SET last_update at chunk completion.
This runs once per chunk (every few minutes) from one thread at a time,
avoiding the contention that caused the original deadlock while keeping
the scan alive for the stuck scan checker.
Collapsible grid below the main progress bar showing each parallel
chunk worker's status during scanning phase. Shows directory path,
files scanned/total, and mini progress bar per chunk. Collapsed by
default, expandable via toggle button.

Backend: extend /api/scan-status with chunks array from ScanChunk table
Frontend: safe DOM-based rendering (no innerHTML), dark mode support,
mobile responsive with hidden progress bars and truncated paths.
Root cause: _parallel_scan_chunks worker threads shared Flask's scoped
db.session, causing "concurrent operations are not permitted" errors.
Every chunk immediately errored, making scans complete with 0 files.
Fixed by calling db.session.remove() at thread start for a fresh session.

Also includes per-worker progress grid UI (collapsible, shows each
chunk worker's status during parallel scanning).
A single chunk failure (ResourceClosedError from stale DB connection)
propagated from scan_chunk() through future.result() and crashed the
entire scan task. Added try/except around _scan_chunk_files so failed
chunks are marked as error and retried later, without killing the scan.

Reverted db.session.remove() which was corrupting the main thread's
session -- Flask-SQLAlchemy already provides thread-local scoping via
app_context().
…ng (v2.6.28)

Two issues:
1. Progress bar disappeared after clicking Start Scan because the API
   returned the previous scan's "completed" status before the new one
   initialized. Added 15-second grace period after user-initiated start
   to ignore stale completed status.

2. Worker grid showed no chunks because chunk objects created in the
   main thread's session weren't visible to worker thread sessions.
   Used db.session.merge() at chunk processing start and raw SQL UPDATE
   at chunk completion to ensure cross-thread visibility.
Fixes 7 interconnected bugs from mixing ORM, raw SQL, and server-side
cursors across threads:

1. Replace yield_per() with limit/offset pagination -- eliminates
   server-side cursor that held transactions open across operations
2. Pass chunk DB IDs to worker threads instead of ORM objects --
   each thread queries fresh objects in its own session
3. Remove ALL scan_state writes from worker threads when in parallel
   chunk mode -- main thread as_completed loop is the single writer
4. Expire chunk ORM before raw SQL completion to prevent ORM from
   overwriting raw SQL values on commit
5. Add rollback before error-handling SQL in exception handlers
6. Remove pointless double commit and unnecessary expire calls
7. Worker grid UI: 3-second delay on scan start to avoid stale status,
   chunk files_scanned updated every 100 files via raw SQL
@ttlequals0 ttlequals0 changed the title JPEG pixel corruption detection (v2.6.10) JPEG pixel corruption detection + parallel scan fixes (v2.6.30) Apr 5, 2026
Two regressions from the limit/offset pagination change:

1. Chunk creation used OFFSET/LIMIT which shifts when concurrent
   processes change file status between queries, skipping ~23% of
   files. Replaced with keyset pagination (WHERE file_path > last_path)
   which is stable regardless of concurrent changes.

2. _retry_pending_files loaded ALL remaining pending files (90K+) via
   .all() and rescanned them sequentially, blocking scan completion for
   hours. Now uses count() first and skips retry if > 1000 files remain
   (they get picked up on the next scheduled scan).
SQLAlchemy query(ScanResult.file_path) returns Row objects where
row[0] can fail with IndexError in some versions. Use row.file_path
named attribute access instead.
…2.6.37)

Fix scan startup crash from undefined offset variable in keyset pagination.
Add startup migration to sync missing scan_state/scan_chunks columns.
Fix batch pagination skipping files on shrinking pending result set.
Fix scans completing with ~65% files still pending by adding raw SQL
status update via Flask's db.session after each scan_file() call,
bridging the cross-connection visibility gap with PixelProbe's separate
StaticPool session. Remove hardcoded 30-chunk limit on scan progress grid.
@ttlequals0 ttlequals0 changed the title JPEG pixel corruption detection + parallel scan fixes (v2.6.30) JPEG pixel corruption detection, video freeze detection, parallel chunk scanning fixes Apr 6, 2026
@ttlequals0 ttlequals0 merged commit a260b69 into main Apr 6, 2026
6 of 9 checks passed
@ttlequals0 ttlequals0 deleted the jpg-color-corruption branch April 6, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant