Skip to content

Create pull request#26

Merged
vadanamu merged 2 commits into
vadanamu:publicfrom
donghyuka:dev_cpp
Apr 9, 2026
Merged

Create pull request#26
vadanamu merged 2 commits into
vadanamu:publicfrom
donghyuka:dev_cpp

Conversation

@donghyuka

Copy link
Copy Markdown
Collaborator

[preprocess] Fix concurrent Pod5FileReader access crash
[preprocess] Optimize dispatch and convert-later pattern

When pod5_files < cpu_count, POD5 records are redistributed
across all workers. Multiple workers then access the same
Pod5FileReader_t* concurrently, causing a data race crash
(the POD5 C API is not thread-safe for same-reader access).

Fix: each worker lazily opens its own Pod5FileReader_t* per
POD5 file on first access, stored in a local map. Readers
are closed when the worker completes.

- Add file_path field to Pod5RecordMeta
- Lazy-open per-worker readers in MergedDataWorker (streaming)
- Lazy-open per-worker readers in process_merged_data_meta_worker
  (consistency mode)
- Close per-worker readers on completion
- Skip records when pod5_open_file fails (avoid shared reader
  fallback which would re-introduce the data race)
- Add error logging for pod5_open_file failures in both paths

Tested on node05 with pod5_8 dataset (8 files, 256 workers):
- Before fix: crash during batch 1 processing (same as AWS)
- After fix: completed in ~30 min, 33,891 NPZ files output

Signed-off-by: donghyuk <donghyuk@genome4me.com>
- Back-pressure with configurable max_queue_size
  (default: process_once * 4 * workers, -Q CLI option)
- Half-watermark notify: wake read_loop only when queue
  drops below 50%
- Convert-later: defer bam1_t to BamRecord conversion
  until batch processing, skip convert for unmatched reads
  in order to avoid memory consumption
- NpzWriter: offset/count based save_chunk, flat array
  reuse, rvalue add_records
- RecordMerger: reuse single instance across batches
- Replace queue with deque in BAM read/dispatch
  for efficient range insert/erase
- Elapsed time logging (HH:MM:SS.ss format)
- Memory monitoring via get_rss_mb()

pod5_81 benchmark (node07 SSD, 256 workers, n=1000):
  Before: 266 min, 1675 GB peak RSS
  After:   91 min, 1119 GB peak RSS (2.9x faster, 33% less memory)

Signed-off-by: donghyuk <donghyuk@genome4me.com>
@vadanamu vadanamu merged commit cc846c8 into vadanamu:public Apr 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants