Skip to content

Create pull request#30

Open
donghyuka wants to merge 5 commits into
vadanamu:publicfrom
donghyuka:dev_cpp
Open

Create pull request#30
donghyuka wants to merge 5 commits into
vadanamu:publicfrom
donghyuka:dev_cpp

Conversation

@donghyuka

Copy link
Copy Markdown
Collaborator

[merger] Add BAM safety guards in normal mode
[bam] Apply --filter-flag in consistency
[npz] Single-pass NPZ writer to reduce mmap_lock contention

donghyuka added 5 commits May 20, 2026 15:07
When many workers run in parallel, cnpy::npz_save opened/closed the file
and rewrote the ZIP central directory 8 times per chunk, allocating
~80 MB each call. This triggered page-fault storms serialized on the
per-process mmap_lock.

NpzFileSession writes all 8 arrays in one fopen/fclose cycle and reuses
caller-owned scratch buffers across chunks, reducing page faults to the
first chunk per worker. Output is byte-compatible with
numpy.savez_compressed.

Signed-off-by: donghyuk <donghyuk@genome4me.com>
When many workers run in parallel, cnpy::npz_save opened/closed the file
and rewrote the ZIP central directory 8 times per chunk, allocating
~80 MB each call. This triggered page-fault storms serialized on the
per-process mmap_lock.

NpzFileSession writes all 8 arrays in one fopen/fclose cycle and reuses
caller-owned scratch buffers across chunks, reducing page faults to the
first chunk per worker. Output is byte-compatible with
numpy.savez_compressed.

Signed-off-by: donghyuk <donghyuk@genome4me.com>
The consistency-mode (-C) path went through BamReader::process_read,
which historically applied BAM_FUNMAP / l_qseq==0 safety checks but
ignored args.filter_flag. The normal-mode (no-C) path applied
`flag & args.filter_flag` in MergedDataWorker::convert_bam1_to_record.

The default --filter-flag of 276 (UNMAP|REV|SEC) is applied in no-C
but ignored in -C. On calls_sorted.bam this caused no-C to drop
3,127,869 reverse-strand records that -C kept.

Before this commit (-C ignoring --filter-flag):
  -C  : 543,351,974 records
  no-C: 540,224,105 records
  -> divergence of 3,127,869 reverse-strand records (REV bit 16)

After this commit (-C now honors --filter-flag):
  -C  : 540,224,105 records
  no-C: 540,224,105 records
  -> multiset-identical

Signed-off-by: donghyuk <donghyuk@genome4me.com>
MergedDataWorker::convert_bam1_to_record is the per-record entry
point of the normal-mode (no-C) path. It applies
`flag & args.filter_flag` and an mv-tag check, but unlike the
consistency-mode (-C) path's BamReader::process_read, it does NOT
guard against unmapped reads or sequence-less records (l_qseq==0,
the BAM-standard form for secondary alignments stored with seq='*').

The default `-g 276` (UNMAP|REV|SEC) happens to mask both via its
4 (UNMAP) and 256 (SECONDARY) bits, so default runs are safe.
However, any user-supplied -g value missing those bits — e.g.
-g 0, -g 4, -g 16 — lets the unsafe records through, and downstream
code crashes when dereferencing the missing reference / sequence:

  -g 0   no-C: SIGSEGV (exit 139)
  -g 4   no-C: SIGSEGV (exit 139)
  -g 16  no-C: SIGSEGV (exit 139)
  -g 256 no-C: OK (l_qseq==0 records blocked via SEC bit)
  -g 276 no-C: OK (default)

Add the same two unconditional guards that BamReader::process_read
already performs on the -C path, before the existing filter_flag
check:

Verified on pod5_8 + calls_sorted.bam:

Default --filter-flag(-g) 276:
  before this commit: 540,224,105 records  h1=cc9e76f7ceccba4d
  after  this commit: 540,224,105 records  h1=cc9e76f7ceccba4d
  -> multiset-identical, no behavioural change for default users

Non-default --filter-flag(-g) (10K subset, previously crashed):
  -g 0   before: SIGSEGV    after: 1,784,476 records (filter off)
  -g 4   before: SIGSEGV    after: 1,784,476 records (UNMAP only)
  -g 16  before: SIGSEGV    after: 1,778,406 records (REV only)

Signed-off-by: donghyuk <donghyuk@genome4me.com>
@donghyuka donghyuka requested a review from vadanamu May 25, 2026 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant