Skip to content

perf(delfi): ~80x faster via blacklist preloading and binary search#172

Open
DucoG wants to merge 1 commit intoepifluidlab:mainfrom
DucoG:perf/delfi-blacklist-and-gc
Open

perf(delfi): ~80x faster via blacklist preloading and binary search#172
DucoG wants to merge 1 commit intoepifluidlab:mainfrom
DucoG:perf/delfi-blacklist-and-gc

Conversation

@DucoG
Copy link
Copy Markdown

@DucoG DucoG commented May 1, 2026

Summary

delfi is roughly 80x faster on whole-genome inputs while producing bit-identical output. On a chr1+chr2 hg38 benchmark (4912 100kb bins, 16 worker processes), wall-clock time drops from 643s → 8s and total CPU time from 9845s → 59s.

Motivation

While running DELFI on 211 WGS samples I profiled the worker function and noticed almost all of the per-window time was spent re-reading the blacklist BED file from disk and doing per-fragment linear scans against it. Per sample the existing implementation does ~26K full re-parses of the blacklist file (one per 100kb window) and reopens the BAM and 2bit reference inside every worker call. With many samples processed in parallel on shared/NFS storage this also produces a heavy and mostly-redundant I/O load.

Changes

src/finaletoolkit/frag/_delfi.py:

  1. Preload the blacklist once and index regions by contig as sorted numpy arrays. Eliminates ~26K redundant disk reads per sample.
  2. Binary search the blacklist per window via np.searchsorted instead of a linear scan over every fragment.
  3. Share BAM and 2bit handles per worker via the multiprocessing.Pool initializer (one open per worker, reused for every window). The handle is reused by passing the open pysam.AlignmentFile to frag_generator, which already supports it.
  4. Pre-load ContigGaps into worker globals instead of pickling one into every task argument.
  5. Vectorise GC counting using str.count('G') + str.count('C') in place of a Python sum over 'G' or 'C' comparisons.

tests/test_delfi.py:

  • Adds test_workers_equivalence which asserts delfi(workers=1) and delfi(workers=4) produce bit-identical output on every column. This guards against any future regression that introduces parallel-only state in the worker pool.

CHANGELOG.md: documents the speedup and the new test under [Unreleased].

The public API (delfi(...) signature, output schema, output values) is unchanged.

Benchmarks

Hardware: 56-core x86_64, NFS-backed BAM, hg38, 16 workers, identical inputs across runs. Wall and CPU times are means over 5 runs each (σ < 1% in every cell).

Implementation Wall (s) CPU (s) Speedup (wall)
Upstream (current main) 643 9845
This PR 8.0 59.3 80×

Ablation (which optimisation contributes what)

Cumulative speedup as each optimisation is added on top of the previous:

Config Wall (s) CPU (s) Speedup
baseline 643 9845
+ preload blacklist 96 1378 6.7×
+ sorted-array blacklist 9.6 85 67×
+ reuse BAM/ref handle 8.5 65 76×
+ preload contig gaps 8.0 60 80×
+ vectorised GC count 8.0 59 80×

The two dominant wins are preloading the blacklist (eliminates ~26K file reopens) and switching to a sorted-array blacklist filter (eliminates O(blacklist) work per fragment). The remaining three optimisations together are worth ~1.5s wall / ~25s CPU on this input.

Correctness

  • Existing tests/test_delfi.py::test_overall passes unchanged.
  • New tests/test_delfi.py::test_workers_equivalence verifies workers=1 and workers=4 produce identical output on every column.
  • Independently verified on a real 211-sample WGS dataset by running the upstream implementation and this PR side-by-side and diffing the output TSVs: every value of every column (contig, start, stop, arm, short, long, gc, num_frags) matches exactly.

Notes

  • I considered an alternative restructuring (one task per contig with a single pysam.fetch per contig instead of per window) hoping to reduce I/O. It was actually slower in every regime (cold and warm cache) because the existing per-window fetches use the BAI index and skip gap regions, while a per-contig fetch reads the full contig. Not pursued.
  • The same patterns (per-task BAM/refseq reopens, etc.) exist in _end_motifs, _frag_length, _coverage, _wps, and _multi_wps. Happy to send follow-up PRs that extract a small shared _pool helper and apply the same pattern there if there's appetite for it.

The previous implementation reopened and re-parsed the blacklist BED
file, opened the BAM via pysam, and opened the 2bit reference for every
single 100kb window (~26K windows per whole-genome run). It also did a
linear blacklist scan on every fragment.

This patch:

* Parses the blacklist once and indexes regions by contig as sorted
  numpy arrays.
* Uses binary search (np.searchsorted) to filter blacklist regions per
  window, replacing the per-fragment linear scan.
* Opens one pysam.AlignmentFile and py2bit reference handle per worker
  process via a multiprocessing.Pool initializer; workers reuse those
  handles for every window they receive.
* Pre-loads per-contig ContigGaps into a worker-global dict so they are
  not pickled into every task argument.
* Replaces the per-base Python sum used to count GC bases with the
  C-level str.count.

Output is bit-identical to the previous implementation (verified by
ablation on chr1+chr2 with hg38; see PR description for benchmarks).

A new regression test (test_workers_equivalence) asserts that
delfi(workers=1) and delfi(workers=N) produce bit-identical results, so
future changes to the parallel path are caught immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant