perf(delfi): ~80x faster via blacklist preloading and binary search by DucoG · Pull Request #172 · epifluidlab/FinaleToolkit

DucoG · 2026-05-01T21:57:17Z

Summary

delfi is roughly 80x faster on whole-genome inputs while producing bit-identical output. On a chr1+chr2 hg38 benchmark (4912 100kb bins, 16 worker processes), wall-clock time drops from 643s → 8s and total CPU time from 9845s → 59s.

Motivation

While running DELFI on 211 WGS samples I profiled the worker function and noticed almost all of the per-window time was spent re-reading the blacklist BED file from disk and doing per-fragment linear scans against it. Per sample the existing implementation does ~26K full re-parses of the blacklist file (one per 100kb window) and reopens the BAM and 2bit reference inside every worker call. With many samples processed in parallel on shared/NFS storage this also produces a heavy and mostly-redundant I/O load.

Changes

src/finaletoolkit/frag/_delfi.py:

Preload the blacklist once and index regions by contig as sorted numpy arrays. Eliminates ~26K redundant disk reads per sample.
Binary search the blacklist per window via np.searchsorted instead of a linear scan over every fragment.
Share BAM and 2bit handles per worker via the multiprocessing.Pool initializer (one open per worker, reused for every window). The handle is reused by passing the open pysam.AlignmentFile to frag_generator, which already supports it.
Pre-load ContigGaps into worker globals instead of pickling one into every task argument.
Vectorise GC counting using str.count('G') + str.count('C') in place of a Python sum over 'G' or 'C' comparisons.

tests/test_delfi.py:

Adds test_workers_equivalence which asserts delfi(workers=1) and delfi(workers=4) produce bit-identical output on every column. This guards against any future regression that introduces parallel-only state in the worker pool.

CHANGELOG.md: documents the speedup and the new test under [Unreleased].

The public API (delfi(...) signature, output schema, output values) is unchanged.

Benchmarks

Hardware: 56-core x86_64, NFS-backed BAM, hg38, 16 workers, identical inputs across runs. Wall and CPU times are means over 5 runs each (σ < 1% in every cell).

Implementation	Wall (s)	CPU (s)	Speedup (wall)
Upstream (current `main`)	643	9845	1×
This PR	8.0	59.3	80×

Ablation (which optimisation contributes what)

Cumulative speedup as each optimisation is added on top of the previous:

Config	Wall (s)	CPU (s)	Speedup
baseline	643	9845	1×
+ preload blacklist	96	1378	6.7×
+ sorted-array blacklist	9.6	85	67×
+ reuse BAM/ref handle	8.5	65	76×
+ preload contig gaps	8.0	60	80×
+ vectorised GC count	8.0	59	80×

The two dominant wins are preloading the blacklist (eliminates ~26K file reopens) and switching to a sorted-array blacklist filter (eliminates O(blacklist) work per fragment). The remaining three optimisations together are worth ~1.5s wall / ~25s CPU on this input.

Correctness

Existing tests/test_delfi.py::test_overall passes unchanged.
New tests/test_delfi.py::test_workers_equivalence verifies workers=1 and workers=4 produce identical output on every column.
Independently verified on a real 211-sample WGS dataset by running the upstream implementation and this PR side-by-side and diffing the output TSVs: every value of every column (contig, start, stop, arm, short, long, gc, num_frags) matches exactly.

Notes

I considered an alternative restructuring (one task per contig with a single pysam.fetch per contig instead of per window) hoping to reduce I/O. It was actually slower in every regime (cold and warm cache) because the existing per-window fetches use the BAI index and skip gap regions, while a per-contig fetch reads the full contig. Not pursued.
The same patterns (per-task BAM/refseq reopens, etc.) exist in _end_motifs, _frag_length, _coverage, _wps, and _multi_wps. Happy to send follow-up PRs that extract a small shared _pool helper and apply the same pattern there if there's appetite for it.

The previous implementation reopened and re-parsed the blacklist BED file, opened the BAM via pysam, and opened the 2bit reference for every single 100kb window (~26K windows per whole-genome run). It also did a linear blacklist scan on every fragment. This patch: * Parses the blacklist once and indexes regions by contig as sorted numpy arrays. * Uses binary search (np.searchsorted) to filter blacklist regions per window, replacing the per-fragment linear scan. * Opens one pysam.AlignmentFile and py2bit reference handle per worker process via a multiprocessing.Pool initializer; workers reuse those handles for every window they receive. * Pre-loads per-contig ContigGaps into a worker-global dict so they are not pickled into every task argument. * Replaces the per-base Python sum used to count GC bases with the C-level str.count. Output is bit-identical to the previous implementation (verified by ablation on chr1+chr2 with hg38; see PR description for benchmarks). A new regression test (test_workers_equivalence) asserts that delfi(workers=1) and delfi(workers=N) produce bit-identical results, so future changes to the parallel path are caught immediately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(delfi): ~80x faster via blacklist preloading and binary search#172

perf(delfi): ~80x faster via blacklist preloading and binary search#172
DucoG wants to merge 1 commit intoepifluidlab:mainfrom
DucoG:perf/delfi-blacklist-and-gc

DucoG commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DucoG commented May 1, 2026

Summary

Motivation

Changes

Benchmarks

Ablation (which optimisation contributes what)

Correctness

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant