perf(delfi): ~80x faster via blacklist preloading and binary search#172
Open
DucoG wants to merge 1 commit intoepifluidlab:mainfrom
Open
perf(delfi): ~80x faster via blacklist preloading and binary search#172DucoG wants to merge 1 commit intoepifluidlab:mainfrom
DucoG wants to merge 1 commit intoepifluidlab:mainfrom
Conversation
The previous implementation reopened and re-parsed the blacklist BED file, opened the BAM via pysam, and opened the 2bit reference for every single 100kb window (~26K windows per whole-genome run). It also did a linear blacklist scan on every fragment. This patch: * Parses the blacklist once and indexes regions by contig as sorted numpy arrays. * Uses binary search (np.searchsorted) to filter blacklist regions per window, replacing the per-fragment linear scan. * Opens one pysam.AlignmentFile and py2bit reference handle per worker process via a multiprocessing.Pool initializer; workers reuse those handles for every window they receive. * Pre-loads per-contig ContigGaps into a worker-global dict so they are not pickled into every task argument. * Replaces the per-base Python sum used to count GC bases with the C-level str.count. Output is bit-identical to the previous implementation (verified by ablation on chr1+chr2 with hg38; see PR description for benchmarks). A new regression test (test_workers_equivalence) asserts that delfi(workers=1) and delfi(workers=N) produce bit-identical results, so future changes to the parallel path are caught immediately.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
delfiis roughly 80x faster on whole-genome inputs while producing bit-identical output. On a chr1+chr2 hg38 benchmark (4912 100kb bins, 16 worker processes), wall-clock time drops from 643s → 8s and total CPU time from 9845s → 59s.Motivation
While running DELFI on 211 WGS samples I profiled the worker function and noticed almost all of the per-window time was spent re-reading the blacklist BED file from disk and doing per-fragment linear scans against it. Per sample the existing implementation does ~26K full re-parses of the blacklist file (one per 100kb window) and reopens the BAM and 2bit reference inside every worker call. With many samples processed in parallel on shared/NFS storage this also produces a heavy and mostly-redundant I/O load.
Changes
src/finaletoolkit/frag/_delfi.py:np.searchsortedinstead of a linear scan over every fragment.multiprocessing.Poolinitializer (one open per worker, reused for every window). The handle is reused by passing the openpysam.AlignmentFiletofrag_generator, which already supports it.ContigGapsinto worker globals instead of pickling one into every task argument.str.count('G') + str.count('C')in place of a Pythonsumover'G' or 'C'comparisons.tests/test_delfi.py:test_workers_equivalencewhich assertsdelfi(workers=1)anddelfi(workers=4)produce bit-identical output on every column. This guards against any future regression that introduces parallel-only state in the worker pool.CHANGELOG.md: documents the speedup and the new test under[Unreleased].The public API (
delfi(...)signature, output schema, output values) is unchanged.Benchmarks
Hardware: 56-core x86_64, NFS-backed BAM, hg38, 16 workers, identical inputs across runs. Wall and CPU times are means over 5 runs each (σ < 1% in every cell).
main)Ablation (which optimisation contributes what)
Cumulative speedup as each optimisation is added on top of the previous:
The two dominant wins are preloading the blacklist (eliminates ~26K file reopens) and switching to a sorted-array blacklist filter (eliminates O(blacklist) work per fragment). The remaining three optimisations together are worth ~1.5s wall / ~25s CPU on this input.
Correctness
tests/test_delfi.py::test_overallpasses unchanged.tests/test_delfi.py::test_workers_equivalenceverifiesworkers=1andworkers=4produce identical output on every column.contig,start,stop,arm,short,long,gc,num_frags) matches exactly.Notes
pysam.fetchper contig instead of per window) hoping to reduce I/O. It was actually slower in every regime (cold and warm cache) because the existing per-window fetches use the BAI index and skip gap regions, while a per-contig fetch reads the full contig. Not pursued._end_motifs,_frag_length,_coverage,_wps, and_multi_wps. Happy to send follow-up PRs that extract a small shared_poolhelper and apply the same pattern there if there's appetite for it.