A C++23 library for reading and writing SAM/BAM files.
Continue to documentation of this early draft.
Built on a decade of bioinformatics experience, pbsamoa is inspired by the work of the SAM/BAM Format Specification Working Group and htslib. It focuses on pipelined I/O for maximum throughput, deterministic chunking of inputs, and flexible record filtering. The "open" reflects our commitment to developing this library in the open.
Everything in this library is in the PacBio::Samoa namespace, for clarity prepend all examples
with:
using namespace PacBio::Samoa;pbsamoa provides two record types with different trade-offs:
RawRecordowns the raw BAM bytes and decodes fields on demand. Ideal for high-throughput scanning where you only inspect a few fields per record (filtering, counting, indexing).BamRecordstores pre-decoded C++ types (std::string,TagMap, etc.). Use it when you need structured access to many fields, mutation, clipping, or writing records back out.
BamRawReader yields RawRecord references directly from decompressed BGZF
blocks with no up-front decoding.
BamRawReader reader{"input.bam"};
int mapped{0};
for (const auto& rec : reader.Records()) { mapped += rec.IsMapped(); }
std::println("Mapped reads: {}", mapped);Enable parallel BGZF decompression for large files:
BamRawReader reader{"input.bam", BamRawReaderConfig{.BgzfWorkers = 4}};BamRecordReader wraps BamRawReader and decodes every record into a BamRecord via a background
thread pool. Fields like Sequence() and Tags() are ready to use without per-access decoding
cost. The decompressed bytes are omitted.
BamRecordReader reader{"input.bam"};
for (const auto& rec : reader.Records()) {
if (rec.IsMapped()) {
std::println("{}\t{}\t{}\t{}", rec.Name(), rec.RefId(), rec.Pos(), rec.MapQ());
}
}Drop large PacBio pulse tags you don't need:
BamRecordReader reader{"input.bam", BamRecordReaderConfig{.TagFilter = DropTags{{"ip", "pw"}}}};Chunking splits a BAM file into N disjoint slices by unique (rgId, zmw) pairs, so multiple
processes can work on the same file without overlap. It requires a .zmi (new) or .pbi (legacy)
index alongside the BAM.
BamRawReaderConfig cfg{
.ChunkNum = 2, // 1-based: this process handles chunk 2
.TotalChunks = 5, // divide ZMWs into 5 roughly-equal groups
};
BamRawReader reader{"input.bam", cfg};
for (const auto& rec : reader.Records()) {
// only records belonging to chunk 2's ZMWs
}How it works. The reader opens the ZMW index, gets the file-order list of unique ZMWs, divides
them into TotalChunks slices with floating-point boundaries (std::lround), then seeks the BGZF
stream to the first record of the slice and sets a record limit to the exact count. Reading stops
automatically when the chunk is exhausted – no per-record index lookups after setup.
ZMWs: [10, 20, 30, 40, 50] 5 unique ZMWs, TotalChunks=3
chunk 1: [10, 20] ← lround(5/3 * 0) .. lround(5/3 * 1) = indices 0..2
chunk 2: [30, 40] ← indices 2..3
chunk 3: [50] ← indices 3..5 (clamped)
Chunking and Whitelist are mutually exclusive – the constructor rejects combining both.
pbsamoa composes three pipeline layers. Each layer runs its own threads and hands data to the next via lock-free SPSC queues, so I/O, decompression, parsing, and decoding all overlap.
Disk
│ IO thread – reads 32 BGZF blocks per batch
▼
ThreadPool (BgzfWorkers)
│ libdeflate decompression, one batch per task
▼
Consumer thread – parses raw BAM records from decompressed bytes
│
▼ SPSCQueue<RawRecord> [4096]
│
BamRawReader::ReadBatch(ByteLimit)
│ accumulates records into a single contiguous RawRecordBatch
▼
Producer thread – dispatches decode tasks to ThreadPool (DecodeWorkers)
│ each record: RawRecord → BamRecord (field decode + tag filter)
▼ SPSCQueue<BamRecord> [4096]
│
Caller thread – BamRecordReader::ReadRecord()
Layer 1 – BGZF (BgzfReader). Three concurrent stages when BgzfWorkers > 0: a dedicated IO
thread reads compressed blocks from disk, a thread pool decompresses them with libdeflate, and a
consumer thread parses the decompressed bytes into RawRecords. With BgzfWorkers = 0 everything
runs synchronously on the caller thread.
Layer 2 – Batching (BamRawReader). ReadBatch(ByteLimit) drains records from Layer 1 into a
contiguous RawRecordBatch – one allocation, N extents – up to a memory budget (default 256 MiB).
This amortises allocation overhead and gives Layer 3 a batch-sized unit of work.
Layer 3 – Decoding (BamRecordReader). A background producer thread calls ReadBatch, then fans
out record-level decoding across DecodeWorkers (default
4) threads. Each record is independently converted from raw BAM bytes to a BamRecord with
structured fields, applying TagFilter (drop/keep) during ToOwned(). Decoded records are pushed
into the output SPSC queue for the caller.
Every queue boundary has a named stall counter (ioStalls, consumerStalls, producerStalls,
etc.) accessible from GetMetrics(), so you can identify the bottleneck stage without external
instrumentation.
BamRecordReaderConfig cfg;
cfg.RawReaderConfig.BgzfWorkers = 8; // layer 1: decompression threads
cfg.BatchBudget = 4_MiB; // layer 2: batch size
cfg.DecodeWorkers = 4; // layer 3: decode threads
cfg.TagFilter = DropTags{{"ip", "pw"}};
BamRecordReader reader{"input.bam", cfg};