Skip to content

A C++23 library for reading and writing SAM/BAM files.

License

Notifications You must be signed in to change notification settings

PacificBiosciences/pbsamoa

Repository files navigation

samoa logo

pbsamoa — SAM Open Alternative

A C++23 library for reading and writing SAM/BAM files.

Continue to documentation of this early draft.

Overview

Built on a decade of bioinformatics experience, pbsamoa is inspired by the work of the SAM/BAM Format Specification Working Group and htslib. It focuses on pipelined I/O for maximum throughput, deterministic chunking of inputs, and flexible record filtering. The "open" reflects our commitment to developing this library in the open.

Quick Start

Everything in this library is in the PacBio::Samoa namespace, for clarity prepend all examples with:

using namespace PacBio::Samoa;

RawRecord vs BamRecord

pbsamoa provides two record types with different trade-offs:

  • RawRecord owns the raw BAM bytes and decodes fields on demand. Ideal for high-throughput scanning where you only inspect a few fields per record (filtering, counting, indexing).
  • BamRecord stores pre-decoded C++ types (std::string, TagMap, etc.). Use it when you need structured access to many fields, mutation, clipping, or writing records back out.

BamRawReader – zero-copy iteration

BamRawReader yields RawRecord references directly from decompressed BGZF blocks with no up-front decoding.

BamRawReader reader{"input.bam"};

int mapped{0};
for (const auto& rec : reader.Records()) { mapped += rec.IsMapped(); }

std::println("Mapped reads: {}", mapped);

Enable parallel BGZF decompression for large files:

BamRawReader reader{"input.bam", BamRawReaderConfig{.BgzfWorkers = 4}};

BamRecordReader – pre-decoded records

BamRecordReader wraps BamRawReader and decodes every record into a BamRecord via a background thread pool. Fields like Sequence() and Tags() are ready to use without per-access decoding cost. The decompressed bytes are omitted.

BamRecordReader reader{"input.bam"};

for (const auto& rec : reader.Records()) {
    if (rec.IsMapped()) {
        std::println("{}\t{}\t{}\t{}", rec.Name(), rec.RefId(), rec.Pos(), rec.MapQ());
    }
}

Drop large PacBio pulse tags you don't need:

BamRecordReader reader{"input.bam", BamRecordReaderConfig{.TagFilter = DropTags{{"ip", "pw"}}}};

Chunking – parallel-by-ZMW sharding

Chunking splits a BAM file into N disjoint slices by unique (rgId, zmw) pairs, so multiple processes can work on the same file without overlap. It requires a .zmi (new) or .pbi (legacy) index alongside the BAM.

BamRawReaderConfig cfg{
  .ChunkNum = 2,      // 1-based: this process handles chunk 2
  .TotalChunks = 5,   // divide ZMWs into 5 roughly-equal groups
};
BamRawReader reader{"input.bam", cfg};

for (const auto& rec : reader.Records()) {
    // only records belonging to chunk 2's ZMWs
}

How it works. The reader opens the ZMW index, gets the file-order list of unique ZMWs, divides them into TotalChunks slices with floating-point boundaries (std::lround), then seeks the BGZF stream to the first record of the slice and sets a record limit to the exact count. Reading stops automatically when the chunk is exhausted – no per-record index lookups after setup.

ZMWs:    [10, 20, 30, 40, 50]    5 unique ZMWs, TotalChunks=3
chunk 1: [10, 20]   ← lround(5/3 * 0) .. lround(5/3 * 1) = indices 0..2
chunk 2: [30, 40]   ← indices 2..3
chunk 3: [50]       ← indices 3..5 (clamped)

Chunking and Whitelist are mutually exclusive – the constructor rejects combining both.

Read Pipeline – layered stages for maximum throughput

pbsamoa composes three pipeline layers. Each layer runs its own threads and hands data to the next via lock-free SPSC queues, so I/O, decompression, parsing, and decoding all overlap.

 Disk
  │  IO thread – reads 32 BGZF blocks per batch
  ▼
 ThreadPool (BgzfWorkers)
  │  libdeflate decompression, one batch per task
  ▼
 Consumer thread – parses raw BAM records from decompressed bytes
  │
  ▼  SPSCQueue<RawRecord>  [4096]
  │
 BamRawReader::ReadBatch(ByteLimit)
  │  accumulates records into a single contiguous RawRecordBatch
  ▼
 Producer thread – dispatches decode tasks to ThreadPool (DecodeWorkers)
  │  each record: RawRecord → BamRecord (field decode + tag filter)
  ▼  SPSCQueue<BamRecord>  [4096]
  │
 Caller thread – BamRecordReader::ReadRecord()

Layer 1 – BGZF (BgzfReader). Three concurrent stages when BgzfWorkers > 0: a dedicated IO thread reads compressed blocks from disk, a thread pool decompresses them with libdeflate, and a consumer thread parses the decompressed bytes into RawRecords. With BgzfWorkers = 0 everything runs synchronously on the caller thread.

Layer 2 – Batching (BamRawReader). ReadBatch(ByteLimit) drains records from Layer 1 into a contiguous RawRecordBatch – one allocation, N extents – up to a memory budget (default 256 MiB). This amortises allocation overhead and gives Layer 3 a batch-sized unit of work.

Layer 3 – Decoding (BamRecordReader). A background producer thread calls ReadBatch, then fans out record-level decoding across DecodeWorkers (default 4) threads. Each record is independently converted from raw BAM bytes to a BamRecord with structured fields, applying TagFilter (drop/keep) during ToOwned(). Decoded records are pushed into the output SPSC queue for the caller.

Every queue boundary has a named stall counter (ioStalls, consumerStalls, producerStalls, etc.) accessible from GetMetrics(), so you can identify the bottleneck stage without external instrumentation.

BamRecordReaderConfig cfg;
cfg.RawReaderConfig.BgzfWorkers = 8;  // layer 1: decompression threads
cfg.BatchBudget = 4_MiB;              // layer 2: batch size
cfg.DecodeWorkers = 4;                // layer 3: decode threads
cfg.TagFilter = DropTags{{"ip", "pw"}};
BamRecordReader reader{"input.bam", cfg};

About

A C++23 library for reading and writing SAM/BAM files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors