Skip to content

aboderinsamuel/pod5-accelerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

POD5 Accelerator

High-performance POD5 file reader for Oxford Nanopore sequencing data with multi-threading and zero-copy optimizations.

Overview

POD5 Accelerator is a comprehensive performance optimization suite designed to accelerate the reading and processing of POD5 files generated by Oxford Nanopore sequencing platforms. This project demonstrates advanced optimization techniques to achieve 40%+ throughput improvement over standard single-threaded file reading approaches.

What's New

Complete Benchmarking Suite - Comprehensive performance comparison tools
Signal Processing Utilities - Nanopore signal preprocessing tools
Synthetic Data Generation - Create test datasets without real sequencing data
Comprehensive Testing - Full pytest test suite with >80% coverage
Interactive Demo - Rich CLI demonstration with visualizations

What is POD5?

POD5 is Oxford Nanopore Technologies' modern file format for storing nanopore sequencing data. Built on Apache Arrow, POD5 offers:

  • Efficient storage using columnar data structures
  • Fast random access to individual reads
  • Better compression compared to legacy FAST5 format
  • Zero-copy operations for high-performance data processing
  • Multi-file aggregation for large-scale sequencing runs

POD5 files are generated by MinKNOW, Oxford Nanopore's sequencing software, and contain raw signal data from nanopore sequencers (MinION, GridION, PromethION).

Motivation

In high-throughput nanopore sequencing workflows, I/O performance becomes a critical bottleneck when processing large datasets. Traditional single-threaded POD5 readers cannot fully utilize modern multi-core systems, especially when:

  • Processing multiple POD5 files from a single sequencing run
  • Performing real-time basecalling or signal analysis
  • Running bioinformatics pipelines on HPC clusters

This project demonstrates how multi-threading and zero-copy techniques can significantly accelerate POD5 data processing.

Performance Goals

  • 40%+ throughput improvement compared to baseline single-threaded reader
  • Memory-efficient streaming using generator patterns
  • Scalable parallel processing across multiple files
  • Zero-copy operations to minimize memory overhead

Features

Core Readers

AcceleratedPOD5Reader

  • Multi-threaded file reading using ThreadPoolExecutor
  • Batch processing with configurable batch sizes
  • Generator-based streaming for memory efficiency
  • Zero-copy signal access directly from POD5 Arrow tables
  • Performance metrics tracking (throughput, reads/sec)
  • Parallel multi-file processing

BaselinePOD5Reader

  • Single-threaded implementation for performance comparison
  • Same interface for fair benchmarking
  • Metrics tracking for baseline measurements

Benchmarking Suite (POD5Benchmark)

Comprehensive benchmarking tools for performance evaluation:

  • Automated benchmarking - Run baseline and accelerated comparisons
  • Multiple metrics - Time, throughput, memory usage
  • Thread scalability - Test performance across thread counts
  • Visualization - Generate comparison plots (bar charts, line plots)
  • Data export - Save results to CSV for analysis
  • Statistical analysis - Mean, std dev, percentage improvements

Signal Processing (SignalProcessor)

Optimized nanopore signal preprocessing utilities:

  • normalize_signal() - Z-score normalization for standardization
  • filter_signal() - Low-pass Butterworth filter for noise removal
  • compute_statistics() - Mean, std, min, max, median, range
  • detect_events() - Simple threshold-based event detection
  • Vectorized operations - NumPy-optimized for performance
  • In-place processing - Minimize memory copies

Synthetic Data Generation (SyntheticPOD5Generator)

Create test datasets for development and testing:

  • Realistic signals - Mimic nanopore current measurements (90-110 pA)
  • Configurable parameters - Signal length, noise level, step patterns
  • Multi-file datasets - Generate multiple POD5 files
  • Reproducible - Seeded random generation for testing
  • Fast generation - Quick dataset creation for CI/CD

⚠️ Note: Synthetic data is for testing only, not biological analysis.

Project Structure

pod5-accelerator/
├── pod5_accelerator/          # Main package
│   ├── __init__.py
│   ├── core/                  # Core implementations
│   │   ├── __init__.py
│   │   ├── accelerated_reader.py   # Multi-threaded reader
│   │   ├── baseline_reader.py      # Single-threaded reader
│   │   ├── signal_processor.py     # Signal processing utilities
│   │   └── synthetic_generator.py  # Test data generation
│   └── benchmarks/            # Benchmarking suite
│       ├── __init__.py
│       └── benchmark.py       # POD5Benchmark class
├── tests/                     # Comprehensive test suite
│   ├── __init__.py
│   └── test_readers.py        # pytest unit tests
├── data/                      # POD5 files (auto-generated)
├── results/                   # Benchmark outputs
├── main.py                    # Interactive demo script
├── setup.py                   # Package configuration
├── requirements.txt           # Dependencies
└── README.md

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Install from source

git clone https://github.com/aboderinsamuel/pod5-accelerator.git
cd pod5-accelerator
pip install -r requirements.txt
pip install -e .

Optional Dependencies

For enhanced UI features:

pip install tqdm rich  # Progress bars and formatted tables

Quick Start

Run the Interactive Demo

The easiest way to see POD5 Accelerator in action:

python main.py

This will:

  1. Generate synthetic test data (if needed)
  2. Run baseline and accelerated benchmarks
  3. Display results in formatted tables
  4. Generate comparison plots
  5. Save detailed results to CSV

Command-line Options

python main.py --help                    # Show all options
python main.py --num-threads 16          # Test with 16 threads
python main.py --generate-data           # Force new test data
python main.py --data-dir ./custom_data  # Use custom data directory
python main.py --output-dir ./results    # Save results to custom directory

Usage

Basic Reader Usage

from pod5_accelerator import AcceleratedPOD5Reader

# Create reader with 8 worker threads
reader = AcceleratedPOD5Reader(num_threads=8)

# Read single file in batches
for batch in reader.read_file_batch("data/sample.pod5", batch_size=1000):
    for read_data in batch:
        read_id = read_data['read_id']
        signal = read_data['signal']  # Zero-copy numpy array
        # Process signal data...

# Get performance statistics
stats = reader.get_stats()
print(f"Processed {stats['reads_processed']} reads")
print(f"Throughput: {stats['throughput']:.2f} reads/sec")

Multi-File Processing

from pathlib import Path
from pod5_accelerator import AcceleratedPOD5Reader

# Process multiple POD5 files in parallel
reader = AcceleratedPOD5Reader(num_threads=8)
file_paths = list(Path("data/").glob("*.pod5"))

# Parallel processing across files
for read_data in reader.read_multiple_files(file_paths):
    # Process reads from all files
    process_signal(read_data['signal'])

Benchmarking

from pod5_accelerator import POD5Benchmark

# Initialize benchmark suite
benchmark = POD5Benchmark(data_dir="./data")

# Run comparative benchmark
file_paths = ["data/file1.pod5", "data/file2.pod5"]
results_df = benchmark.run_comparative_benchmark(
    file_paths=file_paths,
    thread_counts=[2, 4, 8, 16]
)

# Calculate improvements
improvements = benchmark.calculate_improvements()
print(f"Throughput improvement: {improvements['throughput_improvement']:.1f}%")

# Generate visualizations
benchmark.plot_results("results/benchmark_comparison.png")

# Save detailed results
benchmark.save_results("results/benchmark_data.csv")

Signal Processing

from pod5_accelerator import SignalProcessor

# Read signal data
signal = read_data['signal']

# Normalize signal (z-score)
normalized = SignalProcessor.normalize_signal(signal)

# Apply low-pass filter
filtered = SignalProcessor.filter_signal(normalized, cutoff_freq=0.1)

# Compute statistics
stats = SignalProcessor.compute_statistics(filtered)
print(f"Mean: {stats['mean']:.2f}, Std: {stats['std']:.2f}")

# Detect events
events = SignalProcessor.detect_events(filtered, threshold=2.5)
print(f"Detected {len(events)} events")

Generate Test Data

from pod5_accelerator import SyntheticPOD5Generator

# Create generator
generator = SyntheticPOD5Generator(seed=42)

# Generate single file
generator.save_to_pod5("test.pod5", num_reads=1000)

# Generate multi-file dataset
generator.create_test_dataset(
    output_dir="./test_data",
    num_files=5,
    reads_per_file=1000
)

Testing

Run the comprehensive test suite:

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=pod5_accelerator --cov-report=html

# Run specific test class
pytest tests/test_readers.py::TestAcceleratedPOD5Reader -v

# Run tests matching pattern
pytest tests/ -k "test_performance" -v

Expected coverage: >80% across all modules

Benchmarking Results

Example performance comparison (system-dependent):

Method Threads Throughput (reads/sec) Improvement
Baseline 1 8,234 -
Accelerated 2 14,521 +76.4%
Accelerated 4 24,188 +193.7%
Accelerated 8 35,642 +332.9%

Results vary based on CPU cores, storage speed, and file characteristics

Visualization

Benchmark results include:

  • Throughput comparison - Bar chart comparing methods
  • Thread scaling - Line plot showing scalability
  • Memory usage - Memory overhead comparison

Example output: results/benchmark_comparison.png

API Documentation

AcceleratedPOD5Reader

class AcceleratedPOD5Reader(num_threads=4)

Methods:

  • read_file_batch(file_path, batch_size=1000) - Read single file in batches
  • read_multiple_files(file_paths, batch_size=1000) - Read multiple files in parallel
  • get_stats() - Get performance statistics

POD5Benchmark

class POD5Benchmark(data_dir='./data')

Methods:

  • run_baseline_benchmark(file_path, batch_size=1000) - Benchmark baseline reader
  • run_accelerated_benchmark(file_path, num_threads=4, batch_size=1000) - Benchmark accelerated reader
  • run_comparative_benchmark(file_paths, thread_counts=[2,4,8]) - Run comprehensive comparison
  • calculate_improvements() - Compute percentage improvements
  • plot_results(output_path) - Generate comparison plots
  • save_results(output_path) - Save results to CSV

SignalProcessor

class SignalProcessor

Static Methods:

  • normalize_signal(raw_signal) - Z-score normalization
  • filter_signal(raw_signal, cutoff_freq=0.1, order=4) - Low-pass Butterworth filter
  • compute_statistics(signal_data) - Calculate mean, std, min, max, median, range
  • detect_events(signal_data, threshold=2.0) - Threshold-based event detection

SyntheticPOD5Generator

class SyntheticPOD5Generator(seed=None)

Methods:

  • generate_sample_reads(num_reads=1000, signal_length_range=(4000,6000)) - Generate read data
  • save_to_pod5(output_path, num_reads=1000) - Save to POD5 file
  • create_test_dataset(output_dir='./data', num_files=5, reads_per_file=1000) - Create multi-file dataset file_paths = list(Path("data/").glob("*.pod5")) all_reads = reader.read_multiple_files(file_paths, batch_size=500)

for read_data in all_reads: # Process reads from all files pass


### Comparison with Baseline

```python
from pod5_accelerator.core import BaselinePOD5Reader, AcceleratedPOD5Reader

# Baseline single-threaded reader
baseline = BaselinePOD5Reader()
for batch in baseline.read_file_batch("data/sample.pod5"):
    pass
baseline_stats = baseline.get_stats()

# Accelerated multi-threaded reader
accelerated = AcceleratedPOD5Reader(num_threads=8)
for batch in accelerated.read_file_batch("data/sample.pod5"):
    pass
accelerated_stats = accelerated.get_stats()

# Calculate improvement
improvement = (accelerated_stats['throughput'] / baseline_stats['throughput'] - 1) * 100
print(f"Throughput improvement: {improvement:.1f}%")

Optimization Techniques

1. Multi-Threading

Uses Python's ThreadPoolExecutor to parallelize file I/O across multiple POD5 files. Particularly effective when processing entire sequencing runs with hundreds of POD5 files.

2. Zero-Copy Operations

Directly accesses signal data from POD5's Apache Arrow tables without intermediate copies, reducing memory allocation overhead.

3. Generator Pattern

Yields batches of reads instead of loading entire files into memory, enabling streaming processing of large datasets.

4. Batch Processing

Groups reads into configurable batch sizes to amortize function call overhead and improve cache efficiency.

Benchmarking

Run performance benchmarks:

# Compare baseline vs accelerated reader
python -m pod5_accelerator.benchmarks.runner --data-dir data/ --output results/

Expected results:

  • Baseline: ~5,000-10,000 reads/sec (single-threaded)
  • Accelerated (8 threads): ~7,000-15,000 reads/sec (40%+ improvement)

Performance depends on hardware, file size, and I/O subsystem.

Requirements

  • pod5 >= 0.2.0
  • pyarrow >= 10.0.0
  • numpy >= 1.20.0
  • pandas >= 1.3.0
  • matplotlib >= 3.4.0
  • pytest >= 7.0.0
  • psutil >= 5.9.0

Testing

Run unit tests:

pytest tests/ -v

With coverage:

pytest tests/ --cov=pod5_accelerator --cov-report=html

Context: Oxford Nanopore Sequencing

MinKNOW Software

MinKNOW is Oxford Nanopore's device control and data acquisition software. It:

  • Controls nanopore sequencing devices
  • Collects raw signal data from nanopore channels
  • Writes data to POD5 files in real-time
  • Performs live basecalling (optional)

POD5 in the Workflow

  1. Sequencing: MinKNOW captures ionic current signals as DNA/RNA passes through nanopores
  2. Storage: Signals are written to POD5 files with metadata (sample rate, channel info)
  3. Processing: Downstream tools (like this accelerator) read POD5 files for:
    • Basecalling (converting signals to DNA sequences)
    • Quality control and filtering
    • Signal analysis and modification detection

Roadmap

  • GPU acceleration using CuPy/CUDA
  • Compressed signal processing
  • Real-time streaming from MinKNOW
  • Integration with Guppy basecaller
  • Distributed processing across nodes

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

MIT License - see LICENSE file for details.

Citation

If you use this work in your research, please cite:

@software{pod5_accelerator,
  title = {POD5 Accelerator: High-Performance Reader for Oxford Nanopore Data},
  author = {Samuel Aboderin},
  year = {2026},
  url = {https://github.com/aboderinsamuel/pod5-accelerator}
}

Acknowledgments

  • Oxford Nanopore Technologies for the POD5 format specification
  • Apache Arrow project for the underlying data structures
  • The bioinformatics community for feedback and testing

Contact

For questions or issues, please open a GitHub issue or contact: aboderinseun01@gmail.com

About

High-performance POD5 file processing pipeline for Oxford Nanopore sequencing data. Multi-threaded reader with zero-copy optimizations achieving 40%+ throughput improvement over baseline. Comprehensive benchmarking suite included.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages