Skip to content

Pre-index BWA and samtools references for runtime optimizationย #68

@smagala

Description

@smagala

Description

Currently, the Denver pipeline indexes BWA and samtools references at runtime for each pipeline execution. This task implements pre-indexing to improve pipeline startup time and reduce redundant computation.

Current Behavior

  • BWA_INDEX runs for each serotype reference FASTA (line 77-79 in workflows/denver.nf)
  • SAMTOOLS_FAIDX runs for each serotype reference FASTA (line 79-83 in subworkflows/local/denv_serotype_analysis/main.nf)
  • Indexes are recreated on every pipeline run

Proposed Solution

  1. Create a pre-indexing script (bin/preindex_references.sh) that:

    • Iterates through all FASTA files in assets directory
    • Runs BWA index for each reference
    • Runs samtools faidx for each reference
    • Stores indexes appropriately:
      • Single-file indexes (*.fai): stored directly alongside FASTA
      • Multi-file indexes (BWA): stored in per-reference subdirectories (e.g., DENV1/DENV1.fasta + BWA index files)
  2. Update pipeline configuration:

    • Add optional params for pre-built indexes
    • use_prebuilt_bwa_index: boolean flag (default: false)
    • use_prebuilt_fai: boolean flag (default: false)
    • When enabled, skip indexing modules and load pre-built indexes from assets
  3. Update workflows:

    • Modify denver.nf to conditionally skip BWA_INDEX
    • Modify denv_serotype_analysis/main.nf to conditionally skip SAMTOOLS_FAIDX
    • Load pre-built indexes from assets directory when flags are enabled
  4. Update nextflow_schema.json with new parameters

Acceptance Criteria

  • Pre-indexing script created in bin/
  • Script successfully indexes all 6 serotype references (DENV1-4 + 2 sylvatic)
  • Configuration parameters added and validated
  • Pipeline successfully runs with pre-built indexes enabled
  • Pipeline still works with runtime indexing (default behavior)
  • Documentation updated (README, parameter descriptions)

LoE Estimate

2-3 hours

  • Script implementation: 45 minutes
  • Pipeline modifications: 60 minutes
  • Configuration/schema updates: 30 minutes
  • Testing both modes: 45 minutes

Technical Details

BWA Index Output

BWA creates multiple index files for a reference:

  • reference.fasta.amb
  • reference.fasta.ann
  • reference.fasta.bwt
  • reference.fasta.pac
  • reference.fasta.sa

Storage strategy: Create DENV1/ directory containing FASTA + all index files

Samtools faidx Output

Samtools creates a single index file:

  • reference.fasta.fai

Storage strategy: Store directly alongside FASTA in assets directory

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions