Skip to content

maragkakislab/wf-module-dorado

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wf-module-dorado

Snakemake workflow module to run ONT Dorado basecalling, demultiplexing and post-processing for experiments organized as one directory per experiment.

This repository provides a reproducible Snakemake pipeline that is primarily intended to be consumed by other workflows as a module (although it can also run independently):

  • downloads the Dorado basecaller,
  • runs Dorado basecalling across experiment POD5 files,
  • optionally demultiplexes multiplexed runs,
  • converts basecalled BAMs to FASTQ,
  • runs pychopper for unstranded kits,
  • organizes outputs per-experiment and per-sample under basecall/ and samples/ respectively.

Requirements

  • Snakemake (>=6 recommended)
  • A GPU-equipped machine for Dorado basecalling.

These rules require minimal resources and can be speficied as localrules

localrules: run_all, rename_final_stranded_fastq, get_dorado, demux_get_bam,
            pychopper_merge_trimmed_rescued, get_basecalled_bam_for_sample

How to use in other workflows

In the consuming workflow add a section in the config that includes all required parameters included in this workflow config file.

In the consuming config.yml:

# Sample data
DORADO: {
  DOWNLOADS_DIR: "downloads",
  EXP_DIR: "experiments",
  BASECALL_DIR: "basecall",
  SAMPLES_DIR: "samples",
  DELETE_INTERMEDIATES: False,
  BIN_VERSION: 'dorado-1.1.1-linux-x64',
  DORADO_RESOURCES: {
    gpu: 2,
    gpu_model: "[gpua100|gpuv100x]",
  },
  DEFAULT_MODEL: 'hac',
  SAMPLE_DATA: {
    'header': ['sample_id', 'experiment_id', 'kit', 'stranded', 'barcode'],
    'data': [
        ['sample1', 'exp1', 'SQK-LSK114', 'false', ''],
    ]
  }
}

Then in a consuming snakefile:

module dorado_basecall:
    snakefile:
        github("maragkakislab/wf-module-dorado", path="workflow/Snakefile", tag="v1.0.3")
    config:
        config["DORADO"]

use rule * from dorado_basecall as dorado_*

rule run_all:
    input:
        # Basecalled BAM
        SAMPLES_DIR + "/test1/basecall/calls.bam",
        # Basecalled and stranded FASTQ
        SAMPLES_DIR + "/test1/fastq/reads.final.fastq.gz",

Configuration

Important keys found in config/config.yml (examples):

  • DOWNLOADS_DIR - where Dorado tarballs / binaries are downloaded (default downloads).
  • EXP_DIR - directory containing one subdirectory per experiment. Each experiment directory should contain POD5 files and the EXP_DIR_TRIGGER_FILE (default origin.txt) which acts as a trigger for the workflow.
  • BASECALL_DIR - where per-experiment basecalling outputs are stored.
  • SAMPLES_DIR - where per-sample outputs are placed.
  • BIN_VERSION - Dorado binary version to use.
  • DELETE_INTERMEDIATES - Delete intermediate files if True (Default: False)
  • DORADO_RESOURCES - Dorado default GPU resources (Optional)
  • DEFAULT_MODEL, CUSTOM_MODELS - model selection used in basecall rule.
  • SAMPLE_DATA - table describing samples; the workflow/rules/common.smk parses this into samples and experiments objects. If multiple samples share the same experiment_id they are assumed as multiplexed and a unique barcode needs to be provided.

About

A snakemake workflow that is intended to be consumed by other workflows allowing basecalling of Oxford Nanopore files with dorado

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages