GitHub - davide-scognamiglio/MuSA: A Nextflow pipeline for deep, reproducible annotation and ranking of clinical genomic variants

Introduction

MuSA (Multi-Source variant Annotation) is an nf-core-oriented Nextflow pipeline that provides a fully automated, end-to-end framework for variant interpretation. Based on the findings in the sources, MuSA offers several innovations that distinguish it from, and in many aspects make is superior to, existing tools: Automated Resource Management: MuSA eliminates the manual effort and reproducibility issues inherent in standalone VEP or ANNOVAR workflows by fully automating the setup of annotation resources, including 20 curated VEP plugins and the full dbNSFP distribution.

Advanced VUS Reclassification: A standout feature of MuSA is its integration of the RENOVO machine-learning model. By applying a novel linear transformation to RENOVO scores, the pipeline actively shifts Variants of Uncertain Significance (VUS) toward actionable pathogenic or benign extremes—a capability not typically found in standard automated pipelines.
Dual-Output for AI Research and Clinical Review: Unlike lighter clinical reporting tools, MuSA generates deeply annotated, AI-ready MAF files containing up to 950 annotation columns, systematically organized for deep computational research. Simultaneously, it produces interactive HTML reports with HPO-matched gene panels, streamlining results for clinical teams.
Superior Clinical Utility vs. Broad Pipelines: While broad pipelines like MuSA/sarek focus on processing breadth, MuSA is uniquely dedicated to annotation completeness. It specifically addresses the "unsuitable verbosity" of default VEP outputs by focusing on MANE transcript selection and HPO-driven filtering to ensure results are diagnostic-ready.

The pipeline takes as input a samplesheet referencing raw (unannotated) VCF files and outputs consolidated annotation files suitable for clinical research, reporting, or input to downstream workflows. If Human Phenotype Ontology (HPO) terms are provided for individual patients, an additional phenotype-prioritized MAF is generated using HPO-based gene panel filtering.

Default pipeline key parameters

--build
Genome build to use (default: hg38).
--input
Path to the samplesheet containing input VCF files.
--outdir
Directory where all results will be written.
--workflow
Workflow to run: setup or annotate.
--vcf_format
Format of input VCF files. Supported: sarek, multicaller, dragen, iontorrent.
--center
Optional sequencing center identifier added to output files.
--skip_bcftools
Allows user to skip the bcftools-based pre-processing of vcf files.
--offline
If true, no external API call will be performed.
--drop_benign
If true, all variants reported as "benign" or "likely benign" in Clinvar will be dropped in the filtered MAF file.
--max_freq
Optional maximum population frequency threshold. If null, no variant will be dropped based on frequency.
--panel
Optional panel name to be used in the last filtering step.

Genebe parameters (required when `--offline false`)

--gb_user
Genebe account username.
--gb_api_key
Genebe API key.
--http_proxy, --https_proxy
Proxy settings, only if required by your system.

VEP and plugin parameters

--n_core
Number of cores used by VEP (default: 16).
--download_vep_plugins
Download VEP plugins during the setup workflow (true/false).
--use_vep_plugins
Enable VEP plugin usage during the annotation workflow (true/false).
--data_dir
Directory containing all the data downloaded during the setup step.
--annovar_software_dir
Directory containing the annovar software folder (path/to/annovar).

Getting started

1a. Setup

Before annotating any dataset, the pipeline requires a setup step to download the minimal required databases and reference files. This ensures the pipeline can run correctly. Important: Users must independently obtain access to an ANNOVAR license, download the software from the official source, and install it according to its licensing terms.

Licensing and data usage notice:
Users must independently obtain access to an ANNOVAR license, download the software from the official source, and install it according to its licensing terms. While we provide a link to download the dbNSFP academic database for convenience, users are solely responsible for complying with its license terms. In particular, dbNSFP academic is restricted to non-commercial use, and any usage must adhere to the conditions specified by its authors. Ensure that your use case is compliant before downloading and integrating the resource.

Run the setup workflow:

nextflow run main.nf \
    -profile <docker/singularity> \
    --workflow setup \
    --download_vep_plugins=<true/false> \
    --data_dir="../data/"

1b. Test run

Now, you can run your first test using:

nextflow run main.nf \
   -profile test,<docker/singularity> \
   --use_vep_plugins = <true/false> \
   --annovar_software_dir = <path/to/annovar> \
   --outdir <OUTDIR>

2. Using your own data

If you want to annotate your own vcf file, make sure you prepare a samplesheet with this format:

patient	sample_type	sample_file	hpo
patient_code	tissue	path/to/file.vcf.gz	HP:code (optional)

Columns:

patient: Unique identifier for the patient
sample_type: Type of sample (blood, saliva, tissue, etc.)
sample_file: Full path to the unannotated VCF file
hpo: Optional HPO term(s) for phenotype-based filtering

Warning

By default, the entire pipeline is set to run with --offline = true. This will skip the Genebe and HPO API-based annotations. If you want to use Genebe, please provide --gb_user and --gb_api_key which can be obtained for free here. Only then, you can run with --offline = false and provide the Genebe params.

nextflow run main.nf \
  -profile  <docker/singularity> \
  --workflow annotate \
  --use_vep_plugins=<true/false> \
  --data_dir=<path/to/data> \
  --annovar_software_dir=<path/to/annovar> \
  --vcf_format=<sarek/multicaller/dragen/iontorrent> \
  --input  <path/to/samplesheet.csv>  \
  --outdir <OUTDIR>

Pipeline output

MuSA generates two complementary outputs: comprehensive MAF files for computational analysis and interactive HTML reports for clinical interpretation.

MAF files contain up to ~950 annotation columns per variant, including population frequencies (e.g. gnomAD, TOPMed), pathogenicity predictions (e.g. REVEL, CADD, AlphaMissense), splicing scores (e.g. SpliceAI), clinical annotations (e.g. ClinVar, OMIM), gene constraint metrics, ACMG/AMP evidence, and RENOVO pathogenicity scores.

HTML reports provide an interactive overview with: (i) a summary panel (patient metadata and variant counts), (ii) a sortable/filterable variant table with key annotations, and (iii) maftools-based visualizations (e.g. mutation distributions, oncoplots, Ti/Tv ratios).

Credits

MuSA was written by D. Scognamiglio at IRCCS Istituto Ortopedico Rizzoli, Bologna, Italy.

We thank E. Bonetti for his extensive assistance in the development of this pipeline.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the MuSA publication as follows: ...

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
lib		lib
modules/local		modules/local
subworkflows/local		subworkflows/local
tests		tests
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Default pipeline key parameters

Genebe parameters (required when `--offline false`)

VEP and plugin parameters

Getting started

1a. Setup

1b. Test run

2. Using your own data

Pipeline output

Credits

Citations

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Default pipeline key parameters

Genebe parameters (required when --offline false)

VEP and plugin parameters

Getting started

1a. Setup

1b. Test run

2. Using your own data

Pipeline output

Credits

Citations

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Genebe parameters (required when `--offline false`)