PKS-analysis-pipeline

This repository contains a Nextflow pipeline for polyketide synthase (PKS) phylogenetic analysis. The pipeline performs sequence alignment using mafft and prank and estimates phylogenetic trees using RAxML-NG (ML tree with automatic evolutionary model selection) and Beast2 (Bayesian inference).

Pipeline Overview

Preprocessing: Cleans FASTA headers (removing specific formatting artifacts).
Alignment: Sequences are aligned in parallel using:
- PRANK (probabilistic alignment)
- MAFFT (heuristic alignment)
Phylogenetics:
- RAxML-NG: Maximum likelihood tree inference with automatic model selection and adaptive tree search.
- BEAST2: Bayesian phylogenetic inference (XML configuration generated automatically).

Environment Setup

The pipeline is designed to run with Nextflow using a custom Conda/Mamba environment.

Software Versions

The analysis environment includes the following key software versions defined in environment.yml:

Python 3.12
PRANK v.170427
MAFFT v7.526
RAxML-NG v2.0-beta3 (supports automatic model selection)
BEAST2 v2.6.3

Cluster Setup

UZH Wagner HPC

Install and configure the environment using a setup bash script:

git clone https://github.com/acg-team/PKS-analysis-pipeline.git
cd PKS-analysis-pipeline/setup_env
source init_env_uzh_cluster.sh
cd ..

ZHAW HPC

Install and configure the environment using a setup bash script:

git clone https://github.com/acg-team/PKS-analysis-pipeline.git
cd PKS-analysis-pipeline/setup_env
source init_env_zhaw_cluster.sh
cd ..

Local Setup

Standard Workstation (e.g., Linux/Intel Mac)

Prerequisites: micromamba and nextflow installed.

git clone https://github.com/acg-team/PKS-analysis-pipeline.git
cd PKS-analysis-pipeline
micromamba create -f setup_env/environment.yml

Running The Analysis

Prerequisites:

micromamba and nextflow installed, pre-configured micromamba environment created but not activated.
Data for the analysis placed in the data/ folder in the repository root folder.

Running on UZH Wagner HPC

Prerequisites: replace the path to the environment in nextflow_uzh_cluster.config with the appropriate environment path, e.g.:

conda = '/home/jpecerska/micromamba/envs/polyketide_analysis'

To run the pipeline using the Slurm workload manager, run the following command:

nextflow run pipeline.nf -c nextflow_uzh_cluster.config

Running locally

Prerequisites: replace the path to the environment in nextflow_local.config with the appropriate environment path, e.g.:

conda = '/Users/pece/mamba/envs/polyketide_analysis

To run the pipeline using the local machine, run the following command:

nextflow run pipeline.nf -c nextflow_local.config

Analysis Steps Detailed

1. Data Cleaning

The pipeline performs automatic cleaning of input FASTA files:

Symbol Removal: Removes [' and '] characters from sequences. These are formatting artifacts that can cause PRANK to replace residues with unknown amino acids (X).
Header Sanitization: Ensure headers do not contain parentheses (which break RAxML).
- Example fix manually applied to PKS_AT_prot_seq.fasta:
  - Old: >NZ_SZVR01000041_M6_bis_N/A_mxmal/(unknown)/mxmal
  - New: >NZ_SZVR01000041_M6_bis_N/A_mxmal/unknown/mxmal

2. Alignment

PRANK

Command:

prank -d=${fasta} -o=${fasta.baseName}.prank -F

Settings:

-F: Trusts the inference of insertions; sites appearing as insertions are not re-aligned in later stages.

MAFFT

Command:

mafft --auto ${fasta} > ${fasta.baseName}.mafft.aln

Settings:

--auto: Automatically selects the appropriate strategy based on data size.

3. ML phylogenetics (RAxML-NG)

RAxML-NG (v2.0-beta3) is used for its support of Automatic Model Selection (MOOSE) and Adaptive Tree Search.

Command (DNA sequences):

raxml-ng-2 --msa ${alignment} --model DNA --opt-topology adaptive --prefix ${alignment.baseName}

Command (protein sequences):

raxml-ng-2 --msa ${alignment} --model AA --moose-options substitution-models=DCMut,JTT,JTT-DCMut,LG,PMB,Q.pfam,Q.yeast,VT,WAG,PROTGTR --opt-topology adaptive --prefix ${alignment.baseName}

Settings:

--model: Sets the input sequence type DNA for DNA sequences or AA for protein sequences.
--moose-options: Defines the substitution models to use in model selection (only for protein sequences, described in section Model Selection).
--opt-topology adaptive: Activates adaptive tree search based on Pythia difficulty scores (described in section Topology Optimisation).

Model Selection

The pipeline uses MOOSE to select the best-fitting evolutionary model. For DNA sequences it will go through all available substitution models (e.g. JC69, K80, GTR), for proteins it will evaluate a list of models in the table below, selected based on their applicability to the dataset (bacterial PKS):

Model Name	Reference	Included?	Comment
DCMut	Kosiol and Goldman, 2005	Yes	Improved version of PAM model
JTT	Jones et al., 1992	Yes	Generic model
JTT-DCMut	Kosiol and Goldman, 2005	Yes	Corrected Dayhoff rate matrices
LG	Le and Gascuel, 2008	Yes	Generic model
PMB	Veerassamy et al., 2003	Yes	Updated BLOSUM62 model
Q.pfam	Minh et al., 2021	Yes	Generic model derived from PFAM
VT	Muller and Vingron, 2000	Yes	Extension of Dayhoff approach
WAG	Whelan and Goldman, 2001	Yes	Generic model
PROTGTR	-	Yes	General Time Reversible (190 rate parameters)
Blosum62	Henikoff and Henikoff, 1992	No	Omitted in favor of PMB
Dayhoff	Dayhoff et al., 1978	No	Omitted in favor of JTT-DCMut
Q.yeast	Minh et al., 2021	No*	Yeast model (not applicable to bacteria)

(Note: Many specific organismal models like HIV, Flu, Mammal, Bird, Plant, Insect, and Mitochondrial models were excluded as they are not applicable to this analysis.)

Topology Optimisation

Tree search is guided by the Pythia score, a machine-learning predictor of dataset difficulty (likelihood surface ruggedness).

Easy datasets: Single likelihood peak; search converges rapidly and terminates early.
Difficult datasets: Many local optima; search finds one good topology quickly to save time.
Intermediate datasets: Requires more extensive search.

4. Bayesian Phylogenetics (Beast2)

The pipeline automates the generation of Beast2 XML configuration files and execution.

XML Generation: A Python script (scripts/beast_configuration.py) automatically generates the XML configuration file for each alignment.
- It uses a template (beast_template/protein_template.xml for protein sequences or beast_template/dna_template.xml for DNA) to ensure consistent parameters.
- It parses the FASTA alignment, sanitizes sequence IDs (replacing non-alphanumeric characters), and inserts the sequences into the XML structure.
Beast2 run: Beast2 is run on the generated XML files to perform Bayesian phylogenetic inference.

Execution:

beast -threads ${task.cpus} -overwrite ${xml}

Settings:

-threads: Enables parallelisation based on available CPU count.
-overwrite: Allows overwriting of files to prevent getting stuck on user input.

Beast2 Analysis Configuration

The pipeline uses two specific templates for Bayesian phylogenetic inference, depending on the input data type. No model selection is available, so the substitution models are fixed.

Template settings:

MCMC Chain Length: 10,000,000 generations.
Logging Frequency: Every 1,000 generations (for trees, trace logs, and screen output).
Tree Prior: Birth-Death Model.
Clock Model: Strict Clock with a fixed rate of 1.0.
Site Model: Gamma Site Model with 5 categories and invariant sites (G+I).
Specific Configurations:
- Protein Alignments (using protein_template.xml):
  - Substitution Model: WAG.
- DNA Alignments (using dna_template.xml):
  - Substitution Model: GTR (General Time Reversible) with estimated base frequencies and rate parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PKS-analysis-pipeline

Pipeline Overview

Environment Setup

Software Versions

Cluster Setup

UZH Wagner HPC

ZHAW HPC

Local Setup

Standard Workstation (e.g., Linux/Intel Mac)

Running The Analysis

Running on UZH Wagner HPC

Running locally

Analysis Steps Detailed

1. Data Cleaning

2. Alignment

PRANK

MAFFT

3. ML phylogenetics (RAxML-NG)

Model Selection

Topology Optimisation

4. Bayesian Phylogenetics (Beast2)

Beast2 Analysis Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
beast_template		beast_template
data		data
scripts		scripts
setup_env		setup_env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
nextflow_local.config		nextflow_local.config
nextflow_uzh_cluster.config		nextflow_uzh_cluster.config
nextflow_zhaw_cluster.config		nextflow_zhaw_cluster.config
pipeline.nf		pipeline.nf

License

acg-team/PKS-analysis-pipeline

Folders and files

Latest commit

History

Repository files navigation

PKS-analysis-pipeline

Pipeline Overview

Environment Setup

Software Versions

Cluster Setup

UZH Wagner HPC

ZHAW HPC

Local Setup

Standard Workstation (e.g., Linux/Intel Mac)

Running The Analysis

Running on UZH Wagner HPC

Running locally

Analysis Steps Detailed

1. Data Cleaning

2. Alignment

PRANK

MAFFT

3. ML phylogenetics (RAxML-NG)

Model Selection

Topology Optimisation

4. Bayesian Phylogenetics (Beast2)

Beast2 Analysis Configuration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages