kentsislab/proteomegenerator3 is a bioinformatics pipeline that can be used to create sample-specific, proteogenomics search databases from long-read RNAseq data. It takes in a samplesheet and aligned long-read RNAseq data as input, performs guided, de novo transcript assembly, ORF prediction, and then produces a protein fasta file suitable for use with computational proteomics search platforms (e.g, Fragpipe, DIA-NN).
- Pre-processing of aligned reads to create transcript read classes with bambu which can be re-used in future analyses. Optional filtering:
- Filtering on MAPQ and read length with samtools
- Transcript assembly, quantification, and filtering with bambu. Option to merge multiple samples into a unified transcriptome.
- ORF prediction with Transdecoder. Option to provide fusion contigs from JAFFAL.
- Formatting of ORFs into a fasta file which can be used for computational proteomics searchs with Fragpipe, DIA-NN, Spectronaut.
- MultiQC to collate package versions used (
MultiQC)
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. When using the profile, it will run on a minimal test dataset that can be run in 5-10 minutes on most modern laptops.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv:
sample,bam,bai,rcFile,jaffal_fasta,jaffal_table
CONTROL_REP1,AEG588A1_S1_L002_R1_001.bam,AEG588A1_S1_L002_R1_001.bam.bai,,jaffal_results.fasta,jaffal_results.csv
Each row represents a long-read RNAseq sample. The columns are as follows:
sample: name of the samplebam: aligned, sorted long-read RNAseq bambai: index file for bamrcFile: read class file from Bambu if you've already done some pre-processing; you can provide this and then use the--skip_preprocessingflag to speed up run time and re-analyze previous samplesjaffal_fasta: Fusion contigs which are output from JAFFAL (see description here).jaffal_table: Fusion table which is output from JAFFAL (see description here)
To produce the necessary files, we recommend using the nf-core/nanoseq pipeline, which will run both alignment and call fusions with JAFFAL.
Now, you can run the pipeline using:
nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--fasta <REF_GENOME> \
--gtf <REF_GTF> \
--outdir <OUTDIR>Where REF_GENOME and REF_GTF are the reference genome and transcriptome respectively. These can be from GENCODE or Ensembl, but should match the reference used to align the data.
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
To see all optional parameters that could be used with the pipeline and their explanations, use the help menu:
nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev --helpThis options can be run using flags. For example:
nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--fasta <REF_GENOME> \
--gtf <REF_GTF> \
--outdir <OUTDIR> \
--filter_readsWill pre-filter the bam file before transcript assembly is performed on mapq and read length.
As another example, you can use the following flag to perform ORF calling on fusion contigs:
nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--fasta <REF_GENOME> \
--gtf <REF_GTF> \
--outdir <OUTDIR> \
--fusionsTo run with the latest version, which may not be stable you can use the -r main -latest flags:
nextflow run kentsislab/proteomegenerator3 -r main -latest --helpI have highlighted the following options here:
filter_reads: use this flag to pre-filter reads using mapq and read lengthmapq: min mapq for read filtering [default: 20]read_len: min read length for read filtering [default: 500]filter_acc_reads: filter reads on accessory chromosomes; sometimes causes issues for bambuskip_preprocessing: use previously generated bambu read classesNDR: modulate bambu's novel discovery rate [default: 0.1]recommended_NDR: run bambu with recommended NDR (as determined by bambu's algorithm)single_sample: Run bambu on samples individually, and skip merging of transcriptomes; if you provide a single sample or fusions, this will be automatically run.skip_multisample: skip multisample transcript assembly (see #8).fusions: Perform ORF predictions on fusions from JAFFAL [default: false]multiple_orfs: Allow for multiple ORFs per transcript (this is in beta-testing)
kentsislab/proteomegenerator3 was originally written by Asher Preska Steinberg.
We thank the following people for their extensive assistance in the development of this pipeline:
If you would like to contribute to this pipeline, please see the contributing guidelines.
If you use kentsislab/proteomegenerator3 for your analysis, please cite our manuscript:
End-to-end proteogenomics for discovery of cryptic and non-canonical cancer proteoforms using long-read transcriptomics and multi-dimensional proteomics
Katarzyna Kulej, Asher Preska Steinberg, Jinxin Zhang, Gabriella Casalena, Eli Havasov, Sohrab P. Shah, Andrew McPherson, Alex Kentsis.
BioRXiv. 2025 Aug 28. doi: 10.1101/2025.08.23.671943.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.