shahcompbio/orfology is a bioinformatics pipeline that ...
1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv:
fasta,protein_table,sample,condition
U937_protein.fas,philosopher/protein.tsv,U937,AML
swissprot.fasta,,SwissProt,SwissProt
Each row represents either a proteogenomics sample for which the protein fasta has been produced by proteomegenerator2 or proteomegenerator3 and a protein table from philosopher OR a protein fasta file you would like to analyze.
Now, you can run the pipeline using:
nextflow run shahcompbio/orfology \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
If you are using orfology to classify proteins by transcriptomic origins which you have detected peptides for with mass spec (with the categorize_proteins), we recommend running ORFology with the --unique_proteins flag, which will filter your fasta files using the Indistinguishable Proteins column from the philosopher protein.tsv tables to just include those proteins which are uniquely distinguishable from other proteins. This will ensure that the non-canonical proteins have a combination of peptides which is distinguishable from other proteins in your analysis. These output tables all start with the prefix unique. Output tables which not been filtered for uniquely distinguishable proteins have the prefix all or all+unique (the latter of which are tables of merged outputs from unique and all). After this proteins are categorized using the following conditional logic:
SwissProtif it an exact sequence match for a swissprot protein.Alt ORF from canonical transcriptif one of the transcripts which the ORF is predicted from has an Ensembl ID.ORF from alt spice transcriptif one of the transcripts is a non-canonical splice isoform.ORF from neogeneif it is a non-canonical transcript which did not match to a known gene.Uncategorizedif it doesn't fit into one of the above categories.
You can run this workflow withthe following command:
nextflow run shahcompbio/orfology \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--categorize_proteins \
--unique_proteins \
--outdir <OUTDIR>Key outputs here are:
classifyproteins/unique_proteins_merged_annotated_info_table.tsv: proteins are stratified by category and contains information about which samples they appeared in.blastsummary_pgtools_merged/unique_proteins_merged.tsv: Contains results from the diamond blastp search, merged on the results of the table described in 1.
shahcompbio/orfology was originally written by Asher Preska Steinberg.
We thank the following people for their extensive assistance in the development of this pipeline:
If you would like to contribute to this pipeline, please see the contributing guidelines.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.