Skip to content

shahcompbio/ORFology

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

shahcompbio/orfology

Open in GitHub Codespaces GitHub Actions CI Status GitHub Actions Linting StatusCite with Zenodo nf-test

Nextflow nf-core template version run with conda run with docker run with singularity Launch on Seqera Platform

Introduction

shahcompbio/orfology is a bioinformatics pipeline that ...

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

fasta,protein_table,sample,condition
U937_protein.fas,philosopher/protein.tsv,U937,AML
swissprot.fasta,,SwissProt,SwissProt

Each row represents either a proteogenomics sample for which the protein fasta has been produced by proteomegenerator2 or proteomegenerator3 and a protein table from philosopher OR a protein fasta file you would like to analyze.

Now, you can run the pipeline using:

nextflow run shahcompbio/orfology \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Classify proteins by transcriptomic origins

If you are using orfology to classify proteins by transcriptomic origins which you have detected peptides for with mass spec (with the categorize_proteins), we recommend running ORFology with the --unique_proteins flag, which will filter your fasta files using the Indistinguishable Proteins column from the philosopher protein.tsv tables to just include those proteins which are uniquely distinguishable from other proteins. This will ensure that the non-canonical proteins have a combination of peptides which is distinguishable from other proteins in your analysis. These output tables all start with the prefix unique. Output tables which not been filtered for uniquely distinguishable proteins have the prefix all or all+unique (the latter of which are tables of merged outputs from unique and all). After this proteins are categorized using the following conditional logic:

  1. SwissProt if it an exact sequence match for a swissprot protein.
  2. Alt ORF from canonical transcript if one of the transcripts which the ORF is predicted from has an Ensembl ID.
  3. ORF from alt spice transcript if one of the transcripts is a non-canonical splice isoform.
  4. ORF from neogene if it is a non-canonical transcript which did not match to a known gene.
  5. Uncategorized if it doesn't fit into one of the above categories.

You can run this workflow withthe following command:

nextflow run shahcompbio/orfology \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --categorize_proteins \
   --unique_proteins \
   --outdir <OUTDIR>

Key outputs here are:

  1. classifyproteins/unique_proteins_merged_annotated_info_table.tsv: proteins are stratified by category and contains information about which samples they appeared in.
  2. blastsummary_pgtools_merged/unique_proteins_merged.tsv: Contains results from the diamond blastp search, merged on the results of the table described in 1.

Credits

shahcompbio/orfology was originally written by Asher Preska Steinberg.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

protein annotation based on sequence composition

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published