Skip to content

Daylily-Informatics/day-rsem-star

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake workflow: STAR-aligned RSEM quantification

This repository provides a Snakemake workflow focused on transcript quantification with RSEM using STAR for read alignment. The pipeline wraps the tooling required to align paired-end RNA-seq reads, estimate gene and isoform expression levels, and stage the outputs in a well-structured results tree that can be used for downstream differential expression or reporting workflows.

⚠️ Legacy note: This project started as a fork of a broader rna-seq-star-deseq2 workflow, but the current codebase now specializes in running RSEM with STAR. Any lingering references to the older repository have been removed in favor of documenting the RSEM-specific use cases covered here.

Key capabilities

  • Automated alignment and quantification – STAR performs splice-aware alignment while RSEM estimates expected counts, TPM, and FPKM for genes and isoforms.
  • Reproducible orchestration – Snakemake coordinates all steps, supports scalable execution on HPC clusters (including AWS ParallelCluster via the pcluster-slurm executor), and can cache intermediate environments for reuse.
  • Container & conda aware – The workflow is set up to leverage Singularity containers and per-rule conda environments so dependencies remain isolated and reproducible.
  • Example outputs included – Browse docs/results_tree.log and docs/resources_tree.log for sample directory structures generated by a successful run.

When to use this workflow

  • Generating expression matrices for cohorts where downstream analysis (e.g. DESeq2, sleuth, edgeR) will be performed separately.
  • Re-quantifying RNA-seq datasets against updated transcriptomes without rerunning more complex differential expression workflows.
  • Building standardized pipelines for production HPC environments where reproducibility and logging are required.

Prerequisites

daylily-ephemeral-cluster (using AWS Parallel Cluster)

Cluster resources

Although the workflow can run on a workstation, it is primarily tuned for execution on an AWS ParallelCluster configured with the Slurm scheduler (see the snakemake-executor-plugin-pcluster-slurm). Adjust resource directives if using a different scheduler.

Conda

If you are using the daylily-ephemeral-cluster, conda is preinstalled and auto-activated. Otherwise, install miniconda and initialize it (e.g. bash bin/install_miniconda).

Usage

Clone the repository

git clone [email protected]:Daylily-Informatics/day-rsem-star.git
cd day-rsem-star

Build the drsemstar environment

conda create -n drsemstar -c conda-forge tabulate yaml
conda activate drsemstar

pip install git+https://github.com/Daylily-Informatics/[email protected]
pip install snakemake-executor-plugin-pcluster-slurm==0.0.31
pip install snakedeploy

snakemake --version
# 9.11.4.3

Run the bundled test workflow

Running within tmux or screen is highly recommended.

Configure cache and temporary paths

conda activate drsemstar

mkdir -p /fsx/resources/environments/containers/ubuntu/drsemstar_cache/
export SNAKEMAKE_OUTPUT_CACHE=/fsx/resources/environments/containers/ubuntu/drsemstar_cache/
export TMPDIR=/fsx/scratch/

Prepare units.tsv

cp config/units.tsv.template config/units.tsv
[[ "$(uname)" == "Darwin" ]] && sed -i "" "s|REGSUB_PWD|$PWD|g" config/units.tsv || sed -i "s|REGSUB_PWD|$PWD|g" config/units.tsv

Build conda environments

This step pre-builds per-rule environments and may take up to an hour depending on bandwidth.

snakemake --use-conda --use-singularity \
  --singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
  --singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
  --conda-prefix /fsx/resources/environments/containers/ubuntu/ \
  --executor pcluster-slurm \
  --default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=3690 tmpdir=/fsx/scratch \
  --cache -p \
  --verbose -k \
  --max-threads 20000 \
  --restart-times 2 \
  --cores 20000 -j 14 -n \
  --conda-create-envs-only

If Snakemake reports thread limits equal to the head node nproc, override with --max-threads and --cores to match your cluster policy.

Launch the workflow

Remove the -n flag to execute the pipeline. Adjust -j and slurm_partition values to match your cluster partitions (use sinfo to enumerate available queues).

snakemake --use-conda --use-singularity \
  --singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
  --singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
  --conda-prefix /fsx/resources/environments/containers/ubuntu/ \
  --executor pcluster-slurm \
  --default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=3690 tmpdir=/fsx/scratch \
  --cache -p \
  --verbose -k \
  --restart-times 2 \
  --max-threads 20000 \
  --cores 20000 -j 14 \
  --include-aws-benchmark-metrics

Monitor progress with watch squeue or sinfo.

Running with your own samples

  • Update config/units.tsv with sample FASTQ locations and any metadata used by the rules.
  • Modify config/samples.tsv to describe replicate groupings and annotation information consumed by downstream analyses.
  • Customize alignment/quantification parameters in config/config.yaml to match your reference index locations, read layout, and strand-specific settings.
  • Dry-run with -n to validate the DAG before launching the full job.

Example outputs

The docs/ directory contains logs of the resource and results directory trees from a representative run:

  • docs/results_tree.log – final quantification outputs, including gene and isoform abundance tables and STAR alignment summaries.
  • docs/resources_tree.log – cached environments, container images, and intermediate resources created during execution.

Use these files as references when verifying that your own runs are producing the expected directory layout.

Troubleshooting

  • Ensure SMK_SLURM_COMMENT is set if your organization requires Slurm --comment tags for cost allocation.
  • If Singularity mounts fail, confirm the bind paths include directories housing the input FASTQs and reference indices.
  • For large read sets, tune mem_mb and runtime defaults in the Snakemake command to fit the cluster instance types available to you.

About

Star aligner RSEM processing of rnaseq data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published