Snakemake workflow: STAR-aligned RSEM quantification

This repository provides a Snakemake workflow focused on transcript quantification with RSEM using STAR for read alignment. The pipeline wraps the tooling required to align paired-end RNA-seq reads, estimate gene and isoform expression levels, and stage the outputs in a well-structured results tree that can be used for downstream differential expression or reporting workflows.

⚠️ Legacy note: This project started as a fork of a broader rna-seq-star-deseq2 workflow, but the current codebase now specializes in running RSEM with STAR. Any lingering references to the older repository have been removed in favor of documenting the RSEM-specific use cases covered here.

Key capabilities

Automated alignment and quantification – STAR performs splice-aware alignment while RSEM estimates expected counts, TPM, and FPKM for genes and isoforms.
Reproducible orchestration – Snakemake coordinates all steps, supports scalable execution on HPC clusters (including AWS ParallelCluster via the pcluster-slurm executor), and can cache intermediate environments for reuse.
Container & conda aware – The workflow is set up to leverage Singularity containers and per-rule conda environments so dependencies remain isolated and reproducible.
Example outputs included – Browse docs/results_tree.log and docs/resources_tree.log for sample directory structures generated by a successful run.

When to use this workflow

Generating expression matrices for cohorts where downstream analysis (e.g. DESeq2, sleuth, edgeR) will be performed separately.
Re-quantifying RNA-seq datasets against updated transcriptomes without rerunning more complex differential expression workflows.
Building standardized pipelines for production HPC environments where reproducibility and logging are required.

Prerequisites

`daylily-ephemeral-cluster` (using AWS Parallel Cluster)

This has been developed to run on an AWS Parallel Cluster slurm headnode, specifically one created using https://github.com/Daylily-Informatics/daylily-ephemeral-cluster.

Cluster resources

Although the workflow can run on a workstation, it is primarily tuned for execution on an AWS ParallelCluster configured with the Slurm scheduler (see the snakemake-executor-plugin-pcluster-slurm). Adjust resource directives if using a different scheduler.

Conda

If you are using the daylily-ephemeral-cluster, conda is preinstalled and auto-activated. Otherwise, install miniconda and initialize it (e.g. bash bin/install_miniconda).

Usage

Clone the repository

git clone [email protected]:Daylily-Informatics/day-rsem-star.git
cd day-rsem-star

Build the `drsemstar` environment

conda create -n drsemstar -c conda-forge tabulate yaml
conda activate drsemstar

pip install git+https://github.com/Daylily-Informatics/[email protected]
pip install snakemake-executor-plugin-pcluster-slurm==0.0.31
pip install snakedeploy

snakemake --version
# 9.11.4.3

Run the bundled test workflow

Running within tmux or screen is highly recommended.

Configure cache and temporary paths

conda activate drsemstar

mkdir -p /fsx/resources/environments/containers/ubuntu/drsemstar_cache/
export SNAKEMAKE_OUTPUT_CACHE=/fsx/resources/environments/containers/ubuntu/drsemstar_cache/
export TMPDIR=/fsx/scratch/

Prepare `units.tsv`

cp config/units.tsv.template config/units.tsv
[[ "$(uname)" == "Darwin" ]] && sed -i "" "s|REGSUB_PWD|$PWD|g" config/units.tsv || sed -i "s|REGSUB_PWD|$PWD|g" config/units.tsv

Build conda environments

This step pre-builds per-rule environments and may take up to an hour depending on bandwidth.

snakemake --use-conda --use-singularity \
  --singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
  --singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
  --conda-prefix /fsx/resources/environments/containers/ubuntu/ \
  --executor pcluster-slurm \
  --default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=3690 tmpdir=/fsx/scratch \
  --cache -p \
  --verbose -k \
  --max-threads 20000 \
  --restart-times 2 \
  --cores 20000 -j 14 -n \
  --conda-create-envs-only

If Snakemake reports thread limits equal to the head node nproc, override with --max-threads and --cores to match your cluster policy.

Launch the workflow

Remove the -n flag to execute the pipeline. Adjust -j and slurm_partition values to match your cluster partitions (use sinfo to enumerate available queues).

snakemake --use-conda --use-singularity \
  --singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
  --singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
  --conda-prefix /fsx/resources/environments/containers/ubuntu/ \
  --executor pcluster-slurm \
  --default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=3690 tmpdir=/fsx/scratch \
  --cache -p \
  --verbose -k \
  --restart-times 2 \
  --max-threads 20000 \
  --cores 20000 -j 14 \
  --include-aws-benchmark-metrics

Monitor progress with watch squeue or sinfo.

Running with your own samples

Update config/units.tsv with sample FASTQ locations and any metadata used by the rules.
Modify config/samples.tsv to describe replicate groupings and annotation information consumed by downstream analyses.
Customize alignment/quantification parameters in config/config.yaml to match your reference index locations, read layout, and strand-specific settings.
Dry-run with -n to validate the DAG before launching the full job.

Example outputs

The docs/ directory contains logs of the resource and results directory trees from a representative run:

docs/results_tree.log – final quantification outputs, including gene and isoform abundance tables and STAR alignment summaries.
docs/resources_tree.log – cached environments, container images, and intermediate resources created during execution.

Use these files as references when verifying that your own runs are producing the expected directory layout.

Troubleshooting

Ensure SMK_SLURM_COMMENT is set if your organization requires Slurm --comment tags for cost allocation.
If Singularity mounts fail, confirm the bind paths include directories housing the input FASTQs and reference indices.
For large read sets, tune mem_mb and runtime defaults in the Snakemake command to fit the cluster instance types available to you.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.test		.test
bin		bin
config		config
docs		docs
workflow		workflow
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Snakemake workflow: STAR-aligned RSEM quantification

Key capabilities

When to use this workflow

Prerequisites

`daylily-ephemeral-cluster` (using AWS Parallel Cluster)

Cluster resources

Conda

Usage

Clone the repository

Build the `drsemstar` environment

Run the bundled test workflow

Configure cache and temporary paths

Prepare `units.tsv`

Build conda environments

Launch the workflow

Running with your own samples

Example outputs

Troubleshooting

About

Uh oh!

Releases 1

Packages

Languages

License

Daylily-Informatics/day-rsem-star

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow: STAR-aligned RSEM quantification

Key capabilities

When to use this workflow

Prerequisites

daylily-ephemeral-cluster (using AWS Parallel Cluster)

Cluster resources

Conda

Usage

Clone the repository

Build the drsemstar environment

Run the bundled test workflow

Configure cache and temporary paths

Prepare units.tsv

Build conda environments

Launch the workflow

Running with your own samples

Example outputs

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`daylily-ephemeral-cluster` (using AWS Parallel Cluster)

Build the `drsemstar` environment

Prepare `units.tsv`

Packages