This repository provides a Snakemake workflow focused on transcript quantification with RSEM using STAR for read alignment. The pipeline wraps the tooling required to align paired-end RNA-seq reads, estimate gene and isoform expression levels, and stage the outputs in a well-structured results tree that can be used for downstream differential expression or reporting workflows.
⚠️ Legacy note: This project started as a fork of a broaderrna-seq-star-deseq2workflow, but the current codebase now specializes in running RSEM with STAR. Any lingering references to the older repository have been removed in favor of documenting the RSEM-specific use cases covered here.
- Automated alignment and quantification – STAR performs splice-aware alignment while RSEM estimates expected counts, TPM, and FPKM for genes and isoforms.
- Reproducible orchestration – Snakemake coordinates all steps, supports scalable execution on HPC clusters (including AWS ParallelCluster via the pcluster-slurm executor), and can cache intermediate environments for reuse.
- Container & conda aware – The workflow is set up to leverage Singularity containers and per-rule conda environments so dependencies remain isolated and reproducible.
- Example outputs included – Browse
docs/results_tree.loganddocs/resources_tree.logfor sample directory structures generated by a successful run.
- Generating expression matrices for cohorts where downstream analysis (e.g. DESeq2, sleuth, edgeR) will be performed separately.
- Re-quantifying RNA-seq datasets against updated transcriptomes without rerunning more complex differential expression workflows.
- Building standardized pipelines for production HPC environments where reproducibility and logging are required.
- This has been developed to run on an AWS Parallel Cluster slurm headnode, specifically one created using https://github.com/Daylily-Informatics/daylily-ephemeral-cluster.
Although the workflow can run on a workstation, it is primarily tuned for execution on an AWS ParallelCluster configured with the Slurm scheduler (see the snakemake-executor-plugin-pcluster-slurm). Adjust resource directives if using a different scheduler.
If you are using the daylily-ephemeral-cluster, conda is preinstalled and auto-activated. Otherwise, install miniconda and initialize it (e.g. bash bin/install_miniconda).
git clone [email protected]:Daylily-Informatics/day-rsem-star.git
cd day-rsem-starconda create -n drsemstar -c conda-forge tabulate yaml
conda activate drsemstar
pip install git+https://github.com/Daylily-Informatics/[email protected]
pip install snakemake-executor-plugin-pcluster-slurm==0.0.31
pip install snakedeploy
snakemake --version
# 9.11.4.3Running within tmux or screen is highly recommended.
conda activate drsemstar
mkdir -p /fsx/resources/environments/containers/ubuntu/drsemstar_cache/
export SNAKEMAKE_OUTPUT_CACHE=/fsx/resources/environments/containers/ubuntu/drsemstar_cache/
export TMPDIR=/fsx/scratch/cp config/units.tsv.template config/units.tsv
[[ "$(uname)" == "Darwin" ]] && sed -i "" "s|REGSUB_PWD|$PWD|g" config/units.tsv || sed -i "s|REGSUB_PWD|$PWD|g" config/units.tsvThis step pre-builds per-rule environments and may take up to an hour depending on bandwidth.
snakemake --use-conda --use-singularity \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=3690 tmpdir=/fsx/scratch \
--cache -p \
--verbose -k \
--max-threads 20000 \
--restart-times 2 \
--cores 20000 -j 14 -n \
--conda-create-envs-onlyIf Snakemake reports thread limits equal to the head node
nproc, override with--max-threadsand--coresto match your cluster policy.
Remove the -n flag to execute the pipeline. Adjust -j and slurm_partition values to match your cluster partitions (use sinfo to enumerate available queues).
snakemake --use-conda --use-singularity \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=3690 tmpdir=/fsx/scratch \
--cache -p \
--verbose -k \
--restart-times 2 \
--max-threads 20000 \
--cores 20000 -j 14 \
--include-aws-benchmark-metricsMonitor progress with watch squeue or sinfo.
- Update
config/units.tsvwith sample FASTQ locations and any metadata used by the rules. - Modify
config/samples.tsvto describe replicate groupings and annotation information consumed by downstream analyses. - Customize alignment/quantification parameters in
config/config.yamlto match your reference index locations, read layout, and strand-specific settings. - Dry-run with
-nto validate the DAG before launching the full job.
The docs/ directory contains logs of the resource and results directory trees from a representative run:
docs/results_tree.log– final quantification outputs, including gene and isoform abundance tables and STAR alignment summaries.docs/resources_tree.log– cached environments, container images, and intermediate resources created during execution.
Use these files as references when verifying that your own runs are producing the expected directory layout.
- Ensure
SMK_SLURM_COMMENTis set if your organization requires Slurm--commenttags for cost allocation. - If Singularity mounts fail, confirm the bind paths include directories housing the input FASTQs and reference indices.
- For large read sets, tune
mem_mbandruntimedefaults in the Snakemake command to fit the cluster instance types available to you.