Skip to content

Strand specificity: STAR & DESeq2 #16

@likorneeva

Description

@likorneeva

Description

The current STAR alignment rule does not allow users to specify library strandedness, resulting in suboptimal gene quantification for stranded RNA-seq libraries. While STAR does generate strand-specific counts in columns 3 and 4 of ReadsPerGene.out.tab, the downstream DESeq2 script uses column 2 (unstranded) by default, which leads to loss of information and reduced statistical power.

Current Behavior

The STAR alignment rule in rules/star.smk generates ReadsPerGene.out.tab files with:

  • Column 2: Unstranded counts (used by DESeq2 script)
  • Column 3: Counts for reads aligning to forward strand
  • Column 4: Counts for reads aligning to reverse strand

The DESeq2 script (scripts/deseq2_v1.R) always uses column 2:

mat <- sapply(ls, function(p) {
  dt <- fread(p, skip = 4)
  setNames(dt$V2, dt$V1)  # Always uses V2 (column 2)
})

Problem: Information Loss with Unstranded Counting

When using unstranded counting on stranded libraries, significant information is lost due to ambiguous read assignment:

Example from my data comparing summary statistics:

Metric Unstranded (col 2) Forward (col 3) Reverse (col 4)
N_noFeature 3,360,145 34,042,763 3,944,564
N_ambiguous 1,653,887 33,685 638,622

Key observations:

  1. 1.65M reads marked as ambiguous in unstranded mode vs only 638K in strand-specific mode (61% reduction in ambiguity)
  2. Overlapping genes on opposite strands cause reads to be discarded entirely

Concrete example of a single gene:

Gene_ID       Unstranded  Forward  Reverse
MGI:5477018   0           151      0
MGI:3528744   770         18       752
  • MGI:5477018: Gets 0 counts in unstranded mode due to overlap with antisense gene, but 151 reads are actually present
  • MGI:3528744: Gets 770 counts in unstranded mode, but strand-specific counting shows 752 reads in column 4, indicating the true signal

Expected Behavior

The pipeline should:

  1. Allow users to specify library strandedness in the config
  2. Pass appropriate STAR parameters based on library type
  3. Use the correct column from ReadsPerGene.out.tab in the DESeq2 script

Proposed Solution

1. Add configuration option:

# config.yaml
library_strandedness: "unstranded"  
# Options: "unstranded", "forward", "reverse"
# - "unstranded": Standard RNA-seq, use column 2
# - "forward": Ligation-based stranded protocols, use column 3  
# - "reverse": dUTP/NSR/Illumina TruSeq/NEB stranded kits, use column 4

2. Update STAR alignment rule:

Add parameters to preserve strand information in BAM files (improves compatibility with other tools):

STAR --quantMode TranscriptomeSAM GeneCounts \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonical \
# ... rest of parameters

3. Update DESeq2 script:

Modify the count matrix import to use the appropriate column:

# Determine which column to use based on library type
strandedness <- snakemake@params[["library_strandedness"]]

column_map <- list(
  "unstranded" = 2, 
  "forward" = 3,
  "reverse" = 4 
)

count_column <- column_map[[strandedness]]

mat <- sapply(ls, function(p) {
  dt <- fread(p, skip = 4)
  setNames(dt[[count_column]], dt$V1)
})

This is important for:

  • Genes with antisense transcripts
  • Overlapping gene pairs on opposite strands
  • Studies focusing on strand-specific regulation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions