Description
The current STAR alignment rule does not allow users to specify library strandedness, resulting in suboptimal gene quantification for stranded RNA-seq libraries. While STAR does generate strand-specific counts in columns 3 and 4 of ReadsPerGene.out.tab, the downstream DESeq2 script uses column 2 (unstranded) by default, which leads to loss of information and reduced statistical power.
Current Behavior
The STAR alignment rule in rules/star.smk generates ReadsPerGene.out.tab files with:
- Column 2: Unstranded counts (used by DESeq2 script)
- Column 3: Counts for reads aligning to forward strand
- Column 4: Counts for reads aligning to reverse strand
The DESeq2 script (scripts/deseq2_v1.R) always uses column 2:
mat <- sapply(ls, function(p) {
dt <- fread(p, skip = 4)
setNames(dt$V2, dt$V1) # Always uses V2 (column 2)
})
Problem: Information Loss with Unstranded Counting
When using unstranded counting on stranded libraries, significant information is lost due to ambiguous read assignment:
Example from my data comparing summary statistics:
| Metric |
Unstranded (col 2) |
Forward (col 3) |
Reverse (col 4) |
| N_noFeature |
3,360,145 |
34,042,763 |
3,944,564 |
| N_ambiguous |
1,653,887 |
33,685 |
638,622 |
Key observations:
- 1.65M reads marked as ambiguous in unstranded mode vs only 638K in strand-specific mode (61% reduction in ambiguity)
- Overlapping genes on opposite strands cause reads to be discarded entirely
Concrete example of a single gene:
Gene_ID Unstranded Forward Reverse
MGI:5477018 0 151 0
MGI:3528744 770 18 752
- MGI:5477018: Gets 0 counts in unstranded mode due to overlap with antisense gene, but 151 reads are actually present
- MGI:3528744: Gets 770 counts in unstranded mode, but strand-specific counting shows 752 reads in column 4, indicating the true signal
Expected Behavior
The pipeline should:
- Allow users to specify library strandedness in the config
- Pass appropriate STAR parameters based on library type
- Use the correct column from
ReadsPerGene.out.tab in the DESeq2 script
Proposed Solution
1. Add configuration option:
# config.yaml
library_strandedness: "unstranded"
# Options: "unstranded", "forward", "reverse"
# - "unstranded": Standard RNA-seq, use column 2
# - "forward": Ligation-based stranded protocols, use column 3
# - "reverse": dUTP/NSR/Illumina TruSeq/NEB stranded kits, use column 4
2. Update STAR alignment rule:
Add parameters to preserve strand information in BAM files (improves compatibility with other tools):
STAR --quantMode TranscriptomeSAM GeneCounts \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonical \
# ... rest of parameters
3. Update DESeq2 script:
Modify the count matrix import to use the appropriate column:
# Determine which column to use based on library type
strandedness <- snakemake@params[["library_strandedness"]]
column_map <- list(
"unstranded" = 2,
"forward" = 3,
"reverse" = 4
)
count_column <- column_map[[strandedness]]
mat <- sapply(ls, function(p) {
dt <- fread(p, skip = 4)
setNames(dt[[count_column]], dt$V1)
})
This is important for:
- Genes with antisense transcripts
- Overlapping gene pairs on opposite strands
- Studies focusing on strand-specific regulation
Description
The current STAR alignment rule does not allow users to specify library strandedness, resulting in suboptimal gene quantification for stranded RNA-seq libraries. While STAR does generate strand-specific counts in columns 3 and 4 of
ReadsPerGene.out.tab, the downstream DESeq2 script uses column 2 (unstranded) by default, which leads to loss of information and reduced statistical power.Current Behavior
The STAR alignment rule in
rules/star.smkgeneratesReadsPerGene.out.tabfiles with:The DESeq2 script (
scripts/deseq2_v1.R) always uses column 2:Problem: Information Loss with Unstranded Counting
When using unstranded counting on stranded libraries, significant information is lost due to ambiguous read assignment:
Example from my data comparing summary statistics:
Key observations:
Concrete example of a single gene:
Expected Behavior
The pipeline should:
ReadsPerGene.out.tabin the DESeq2 scriptProposed Solution
1. Add configuration option:
2. Update STAR alignment rule:
Add parameters to preserve strand information in BAM files (improves compatibility with other tools):
STAR --quantMode TranscriptomeSAM GeneCounts \ --outSAMstrandField intronMotif \ --outFilterIntronMotifs RemoveNoncanonical \ # ... rest of parameters3. Update DESeq2 script:
Modify the count matrix import to use the appropriate column:
This is important for: