Strand specificity: STAR & DESeq2

### Description
The current STAR alignment rule does not allow users to specify library strandedness, resulting in suboptimal gene quantification for stranded RNA-seq libraries. While STAR does generate strand-specific counts in columns 3 and 4 of `ReadsPerGene.out.tab`, the downstream DESeq2 script uses column 2 (unstranded) by default, which leads to loss of information and reduced statistical power.

### Current Behavior
The STAR alignment rule in `rules/star.smk` generates `ReadsPerGene.out.tab` files with:
- **Column 2**: Unstranded counts (used by DESeq2 script)
- **Column 3**: Counts for reads aligning to forward strand
- **Column 4**: Counts for reads aligning to reverse strand

The DESeq2 script (`scripts/deseq2_v1.R`) always uses column 2:
```r
mat <- sapply(ls, function(p) {
  dt <- fread(p, skip = 4)
  setNames(dt$V2, dt$V1)  # Always uses V2 (column 2)
})
```

### Problem: Information Loss with Unstranded Counting

When using unstranded counting on stranded libraries, significant information is lost due to ambiguous read assignment:

**Example from my data comparing summary statistics:**

| Metric | Unstranded (col 2) | Forward (col 3) | Reverse (col 4) |
|--------|-------------------|-----------------|-----------------|
| N_noFeature | 3,360,145 | 34,042,763 | 3,944,564 |
| N_ambiguous | 1,653,887 | 33,685 | 638,622 |

**Key observations:**
1. **1.65M reads marked as ambiguous** in unstranded mode vs only 638K in strand-specific mode (61% reduction in ambiguity)
2. Overlapping genes on opposite strands cause reads to be discarded entirely

**Concrete example of a single gene:**
```
Gene_ID       Unstranded  Forward  Reverse
MGI:5477018   0           151      0
MGI:3528744   770         18       752
```

- MGI:5477018: Gets **0 counts** in unstranded mode due to overlap with antisense gene, but 151 reads are actually present
- MGI:3528744: Gets **770 counts** in unstranded mode, but strand-specific counting shows **752 reads** in column 4, indicating the true signal

### Expected Behavior
The pipeline should:
1. Allow users to specify library strandedness in the config
2. Pass appropriate STAR parameters based on library type
3. Use the correct column from `ReadsPerGene.out.tab` in the DESeq2 script

### Proposed Solution

#### 1. Add configuration option:
```yaml
# config.yaml
library_strandedness: "unstranded"  
# Options: "unstranded", "forward", "reverse"
# - "unstranded": Standard RNA-seq, use column 2
# - "forward": Ligation-based stranded protocols, use column 3  
# - "reverse": dUTP/NSR/Illumina TruSeq/NEB stranded kits, use column 4
```

#### 2. Update STAR alignment rule:
Add parameters to preserve strand information in BAM files (improves compatibility with other tools):
```bash
STAR --quantMode TranscriptomeSAM GeneCounts \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonical \
# ... rest of parameters
```

#### 3. Update DESeq2 script:
Modify the count matrix import to use the appropriate column:
```r
# Determine which column to use based on library type
strandedness <- snakemake@params[["library_strandedness"]]

column_map <- list(
  "unstranded" = 2, 
  "forward" = 3,
  "reverse" = 4 
)

count_column <- column_map[[strandedness]]

mat <- sapply(ls, function(p) {
  dt <- fread(p, skip = 4)
  setNames(dt[[count_column]], dt$V1)
})
```

This is important for:
- Genes with antisense transcripts
- Overlapping gene pairs on opposite strands
- Studies focusing on strand-specific regulation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strand specificity: STAR & DESeq2 #16

Description

Current Behavior

Problem: Information Loss with Unstranded Counting

Expected Behavior

Proposed Solution

1. Add configuration option:

2. Update STAR alignment rule:

3. Update DESeq2 script:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Unstranded (col 2)	Forward (col 3)	Reverse (col 4)
N_noFeature	3,360,145	34,042,763	3,944,564
N_ambiguous	1,653,887	33,685	638,622

Strand specificity: STAR & DESeq2 #16

Description

Description

Current Behavior

Problem: Information Loss with Unstranded Counting

Expected Behavior

Proposed Solution

1. Add configuration option:

2. Update STAR alignment rule:

3. Update DESeq2 script:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions