Skip to content

Commit 4010d2f

Browse files
committed
feat: make --gtf and --bed mutually exclusive; GTF now runs all analyses
Allow users to provide either a GTF or BED annotation file (but not both). GTF mode runs everything (dupRadar, featureCounts, and all 7 RSeQC tools) by deriving transcript-level structure from the GTF. BED mode runs only RSeQC tools, skipping dupRadar/featureCounts with a warning. Key changes: - Extend GTF parser with Transcript struct (exon blocks + CDS range) - Add from_genes() constructors to all 5 annotation-requiring RSeQC tools - Extract shared junction/intron helpers into rseqc::common module - Pre-build all RSeQC data structures once (no more per-BAM BED re-parsing) - Fix inner_distance first-transcript-per-chromosome RSeQC bug - Update all documentation to reflect new CLI semantics
1 parent 0caec57 commit 4010d2f

File tree

21 files changed

+2062
-1083
lines changed

21 files changed

+2062
-1083
lines changed

AGENTS.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,8 @@ src/
6363
mod.rs — Re-exports output
6464
output.rs — featureCounts-format output & biotype counting
6565
rseqc/
66-
mod.rs — Re-exports all 7 RSeQC modules
66+
mod.rs — Re-exports all 7 RSeQC modules + common helpers
67+
common.rs — Shared junction/intron extraction, from_genes/from_bed builders
6768
bam_stat.rs — bam_stat.py reimplementation
6869
infer_experiment.rs — infer_experiment.py reimplementation
6970
read_duplication.rs — read_duplication.py reimplementation
@@ -83,17 +84,19 @@ in `main.rs`, no `lib.rs`. The `rna` module contains sub-modules for each tool g
8384
Inter-module access uses `crate::` paths (e.g., `use crate::rna::dupradar::counting::GeneCounts;`).
8485

8586
The CLI uses a single subcommand:
86-
- `rustqc rna <BAM>... --gtf <GTF> [--bed <BED>] [OPTIONS]`
87-
88-
This runs all analyses in a single pass: dupRadar duplicate rate analysis,
89-
featureCounts-compatible gene counting, and all 7 RSeQC-equivalent tools
90-
(bam_stat, infer_experiment, read_duplication, read_distribution,
91-
junction_annotation, junction_saturation, inner_distance).
92-
93-
The `--bed` flag is optional. If omitted, a warning is printed and the 5 tools
94-
that require a BED12 gene model (infer_experiment, read_distribution,
95-
junction_annotation, junction_saturation, inner_distance) are skipped.
96-
The 2 tools that don't need a BED file (bam_stat, read_duplication) still run.
87+
- `rustqc rna <BAM>... (--gtf <GTF> | --bed <BED>) [OPTIONS]`
88+
89+
The `--gtf` and `--bed` flags are **mutually exclusive** — provide one or the other:
90+
91+
- **`--gtf`** (recommended): Runs all analyses — dupRadar duplicate rate analysis,
92+
featureCounts-compatible gene counting, and all 7 RSeQC-equivalent tools
93+
(bam_stat, infer_experiment, read_duplication, read_distribution,
94+
junction_annotation, junction_saturation, inner_distance). The GTF parser
95+
extracts transcript-level structure (exons + CDS features) to derive all
96+
data needed by every tool.
97+
- **`--bed`**: Runs only the 7 RSeQC tools + bam_stat + read_duplication.
98+
DupRadar and featureCounts are skipped (they require gene-level annotation
99+
with biotype attributes that BED12 cannot provide).
97100

98101
Individual tools can be disabled via the YAML config file (each has an `enabled`
99102
toggle). Tool-specific parameters (e.g., `--min-intron`, `--inner-distance-lower-bound`)

CHANGELOG.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,10 @@ be disabled via the YAML config file.
2929
- bam_stat -- alignment statistics
3030
- infer_experiment -- library strandedness inference
3131
- read_duplication -- position-based and sequence-based duplication histograms
32-
- read_distribution -- read distribution across genomic features (requires `--bed`)
33-
- junction_annotation -- splice junction classification (requires `--bed`)
34-
- junction_saturation -- junction saturation curves (requires `--bed`)
35-
- inner_distance -- insert size estimation for paired-end reads (requires `--bed`)
32+
- read_distribution -- read distribution across genomic features
33+
- junction_annotation -- splice junction classification
34+
- junction_saturation -- junction saturation curves
35+
- inner_distance -- insert size estimation for paired-end reads
3636

3737
### General
3838

README.md

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ All tools accept SAM/BAM/CRAM input and support processing multiple files in a s
2929

3030
### `rustqc rna` -- RNA-Seq quality control
3131

32-
Performs dupRadar-equivalent duplicate rate analysis, featureCounts-compatible read counting, and 7 RSeQC-equivalent QC analyses in a single pass. Given a duplicate-marked BAM, GTF annotation, and optional BED12 gene model, it computes per-gene duplication rates, fits a logistic regression model, generates diagnostic plots, produces gene-level count files with biotype summaries, and runs comprehensive QC metrics (bam_stat, infer_experiment, read_duplication, read_distribution, junction_annotation, junction_saturation, inner_distance).
32+
Performs dupRadar-equivalent duplicate rate analysis, featureCounts-compatible read counting, and 7 RSeQC-equivalent QC analyses in a single pass. Given a duplicate-marked BAM and either a GTF annotation or a BED12 gene model, it computes per-gene duplication rates, fits a logistic regression model, generates diagnostic plots, produces gene-level count files with biotype summaries, and runs comprehensive QC metrics (bam_stat, infer_experiment, read_duplication, read_distribution, junction_annotation, junction_saturation, inner_distance).
3333

3434
| Feature | dupRadar (R) | RustQC |
3535
|---------|-------------|--------|
@@ -56,7 +56,7 @@ The `rustqc rna` command also includes reimplementations of 7 popular [RSeQC](ht
5656
| junction_saturation | `junction_saturation.py` | Saturation analysis of detected splice junctions |
5757
| inner_distance | `inner_distance.py` | Inner distance distribution for paired-end reads |
5858

59-
All RSeQC tools run by default. Five of the seven tools require a BED12 gene model file (`--bed`); if `--bed` is not provided, those tools are skipped with a warning. Individual tools can be disabled via the [configuration file](#configuration).
59+
All RSeQC tools run by default when annotation is provided via `--gtf` or `--bed`. With a GTF file, all 7 tools run alongside dupRadar and featureCounts. With a BED file only, all 7 RSeQC tools run but dupRadar and featureCounts are skipped (they require a GTF). Individual tools can be disabled via the [configuration file](#configuration).
6060

6161
## Density scatter plots
6262

@@ -140,15 +140,16 @@ The binary will be at `target/release/rustqc`.
140140
### RNA duplicate rate analysis
141141

142142
```bash
143-
rustqc rna <INPUT>... --gtf <GTF> [OPTIONS]
143+
rustqc rna <INPUT>... (--gtf <GTF> | --bed <BED>) [OPTIONS]
144144
```
145145

146146
#### Required arguments
147147

148148
| Argument | Description |
149149
|----------|-------------|
150150
| `<INPUT>...` | One or more duplicate-marked alignment files (SAM/BAM/CRAM). Duplicates must be flagged (SAM flag 0x400), not removed. BAM/CRAM files should be sorted and indexed for parallel processing. |
151-
| `--gtf <GTF>` | Path to a GTF gene annotation file (e.g., from Ensembl or UCSC). |
151+
| `--gtf <GTF>` | Path to a GTF gene annotation file. Runs all analyses (dupRadar + featureCounts + all 7 RSeQC tools). Mutually exclusive with `--bed`. |
152+
| `--bed <BED>` | Path to a BED12 gene model file. Runs RSeQC tools only (dupRadar and featureCounts are skipped). Mutually exclusive with `--gtf`. |
152153

153154
#### Options
154155

@@ -166,10 +167,13 @@ rustqc rna <INPUT>... --gtf <GTF> [OPTIONS]
166167
#### Examples
167168

168169
```bash
169-
# Single BAM, single-end, unstranded
170+
# Single BAM with GTF (all analyses: dupRadar + featureCounts + all 7 RSeQC tools)
170171
rustqc rna sample.markdup.bam --gtf genes.gtf --outdir results/
171172

172-
# Single BAM, paired-end, reverse-stranded
173+
# Single BAM with BED (RSeQC tools only, no dupRadar/featureCounts)
174+
rustqc rna sample.markdup.bam --bed genes.bed --outdir results/
175+
176+
# Paired-end, reverse-stranded
173177
rustqc rna sample.markdup.bam --gtf genes.gtf --paired --stranded 2 --outdir results/
174178

175179
# CRAM input with reference FASTA
@@ -184,21 +188,12 @@ rustqc rna sample.markdup.bam --gtf gencode.gtf --paired --biotype-attribute gen
184188

185189
### RSeQC tools
186190

187-
The RSeQC tools are integrated into `rustqc rna` and run automatically. To enable the tools that require a BED12 gene model (infer_experiment, read_distribution, junction_annotation, junction_saturation, inner_distance), pass `--bed`:
188-
189-
```bash
190-
# Run everything: dupRadar + featureCounts + all 7 RSeQC tools
191-
rustqc rna sample.markdup.bam --gtf genes.gtf --bed genes.bed --outdir results/
192-
193-
# Without --bed: dupRadar + featureCounts + bam_stat + read_duplication only
194-
rustqc rna sample.markdup.bam --gtf genes.gtf --outdir results/
195-
```
191+
The 7 RSeQC tools are integrated into `rustqc rna` and run automatically with either `--gtf` or `--bed`. With a GTF file, all analyses run (dupRadar + featureCounts + all 7 RSeQC tools). With a BED file, only the 7 RSeQC tools run.
196192

197193
RSeQC-specific options:
198194

199195
| Option | Default | Description |
200196
|--------|---------|-------------|
201-
| `--bed <BED>` | none | BED12 gene model (enables 5 additional RSeQC tools) |
202197
| `-q` / `--mapq <N>` | `30` | Minimum mapping quality for RSeQC tools |
203198
| `--infer-experiment-sample-size <N>` | `200000` | Max reads to sample for strandedness inference |
204199
| `--min-intron <N>` | `50` | Minimum intron size for junction tools |

benchmark/README.md

Lines changed: 4 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -154,27 +154,11 @@ cargo build --release
154154
> to match the GENCODE GTF. The `--biotype-attribute gene_type` flag is needed
155155
> because GENCODE uses `gene_type` instead of the Ensembl default `gene_biotype`.
156156
157-
### 5. Run RSeQC tools (RustQC reimplementations)
157+
### 5. RSeQC tools (RustQC reimplementations)
158158

159-
The RSeQC tools are integrated into `rustqc rna` and run automatically when a
160-
`--bed` file is provided. To include all 7 RSeQC tools in the analysis, add
161-
`--bed` to the command:
162-
163-
```bash
164-
# Small — all analyses including RSeQC tools
165-
./target/release/rustqc rna benchmark/input/small/test.bam \
166-
--gtf benchmark/input/small/chr6.gtf -p --skip-dup-check \
167-
--bed benchmark/input/small/chr6.bed \
168-
-o benchmark/RustQC/small
169-
170-
# Large — all analyses including RSeQC tools
171-
./target/release/rustqc rna benchmark/input/large/GM12878_REP1.markdup.sorted.bam \
172-
--gtf benchmark/input/large/genes.gtf -p -t 10 \
173-
--bed benchmark/input/large/genes.bed \
174-
-o benchmark/RustQC/large \
175-
-c benchmark/input/large/config.yaml \
176-
--biotype-attribute gene_type
177-
```
159+
All 7 RSeQC tools are integrated into `rustqc rna` and run automatically when
160+
`--gtf` is provided (the commands in step 4 already include all RSeQC analyses).
161+
No separate `--bed` file is needed — all required data is derived from the GTF.
178162

179163
### 6. Generate RSeQC Python reference outputs
180164

docs/src/content/docs/getting-started/introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Seven reimplementations of [RSeQC](https://rseqc.sourceforge.net/) tools, all in
5555
| junction_saturation | Assess saturation of splice junction detection at increasing read depths |
5656
| inner_distance | Compute inner distance between paired-end read mates |
5757

58-
Five tools (infer_experiment, read_distribution, junction_annotation, junction_saturation, inner_distance) require a BED12 gene model file via `--bed`. If `--bed` is omitted, these tools are skipped with a warning while the remaining tools still run. Individual tools can be disabled via the YAML configuration file.
58+
When a GTF file is provided via `--gtf`, all 7 tools run automatically — transcript-level structure is extracted from the GTF. Alternatively, a BED12 gene model file can be provided via `--bed` (mutually exclusive with `--gtf`), which runs only the RSeQC tools. Individual tools can be disabled via the YAML configuration file.
5959

6060
## Credits
6161

docs/src/content/docs/getting-started/quickstart.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,7 @@ This guide walks you through a basic RustQC analysis from start to finish.
1010
The `rustqc rna` command runs all analyses in a single pass. It requires:
1111

1212
- A **duplicate-marked** alignment file (BAM, SAM, or CRAM). Duplicates must be flagged with SAM flag 0x400 by a tool like [Picard MarkDuplicates](https://broadinstitute.github.io/picard/), [samblaster](https://github.com/GregoryFaust/samblaster), or [sambamba](https://github.com/biod/sambamba).
13-
- A **GTF annotation** file (for dupRadar, featureCounts, and gene-level analyses).
14-
- Optionally, a **BED12 gene model** file (`--bed`) for the 5 RSeQC tools that require it (infer_experiment, read_distribution, junction_annotation, junction_saturation, inner_distance). If omitted, these tools are skipped with a warning.
13+
- Either a **GTF annotation** file (`--gtf`) or a **BED12 gene model** file (`--bed`). With a GTF, all analyses run (dupRadar, featureCounts, and all 7 RSeQC tools). With a BED file, only the 7 RSeQC tools run. The two flags are mutually exclusive.
1514

1615
## RNA-seq duplicate analysis
1716

@@ -50,13 +49,12 @@ integrated into the `rustqc rna` command. They run automatically alongside the
5049
dupRadar and featureCounts analyses:
5150

5251
```bash
53-
# Run everything: dupRadar + featureCounts + all 7 RSeQC tools
54-
rustqc rna sample.markdup.bam --gtf genes.gtf --bed genes.bed -p -o results/
55-
```
52+
# Run everything with a GTF: dupRadar + featureCounts + all 7 RSeQC tools
53+
rustqc rna sample.markdup.bam --gtf genes.gtf -p -o results/
5654

57-
The `--bed` flag provides a BED12 gene model required by 5 of the 7 tools. If
58-
omitted, those tools are skipped with a warning while bam_stat and
59-
read_duplication still run.
55+
# Or run RSeQC tools only with a BED file (no dupRadar/featureCounts)
56+
rustqc rna sample.markdup.bam --bed genes.bed -p -o results/
57+
```
6058

6159
To disable specific RSeQC tools, use a YAML config file:
6260

@@ -67,7 +65,7 @@ inner_distance:
6765
```
6866
6967
```bash
70-
rustqc rna sample.markdup.bam --gtf genes.gtf --bed genes.bed -p -c config.yaml -o results/
68+
rustqc rna sample.markdup.bam --gtf genes.gtf -p -c config.yaml -o results/
7169
```
7270

7371
See [RSeQC Outputs](/outputs/rseqc/) for details on output files.
@@ -79,32 +77,32 @@ producing its own set of output files:
7977

8078
```bash
8179
rustqc rna sample1.bam sample2.bam sample3.bam \
82-
--gtf genes.gtf --bed genes.bed -p -t 8 -o results/
80+
--gtf genes.gtf -p -t 8 -o results/
8381
```
8482

85-
The GTF and BED files are parsed once and shared across all samples. Threads
83+
The annotation file is parsed once and shared across all samples. Threads
8684
are distributed automatically among concurrent jobs.
8785

8886
## Common options
8987

9088
```bash
9189
# Stranded library (forward)
92-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p -s 1
90+
rustqc rna sample.bam --gtf genes.gtf -p -s 1
9391

9492
# Use 8 threads
95-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p -t 8
93+
rustqc rna sample.bam --gtf genes.gtf -p -t 8
9694

9795
# CRAM input with reference
98-
rustqc rna sample.cram --gtf genes.gtf --bed genes.bed -p --reference genome.fa
96+
rustqc rna sample.cram --gtf genes.gtf -p --reference genome.fa
9997

10098
# Skip duplicate-marking validation
10199
rustqc rna sample.bam --gtf genes.gtf -p --skip-dup-check
102100

103101
# Use a YAML config file
104-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p --config config.yaml
102+
rustqc rna sample.bam --gtf genes.gtf -p --config config.yaml
105103

106104
# Custom MAPQ cutoff for RSeQC tools
107-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p -q 20 -o results/
105+
rustqc rna sample.bam --gtf genes.gtf -p -q 20 -o results/
108106
```
109107

110108
## Next steps

docs/src/content/docs/outputs/rseqc.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ original Python implementation.
99

1010
All RSeQC tools run automatically as part of the `rustqc rna` command and use
1111
the input filename stem as a prefix for output files. For example,
12-
`rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -o results/` produces
12+
`rustqc rna sample.bam --gtf genes.gtf -o results/` produces
1313
RSeQC output files like `results/sample.bam_stat.txt`.
1414

1515
## bam_stat

docs/src/content/docs/usage/cli-reference.md

Lines changed: 31 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ read_distribution, junction_annotation, junction_saturation, inner_distance).
1616
### Synopsis
1717

1818
```
19-
rustqc rna <INPUT>... --gtf <GTF> [--bed <BED>] [OPTIONS]
19+
rustqc rna <INPUT>... (--gtf <GTF> | --bed <BED>) [OPTIONS]
2020
```
2121

2222
### Positional arguments
@@ -31,32 +31,36 @@ producing its own set of output files. Threads are divided evenly among
3131
concurrent jobs.
3232

3333
```bash
34-
# Single file
35-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed
34+
# Single file with GTF (all analyses)
35+
rustqc rna sample.bam --gtf genes.gtf
36+
37+
# Single file with BED (RSeQC tools only; dupRadar/featureCounts skipped)
38+
rustqc rna sample.bam --bed genes.bed
3639

3740
# Multiple files
38-
rustqc rna sample1.bam sample2.bam sample3.bam --gtf genes.gtf --bed genes.bed
41+
rustqc rna sample1.bam sample2.bam sample3.bam --gtf genes.gtf
3942
```
4043

41-
### Required options
44+
### Annotation options
4245

43-
#### `--gtf <GTF>` / `-g <GTF>`
46+
Exactly one of `--gtf` or `--bed` must be provided. They are **mutually exclusive**.
4447

45-
Path to the GTF gene annotation file. Used to define gene models for read
46-
counting and expression-level calculation. The GTF must contain `exon` features
47-
with a `gene_id` attribute.
48+
#### `--gtf <GTF>` / `-g <GTF>`
4849

49-
### General options
50+
Path to a GTF gene annotation file. The GTF must contain `exon` features with a
51+
`gene_id` attribute. When a GTF is provided, **all analyses run**: dupRadar,
52+
featureCounts, and all 7 RSeQC tools. Transcript-level structure (exon blocks,
53+
CDS features) is extracted automatically and used by the RSeQC tools that
54+
previously required a separate BED file.
5055

5156
#### `-b, --bed <BED>`
5257

53-
Path to a BED12-format gene model file. Required for 5 of the 7 RSeQC tools
54-
(infer_experiment, read_distribution, junction_annotation, junction_saturation,
55-
inner_distance).
58+
Path to a BED12-format gene model file. When a BED file is provided **without**
59+
a GTF, only the 7 RSeQC tools run (plus bam_stat and read_duplication). dupRadar
60+
and featureCounts are skipped because BED files lack gene-level grouping and
61+
biotype information.
5662

57-
If omitted, a warning is printed listing the tools that will be skipped. The
58-
2 tools that do not need a BED file (bam_stat, read_duplication) still run.
59-
Individual tools can also be disabled via the [configuration file](/usage/configuration/).
63+
Individual tools can be disabled via the [configuration file](/usage/configuration/).
6064

6165
#### `-o, --outdir <DIR>`
6266

@@ -170,35 +174,35 @@ tool runs by default as part of `rustqc rna` and can be disabled via the
170174
### Examples
171175

172176
```bash
173-
# Basic paired-end analysis (all tools)
174-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p -o results/
177+
# Basic paired-end analysis with GTF (all tools)
178+
rustqc rna sample.bam --gtf genes.gtf -p -o results/
175179

176-
# Reverse-stranded library with 8 threads
177-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p -s 2 -t 8 -o results/
180+
# BED-only mode (RSeQC tools only; dupRadar/featureCounts skipped)
181+
rustqc rna sample.bam --bed genes.bed -p -o results/
178182

179-
# Without BED file (bam_stat + read_duplication only, others skipped)
180-
rustqc rna sample.bam --gtf genes.gtf -p -o results/
183+
# Reverse-stranded library with 8 threads
184+
rustqc rna sample.bam --gtf genes.gtf -p -s 2 -t 8 -o results/
181185

182186
# CRAM input with reference
183-
rustqc rna sample.cram --gtf genes.gtf --bed genes.bed -p -r genome.fa -o results/
187+
rustqc rna sample.cram --gtf genes.gtf -p -r genome.fa -o results/
184188

185189
# GENCODE GTF with explicit biotype attribute
186-
rustqc rna sample.bam --gtf gencode.v46.gtf --bed genes.bed -p \
190+
rustqc rna sample.bam --gtf gencode.v46.gtf -p \
187191
--biotype-attribute gene_type -o results/
188192

189193
# Custom junction saturation sampling range
190-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p \
194+
rustqc rna sample.bam --gtf genes.gtf -p \
191195
--junction-saturation-percentile-floor 10 \
192196
--junction-saturation-percentile-ceiling 100 \
193197
--junction-saturation-percentile-step 10 -o results/
194198

195199
# Custom inner distance histogram bounds
196-
rustqc rna sample.bam --gtf genes.gtf --bed genes.bed -p \
200+
rustqc rna sample.bam --gtf genes.gtf -p \
197201
--inner-distance-lower-bound -500 --inner-distance-upper-bound 500 \
198202
--inner-distance-step 10 -o results/
199203

200204
# Multiple BAM files with parallel processing
201-
rustqc rna *.bam --gtf genes.gtf --bed genes.bed -p -t 8 -o results/
205+
rustqc rna *.bam --gtf genes.gtf -p -t 8 -o results/
202206
```
203207

204208
---

0 commit comments

Comments
 (0)