-
-
Couldn't load subscription status.
- Fork 93
Update Alignment #466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Update Alignment #466
Conversation
# RNA-seq Bulk Alignment Pipeline — Class 集成与批处理支持
## 摘要(Summary)
本 PR 将原有分散脚本封装为 `omicverse.bulk.Alignment` 统一入口,打通 **SRA → FASTQ → QC(fastp) → STAR → featureCounts** 全链条,支持**并发**、**幂等跳过**与**一致的产出结构**,并新增:
- `Alignment.fetch_metadata()`:GEO/ENA 元数据获取与 RunInfo 生成
- `Alignment.prefetch()`:多路 SRA 并发下载(带进度、断点与官方校验)
- `Alignment.fasterq()`:`fasterq-dump` 并行(按 SRR 隔离输出与 tmp)
- `Alignment.fastp()`:批处理 QC,缓存检测与报告收集
- `Alignment.star_align()`:批量 STAR(索引自动选择/缓存、BAM 幂等、SRR 命名软链接)
- `Alignment.featurecounts()`:批量计数与**自动合并矩阵**(自动推断 GTF;返回矩阵路径)
> 产物命名标准化:
> - STAR 目录中保留官方名 `Aligned.sortedByCoord.out.bam`,同时**额外暴露** `SRR.sorted.bam`(软链接/拷贝)以避免合并矩阵时列名冲突。
> - featureCounts 单样本表统一为 `<counts_root>/<SRR>/<SRR>.counts.txt`;合并矩阵为 `<counts_root>/matrix.<by>.csv`。
---
## 变更范围(Scope)
- 新增/调整的模块(示例,实际以本 PR diff 为准):
- `bulk/alignment.py`(核心类与适配)
- `bulk/sra_prefetch.py`(并发 prefetch 与进度条最小侵入改造)
- `bulk/sra_fasterq.py`(批处理 fasterq;幂等与重试)
- `bulk/qc_fastp.py`(fastp 批处理与产物检测)
- `bulk/star_step.py`(基于现有 `star_tools` 的批处理适配;SRR 软链接)
- `bulk/count_step.py` / `bulk/count_tools.py`(批量 featureCounts 与合并矩阵;列名统一为 SRR)
- `bulk/tools_check.py`(`which_or_find`、`merged_env` 等工具)
- `bulk/__init__.py` 导出 `Alignment`, `AlignmentConfig`
---
## 依赖(Dependencies)
### 系统 & 生信工具
- **sra-tools**(需要 `prefetch`, `vdb-validate`, `fasterq-dump`)
- **samtools**(BAM index)
- **STAR**
- **subread**(提供 `featureCounts`)
- **fastp**
- 建议:`pigz`(gzip 加速)、`aria2`(下载加速,可选)
### Python 包
- 必需:`pandas`, `tqdm`, `numpy`, `requests`, `lxml`
- 可能用到(视 meta 抓取实现):`biopython`
**见本 PR 附带的 `environment.yml`(bioconda 优先)**。
---
## 目录结构(Outputs Layout)
运行后默认输出如下(以 `work/` 为根):
```
work/
├── prefetch/ # prefetch 的 .sra
│ └── SRRxxxxxx/SRRxxxxxx.sra
├── fasterq/
│ └── SRRxxxxxx/
│ ├── SRRxxxxxx_1.fastq.gz
│ └── SRRxxxxxx_2.fastq.gz
├── fastp/
│ └── SRRxxxxxx/
│ ├── SRRxxxxxx_clean_1.fastq.gz
│ ├── SRRxxxxxx_clean_2.fastq.gz
│ ├── SRRxxxxxx.fastp.json
│ └── SRRxxxxxx.fastp.html
├── star/
│ └── SRRxxxxxx/
│ ├── Aligned.sortedByCoord.out.bam # 官方文件名(保留)
│ ├── Aligned.sortedByCoord.out.bam.bai
│ ├── SRRxxxxxx.sorted.bam # SRR 命名(软链接/拷贝)
│ └── SRRxxxxxx.sorted.bam.bai
└── counts/
├── SRRxxxxxx/SRRxxxxxx.counts.txt
└── matrix.auto.csv # 合并矩阵(行=gene_id, 列=SRR)
```
---
## 环境变量 / 配置(Configuration)
- `NCBI_SETTINGS`(可选):SRA 工具配置路径
- `TMPDIR`(可选):大文件临时目录
- `FC_GTF_HINT`(可选):当无法从 STAR index 推断 GTF 时,提供 GTF 路径提示
`AlignmentConfig` 关键字段(有默认):
- `work_root`, `prefetch_root`, `fasterq_root`, `fastp_root`,
`star_index_root`, `star_align_root`, `counts_root`
- `threads`(并发控制,与内部每任务线程需平衡)
- `memory`(传递给 fasterq 的 `--mem`,如 `"8G"`)
- `gzip_fastq`(fasterq 输出是否压缩)
---
## 端到端用法(Usage)
```python
from omicverse.bulk import Alignment, AlignmentConfig
cfg = AlignmentConfig(
work_root="work",
prefetch_root="work/prefetch",
fasterq_root="work/fasterq",
fastp_root="work/fastp",
star_index_root="index",
star_align_root="work/star",
counts_root="work/counts",
threads=16,
memory="8G",
gzip_fastq=True,
)
aln = Alignment(cfg)
meta = aln.fetch_metadata("GSE157103")
sra_paths = aln.prefetch(meta["srr_list"], max_concurrent=4)
fq_pairs = aln.fasterq(meta["srr_list"])
qc_results = aln.fastp(fq_pairs)
pairs_for_star = [(srr, c1, c2) for (srr, c1, c2, _, _) in qc_results]
bam_triples = aln.star_align(
pairs_for_star,
gencode_release="v44",
sjdb_overhang=149,
accession_for_species=None,
max_workers=2,
)
fc_out = aln.featurecounts(
bam_triples,
simple=True,
by="auto",
threads=8,
)
print("merged matrix:", fc_out.get("matrix"))
```
---
## 复现实验(Repro Checklist)
- [ ] 同一批次 SRR **重复运行**:所有阶段出现 `[SKIP]`,不重复生成产物
- [ ] fasterq 失败重试有效(网络不稳时自动切换本地 `.sra` 输入)
- [ ] STAR 产物包含**官方名**与**SRR 命名软链接**,二者指向同一数据
- [ ] featureCounts 合并矩阵列名为 **SRR**,无重复列名冲突
- [ ] `counts/matrix.auto.csv` 行数 ≥ 单样本 gene 行数上限,且 `gene_id` 非空
---
## 并发与性能(Tuning)
- **机器核数 N**:`max_workers × per-sample threads ≤ N`(含超线程时适当打折)
- `prefetch`:外层并发(例 `max_concurrent=4`),单个下载内部 0.25s 轮询进度
- `fasterq-dump`:建议 `--mem 8–16G`、`threads_per_job 12–24`;并发样本数谨慎
- `STAR`:**内存敏感**;大型基因组建议单样本 8–16 线程,并发 1–2
- `featureCounts`:`-T` 适中(8–16),I/O 是主要瓶颈
---
## 向后兼容性(Compatibility)
- 原有脚本可继续**独立调用**;类方法只是薄适配,不改变原业务逻辑
- STAR 输出增加了 SRR 软链接,不影响原有消费方;有助于矩阵列名唯一化
---
## 测试(Test Plan)
- [ ] 小样本(2–3 SRR)本地端到端测试
- [ ] 中等批量(8–12 SRR)并发参数
- [ ] 断点续跑(kill 后重启)产物与日志一致性
- [ ] 手动删除某一步产物,仅重做该步(其余 `[SKIP]`)
- [ ] 仅保留 STAR 官方 BAM 名时,`_normalize_bam` 能补齐 `SRR.sorted.bam` 软链接
- [ ] `FC_GTF_HINT` 指向 GTF 时,可绕过自动推断
---
## 未来工作(Next)
- [ ] 增加 **CLI**(`ov-bulk align ...`)封装类方法
- [ ] 支持 **STAR 索引自动下载/构建**(gencode/ensembl)与缓存记录
- [ ] 整合 **salmon**/**kallisto** 作为可选轻量计数
- [ ] 增加 **md5/sha256** 与产物 manifest,便于审计与复现
- [ ] GitHub Actions 做最小 CI:lint + 单元测试 + 小数据集集成测试
---
## 故障排查(Troubleshooting)
- **MergeError: duplicate columns** → 已在 `count_tools.py` 中合并前将计数列名改为 **SRR**
- **fasterq 退出码 3 / 输出缺失** → 网络不稳或 S3 超时;已加入重试与切换本地 `.sra`
- **STAR 内存不足** → 降低 `threads` 或减少 `max_workers`;必要时调整 `--limitGenomeGenerateRAM`
- **找不到 GTF** → 显式传 `gtf=` 或设 `FC_GTF_HINT`;或确保 STAR index 上级 `_cache/` 下有 gtf
- **权限/软链接问题** → 若文件系统不支持 symlink,代码自动回退为拷贝策略
---
## Checklist(提交前)
- [ ] 代码通过 `flake8`/`black`(或项目既有规范)
- [ ] 大文件未纳入仓库(.sra/.bam/.fastq.gz 等)
- [ ] 文档与示例路径与默认配置一致
- [ ] 在 `CHANGELOG.md` 或本 PR 中清楚记录变更与迁移说明
Add Alignment Features
# OmicVerse Enhanced Alignment Pipeline
## 🚀 新功能概述
增强的OmicVerse比对管道现在支持多种输入类型,包括:
1. **SRA数据** - 原有的公共数据库数据
2. **直接FASTQ文件** - 用户自己的FASTQ文件
## ✨ 主要特性
### 🔧 统一输入接口
- 自动检测输入类型
- 支持多种数据源的统一处理
- 灵活的样本ID分配机制
### 🏢 fastq数据输入支持
- 自动发现FASTQ文件
- 智能样本ID提取
- 双端测序自动配对
- 文件完整性验证
### 🔍 增强的工具检查
- 自动检测必需软件
- 提供安装指引
- 支持自动安装
### ⚙️ 灵活的配置系统
- YAML/JSON配置文件
- 多种预设配置模板
- 运行时参数调整
### 🚀 多种下载模式
- **prefetch模式**: 使用NCBI SRA Toolkit (默认)
- **iseq模式**: 使用iseq工具,支持多数据库、Aspera加速、直接下载gzip等高级功能
- **iseq功能优化**: 多数据下载支持以列表形式输入
## 📋 快速开始
### 1. 基础使用
```python
from omicverse.bulk import geo_data_preprocess, fq_data_preprocess
# 支持以SRA Run List txt文本路径输入
result = geo_data_preprocess(input_data ="./srr_list.txt")
# 支持以单个/多个 SRA ID 输入
data_list = ["SRR123456","SRR123457"]
result = geo_data_preprocess( input_data = data_list)
# 支持以fastq数据输入
fastq_files=[
"./work/fasterq/SRR12544421/SRR12544419_1.fastq.gz",
"./work/fasterq/SRR12544421/SRR12544419_2.fastq.gz",
"./work/fasterq/SRR12544421/SRR12544421_1.fastq.gz",
"./work/fasterq/SRR12544421/SRR12544421_2.fastq.gz",
]
result = fq_data_preprocess( input_data =fastq_files )
```
### 2. 进阶使用
```python
from omicverse.bulk import geo_data_preprocess, fq_data_preprocess,AlignmentConfig
cfg = AlignmentConfig(
work_root="work", # 数据分析保存根目录
threads=64, # CPU资源
genome="human", #数据来源organism
download_method="prefetch", #下载方式,可选 "iseq"
memory = "128G", #内存资源
fastp_enabled=True, #QC选项
gzip_fastq=True, # fastq数据压缩选项
)
result = geo_data_preprocess( input_data = data_list, config =cfg)
result = fq_data_preprocess( input_data = data_list, config =cfg)
```
### 2. 下载模式选择
#### prefetch模式 (默认)
使用NCBI SRA Toolkit进行数据下载,适合大多数情况:
#### iseq模式
使用iseq工具进行数据下载,支持更多高级功能:
```python
# 自定义iseq配置
config = {
"work_root": "work",
"download_method": "iseq",
"iseq_gzip": True, # 下载gzip格式的FASTQ文件
"iseq_aspera": True, # 使用Aspera加速
"iseq_database": "ena", # 选择数据库: ena, sra
"iseq_protocol": "ftp", # 选择协议: ftp, https
"iseq_parallel": 8, # 并行下载数
"iseq_threads": 16 # 处理线程数
}
result = run_analysis("SRR123456", config=config)
```
#### iseq命令行示例
以下是iseq支持的命令行选项,可以在配置中使用:
```bash
# 基本下载
iseq -i SRR123456
# 批量下载并gzip压缩
iseq -i SRR_Acc_List.txt -g
# 使用Aspera加速下载gzip文件
iseq -i PRJNA211801 -a -g
# 指定数据库和协议
iseq -i SRR123456 -d ena -r ftp -g
# 并行下载
iseq -i accession_list.txt -p 10 -g
```
### 工具参数
```yaml
star_params:
gencode_release: "v44"
sjdb_overhang: 149
fastp_params:
qualified_quality_phred: 20
length_required: 50
featurecounts_params:
simple: true
by: "gene_id"
```
## 📁 输入格式
### SRA数据
- 单个SRR编号:`"SRR123456"`
- 多个SRR编号:`["SRR123456", "SRR789012"]`
- GEO accession:`"GSE123456"`
### FASTQ文件
- 配对文件:`["sample_R1.fastq.gz", "sample_R2.fastq.gz"]`
- 多个配对:`[sample1_R1.fastq.gz", "sample1_R2.fastq.gz", "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]`
## 🔍 样本ID处理
### 自动提取
系统会自动从文件名提取样本ID:
- `sample1_R1.fastq.gz` → `sample1`
- `Tumor_001_L001_R1_001.fastq.gz` → `Tumor_001`
- `Sample_A_R1.fastq.gz` → `Sample_A`
## 🔧 工具要求
### 必需软件
- **sra-tools**: SRA数据下载
- **STAR**: RNA-seq比对
- **fastp**: 质量控制
- **featureCounts**: 基因定量
- **samtools**: BAM文件处理
- **entrez-direct**: 元数据获取
### 安装检查
```python
from omicverse.bulk._alignment import check_all_tools
# 检查工具
results = check_all_tools()
for tool, (available, path) in results.items():
print(f"{tool}: {'✅' if available else '❌'}")
```
### 自动安装
```python
# 自动安装缺失工具(需要conda环境)
results = check_all_tools(auto_install=True)
```
## 📊 输出结果
### 结果结构
```python
result = {
"type": "company", # 输入类型
"fastq_input": [(sample_id, fq1_path, fq2_path), ...],
"fastq_qc": [(sample_id, clean_fq1, clean_fq2), ...],
"bam": [(sample_id, bam_path, index_dir), ...],
"counts": {
"tables": [(sample_id, count_file), ...],
"matrix": matrix_file_path
}
}
```
### 文件组织
```
work/
├── meta/
│ ├── sample_metadata.csv # 样本元数据
│ └── sample_id_mapping.json # ID映射
├── prefetch/
│ ├── SRRID # 样本元数据
│ └── SRRID.sra # 原始文件
├── fasterq/
│ ├── SRRID
│ └── SRRID_R1.fastq.gz # 原始文件
│ └── SRRID_R2.fastq.gz # 原始文件
├── index/
│ ├── _cache # 自动检测数据下载对应index
├── fastp/ # 质控结果
│ ├── Sample_001/
│ │ ├── Sample_001_clean_R1.fastq.gz
│ │ └── Sample_001_clean_R2.fastq.gz
│ └── fastp_reports/
├── star/ # 比对结果
│ ├── Sample_001/
│ │ ├── Aligned.sortedByCoord.out.bam
│ │ └── Sample_001.sorted.bam
│ └── logs/
└── counts/ # 定量结果
├── Sample_001/
│ └── Sample_001.counts.txt
└── matrix.auto.csv # 合并矩阵
```
## 🚨 错误处理
### 常见错误
1. **文件不存在**
```
Error: File not found: /path/to/sample.fastq.gz
Solution: 检查文件路径是否正确
```
2. **样本ID冲突**
```
Error: Duplicate sample IDs detected
Solution: 使用不同的sample_prefix或手动指定样本ID
```
### 容错处理
- 部分样本失败时继续处理其他样本
- 自动重试机制(可配置)
- 详细的错误日志
## 🔬 高级功能
### 自定义样本ID
```python
# 手动指定样本ID映射
fastq_pairs = [
("Patient_001_Tumor", "/path/to/tumor_R1.fastq.gz", "/path/to/tumor_R2.fastq.gz"),
("Patient_001_Normal", "/path/to/normal_R1.fastq.gz", "/path/to/normal_R2.fastq.gz")
]
result = pipeline.run_from_fastq(fastq_pairs)
```
### 批量处理
```python
# 多个目录批量处理
data_dirs = ["/path/to/batch1", "/path/to/batch2", "/path/to/batch3"]
for i, data_dir in enumerate(data_dirs):
result = pipeline.run_pipeline(
data_dir,
input_type="company",
sample_prefix=f"Batch{i+1}"
)
```
### 质量控制参数
```python
config = AlignmentConfig(
fastp_params={
"qualified_quality_phred": 30, # 提高质量阈值
"length_required": 75, # 增加最小长度
"detect_adapter_for_pe": True # 自动检测接头
}
)
```
## 📈 性能优化
### 并行处理
```python
config = AlignmentConfig(
threads=16, # 增加线程数
max_workers=4, # 并行样本数
continue_on_error=True # 错误继续
)
```
### 内存管理
```python
config = AlignmentConfig(
memory="32G", # 增加内存
retry_attempts=3, # 重试次数
retry_delay=10 # 重试延迟
)
```
## 🔗 相关文件
- `alignment.py` - 核心管道类
- `pipeline_config.py` - 配置管理
- `tools_check.py` - 工具检查
## 📞 支持
如有问题,请:
1. 检查工具是否正确安装
2. 查看日志文件获取详细错误信息
3. 确保输入数据格式正确
4. 参考示例代码
## 📄 更新日志
### v2.0.0
- ✅ 新增直接FASTQ文件支持
- ✅ 新增自动输入类型检测
- ✅ 增强工具检查和安装指引
- ✅ 新增统一配置系统
- ✅ 改进错误处理和日志系统
- ✅ 新增样本ID标准化机制
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a significant enhancement to OmicVerse's bulk RNA-seq analysis pipeline, adding support for multiple input data types and download methods. The changes transform the existing alignment workflow into a flexible, unified system that can handle SRA data, direct FASTQ files, and company-provided sequencing data.
Key Changes:
- Added modular pipeline architecture with step-based processing (prefetch, fasterq, fastp, STAR, featureCounts)
- Implemented dual download strategies: traditional prefetch and iseq tool with multi-database support
- Introduced unified API functions
geo_data_preprocess()andfq_data_preprocess()for streamlined data processing
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tools_check.py |
Tool availability checker with auto-installation support for bioinformatics software |
star_tools.py |
STAR alignment utilities including index management and GEO metadata fetching |
star_step.py |
STAR alignment step factory with batch processing and BAM normalization |
sra_tools.py |
Comprehensive SRA download/conversion with prefetch, fasterq-dump, and metadata retrieval |
sra_prefetch.py |
SRA prefetch step with progress monitoring and file validation |
sra_fasterq.py |
Fasterq-dump batch processing with retry logic and compression support |
qc_tools.py |
Fastp QC wrapper supporting both single samples and parallel batch processing |
qc_fastp.py |
Fastp step factory for quality control pipeline integration |
pipeline_config.py |
Configuration management with YAML/JSON support and validation |
iseq_handler.py |
Handler for company FASTQ data with automatic sample pairing and validation |
geo_meta_fetcher.py |
GEO metadata extraction from SOFT format with BioProject/SRA mapping |
entrez_direct.py |
EDirect wrapper for SRA metadata retrieval via NCBI E-utilities |
data_prepare_pipline.py |
Pipeline orchestration coordinating all processing steps |
count_tools.py |
FeatureCounts wrapper with GTF auto-detection and matrix merging |
count_step.py |
FeatureCounts step factory with GTF resolution logic |
alignment.py |
Main Alignment class providing unified API for all input types |
__init__.py (_alignment) |
Package initialization exporting main classes and convenience functions |
README_ENHANCED.md |
Comprehensive documentation with examples and configuration guides |
__init__.py (bulk) |
Bulk package initialization adding alignment exports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if total_size is None and current_size > 0 and elapsed_time > 3: | ||
| # 初次设置一个软上限(当前值的 1.3 倍) | ||
| total_size = int(current_size * 1.3) | ||
| pbar.total = total_size | ||
| pbar.refresh() | ||
| # 若逼近上限,则抬高上限,避免显示 >100% | ||
| if total_size and current_size > total_size * 0.95: | ||
| total_size = int(current_size * 1.1) | ||
| pbar.total = total_size | ||
| pbar.refresh() |
Copilot
AI
Oct 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation error: lines 661-670 are incorrectly indented and appear to be inside the wrong scope. These lines should be inside a loop or conditional block that updates progress, but they're currently at the module level within a try-except block. The code references undefined variables current_size and elapsed_time that don't exist in this scope.
| if total_size is None and current_size > 0 and elapsed_time > 3: | |
| # 初次设置一个软上限(当前值的 1.3 倍) | |
| total_size = int(current_size * 1.3) | |
| pbar.total = total_size | |
| pbar.refresh() | |
| # 若逼近上限,则抬高上限,避免显示 >100% | |
| if total_size and current_size > total_size * 0.95: | |
| total_size = int(current_size * 1.1) | |
| pbar.total = total_size | |
| pbar.refresh() |
|
I am wondering if the comment in the code could be English? This will really help both LLMs and humans understand the whole code. |
Sure, I am working on it. |
# OmicVerse Enhanced Alignment Pipeline
## 🚀 New Feature Overview
The enhanced OmicVerse alignment pipeline now supports multiple input types, including:
1. **SRA data** - Original public repository data
2. **Direct FASTQ files** - User-provided FASTQ files
## ✨ Key Features
### 🔧 Unified Input Interface
- Automatically detect input type
- Unified processing across data sources
- Flexible sample ID assignment mechanism
### 🏢 FASTQ Input Support
- Automatically discover FASTQ files
- Intelligent sample ID extraction
- Automatic pairing for paired-end sequencing
- File integrity validation
### 🔍 Enhanced Tool Checks
- Automatically detect required software
- Provide installation guidance
- Support automatic installation
### ⚙️ Flexible Configuration System
- YAML/JSON configuration files
- Multiple preset configuration templates
- Runtime parameter adjustments
### 🚀 Multiple Download Modes
- **prefetch mode**: Use the NCBI SRA Toolkit (default)
- **iseq mode**: Use the iseq tool with multi-database support, Aspera acceleration, direct gzip downloads, and more
- **iseq enhancements**: Batch downloads accept list-style inputs
## 📋 Quick Start
### 1. Basic Usage
```python
from omicverse.bulk import geo_data_preprocess, fq_data_preprocess
# Accept an SRA Run List text file path
result = geo_data_preprocess(input_data ="./srr_list.txt")
# Accept single or multiple SRA IDs
data_list = ["SRR123456","SRR123457"]
result = geo_data_preprocess( input_data = data_list)
# Accept FASTQ data input
fastq_files=[
"./work/fasterq/SRR12544421/SRR12544419_1.fastq.gz",
"./work/fasterq/SRR12544421/SRR12544419_2.fastq.gz",
"./work/fasterq/SRR12544421/SRR12544421_1.fastq.gz",
"./work/fasterq/SRR12544421/SRR12544421_2.fastq.gz",
]
result = fq_data_preprocess( input_data =fastq_files )
```
### 2. Advanced Usage
```python
from omicverse.bulk import geo_data_preprocess, fq_data_preprocess,AlignmentConfig
cfg = AlignmentConfig(
work_root="work", # Root directory for analysis outputs
threads=64, # CPU resources
genome="human", # Source organism
download_method="prefetch", # Download method; set to "iseq" to use iseq
memory = "128G", # Memory allocation
fastp_enabled=True, # QC option
gzip_fastq=True, # FASTQ compression option
)
result = geo_data_preprocess( input_data = data_list, config =cfg)
result = fq_data_preprocess( input_data = data_list, config =cfg)
```
### 2. Download Mode Selection
#### prefetch mode (default)
Use the NCBI SRA Toolkit for downloads; suitable for most scenarios.
#### iseq mode
Use the iseq tool for downloads with additional advanced capabilities:
```python
# Custom iseq configuration
config = {
"work_root": "work",
"download_method": "iseq",
"iseq_gzip": True, # Download FASTQ as gzip
"iseq_aspera": True, # Enable Aspera acceleration
"iseq_database": "ena", # Select database: ena or sra
"iseq_protocol": "ftp", # Select protocol: ftp or https
"iseq_parallel": 8, # Parallel download count
"iseq_threads": 16 # Processing threads
}
result = run_analysis("SRR123456", config=config)
```
#### iseq command-line examples
Command-line options supported by iseq that you can mirror in configuration:
```bash
# Basic download
iseq -i SRR123456
# Batch download with gzip compression
iseq -i SRR_Acc_List.txt -g
# Download gzip files with Aspera acceleration
iseq -i PRJNA211801 -a -g
# Specify database and protocol
iseq -i SRR123456 -d ena -r ftp -g
# Parallel downloads
iseq -i accession_list.txt -p 10 -g
```
### Tool Parameters
```yaml
star_params:
gencode_release: "v44"
sjdb_overhang: 149
fastp_params:
qualified_quality_phred: 20
length_required: 50
featurecounts_params:
simple: true
by: "gene_id"
```
## 📁 Input Formats
### SRA Data
- Single SRR accession: "SRR123456"
- Multiple SRR accessions: ["SRR123456", "SRR789012"]
- GEO accession: "GSE123456" (This will download all samples automatically make sure the disk is enough before download entire datasets)
### FASTQ Files
- Paired files: ["sample_R1.fastq.gz", "sample_R2.fastq.gz"]
- Multiple pairs: ["sample1_R1.fastq.gz", "sample1_R2.fastq.gz", "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]
## 🔍 Sample ID Handling
### Automatic Extraction
The system automatically derives sample IDs from file names:
- `sample1_R1.fastq.gz` -> `sample1`
- `Tumor_001_L001_R1_001.fastq.gz` -> `Tumor_001`
- `Sample_A_R1.fastq.gz` -> `Sample_A`
## 🔧 Tool Requirements
### Required software
- **sra-tools**: SRA data downloads
- **STAR**: RNA-seq alignment
- **fastp**: Quality control
- **featureCounts**: Gene quantification
- **samtools**: BAM file handling
- **entrez-direct**: Metadata retrieval
### Installation check
```python
from omicverse.bulk._alignment import check_all_tools
# Check tools
results = check_all_tools()
for tool, (available, path) in results.items():
print(f"{tool}: {'✅' if available else '❌'}")
```
### Automatic installation
```python
# Automatically install missing tools (requires a conda environment)
results = check_all_tools(auto_install=True)
```
## 📊 Output Results
### Result structure
```python
result = {
"type": "company", # Input type
"fastq_input": [(sample_id, fq1_path, fq2_path), ...],
"fastq_qc": [(sample_id, clean_fq1, clean_fq2), ...],
"bam": [(sample_id, bam_path, index_dir), ...],
"counts": {
"tables": [(sample_id, count_file), ...],
"matrix": matrix_file_path
}
}
```
### File organization
```
work/
├── meta/
│ ├── sample_metadata.csv # Sample metadata
│ └── sample_id_mapping.json # ID mapping
├── prefetch/
│ ├── SRRID # Sample metadata
│ └── SRRID.sra # Raw files
├── fasterq/
│ ├── SRRID
│ └── SRRID_R1.fastq.gz # Raw files
│ └── SRRID_R2.fastq.gz # Raw files
├── index/
│ ├── _cache # Automatically detected indices for downloaded data
├── fastp/ # QC results
│ ├── Sample_001/
│ │ ├── Sample_001_clean_R1.fastq.gz
│ │ └── Sample_001_clean_R2.fastq.gz
│ └── fastp_reports/
├── star/ # Alignment results
│ ├── Sample_001/
│ │ ├── Aligned.sortedByCoord.out.bam
│ │ └── Sample_001.sorted.bam
│ └── logs/
└── counts/ # Quantification results
├── Sample_001/
│ └── Sample_001.counts.txt
└── matrix.auto.csv # Combined matrix
```
## 🚨 Error Handling
### Common Errors
1. **File not found**
```
Error: File not found: /path/to/sample.fastq.gz
Solution: Verify the file path is correct
```
2. **Sample ID conflict**
```
Error: Duplicate sample IDs detected
Solution: Use a different sample_prefix or specify sample IDs manually
```
### Fault tolerance
- Continue processing remaining samples when some fail
- Configurable automatic retry mechanism
- Detailed error logs
## 🔬 Advanced Features
### Custom sample IDs
```python
# Manually specify sample ID mapping
fastq_pairs = [
("Patient_001_Tumor", "/path/to/tumor_R1.fastq.gz", "/path/to/tumor_R2.fastq.gz"),
("Patient_001_Normal", "/path/to/normal_R1.fastq.gz", "/path/to/normal_R2.fastq.gz")
]
result = pipeline.run_from_fastq(fastq_pairs)
```
### Batch processing
```python
# Batch process multiple directories
data_dirs = ["/path/to/batch1", "/path/to/batch2", "/path/to/batch3"]
for i, data_dir in enumerate(data_dirs):
result = pipeline.run_pipeline(
data_dir,
input_type="company",
sample_prefix=f"Batch{i+1}"
)
```
### Quality control parameters
```python
config = AlignmentConfig(
fastp_params={
"qualified_quality_phred": 30, # Increase quality threshold
"length_required": 75, # Increase minimum length
"detect_adapter_for_pe": True # Auto-detect adapters
}
)
```
## 📈 Performance Optimization
### Parallel processing
```python
config = AlignmentConfig(
threads=16, # Increase thread count
max_workers=4, # Number of parallel samples
continue_on_error=True # Continue after errors
)
```
### Memory management
```python
config = AlignmentConfig(
memory="32G", # Increase memory
retry_attempts=3, # Retry attempts
retry_delay=10 # Retry delay
)
```
## 🔗 Related Files
- `alignment.py` - Core pipeline class
- `pipeline_config.py` - Configuration management
- `tools_check.py` - Tool checks
## 📞 Support
If issues arise:
1. Check that tools are installed correctly
2. Review log files for detailed error information
3. Ensure input data formats are correct
4. Refer to the sample code
## 📄 Changelog
### v2.0.0
- ✅ Added direct FASTQ file support
- ✅ Added automatic input type detection
- ✅ Enhanced tool checks and installation guidance
- ✅ Added unified configuration system
- ✅ Improved error handling and logging
- ✅ Added sample ID standardization mechanism
Co-Authored-By: HendricksJudy <[email protected]>
Co-Authored-By: Zehua Zeng <[email protected]>
Co-Authored-By: Claude <[email protected]>
Thanks a lot |
OmicVerse Enhanced Alignment Pipeline
🚀 新功能概述
增强的OmicVerse比对管道现在支持多种输入类型,包括:
✨ 主要特性
🔧 统一输入接口
🏢 fastq数据输入支持
🔍 增强的工具检查
⚙️ 灵活的配置系统
🚀 多种下载模式
📋 快速开始
1. 基础使用
2. 进阶使用
2. 下载模式选择
prefetch模式 (默认)
使用NCBI SRA Toolkit进行数据下载,适合大多数情况:
iseq模式
使用iseq工具进行数据下载,支持更多高级功能:
iseq命令行示例
以下是iseq支持的命令行选项,可以在配置中使用:
工具参数
📁 输入格式
SRA数据
"SRR123456"["SRR123456", "SRR789012"]"GSE123456"FASTQ文件
["sample_R1.fastq.gz", "sample_R2.fastq.gz"][sample1_R1.fastq.gz", "sample1_R2.fastq.gz", "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]🔍 样本ID处理
自动提取
系统会自动从文件名提取样本ID:
sample1_R1.fastq.gz→sample1Tumor_001_L001_R1_001.fastq.gz→Tumor_001Sample_A_R1.fastq.gz→Sample_A🔧 工具要求
必需软件
安装检查
自动安装
📊 输出结果
结果结构
文件组织
🚨 错误处理
常见错误
文件不存在
样本ID冲突
容错处理
🔬 高级功能
自定义样本ID
批量处理
质量控制参数
📈 性能优化
并行处理
内存管理
🔗 相关文件
alignment.py- 核心管道类pipeline_config.py- 配置管理tools_check.py- 工具检查📞 支持
如有问题,请:
📄 更新日志
v2.0.0