Update Alignment #466

zhiluo20 · 2025-10-25T21:45:43Z

OmicVerse Enhanced Alignment Pipeline

🚀 新功能概述

增强的OmicVerse比对管道现在支持多种输入类型，包括：

SRA数据 - 原有的公共数据库数据
直接FASTQ文件 - 用户自己的FASTQ文件

✨ 主要特性

🔧 统一输入接口

自动检测输入类型
支持多种数据源的统一处理
灵活的样本ID分配机制

🏢 fastq数据输入支持

自动发现FASTQ文件
智能样本ID提取
双端测序自动配对
文件完整性验证

🔍 增强的工具检查

自动检测必需软件
提供安装指引
支持自动安装

⚙️ 灵活的配置系统

YAML/JSON配置文件
多种预设配置模板
运行时参数调整

🚀 多种下载模式

prefetch模式: 使用NCBI SRA Toolkit (默认)
iseq模式: 使用iseq工具，支持多数据库、Aspera加速、直接下载gzip等高级功能
iseq功能优化: 多数据下载支持以列表形式输入

📋 快速开始

1. 基础使用

from omicverse.bulk import geo_data_preprocess, fq_data_preprocess


# 支持以SRA Run List txt文本路径输入
result = geo_data_preprocess(input_data ="./srr_list.txt")
                            
# 支持以单个/多个 SRA ID 输入
data_list = ["SRR123456","SRR123457"]
result = geo_data_preprocess( input_data = data_list)
                            
# 支持以fastq数据输入

fastq_files=[
        "./work/fasterq/SRR12544421/SRR12544419_1.fastq.gz",
        "./work/fasterq/SRR12544421/SRR12544419_2.fastq.gz",
        "./work/fasterq/SRR12544421/SRR12544421_1.fastq.gz",
        "./work/fasterq/SRR12544421/SRR12544421_2.fastq.gz",
]
result = fq_data_preprocess( input_data =fastq_files )

2. 进阶使用

from omicverse.bulk import geo_data_preprocess, fq_data_preprocess,AlignmentConfig

cfg = AlignmentConfig(
    work_root="work",  # 数据分析保存根目录
    threads=64,        # CPU资源
    genome="human",    #数据来源organism
    download_method="prefetch", #下载方式，可选 "iseq"
    memory = "128G",    #内存资源
    fastp_enabled=True, #QC选项
    gzip_fastq=True,    # fastq数据压缩选项
)
result = geo_data_preprocess( input_data = data_list, config =cfg)
result = fq_data_preprocess( input_data = data_list, config =cfg)

2. 下载模式选择

prefetch模式 (默认)

使用NCBI SRA Toolkit进行数据下载，适合大多数情况：

iseq模式

使用iseq工具进行数据下载，支持更多高级功能：

# 自定义iseq配置
config = {
    "work_root": "work",
    "download_method": "iseq",
    "iseq_gzip": True,           # 下载gzip格式的FASTQ文件
    "iseq_aspera": True,         # 使用Aspera加速
    "iseq_database": "ena",      # 选择数据库: ena, sra
    "iseq_protocol": "ftp",      # 选择协议: ftp, https
    "iseq_parallel": 8,          # 并行下载数
    "iseq_threads": 16           # 处理线程数
}

result = run_analysis("SRR123456", config=config)

iseq命令行示例

以下是iseq支持的命令行选项，可以在配置中使用：

# 基本下载
iseq -i SRR123456

# 批量下载并gzip压缩
iseq -i SRR_Acc_List.txt -g

# 使用Aspera加速下载gzip文件
iseq -i PRJNA211801 -a -g

# 指定数据库和协议
iseq -i SRR123456 -d ena -r ftp -g

# 并行下载
iseq -i accession_list.txt -p 10 -g

工具参数

star_params:
  gencode_release: "v44"
  sjdb_overhang: 149

fastp_params:
  qualified_quality_phred: 20
  length_required: 50

featurecounts_params:
  simple: true
  by: "gene_id"

📁 输入格式

SRA数据

单个SRR编号："SRR123456"
多个SRR编号：["SRR123456", "SRR789012"]
GEO accession："GSE123456"

FASTQ文件

配对文件：["sample_R1.fastq.gz", "sample_R2.fastq.gz"]
多个配对：[sample1_R1.fastq.gz", "sample1_R2.fastq.gz", "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]

🔍 样本ID处理

自动提取

系统会自动从文件名提取样本ID：

sample1_R1.fastq.gz → sample1
Tumor_001_L001_R1_001.fastq.gz → Tumor_001
Sample_A_R1.fastq.gz → Sample_A

🔧 工具要求

必需软件

sra-tools: SRA数据下载
STAR: RNA-seq比对
fastp: 质量控制
featureCounts: 基因定量
samtools: BAM文件处理
entrez-direct: 元数据获取

安装检查

from omicverse.bulk._alignment import check_all_tools

# 检查工具
results = check_all_tools()
for tool, (available, path) in results.items():
    print(f"{tool}: {'✅' if available else '❌'}")

自动安装

# 自动安装缺失工具（需要conda环境）
results = check_all_tools(auto_install=True)

📊 输出结果

结果结构

result = {
    "type": "company",  # 输入类型
    "fastq_input": [(sample_id, fq1_path, fq2_path), ...],
    "fastq_qc": [(sample_id, clean_fq1, clean_fq2), ...],
    "bam": [(sample_id, bam_path, index_dir), ...],
    "counts": {
        "tables": [(sample_id, count_file), ...],
        "matrix": matrix_file_path
    }
}

文件组织

work/
├── meta/                    
│   ├── sample_metadata.csv  # 样本元数据
│   └── sample_id_mapping.json # ID映射
├── prefetch/                    
│   ├── SRRID  # 样本元数据
│       └── SRRID.sra # 原始文件
├── fasterq/                    
│   ├── SRRID  
│       └── SRRID_R1.fastq.gz # 原始文件
│       └── SRRID_R2.fastq.gz # 原始文件
├── index/                    
│   ├── _cache  # 自动检测数据下载对应index
├── fastp/                   # 质控结果
│   ├── Sample_001/
│   │   ├── Sample_001_clean_R1.fastq.gz
│   │   └── Sample_001_clean_R2.fastq.gz
│   └── fastp_reports/
├── star/                    # 比对结果
│   ├── Sample_001/
│   │   ├── Aligned.sortedByCoord.out.bam
│   │   └── Sample_001.sorted.bam
│   └── logs/
└── counts/                  # 定量结果
    ├── Sample_001/
    │   └── Sample_001.counts.txt
    └── matrix.auto.csv      # 合并矩阵

🚨 错误处理

常见错误

文件不存在

Error: File not found: /path/to/sample.fastq.gz
Solution: 检查文件路径是否正确

样本ID冲突

Error: Duplicate sample IDs detected
Solution: 使用不同的sample_prefix或手动指定样本ID

容错处理

部分样本失败时继续处理其他样本
自动重试机制（可配置）
详细的错误日志

🔬 高级功能

自定义样本ID

# 手动指定样本ID映射
fastq_pairs = [
    ("Patient_001_Tumor", "/path/to/tumor_R1.fastq.gz", "/path/to/tumor_R2.fastq.gz"),
    ("Patient_001_Normal", "/path/to/normal_R1.fastq.gz", "/path/to/normal_R2.fastq.gz")
]

result = pipeline.run_from_fastq(fastq_pairs)

批量处理

# 多个目录批量处理
data_dirs = ["/path/to/batch1", "/path/to/batch2", "/path/to/batch3"]

for i, data_dir in enumerate(data_dirs):
    result = pipeline.run_pipeline(
        data_dir,
        input_type="company",
        sample_prefix=f"Batch{i+1}"
    )

质量控制参数

config = AlignmentConfig(
    fastp_params={
        "qualified_quality_phred": 30,  # 提高质量阈值
        "length_required": 75,          # 增加最小长度
        "detect_adapter_for_pe": True   # 自动检测接头
    }
)

📈 性能优化

并行处理

config = AlignmentConfig(
    threads=16,           # 增加线程数
    max_workers=4,        # 并行样本数
    continue_on_error=True # 错误继续
)

内存管理

config = AlignmentConfig(
    memory="32G",         # 增加内存
    retry_attempts=3,     # 重试次数
    retry_delay=10        # 重试延迟
)

🔗 相关文件

alignment.py - 核心管道类
pipeline_config.py - 配置管理
tools_check.py - 工具检查

📞 支持

如有问题，请：

检查工具是否正确安装
查看日志文件获取详细错误信息
确保输入数据格式正确
参考示例代码

📄 更新日志

v2.0.0

✅ 新增直接FASTQ文件支持
✅ 新增自动输入类型检测
✅ 增强工具检查和安装指引
✅ 新增统一配置系统
✅ 改进错误处理和日志系统
✅ 新增样本ID标准化机制

# RNA-seq Bulk Alignment Pipeline — Class 集成与批处理支持 ## 摘要（Summary）本 PR 将原有分散脚本封装为 `omicverse.bulk.Alignment` 统一入口，打通 **SRA → FASTQ → QC(fastp) → STAR → featureCounts** 全链条，支持**并发**、**幂等跳过**与**一致的产出结构**，并新增： - `Alignment.fetch_metadata()`：GEO/ENA 元数据获取与 RunInfo 生成 - `Alignment.prefetch()`：多路 SRA 并发下载（带进度、断点与官方校验） - `Alignment.fasterq()`：`fasterq-dump` 并行（按 SRR 隔离输出与 tmp） - `Alignment.fastp()`：批处理 QC，缓存检测与报告收集 - `Alignment.star_align()`：批量 STAR（索引自动选择/缓存、BAM 幂等、SRR 命名软链接） - `Alignment.featurecounts()`：批量计数与**自动合并矩阵**（自动推断 GTF；返回矩阵路径） > 产物命名标准化： > - STAR 目录中保留官方名 `Aligned.sortedByCoord.out.bam`，同时**额外暴露** `SRR.sorted.bam`（软链接/拷贝）以避免合并矩阵时列名冲突。 > - featureCounts 单样本表统一为 `<counts_root>/<SRR>/<SRR>.counts.txt`；合并矩阵为 `<counts_root>/matrix.<by>.csv`。 --- ## 变更范围（Scope） - 新增/调整的模块（示例，实际以本 PR diff 为准）： - `bulk/alignment.py`（核心类与适配） - `bulk/sra_prefetch.py`（并发 prefetch 与进度条最小侵入改造） - `bulk/sra_fasterq.py`（批处理 fasterq；幂等与重试） - `bulk/qc_fastp.py`（fastp 批处理与产物检测） - `bulk/star_step.py`（基于现有 `star_tools` 的批处理适配；SRR 软链接） - `bulk/count_step.py` / `bulk/count_tools.py`（批量 featureCounts 与合并矩阵；列名统一为 SRR） - `bulk/tools_check.py`（`which_or_find`、`merged_env` 等工具） - `bulk/__init__.py` 导出 `Alignment`, `AlignmentConfig` --- ## 依赖（Dependencies） ### 系统 & 生信工具 - **sra-tools**（需要 `prefetch`, `vdb-validate`, `fasterq-dump`） - **samtools**（BAM index） - **STAR** - **subread**（提供 `featureCounts`） - **fastp** - 建议：`pigz`（gzip 加速）、`aria2`（下载加速，可选） ### Python 包 - 必需：`pandas`, `tqdm`, `numpy`, `requests`, `lxml` - 可能用到（视 meta 抓取实现）：`biopython` **见本 PR 附带的 `environment.yml`（bioconda 优先）**。 --- ## 目录结构（Outputs Layout）运行后默认输出如下（以 `work/` 为根）： ``` work/ ├── prefetch/ # prefetch 的 .sra │ └── SRRxxxxxx/SRRxxxxxx.sra ├── fasterq/ │ └── SRRxxxxxx/ │ ├── SRRxxxxxx_1.fastq.gz │ └── SRRxxxxxx_2.fastq.gz ├── fastp/ │ └── SRRxxxxxx/ │ ├── SRRxxxxxx_clean_1.fastq.gz │ ├── SRRxxxxxx_clean_2.fastq.gz │ ├── SRRxxxxxx.fastp.json │ └── SRRxxxxxx.fastp.html ├── star/ │ └── SRRxxxxxx/ │ ├── Aligned.sortedByCoord.out.bam # 官方文件名（保留） │ ├── Aligned.sortedByCoord.out.bam.bai │ ├── SRRxxxxxx.sorted.bam # SRR 命名（软链接/拷贝） │ └── SRRxxxxxx.sorted.bam.bai └── counts/ ├── SRRxxxxxx/SRRxxxxxx.counts.txt └── matrix.auto.csv # 合并矩阵（行=gene_id, 列=SRR） ``` --- ## 环境变量 / 配置（Configuration） - `NCBI_SETTINGS`（可选）：SRA 工具配置路径 - `TMPDIR`（可选）：大文件临时目录 - `FC_GTF_HINT`（可选）：当无法从 STAR index 推断 GTF 时，提供 GTF 路径提示 `AlignmentConfig` 关键字段（有默认）： - `work_root`, `prefetch_root`, `fasterq_root`, `fastp_root`, `star_index_root`, `star_align_root`, `counts_root` - `threads`（并发控制，与内部每任务线程需平衡） - `memory`（传递给 fasterq 的 `--mem`，如 `"8G"`） - `gzip_fastq`（fasterq 输出是否压缩） --- ## 端到端用法（Usage） ```python from omicverse.bulk import Alignment, AlignmentConfig cfg = AlignmentConfig( work_root="work", prefetch_root="work/prefetch", fasterq_root="work/fasterq", fastp_root="work/fastp", star_index_root="index", star_align_root="work/star", counts_root="work/counts", threads=16, memory="8G", gzip_fastq=True, ) aln = Alignment(cfg) meta = aln.fetch_metadata("GSE157103") sra_paths = aln.prefetch(meta["srr_list"], max_concurrent=4) fq_pairs = aln.fasterq(meta["srr_list"]) qc_results = aln.fastp(fq_pairs) pairs_for_star = [(srr, c1, c2) for (srr, c1, c2, _, _) in qc_results] bam_triples = aln.star_align( pairs_for_star, gencode_release="v44", sjdb_overhang=149, accession_for_species=None, max_workers=2, ) fc_out = aln.featurecounts( bam_triples, simple=True, by="auto", threads=8, ) print("merged matrix:", fc_out.get("matrix")) ``` --- ## 复现实验（Repro Checklist） - [ ] 同一批次 SRR **重复运行**：所有阶段出现 `[SKIP]`，不重复生成产物 - [ ] fasterq 失败重试有效（网络不稳时自动切换本地 `.sra` 输入） - [ ] STAR 产物包含**官方名**与**SRR 命名软链接**，二者指向同一数据 - [ ] featureCounts 合并矩阵列名为 **SRR**，无重复列名冲突 - [ ] `counts/matrix.auto.csv` 行数 ≥ 单样本 gene 行数上限，且 `gene_id` 非空 --- ## 并发与性能（Tuning） - **机器核数 N**：`max_workers × per-sample threads ≤ N`（含超线程时适当打折） - `prefetch`：外层并发（例 `max_concurrent=4`），单个下载内部 0.25s 轮询进度 - `fasterq-dump`：建议 `--mem 8–16G`、`threads_per_job 12–24`；并发样本数谨慎 - `STAR`：**内存敏感**；大型基因组建议单样本 8–16 线程，并发 1–2 - `featureCounts`：`-T` 适中（8–16），I/O 是主要瓶颈 --- ## 向后兼容性（Compatibility） - 原有脚本可继续**独立调用**；类方法只是薄适配，不改变原业务逻辑 - STAR 输出增加了 SRR 软链接，不影响原有消费方；有助于矩阵列名唯一化 --- ## 测试（Test Plan） - [ ] 小样本（2–3 SRR）本地端到端测试 - [ ] 中等批量（8–12 SRR）并发参数 - [ ] 断点续跑（kill 后重启）产物与日志一致性 - [ ] 手动删除某一步产物，仅重做该步（其余 `[SKIP]`） - [ ] 仅保留 STAR 官方 BAM 名时，`_normalize_bam` 能补齐 `SRR.sorted.bam` 软链接 - [ ] `FC_GTF_HINT` 指向 GTF 时，可绕过自动推断 --- ## 未来工作（Next） - [ ] 增加 **CLI**（`ov-bulk align ...`）封装类方法 - [ ] 支持 **STAR 索引自动下载/构建**（gencode/ensembl）与缓存记录 - [ ] 整合 **salmon**/**kallisto** 作为可选轻量计数 - [ ] 增加 **md5/sha256** 与产物 manifest，便于审计与复现 - [ ] GitHub Actions 做最小 CI：lint + 单元测试 + 小数据集集成测试 --- ## 故障排查（Troubleshooting） - **MergeError: duplicate columns** → 已在 `count_tools.py` 中合并前将计数列名改为 **SRR** - **fasterq 退出码 3 / 输出缺失** → 网络不稳或 S3 超时；已加入重试与切换本地 `.sra` - **STAR 内存不足** → 降低 `threads` 或减少 `max_workers`；必要时调整 `--limitGenomeGenerateRAM` - **找不到 GTF** → 显式传 `gtf=` 或设 `FC_GTF_HINT`；或确保 STAR index 上级 `_cache/` 下有 gtf - **权限/软链接问题** → 若文件系统不支持 symlink，代码自动回退为拷贝策略 --- ## Checklist（提交前） - [ ] 代码通过 `flake8`/`black`（或项目既有规范） - [ ] 大文件未纳入仓库（.sra/.bam/.fastq.gz 等） - [ ] 文档与示例路径与默认配置一致 - [ ] 在 `CHANGELOG.md` 或本 PR 中清楚记录变更与迁移说明

Add Alignment Features

# OmicVerse Enhanced Alignment Pipeline ## 🚀 新功能概述增强的OmicVerse比对管道现在支持多种输入类型，包括： 1. **SRA数据** - 原有的公共数据库数据 2. **直接FASTQ文件** - 用户自己的FASTQ文件 ## ✨ 主要特性 ### 🔧 统一输入接口 - 自动检测输入类型 - 支持多种数据源的统一处理 - 灵活的样本ID分配机制 ### 🏢 fastq数据输入支持 - 自动发现FASTQ文件 - 智能样本ID提取 - 双端测序自动配对 - 文件完整性验证 ### 🔍 增强的工具检查 - 自动检测必需软件 - 提供安装指引 - 支持自动安装 ### ⚙️ 灵活的配置系统 - YAML/JSON配置文件 - 多种预设配置模板 - 运行时参数调整 ### 🚀 多种下载模式 - **prefetch模式**: 使用NCBI SRA Toolkit (默认) - **iseq模式**: 使用iseq工具，支持多数据库、Aspera加速、直接下载gzip等高级功能 - **iseq功能优化**: 多数据下载支持以列表形式输入 ## 📋 快速开始 ### 1. 基础使用 ```python from omicverse.bulk import geo_data_preprocess, fq_data_preprocess # 支持以SRA Run List txt文本路径输入 result = geo_data_preprocess(input_data ="./srr_list.txt") # 支持以单个/多个 SRA ID 输入 data_list = ["SRR123456","SRR123457"] result = geo_data_preprocess( input_data = data_list) # 支持以fastq数据输入 fastq_files=[ "./work/fasterq/SRR12544421/SRR12544419_1.fastq.gz", "./work/fasterq/SRR12544421/SRR12544419_2.fastq.gz", "./work/fasterq/SRR12544421/SRR12544421_1.fastq.gz", "./work/fasterq/SRR12544421/SRR12544421_2.fastq.gz", ] result = fq_data_preprocess( input_data =fastq_files ) ``` ### 2. 进阶使用 ```python from omicverse.bulk import geo_data_preprocess, fq_data_preprocess,AlignmentConfig cfg = AlignmentConfig( work_root="work", # 数据分析保存根目录 threads=64, # CPU资源 genome="human", #数据来源organism download_method="prefetch", #下载方式，可选 "iseq" memory = "128G", #内存资源 fastp_enabled=True, #QC选项 gzip_fastq=True, # fastq数据压缩选项 ) result = geo_data_preprocess( input_data = data_list, config =cfg) result = fq_data_preprocess( input_data = data_list, config =cfg) ``` ### 2. 下载模式选择 #### prefetch模式 (默认) 使用NCBI SRA Toolkit进行数据下载，适合大多数情况： #### iseq模式使用iseq工具进行数据下载，支持更多高级功能： ```python # 自定义iseq配置 config = { "work_root": "work", "download_method": "iseq", "iseq_gzip": True, # 下载gzip格式的FASTQ文件 "iseq_aspera": True, # 使用Aspera加速 "iseq_database": "ena", # 选择数据库: ena, sra "iseq_protocol": "ftp", # 选择协议: ftp, https "iseq_parallel": 8, # 并行下载数 "iseq_threads": 16 # 处理线程数 } result = run_analysis("SRR123456", config=config) ``` #### iseq命令行示例以下是iseq支持的命令行选项，可以在配置中使用： ```bash # 基本下载 iseq -i SRR123456 # 批量下载并gzip压缩 iseq -i SRR_Acc_List.txt -g # 使用Aspera加速下载gzip文件 iseq -i PRJNA211801 -a -g # 指定数据库和协议 iseq -i SRR123456 -d ena -r ftp -g # 并行下载 iseq -i accession_list.txt -p 10 -g ``` ### 工具参数 ```yaml star_params: gencode_release: "v44" sjdb_overhang: 149 fastp_params: qualified_quality_phred: 20 length_required: 50 featurecounts_params: simple: true by: "gene_id" ``` ## 📁 输入格式 ### SRA数据 - 单个SRR编号：`"SRR123456"` - 多个SRR编号：`["SRR123456", "SRR789012"]` - GEO accession：`"GSE123456"` ### FASTQ文件 - 配对文件：`["sample_R1.fastq.gz", "sample_R2.fastq.gz"]` - 多个配对：`[sample1_R1.fastq.gz", "sample1_R2.fastq.gz", "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]` ## 🔍 样本ID处理 ### 自动提取系统会自动从文件名提取样本ID： - `sample1_R1.fastq.gz` → `sample1` - `Tumor_001_L001_R1_001.fastq.gz` → `Tumor_001` - `Sample_A_R1.fastq.gz` → `Sample_A` ## 🔧 工具要求 ### 必需软件 - **sra-tools**: SRA数据下载 - **STAR**: RNA-seq比对 - **fastp**: 质量控制 - **featureCounts**: 基因定量 - **samtools**: BAM文件处理 - **entrez-direct**: 元数据获取 ### 安装检查 ```python from omicverse.bulk._alignment import check_all_tools # 检查工具 results = check_all_tools() for tool, (available, path) in results.items(): print(f"{tool}: {'✅' if available else '❌'}") ``` ### 自动安装 ```python # 自动安装缺失工具（需要conda环境） results = check_all_tools(auto_install=True) ``` ## 📊 输出结果 ### 结果结构 ```python result = { "type": "company", # 输入类型 "fastq_input": [(sample_id, fq1_path, fq2_path), ...], "fastq_qc": [(sample_id, clean_fq1, clean_fq2), ...], "bam": [(sample_id, bam_path, index_dir), ...], "counts": { "tables": [(sample_id, count_file), ...], "matrix": matrix_file_path } } ``` ### 文件组织 ``` work/ ├── meta/ │ ├── sample_metadata.csv # 样本元数据 │ └── sample_id_mapping.json # ID映射 ├── prefetch/ │ ├── SRRID # 样本元数据 │ └── SRRID.sra # 原始文件 ├── fasterq/ │ ├── SRRID │ └── SRRID_R1.fastq.gz # 原始文件 │ └── SRRID_R2.fastq.gz # 原始文件 ├── index/ │ ├── _cache # 自动检测数据下载对应index ├── fastp/ # 质控结果 │ ├── Sample_001/ │ │ ├── Sample_001_clean_R1.fastq.gz │ │ └── Sample_001_clean_R2.fastq.gz │ └── fastp_reports/ ├── star/ # 比对结果 │ ├── Sample_001/ │ │ ├── Aligned.sortedByCoord.out.bam │ │ └── Sample_001.sorted.bam │ └── logs/ └── counts/ # 定量结果 ├── Sample_001/ │ └── Sample_001.counts.txt └── matrix.auto.csv # 合并矩阵 ``` ## 🚨 错误处理 ### 常见错误 1. **文件不存在** ``` Error: File not found: /path/to/sample.fastq.gz Solution: 检查文件路径是否正确 ``` 2. **样本ID冲突** ``` Error: Duplicate sample IDs detected Solution: 使用不同的sample_prefix或手动指定样本ID ``` ### 容错处理 - 部分样本失败时继续处理其他样本 - 自动重试机制（可配置） - 详细的错误日志 ## 🔬 高级功能 ### 自定义样本ID ```python # 手动指定样本ID映射 fastq_pairs = [ ("Patient_001_Tumor", "/path/to/tumor_R1.fastq.gz", "/path/to/tumor_R2.fastq.gz"), ("Patient_001_Normal", "/path/to/normal_R1.fastq.gz", "/path/to/normal_R2.fastq.gz") ] result = pipeline.run_from_fastq(fastq_pairs) ``` ### 批量处理 ```python # 多个目录批量处理 data_dirs = ["/path/to/batch1", "/path/to/batch2", "/path/to/batch3"] for i, data_dir in enumerate(data_dirs): result = pipeline.run_pipeline( data_dir, input_type="company", sample_prefix=f"Batch{i+1}" ) ``` ### 质量控制参数 ```python config = AlignmentConfig( fastp_params={ "qualified_quality_phred": 30, # 提高质量阈值 "length_required": 75, # 增加最小长度 "detect_adapter_for_pe": True # 自动检测接头 } ) ``` ## 📈 性能优化 ### 并行处理 ```python config = AlignmentConfig( threads=16, # 增加线程数 max_workers=4, # 并行样本数 continue_on_error=True # 错误继续 ) ``` ### 内存管理 ```python config = AlignmentConfig( memory="32G", # 增加内存 retry_attempts=3, # 重试次数 retry_delay=10 # 重试延迟 ) ``` ## 🔗 相关文件 - `alignment.py` - 核心管道类 - `pipeline_config.py` - 配置管理 - `tools_check.py` - 工具检查 ## 📞 支持如有问题，请： 1. 检查工具是否正确安装 2. 查看日志文件获取详细错误信息 3. 确保输入数据格式正确 4. 参考示例代码 ## 📄 更新日志 ### v2.0.0 - ✅ 新增直接FASTQ文件支持 - ✅ 新增自动输入类型检测 - ✅ 增强工具检查和安装指引 - ✅ 新增统一配置系统 - ✅ 改进错误处理和日志系统 - ✅ 新增样本ID标准化机制

zhiluo20 · 2025-10-25T21:46:18Z

@claude

Copilot

Pull Request Overview

This PR introduces a significant enhancement to OmicVerse's bulk RNA-seq analysis pipeline, adding support for multiple input data types and download methods. The changes transform the existing alignment workflow into a flexible, unified system that can handle SRA data, direct FASTQ files, and company-provided sequencing data.

Key Changes:

Added modular pipeline architecture with step-based processing (prefetch, fasterq, fastp, STAR, featureCounts)
Implemented dual download strategies: traditional prefetch and iseq tool with multi-database support
Introduced unified API functions geo_data_preprocess() and fq_data_preprocess() for streamlined data processing

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tools_check.py`	Tool availability checker with auto-installation support for bioinformatics software
`star_tools.py`	STAR alignment utilities including index management and GEO metadata fetching
`star_step.py`	STAR alignment step factory with batch processing and BAM normalization
`sra_tools.py`	Comprehensive SRA download/conversion with prefetch, fasterq-dump, and metadata retrieval
`sra_prefetch.py`	SRA prefetch step with progress monitoring and file validation
`sra_fasterq.py`	Fasterq-dump batch processing with retry logic and compression support
`qc_tools.py`	Fastp QC wrapper supporting both single samples and parallel batch processing
`qc_fastp.py`	Fastp step factory for quality control pipeline integration
`pipeline_config.py`	Configuration management with YAML/JSON support and validation
`iseq_handler.py`	Handler for company FASTQ data with automatic sample pairing and validation
`geo_meta_fetcher.py`	GEO metadata extraction from SOFT format with BioProject/SRA mapping
`entrez_direct.py`	EDirect wrapper for SRA metadata retrieval via NCBI E-utilities
`data_prepare_pipline.py`	Pipeline orchestration coordinating all processing steps
`count_tools.py`	FeatureCounts wrapper with GTF auto-detection and matrix merging
`count_step.py`	FeatureCounts step factory with GTF resolution logic
`alignment.py`	Main Alignment class providing unified API for all input types
`__init__.py` (_alignment)	Package initialization exporting main classes and convenience functions
`README_ENHANCED.md`	Comprehensive documentation with examples and configuration guides
`__init__.py` (bulk)	Bulk package initialization adding alignment exports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-27T11:26:10Z

omicverse/bulk/_alignment/sra_tools.py

+          if total_size is None and current_size > 0 and elapsed_time > 3:
+              # 初次设置一个软上限（当前值的 1.3 倍）
+              total_size = int(current_size * 1.3)
+              pbar.total = total_size
+              pbar.refresh()
+          # 若逼近上限，则抬高上限，避免显示 >100%
+          if total_size and current_size > total_size * 0.95:
+              total_size = int(current_size * 1.1)
+              pbar.total = total_size
+              pbar.refresh()


Indentation error: lines 661-670 are incorrectly indented and appear to be inside the wrong scope. These lines should be inside a loop or conditional block that updates progress, but they're currently at the module level within a try-except block. The code references undefined variables current_size and elapsed_time that don't exist in this scope.

Suggested change

if total_size is None and current_size > 0 and elapsed_time > 3:

# 初次设置一个软上限（当前值的 1.3 倍）

total_size = int(current_size * 1.3)

pbar.total = total_size

pbar.refresh()

# 若逼近上限，则抬高上限，避免显示 >100%

if total_size and current_size > total_size * 0.95:

total_size = int(current_size * 1.1)

pbar.total = total_size

pbar.refresh()

omicverse/bulk/_alignment/alignment.py

omicverse/bulk/__init__.py

HendricksJudy · 2025-10-27T11:31:46Z

I am wondering if the comment in the code could be English? This will really help both LLMs and humans understand the whole code.

zhiluo20 · 2025-10-28T04:38:51Z

I am wondering if the comment in the code could be English? This will really help both LLMs and humans understand the whole code.

Sure, I am working on it.

# OmicVerse Enhanced Alignment Pipeline ## 🚀 New Feature Overview The enhanced OmicVerse alignment pipeline now supports multiple input types, including: 1. **SRA data** - Original public repository data 2. **Direct FASTQ files** - User-provided FASTQ files ## ✨ Key Features ### 🔧 Unified Input Interface - Automatically detect input type - Unified processing across data sources - Flexible sample ID assignment mechanism ### 🏢 FASTQ Input Support - Automatically discover FASTQ files - Intelligent sample ID extraction - Automatic pairing for paired-end sequencing - File integrity validation ### 🔍 Enhanced Tool Checks - Automatically detect required software - Provide installation guidance - Support automatic installation ### ⚙️ Flexible Configuration System - YAML/JSON configuration files - Multiple preset configuration templates - Runtime parameter adjustments ### 🚀 Multiple Download Modes - **prefetch mode**: Use the NCBI SRA Toolkit (default) - **iseq mode**: Use the iseq tool with multi-database support, Aspera acceleration, direct gzip downloads, and more - **iseq enhancements**: Batch downloads accept list-style inputs ## 📋 Quick Start ### 1. Basic Usage ```python from omicverse.bulk import geo_data_preprocess, fq_data_preprocess # Accept an SRA Run List text file path result = geo_data_preprocess(input_data ="./srr_list.txt") # Accept single or multiple SRA IDs data_list = ["SRR123456","SRR123457"] result = geo_data_preprocess( input_data = data_list) # Accept FASTQ data input fastq_files=[ "./work/fasterq/SRR12544421/SRR12544419_1.fastq.gz", "./work/fasterq/SRR12544421/SRR12544419_2.fastq.gz", "./work/fasterq/SRR12544421/SRR12544421_1.fastq.gz", "./work/fasterq/SRR12544421/SRR12544421_2.fastq.gz", ] result = fq_data_preprocess( input_data =fastq_files ) ``` ### 2. Advanced Usage ```python from omicverse.bulk import geo_data_preprocess, fq_data_preprocess,AlignmentConfig cfg = AlignmentConfig( work_root="work", # Root directory for analysis outputs threads=64, # CPU resources genome="human", # Source organism download_method="prefetch", # Download method; set to "iseq" to use iseq memory = "128G", # Memory allocation fastp_enabled=True, # QC option gzip_fastq=True, # FASTQ compression option ) result = geo_data_preprocess( input_data = data_list, config =cfg) result = fq_data_preprocess( input_data = data_list, config =cfg) ``` ### 2. Download Mode Selection #### prefetch mode (default) Use the NCBI SRA Toolkit for downloads; suitable for most scenarios. #### iseq mode Use the iseq tool for downloads with additional advanced capabilities: ```python # Custom iseq configuration config = { "work_root": "work", "download_method": "iseq", "iseq_gzip": True, # Download FASTQ as gzip "iseq_aspera": True, # Enable Aspera acceleration "iseq_database": "ena", # Select database: ena or sra "iseq_protocol": "ftp", # Select protocol: ftp or https "iseq_parallel": 8, # Parallel download count "iseq_threads": 16 # Processing threads } result = run_analysis("SRR123456", config=config) ``` #### iseq command-line examples Command-line options supported by iseq that you can mirror in configuration: ```bash # Basic download iseq -i SRR123456 # Batch download with gzip compression iseq -i SRR_Acc_List.txt -g # Download gzip files with Aspera acceleration iseq -i PRJNA211801 -a -g # Specify database and protocol iseq -i SRR123456 -d ena -r ftp -g # Parallel downloads iseq -i accession_list.txt -p 10 -g ``` ### Tool Parameters ```yaml star_params: gencode_release: "v44" sjdb_overhang: 149 fastp_params: qualified_quality_phred: 20 length_required: 50 featurecounts_params: simple: true by: "gene_id" ``` ## 📁 Input Formats ### SRA Data - Single SRR accession: "SRR123456" - Multiple SRR accessions: ["SRR123456", "SRR789012"] - GEO accession: "GSE123456" (This will download all samples automatically make sure the disk is enough before download entire datasets) ### FASTQ Files - Paired files: ["sample_R1.fastq.gz", "sample_R2.fastq.gz"] - Multiple pairs: ["sample1_R1.fastq.gz", "sample1_R2.fastq.gz", "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"] ## 🔍 Sample ID Handling ### Automatic Extraction The system automatically derives sample IDs from file names: - `sample1_R1.fastq.gz` -> `sample1` - `Tumor_001_L001_R1_001.fastq.gz` -> `Tumor_001` - `Sample_A_R1.fastq.gz` -> `Sample_A` ## 🔧 Tool Requirements ### Required software - **sra-tools**: SRA data downloads - **STAR**: RNA-seq alignment - **fastp**: Quality control - **featureCounts**: Gene quantification - **samtools**: BAM file handling - **entrez-direct**: Metadata retrieval ### Installation check ```python from omicverse.bulk._alignment import check_all_tools # Check tools results = check_all_tools() for tool, (available, path) in results.items(): print(f"{tool}: {'✅' if available else '❌'}") ``` ### Automatic installation ```python # Automatically install missing tools (requires a conda environment) results = check_all_tools(auto_install=True) ``` ## 📊 Output Results ### Result structure ```python result = { "type": "company", # Input type "fastq_input": [(sample_id, fq1_path, fq2_path), ...], "fastq_qc": [(sample_id, clean_fq1, clean_fq2), ...], "bam": [(sample_id, bam_path, index_dir), ...], "counts": { "tables": [(sample_id, count_file), ...], "matrix": matrix_file_path } } ``` ### File organization ``` work/ ├── meta/ │ ├── sample_metadata.csv # Sample metadata │ └── sample_id_mapping.json # ID mapping ├── prefetch/ │ ├── SRRID # Sample metadata │ └── SRRID.sra # Raw files ├── fasterq/ │ ├── SRRID │ └── SRRID_R1.fastq.gz # Raw files │ └── SRRID_R2.fastq.gz # Raw files ├── index/ │ ├── _cache # Automatically detected indices for downloaded data ├── fastp/ # QC results │ ├── Sample_001/ │ │ ├── Sample_001_clean_R1.fastq.gz │ │ └── Sample_001_clean_R2.fastq.gz │ └── fastp_reports/ ├── star/ # Alignment results │ ├── Sample_001/ │ │ ├── Aligned.sortedByCoord.out.bam │ │ └── Sample_001.sorted.bam │ └── logs/ └── counts/ # Quantification results ├── Sample_001/ │ └── Sample_001.counts.txt └── matrix.auto.csv # Combined matrix ``` ## 🚨 Error Handling ### Common Errors 1. **File not found** ``` Error: File not found: /path/to/sample.fastq.gz Solution: Verify the file path is correct ``` 2. **Sample ID conflict** ``` Error: Duplicate sample IDs detected Solution: Use a different sample_prefix or specify sample IDs manually ``` ### Fault tolerance - Continue processing remaining samples when some fail - Configurable automatic retry mechanism - Detailed error logs ## 🔬 Advanced Features ### Custom sample IDs ```python # Manually specify sample ID mapping fastq_pairs = [ ("Patient_001_Tumor", "/path/to/tumor_R1.fastq.gz", "/path/to/tumor_R2.fastq.gz"), ("Patient_001_Normal", "/path/to/normal_R1.fastq.gz", "/path/to/normal_R2.fastq.gz") ] result = pipeline.run_from_fastq(fastq_pairs) ``` ### Batch processing ```python # Batch process multiple directories data_dirs = ["/path/to/batch1", "/path/to/batch2", "/path/to/batch3"] for i, data_dir in enumerate(data_dirs): result = pipeline.run_pipeline( data_dir, input_type="company", sample_prefix=f"Batch{i+1}" ) ``` ### Quality control parameters ```python config = AlignmentConfig( fastp_params={ "qualified_quality_phred": 30, # Increase quality threshold "length_required": 75, # Increase minimum length "detect_adapter_for_pe": True # Auto-detect adapters } ) ``` ## 📈 Performance Optimization ### Parallel processing ```python config = AlignmentConfig( threads=16, # Increase thread count max_workers=4, # Number of parallel samples continue_on_error=True # Continue after errors ) ``` ### Memory management ```python config = AlignmentConfig( memory="32G", # Increase memory retry_attempts=3, # Retry attempts retry_delay=10 # Retry delay ) ``` ## 🔗 Related Files - `alignment.py` - Core pipeline class - `pipeline_config.py` - Configuration management - `tools_check.py` - Tool checks ## 📞 Support If issues arise: 1. Check that tools are installed correctly 2. Review log files for detailed error information 3. Ensure input data formats are correct 4. Refer to the sample code ## 📄 Changelog ### v2.0.0 - ✅ Added direct FASTQ file support - ✅ Added automatic input type detection - ✅ Enhanced tool checks and installation guidance - ✅ Added unified configuration system - ✅ Improved error handling and logging - ✅ Added sample ID standardization mechanism Co-Authored-By: HendricksJudy <[email protected]> Co-Authored-By: Zehua Zeng <[email protected]> Co-Authored-By: Claude <[email protected]>

HendricksJudy · 2025-10-28T14:15:51Z

I am wondering if the comment in the code could be English? This will really help both LLMs and humans understand the whole code.

Sure, I am working on it.

Thanks a lot

zhiluo20 and others added 3 commits October 23, 2025 14:29

Merge pull request #1 from zhiluo20/zhiluo20-OV-Bulk-Alignment-Class-Add

0a69aca

Add Alignment Features

Delete _alignment.py

ba8c106

HendricksJudy requested a review from Copilot October 27, 2025 11:25

Copilot AI reviewed Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Update Alignment #466

Update Alignment #466

Uh oh!

zhiluo20 commented Oct 25, 2025

Uh oh!

zhiluo20 commented Oct 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

HendricksJudy commented Oct 27, 2025

Uh oh!

zhiluo20 commented Oct 28, 2025

Uh oh!

HendricksJudy commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Update Alignment #466

Are you sure you want to change the base?

Update Alignment #466

Uh oh!

Conversation

zhiluo20 commented Oct 25, 2025

OmicVerse Enhanced Alignment Pipeline

🚀 新功能概述

✨ 主要特性

🔧 统一输入接口

🏢 fastq数据输入支持

🔍 增强的工具检查

⚙️ 灵活的配置系统

🚀 多种下载模式

📋 快速开始

1. 基础使用

2. 进阶使用

2. 下载模式选择

prefetch模式 (默认)

iseq模式

iseq命令行示例

工具参数

📁 输入格式

SRA数据

FASTQ文件

🔍 样本ID处理

自动提取

🔧 工具要求

必需软件

安装检查

自动安装

📊 输出结果

结果结构

文件组织

🚨 错误处理

常见错误

容错处理

🔬 高级功能

自定义样本ID

批量处理

质量控制参数

📈 性能优化

并行处理

内存管理

🔗 相关文件

📞 支持

📄 更新日志

v2.0.0

Uh oh!

zhiluo20 commented Oct 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HendricksJudy commented Oct 27, 2025

Uh oh!

zhiluo20 commented Oct 28, 2025

Uh oh!

HendricksJudy commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants