pyflow-ChIPseq: Modern ChIP-seq Analysis Pipeline

A production-ready Snakemake pipeline for comprehensive ChIP-seq data analysis, from raw FASTQ files to peaks, bigWigs, and super-enhancers.

📖 Table of Contents

Why pyflow-ChIPseq?
What Does It Do?
Pipeline Overview
How to Use
Citation
Documentation
Support

🎯 Why pyflow-ChIPseq?

The Problem

ChIP-seq data analysis requires coordinating dozens of bioinformatics tools with precise parameter settings. Manual analysis is:

Error-prone: Easy to miss steps or use inconsistent parameters
Time-consuming: Repeated copy-pasting of commands for each sample
Not reproducible: Hard to track what was done and share methods
Difficult to scale: Processing 10+ samples manually is tedious

The Solution

pyflow-ChIPseq automates the entire ChIP-seq workflow with:

✅ Reproducibility: Conda environments ensure identical software versions
✅ Scalability: Process 1 or 100 samples with the same command
✅ Automation: One command runs QC → alignment → peaks → visualization
✅ Best practices: Built-in quality control, duplicate removal, and normalization
✅ Cluster support: Seamlessly scales from laptops to HPC clusters
✅ Modern tools: Python 3, MACS3, updated samtools/deepTools

Who Is It For?

Bioinformaticians processing ChIP-seq data routinely
Researchers who want reproducible analysis without manual steps
Core facilities standardizing ChIP-seq analysis across projects
Labs sharing methods and ensuring consistency

🔬 What Does It Do?

Inputs

FASTQ files: Single-end or paired-end ChIP-seq reads
Reference genome: FASTA file (human, mouse, or custom)
Metadata file (meta.txt): Tab-delimited file describing your samples

Metadata File Format (`meta.txt`)

The metadata file is a tab-delimited file with 4 columns that describes your samples:

For Single-End Data:

sample_name	fastq_name	factor	reads
MOLM-14_DMSO1_5	SRR2518123.fastq.gz	BRD4	R1
MOLM-14_DMSO1_5	SRR2518124.fastq.gz	Input	R1
MOLM-14_DMSO2_6	SRR2518125.fastq.gz	BRD4	R1
MOLM-14_DMSO2_6	SRR2518126.fastq.gz	Input	R1

For Paired-End Data:

sample_name	fastq_name	factor	reads
Sample1	Sample1_Input_R1.fastq.gz	Input	R1
Sample1	Sample1_Input_R2.fastq.gz	Input	R2
Sample1	Sample1_H3K27ac_R1.fastq.gz	H3K27ac	R1
Sample1	Sample1_H3K27ac_R2.fastq.gz	H3K27ac	R2
Sample2	Sample2_Input_R1.fastq.gz	Input	R1
Sample2	Sample2_Input_R2.fastq.gz	Input	R2
Sample2	Sample2_H3K4me3_R1.fastq.gz	H3K4me3	R1
Sample2	Sample2_H3K4me3_R2.fastq.gz	H3K4me3	R2

Column Descriptions:

sample_name: Biological sample identifier (can have multiple factors per sample)
fastq_name: FASTQ file name (must match actual files in your directory)
factor: ChIP target (e.g., BRD4, H3K27ac, Input, IgG)
reads: R1 for forward reads, R2 for reverse reads (paired-end only)

Important Notes:

Multiple factors can belong to the same sample_name
One factor must match the control setting in config.yaml (e.g., "Input")
Output files will be named: {sample_name}_{factor} (e.g., MOLM-14_DMSO1_5_BRD4)

Outputs

Output	Description	Use Case
Peaks	Narrow & broad peaks (MACS3)	Identify binding sites, histone marks
BigWigs	RPKM-normalized & input-subtracted	Genome browser visualization (IGV, UCSC)
BAM files	Aligned, deduplicated, downsampled	Further custom analysis
QC Reports	MultiQC summary + phantompeakqual	Assess experiment quality
Super-enhancers	ROSE2 calls (optional)	Identify regulatory elements

Key Features

Core Analysis

Quality control: FastQC + MultiQC comprehensive reports
Alignment: BWA mem (long reads) or BWA aln (short reads)
Duplicate handling: Removes PCR duplicates with samblaster
Normalization: Downsampling to target read depth (default 50M)
Peak calling: MACS3 narrow peaks (TFs) and broad peaks (histone marks)
Visualization: RPKM-normalized and input-subtracted bigWigs

Advanced Features

ChIP quality metrics: Phantompeakqualtools NSC/RSC scores
Super-enhancer calling: ROSE2 (Python 3 port)
Chromatin states: ChromHMM support (optional)
Flexible configuration: Extensive parameters in config.yaml

Technical Excellence

Conda integration: Automatic dependency management
SLURM support: Built-in cluster execution profiles
Parallel processing: Efficiently uses multiple cores
Smart caching: Only reruns changed steps
Comprehensive logging: All commands and outputs tracked

📊 Pipeline Overview

Processing Steps

Input Processing
- Merge technical replicates
- Quality control with FastQC
Alignment (split for debugging)
- BWA alignment (mem or aln based on read length)
- Duplicate removal with samblaster
- BAM sorting and indexing
Quality Assessment
- Flagstat metrics
- Phantompeakqualtools (NSC, RSC, fragment length)
- MultiQC aggregation
Normalization
- Downsample to target read depth
- Normalize across samples
Peak Calling
- MACS3 narrow peaks (q < 0.05)
- MACS3 broad peaks (q < 0.1)
Visualization
- RPKM-normalized bigWigs
- Input-subtracted bigWigs
Advanced Analysis (optional)
- ROSE2 super-enhancers
- ChromHMM chromatin states

🚀 How to Use

Quick Start (5 minutes)

See QUICKSTART.md for a complete working example with public data.

# 1. Clone the repository
git clone https://github.com/crazyhottommy/pyflow-ChIPseq.git
cd pyflow-ChIPseq
git checkout modernize-2025

# 2. Prepare your metadata file
# See QUICKSTART.md for format

# 3. Generate samples.json
python sample2json.py /path/to/fastq/ meta.txt

# 4. Configure pipeline
nano config.yaml  # Edit paths and parameters

# 5. Run!
snakemake --use-conda --cores 8

Full Installation

See INSTALL.md for:

Installing Snakemake and dependencies
Setting up reference genomes
Configuring SLURM clusters
Troubleshooting common issues

Configuration

Key settings in config.yaml:

# Data type
from_fastq: True          # Start from FASTQ (not BAM)
paired_end: False         # Single-end or paired-end
long_reads: True          # >70bp uses bwa mem, <70bp uses bwa aln

# Reference genome
ref_fa: /path/to/genome.fa
macs_g: mm                # mm (mouse) or hs (human)

# Analysis parameters
control: 'Input'          # Control sample name in metadata
downsample: True          # Normalize read depth
target_reads: 50000000    # Target reads after downsampling

# Optional analyses
run_phantompeakqual: True # ChIP quality metrics
chromHMM: False           # Chromatin state modeling

Running on Different Systems

Local execution

snakemake --use-conda --cores 8

SLURM cluster

snakemake --profile profiles/slurm

Dry run (see what will be executed)

snakemake -n --use-conda

Output Directory Structure

pyflow-ChIPseq/
├── 00log/              # All log files
├── 01seq/              # Merged FASTQ files
├── 02fqc/              # FastQC reports
├── 03aln/              # Aligned BAM files
├── 04aln_downsample/   # Downsampled BAMs
├── 05phantompeakqual/  # ChIP quality metrics
├── 06bigwig_inputSubtract/  # Input-subtracted bigWigs
├── 07bigwig/           # RPKM-normalized bigWigs
├── 08peak_macs3/       # Narrow peaks
├── 09peak_macs3/       # Broad peaks
├── 10multiQC/          # Quality summary (START HERE!)
└── 11superEnhancer/    # Super-enhancer calls (optional)

Pro tip: Start by reviewing 10multiQC/multiQC_log.html for a comprehensive quality overview!

📚 Documentation

Document	Description
QUICKSTART.md	5-minute tutorial with example data
INSTALL.md	Complete installation guide
MODERNIZATION.md	What's new in 2025 version
CLAUDE.md	Repository architecture
README_original.md	Historical documentation

📖 Citation

Pipeline Publication

If you use this pipeline, please cite:

Terranova, C., Tang, M., Orouji, E., Maitituoheti, M., Raman, A., Amin, S., et al.
An Integrated Platform for Genome-wide Mapping of Chromatin States Using
High-throughput ChIP-sequencing in Tumor Tissues.
J. Vis. Exp. (134), e56972, doi:10.3791/56972 (2018).

DOI:

Key Tools

This pipeline builds on excellent open-source tools:

Snakemake: Mölder et al., 2021
MACS3: Zhang et al., 2008
deepTools: Ramírez et al., 2016
BWA: Li & Durbin, 2009
samtools: Danecek et al., 2021

💡 Key Improvements (2025 Modernization)

This branch (modernize-2025) represents a complete overhaul:

Python 3 Migration

✅ MACS3 instead of MACS1/MACS2
✅ ROSE2 (Python 3 port) for super-enhancers
✅ All scripts Python 3 compatible

Modern Tooling

✅ Snakemake v7+ with profiles
✅ Conda environments for reproducibility
✅ Updated tools: samtools 1.19+, deepTools 3.5+

Better UX

✅ Simplified configuration
✅ Better error messages
✅ Comprehensive documentation
✅ Split alignment steps for easier debugging

Performance

✅ Parallel execution optimized
✅ Smart temp file cleanup
✅ Configurable threading

See MODERNIZATION.md for complete details.

🛠️ Troubleshooting

Common Issues

Problem	Solution
`samples.json not found`	Run `python sample2json.py /path/to/fastq meta.txt`
Reference genome not indexed	Run `bwa index genome.fa`
Memory errors	Increase `mem_mb` in Snakefile resources
Conda is slow	Use `--conda-frontend mamba`
ROSE2 fails	Check genome name is uppercase: HG38, MM10

See INSTALL.md for detailed troubleshooting.

🤝 Support

Issues: GitHub Issues
Snakemake docs: https://snakemake.readthedocs.io/
Email: Contact the author through GitHub

📋 Requirements

Operating System: Linux or macOS
Memory: 32 GB RAM minimum (for alignment)
Storage: ~50-100 GB per sample
CPU: Multi-core recommended (8+ cores)

Software (auto-installed via conda):

Python ≥ 3.8
Snakemake ≥ 7.0
BWA, samtools, MACS3, deepTools, etc.

🗺️ Roadmap

Future improvements:

Support for additional peak callers (HOMER, SICER)
Automatic peak quality filtering
Differential binding analysis (DiffBind)
Motif enrichment integration
Docker/Singularity containers
Cloud execution (AWS, Google Cloud)

📜 License

This project is open source. See repository for license details.

🙏 Acknowledgments

Originally developed by Ming Tang (@crazyhottommy) at MD Anderson Cancer Center.

Modernized in 2025 with contributions from the bioinformatics community.

Special thanks to the developers of Snakemake and all the tools integrated in this pipeline.

Ready to analyze your ChIP-seq data? → Start with QUICKSTART.md 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
envs		envs
examples		examples
profiles/slurm		profiles/slurm
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GEO_rulegraph.png		GEO_rulegraph.png
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MODERNIZATION.md		MODERNIZATION.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
README_original.md		README_original.md
SRR.txt		SRR.txt
Snakefile		Snakefile
TCGA.png		TCGA.png
TCGA_barcode.jpg		TCGA_barcode.jpg
cluster.json		cluster.json
config.yaml		config.yaml
jobscript.sh		jobscript.sh
meta.txt		meta.txt
pyflow-ChIPseq.sh		pyflow-ChIPseq.sh
pyflow-drmaa-ChIPseq.sh		pyflow-drmaa-ChIPseq.sh
rulegraph_all.png		rulegraph_all.png
sample2json.py		sample2json.py
samples.json		samples.json
samples.txt		samples.txt
sbatch_cluster.py		sbatch_cluster.py

License

crazyhottommy/pyflow-ChIPseq

Folders and files

Latest commit

History

Repository files navigation

pyflow-ChIPseq: Modern ChIP-seq Analysis Pipeline

📖 Table of Contents

🎯 Why pyflow-ChIPseq?

The Problem

The Solution

Who Is It For?

🔬 What Does It Do?

Inputs

Metadata File Format (meta.txt)

Outputs

Key Features

Core Analysis

Advanced Features

Technical Excellence

📊 Pipeline Overview

Processing Steps

🚀 How to Use

Quick Start (5 minutes)

Full Installation

Configuration

Running on Different Systems

Local execution

SLURM cluster

Dry run (see what will be executed)

Output Directory Structure

📚 Documentation

📖 Citation

Pipeline Publication

Key Tools

💡 Key Improvements (2025 Modernization)

Python 3 Migration

Modern Tooling

Better UX

Performance

🛠️ Troubleshooting

Common Issues

🤝 Support

📋 Requirements

🗺️ Roadmap

📜 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Metadata File Format (`meta.txt`)

Packages