A production-ready Snakemake pipeline for comprehensive ChIP-seq data analysis, from raw FASTQ files to peaks, bigWigs, and super-enhancers.
ChIP-seq data analysis requires coordinating dozens of bioinformatics tools with precise parameter settings. Manual analysis is:
- Error-prone: Easy to miss steps or use inconsistent parameters
- Time-consuming: Repeated copy-pasting of commands for each sample
- Not reproducible: Hard to track what was done and share methods
- Difficult to scale: Processing 10+ samples manually is tedious
pyflow-ChIPseq automates the entire ChIP-seq workflow with:
- β Reproducibility: Conda environments ensure identical software versions
- β Scalability: Process 1 or 100 samples with the same command
- β Automation: One command runs QC β alignment β peaks β visualization
- β Best practices: Built-in quality control, duplicate removal, and normalization
- β Cluster support: Seamlessly scales from laptops to HPC clusters
- β Modern tools: Python 3, MACS3, updated samtools/deepTools
- Bioinformaticians processing ChIP-seq data routinely
- Researchers who want reproducible analysis without manual steps
- Core facilities standardizing ChIP-seq analysis across projects
- Labs sharing methods and ensuring consistency
- FASTQ files: Single-end or paired-end ChIP-seq reads
- Reference genome: FASTA file (human, mouse, or custom)
- Metadata file (
meta.txt): Tab-delimited file describing your samples
The metadata file is a tab-delimited file with 4 columns that describes your samples:
For Single-End Data:
sample_name fastq_name factor reads
MOLM-14_DMSO1_5 SRR2518123.fastq.gz BRD4 R1
MOLM-14_DMSO1_5 SRR2518124.fastq.gz Input R1
MOLM-14_DMSO2_6 SRR2518125.fastq.gz BRD4 R1
MOLM-14_DMSO2_6 SRR2518126.fastq.gz Input R1
For Paired-End Data:
sample_name fastq_name factor reads
Sample1 Sample1_Input_R1.fastq.gz Input R1
Sample1 Sample1_Input_R2.fastq.gz Input R2
Sample1 Sample1_H3K27ac_R1.fastq.gz H3K27ac R1
Sample1 Sample1_H3K27ac_R2.fastq.gz H3K27ac R2
Sample2 Sample2_Input_R1.fastq.gz Input R1
Sample2 Sample2_Input_R2.fastq.gz Input R2
Sample2 Sample2_H3K4me3_R1.fastq.gz H3K4me3 R1
Sample2 Sample2_H3K4me3_R2.fastq.gz H3K4me3 R2
Column Descriptions:
sample_name: Biological sample identifier (can have multiple factors per sample)fastq_name: FASTQ file name (must match actual files in your directory)factor: ChIP target (e.g., BRD4, H3K27ac, Input, IgG)reads: R1 for forward reads, R2 for reverse reads (paired-end only)
Important Notes:
- Multiple factors can belong to the same
sample_name - One
factormust match thecontrolsetting inconfig.yaml(e.g., "Input") - Output files will be named:
{sample_name}_{factor}(e.g.,MOLM-14_DMSO1_5_BRD4)
| Output | Description | Use Case |
|---|---|---|
| Peaks | Narrow & broad peaks (MACS3) | Identify binding sites, histone marks |
| BigWigs | RPKM-normalized & input-subtracted | Genome browser visualization (IGV, UCSC) |
| BAM files | Aligned, deduplicated, downsampled | Further custom analysis |
| QC Reports | MultiQC summary + phantompeakqual | Assess experiment quality |
| Super-enhancers | ROSE2 calls (optional) | Identify regulatory elements |
- Quality control: FastQC + MultiQC comprehensive reports
- Alignment: BWA mem (long reads) or BWA aln (short reads)
- Duplicate handling: Removes PCR duplicates with samblaster
- Normalization: Downsampling to target read depth (default 50M)
- Peak calling: MACS3 narrow peaks (TFs) and broad peaks (histone marks)
- Visualization: RPKM-normalized and input-subtracted bigWigs
- ChIP quality metrics: Phantompeakqualtools NSC/RSC scores
- Super-enhancer calling: ROSE2 (Python 3 port)
- Chromatin states: ChromHMM support (optional)
- Flexible configuration: Extensive parameters in config.yaml
- Conda integration: Automatic dependency management
- SLURM support: Built-in cluster execution profiles
- Parallel processing: Efficiently uses multiple cores
- Smart caching: Only reruns changed steps
- Comprehensive logging: All commands and outputs tracked
-
Input Processing
- Merge technical replicates
- Quality control with FastQC
-
Alignment (split for debugging)
- BWA alignment (mem or aln based on read length)
- Duplicate removal with samblaster
- BAM sorting and indexing
-
Quality Assessment
- Flagstat metrics
- Phantompeakqualtools (NSC, RSC, fragment length)
- MultiQC aggregation
-
Normalization
- Downsample to target read depth
- Normalize across samples
-
Peak Calling
- MACS3 narrow peaks (q < 0.05)
- MACS3 broad peaks (q < 0.1)
-
Visualization
- RPKM-normalized bigWigs
- Input-subtracted bigWigs
-
Advanced Analysis (optional)
- ROSE2 super-enhancers
- ChromHMM chromatin states
See QUICKSTART.md for a complete working example with public data.
# 1. Clone the repository
git clone https://github.com/crazyhottommy/pyflow-ChIPseq.git
cd pyflow-ChIPseq
git checkout modernize-2025
# 2. Prepare your metadata file
# See QUICKSTART.md for format
# 3. Generate samples.json
python sample2json.py /path/to/fastq/ meta.txt
# 4. Configure pipeline
nano config.yaml # Edit paths and parameters
# 5. Run!
snakemake --use-conda --cores 8See INSTALL.md for:
- Installing Snakemake and dependencies
- Setting up reference genomes
- Configuring SLURM clusters
- Troubleshooting common issues
Key settings in config.yaml:
# Data type
from_fastq: True # Start from FASTQ (not BAM)
paired_end: False # Single-end or paired-end
long_reads: True # >70bp uses bwa mem, <70bp uses bwa aln
# Reference genome
ref_fa: /path/to/genome.fa
macs_g: mm # mm (mouse) or hs (human)
# Analysis parameters
control: 'Input' # Control sample name in metadata
downsample: True # Normalize read depth
target_reads: 50000000 # Target reads after downsampling
# Optional analyses
run_phantompeakqual: True # ChIP quality metrics
chromHMM: False # Chromatin state modelingsnakemake --use-conda --cores 8snakemake --profile profiles/slurmsnakemake -n --use-condapyflow-ChIPseq/
βββ 00log/ # All log files
βββ 01seq/ # Merged FASTQ files
βββ 02fqc/ # FastQC reports
βββ 03aln/ # Aligned BAM files
βββ 04aln_downsample/ # Downsampled BAMs
βββ 05phantompeakqual/ # ChIP quality metrics
βββ 06bigwig_inputSubtract/ # Input-subtracted bigWigs
βββ 07bigwig/ # RPKM-normalized bigWigs
βββ 08peak_macs3/ # Narrow peaks
βββ 09peak_macs3/ # Broad peaks
βββ 10multiQC/ # Quality summary (START HERE!)
βββ 11superEnhancer/ # Super-enhancer calls (optional)
Pro tip: Start by reviewing 10multiQC/multiQC_log.html for a comprehensive quality overview!
| Document | Description |
|---|---|
| QUICKSTART.md | 5-minute tutorial with example data |
| INSTALL.md | Complete installation guide |
| MODERNIZATION.md | What's new in 2025 version |
| CLAUDE.md | Repository architecture |
| README_original.md | Historical documentation |
If you use this pipeline, please cite:
Terranova, C., Tang, M., Orouji, E., Maitituoheti, M., Raman, A., Amin, S., et al.
An Integrated Platform for Genome-wide Mapping of Chromatin States Using
High-throughput ChIP-sequencing in Tumor Tissues.
J. Vis. Exp. (134), e56972, doi:10.3791/56972 (2018).
This pipeline builds on excellent open-source tools:
- Snakemake: MΓΆlder et al., 2021
- MACS3: Zhang et al., 2008
- deepTools: RamΓrez et al., 2016
- BWA: Li & Durbin, 2009
- samtools: Danecek et al., 2021
This branch (modernize-2025) represents a complete overhaul:
- β MACS3 instead of MACS1/MACS2
- β ROSE2 (Python 3 port) for super-enhancers
- β All scripts Python 3 compatible
- β Snakemake v7+ with profiles
- β Conda environments for reproducibility
- β Updated tools: samtools 1.19+, deepTools 3.5+
- β Simplified configuration
- β Better error messages
- β Comprehensive documentation
- β Split alignment steps for easier debugging
- β Parallel execution optimized
- β Smart temp file cleanup
- β Configurable threading
See MODERNIZATION.md for complete details.
| Problem | Solution |
|---|---|
samples.json not found |
Run python sample2json.py /path/to/fastq meta.txt |
| Reference genome not indexed | Run bwa index genome.fa |
| Memory errors | Increase mem_mb in Snakefile resources |
| Conda is slow | Use --conda-frontend mamba |
| ROSE2 fails | Check genome name is uppercase: HG38, MM10 |
See INSTALL.md for detailed troubleshooting.
- Issues: GitHub Issues
- Snakemake docs: https://snakemake.readthedocs.io/
- Email: Contact the author through GitHub
- Operating System: Linux or macOS
- Memory: 32 GB RAM minimum (for alignment)
- Storage: ~50-100 GB per sample
- CPU: Multi-core recommended (8+ cores)
Software (auto-installed via conda):
- Python β₯ 3.8
- Snakemake β₯ 7.0
- BWA, samtools, MACS3, deepTools, etc.
Future improvements:
- Support for additional peak callers (HOMER, SICER)
- Automatic peak quality filtering
- Differential binding analysis (DiffBind)
- Motif enrichment integration
- Docker/Singularity containers
- Cloud execution (AWS, Google Cloud)
This project is open source. See repository for license details.
Originally developed by Ming Tang (@crazyhottommy) at MD Anderson Cancer Center.
Modernized in 2025 with contributions from the bioinformatics community.
Special thanks to the developers of Snakemake and all the tools integrated in this pipeline.
Ready to analyze your ChIP-seq data? β Start with QUICKSTART.md π
