Skip to content

crazyhottommy/pyflow-ChIPseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pyflow-ChIPseq: Modern ChIP-seq Analysis Pipeline

DOI Snakemake Python License

A production-ready Snakemake pipeline for comprehensive ChIP-seq data analysis, from raw FASTQ files to peaks, bigWigs, and super-enhancers.


πŸ“– Table of Contents


🎯 Why pyflow-ChIPseq?

The Problem

ChIP-seq data analysis requires coordinating dozens of bioinformatics tools with precise parameter settings. Manual analysis is:

  • Error-prone: Easy to miss steps or use inconsistent parameters
  • Time-consuming: Repeated copy-pasting of commands for each sample
  • Not reproducible: Hard to track what was done and share methods
  • Difficult to scale: Processing 10+ samples manually is tedious

The Solution

pyflow-ChIPseq automates the entire ChIP-seq workflow with:

  • βœ… Reproducibility: Conda environments ensure identical software versions
  • βœ… Scalability: Process 1 or 100 samples with the same command
  • βœ… Automation: One command runs QC β†’ alignment β†’ peaks β†’ visualization
  • βœ… Best practices: Built-in quality control, duplicate removal, and normalization
  • βœ… Cluster support: Seamlessly scales from laptops to HPC clusters
  • βœ… Modern tools: Python 3, MACS3, updated samtools/deepTools

Who Is It For?

  • Bioinformaticians processing ChIP-seq data routinely
  • Researchers who want reproducible analysis without manual steps
  • Core facilities standardizing ChIP-seq analysis across projects
  • Labs sharing methods and ensuring consistency

πŸ”¬ What Does It Do?

Inputs

  • FASTQ files: Single-end or paired-end ChIP-seq reads
  • Reference genome: FASTA file (human, mouse, or custom)
  • Metadata file (meta.txt): Tab-delimited file describing your samples

Metadata File Format (meta.txt)

The metadata file is a tab-delimited file with 4 columns that describes your samples:

For Single-End Data:

sample_name	fastq_name	factor	reads
MOLM-14_DMSO1_5	SRR2518123.fastq.gz	BRD4	R1
MOLM-14_DMSO1_5	SRR2518124.fastq.gz	Input	R1
MOLM-14_DMSO2_6	SRR2518125.fastq.gz	BRD4	R1
MOLM-14_DMSO2_6	SRR2518126.fastq.gz	Input	R1

For Paired-End Data:

sample_name	fastq_name	factor	reads
Sample1	Sample1_Input_R1.fastq.gz	Input	R1
Sample1	Sample1_Input_R2.fastq.gz	Input	R2
Sample1	Sample1_H3K27ac_R1.fastq.gz	H3K27ac	R1
Sample1	Sample1_H3K27ac_R2.fastq.gz	H3K27ac	R2
Sample2	Sample2_Input_R1.fastq.gz	Input	R1
Sample2	Sample2_Input_R2.fastq.gz	Input	R2
Sample2	Sample2_H3K4me3_R1.fastq.gz	H3K4me3	R1
Sample2	Sample2_H3K4me3_R2.fastq.gz	H3K4me3	R2

Column Descriptions:

  • sample_name: Biological sample identifier (can have multiple factors per sample)
  • fastq_name: FASTQ file name (must match actual files in your directory)
  • factor: ChIP target (e.g., BRD4, H3K27ac, Input, IgG)
  • reads: R1 for forward reads, R2 for reverse reads (paired-end only)

Important Notes:

  • Multiple factors can belong to the same sample_name
  • One factor must match the control setting in config.yaml (e.g., "Input")
  • Output files will be named: {sample_name}_{factor} (e.g., MOLM-14_DMSO1_5_BRD4)

Outputs

Output Description Use Case
Peaks Narrow & broad peaks (MACS3) Identify binding sites, histone marks
BigWigs RPKM-normalized & input-subtracted Genome browser visualization (IGV, UCSC)
BAM files Aligned, deduplicated, downsampled Further custom analysis
QC Reports MultiQC summary + phantompeakqual Assess experiment quality
Super-enhancers ROSE2 calls (optional) Identify regulatory elements

Key Features

Core Analysis

  • Quality control: FastQC + MultiQC comprehensive reports
  • Alignment: BWA mem (long reads) or BWA aln (short reads)
  • Duplicate handling: Removes PCR duplicates with samblaster
  • Normalization: Downsampling to target read depth (default 50M)
  • Peak calling: MACS3 narrow peaks (TFs) and broad peaks (histone marks)
  • Visualization: RPKM-normalized and input-subtracted bigWigs

Advanced Features

  • ChIP quality metrics: Phantompeakqualtools NSC/RSC scores
  • Super-enhancer calling: ROSE2 (Python 3 port)
  • Chromatin states: ChromHMM support (optional)
  • Flexible configuration: Extensive parameters in config.yaml

Technical Excellence

  • Conda integration: Automatic dependency management
  • SLURM support: Built-in cluster execution profiles
  • Parallel processing: Efficiently uses multiple cores
  • Smart caching: Only reruns changed steps
  • Comprehensive logging: All commands and outputs tracked

πŸ“Š Pipeline Overview

Pipeline Workflow

Processing Steps

  1. Input Processing

    • Merge technical replicates
    • Quality control with FastQC
  2. Alignment (split for debugging)

    • BWA alignment (mem or aln based on read length)
    • Duplicate removal with samblaster
    • BAM sorting and indexing
  3. Quality Assessment

    • Flagstat metrics
    • Phantompeakqualtools (NSC, RSC, fragment length)
    • MultiQC aggregation
  4. Normalization

    • Downsample to target read depth
    • Normalize across samples
  5. Peak Calling

    • MACS3 narrow peaks (q < 0.05)
    • MACS3 broad peaks (q < 0.1)
  6. Visualization

    • RPKM-normalized bigWigs
    • Input-subtracted bigWigs
  7. Advanced Analysis (optional)

    • ROSE2 super-enhancers
    • ChromHMM chromatin states

πŸš€ How to Use

Quick Start (5 minutes)

See QUICKSTART.md for a complete working example with public data.

# 1. Clone the repository
git clone https://github.com/crazyhottommy/pyflow-ChIPseq.git
cd pyflow-ChIPseq
git checkout modernize-2025

# 2. Prepare your metadata file
# See QUICKSTART.md for format

# 3. Generate samples.json
python sample2json.py /path/to/fastq/ meta.txt

# 4. Configure pipeline
nano config.yaml  # Edit paths and parameters

# 5. Run!
snakemake --use-conda --cores 8

Full Installation

See INSTALL.md for:

  • Installing Snakemake and dependencies
  • Setting up reference genomes
  • Configuring SLURM clusters
  • Troubleshooting common issues

Configuration

Key settings in config.yaml:

# Data type
from_fastq: True          # Start from FASTQ (not BAM)
paired_end: False         # Single-end or paired-end
long_reads: True          # >70bp uses bwa mem, <70bp uses bwa aln

# Reference genome
ref_fa: /path/to/genome.fa
macs_g: mm                # mm (mouse) or hs (human)

# Analysis parameters
control: 'Input'          # Control sample name in metadata
downsample: True          # Normalize read depth
target_reads: 50000000    # Target reads after downsampling

# Optional analyses
run_phantompeakqual: True # ChIP quality metrics
chromHMM: False           # Chromatin state modeling

Running on Different Systems

Local execution

snakemake --use-conda --cores 8

SLURM cluster

snakemake --profile profiles/slurm

Dry run (see what will be executed)

snakemake -n --use-conda

Output Directory Structure

pyflow-ChIPseq/
β”œβ”€β”€ 00log/              # All log files
β”œβ”€β”€ 01seq/              # Merged FASTQ files
β”œβ”€β”€ 02fqc/              # FastQC reports
β”œβ”€β”€ 03aln/              # Aligned BAM files
β”œβ”€β”€ 04aln_downsample/   # Downsampled BAMs
β”œβ”€β”€ 05phantompeakqual/  # ChIP quality metrics
β”œβ”€β”€ 06bigwig_inputSubtract/  # Input-subtracted bigWigs
β”œβ”€β”€ 07bigwig/           # RPKM-normalized bigWigs
β”œβ”€β”€ 08peak_macs3/       # Narrow peaks
β”œβ”€β”€ 09peak_macs3/       # Broad peaks
β”œβ”€β”€ 10multiQC/          # Quality summary (START HERE!)
└── 11superEnhancer/    # Super-enhancer calls (optional)

Pro tip: Start by reviewing 10multiQC/multiQC_log.html for a comprehensive quality overview!


πŸ“š Documentation

Document Description
QUICKSTART.md 5-minute tutorial with example data
INSTALL.md Complete installation guide
MODERNIZATION.md What's new in 2025 version
CLAUDE.md Repository architecture
README_original.md Historical documentation

πŸ“– Citation

Pipeline Publication

If you use this pipeline, please cite:

Terranova, C., Tang, M., Orouji, E., Maitituoheti, M., Raman, A., Amin, S., et al.
An Integrated Platform for Genome-wide Mapping of Chromatin States Using
High-throughput ChIP-sequencing in Tumor Tissues.
J. Vis. Exp. (134), e56972, doi:10.3791/56972 (2018).

DOI: DOI

Key Tools

This pipeline builds on excellent open-source tools:


πŸ’‘ Key Improvements (2025 Modernization)

This branch (modernize-2025) represents a complete overhaul:

Python 3 Migration

  • βœ… MACS3 instead of MACS1/MACS2
  • βœ… ROSE2 (Python 3 port) for super-enhancers
  • βœ… All scripts Python 3 compatible

Modern Tooling

  • βœ… Snakemake v7+ with profiles
  • βœ… Conda environments for reproducibility
  • βœ… Updated tools: samtools 1.19+, deepTools 3.5+

Better UX

  • βœ… Simplified configuration
  • βœ… Better error messages
  • βœ… Comprehensive documentation
  • βœ… Split alignment steps for easier debugging

Performance

  • βœ… Parallel execution optimized
  • βœ… Smart temp file cleanup
  • βœ… Configurable threading

See MODERNIZATION.md for complete details.


πŸ› οΈ Troubleshooting

Common Issues

Problem Solution
samples.json not found Run python sample2json.py /path/to/fastq meta.txt
Reference genome not indexed Run bwa index genome.fa
Memory errors Increase mem_mb in Snakefile resources
Conda is slow Use --conda-frontend mamba
ROSE2 fails Check genome name is uppercase: HG38, MM10

See INSTALL.md for detailed troubleshooting.


🀝 Support


πŸ“‹ Requirements

  • Operating System: Linux or macOS
  • Memory: 32 GB RAM minimum (for alignment)
  • Storage: ~50-100 GB per sample
  • CPU: Multi-core recommended (8+ cores)

Software (auto-installed via conda):

  • Python β‰₯ 3.8
  • Snakemake β‰₯ 7.0
  • BWA, samtools, MACS3, deepTools, etc.

πŸ—ΊοΈ Roadmap

Future improvements:

  • Support for additional peak callers (HOMER, SICER)
  • Automatic peak quality filtering
  • Differential binding analysis (DiffBind)
  • Motif enrichment integration
  • Docker/Singularity containers
  • Cloud execution (AWS, Google Cloud)

πŸ“œ License

This project is open source. See repository for license details.


πŸ™ Acknowledgments

Originally developed by Ming Tang (@crazyhottommy) at MD Anderson Cancer Center.

Modernized in 2025 with contributions from the bioinformatics community.

Special thanks to the developers of Snakemake and all the tools integrated in this pipeline.


Ready to analyze your ChIP-seq data? β†’ Start with QUICKSTART.md πŸš€

About

a snakemake pipeline to process ChIP-seq files from GEO or in-house

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •