This file provides guidance to WARP (warp.dev) when working with code in this repository.
This repository contains code for "Machine learning reveals 3D regulatory mechanisms for height-associated haplotypes" - a research project that applies machine learning to analyze 3D genome structure in relation to genetic variants associated with human height.
The project is organized into three main components:
- Purpose: Performs linkage disequilibrium score regression enrichment analysis
- Key Components:
make_ldsc_data/: Scripts for generating LDSC-compatible annotations and GWAS datamake_annot.R: Creates annotation files from variant datamunge_gwas.R: Processes GWAS summary statistics for multiple populationsgwas_raw/: Contains GWAS summary statistics for different populations (AFR, AMR, EAS, EUR, SAS)akita_variants/: Variant data from Akita model predictions
- Purpose: Creates publication-ready visualizations of Hi-C contact maps
- Key Components:
visualization_hic.ipynb: Jupyter notebook with specialized functions for Hi-C visualization- Includes diamond-rotated contact maps with genomic annotations
- Functions for loading predictions, adding genomic tracks, and comparing contact maps
- Figures/: Contains publication figures and supplementary materials
- ASHG Abstract/ and ASHG Presentation/: Conference materials
- README.md: Basic project description
# Generate LDSC annotations (run from LDSC Enrichment/make_ldsc_data/)
Rscript make_annot.R
# Process GWAS data for multiple populations
Rscript munge_gwas.R# Launch Jupyter notebook for Hi-C visualization
jupyter notebook "Visualize HiC/visualization_hic.ipynb"# Auto-sync scripts are provided for quick commits
./auto_sync.sh # On Unix systems
auto_sync.bat # On Windowsfrom_upper_triu(): Converts upper triangular vector to symmetric matrixload_individual_map(): Loads Hi-C contact maps from prediction filesannotations(): Adds genomic annotations (genes, CTCF, conservation)map_comparison_no_delta_with_annotations(): Creates side-by-side contact map comparisons with genomic tracks
- Variant Data: Located in
akita_variants/variants_in_divergent_windows_10AF.txt- Format: chr, position, ref_allele, alt_allele, window_position
- GWAS Data: Population-specific files in
gwas_raw/- Format: Standard GWAS summary statistics with hg38 coordinates
- Input: Height GWAS variants and Hi-C predictions from Akita model
- Processing: LDSC enrichment analysis and Hi-C visualization
- Output: Publication figures and enrichment statistics
- R: For LDSC data processing and statistical analysis
- Python/Jupyter: For Hi-C visualization with matplotlib, numpy, pandas
- External Tools: LDSC software suite (referenced but not included)
- Raw data files are organized by analysis type and population
- Visualization code is self-contained in Jupyter notebooks
- Auto-sync scripts maintain Git synchronization
The visualization notebook expects specific file paths for annotation data:
- Gene annotations:
grch38_gene_annotations.bed - Conservation data:
phastConsElements100way_hg38.bed - CTCF binding sites:
ctcf_full_merged_hg38.bed
- GWAS files are expected in standard format with hg38 coordinates
- Variant annotations are derived from Akita model predictions
- Population-specific analysis is supported for AFR, AMR, EAS, EUR, and SAS populations
The project includes automated sync scripts that perform standard Git operations:
- Stage all changes (
git add .) - Commit with automatic message
- Pull with unrelated histories flag
- Push to main branch
This setup facilitates rapid iteration during active research phases.