This repository presents a mutation-to-chromatin matching framework for studying origin-like molecular states in hepatocellular carcinoma. Passenger mutation profiles are converted into genomic tracks and compared with reference chromatin maps to test whether regional mutation patterns retain signals of earlier cellular organisation. A pan-cell-type benchmark evaluates track construction, bin size and scoring choices, identifying an exponential-decay, 0.5-Mb, residualised Spearman configuration as the strongest performer. The selected approach is then applied to TCGA-LIHC tumours using FOXA2-associated hepatocyte references, followed by clinical, expression, pathway and null analyses to assess interpretability and distinguish subtle reference-aligned chromatin similarities from definitive cell-of-origin claims in liver cancer.
Use the unified Streamlit results app to inspect the thesis figures, supporting tables, and interactive track visualisation.
conda activate mut-epi-origin
streamlit run tools/results_app.pyCreate the environment:
conda env create -f environment.yml
conda activate mut-epi-origin
pip install -r requirements.txtIf conda activate fails, run conda init once and restart your shell.
The thesis-facing results are organised under outputs/thesis/ and mirror the
Results section figure order. To regenerate the active thesis outputs, run the
section scripts from the repository root after activating the environment:
conda activate mut-epi-origin
bash scripts/01_pan_celltype_benchmark/reproduce_results.sh
bash scripts/02_foxa2_epigenome_orientation/reproduce_results.sh
bash scripts/03_hepatocyte_clinical_associations/reproduce_results.sh
bash scripts/04_differential_expression/reproduce_results.sh
bash scripts/05_null_bootstrap_validation/reproduce_results.shThe null bootstrap script accepts optional arguments:
bash scripts/05_null_bootstrap_validation/reproduce_results.sh 10where the argument is the number of bootstrap replicates.
outputs/thesis/05_null_bootstrap_validation/reproduce_results.sh is a
lightweight wrapper around the root-level null-bootstrap reproducer. Executable
analysis code lives under scripts/; thesis folders contain notebooks, curated
data, figures, and wrappers only.
For a compact list of script entrypoints, see
scripts/README.md. For the curated thesis output index,
see outputs/thesis/index.md.
Set machine-specific roots in:
config/data_paths.json
Required key:
wgs_tcga25_root
Example:
{
"wgs_tcga25_root": "data/raw/WGS_TCGA25"
}Use an absolute path if your local WGS_TCGA25 lives outside this repository.
The current pipeline keeps and uses only:
data/raw/mutations/filtered_mutations.beddata/raw/mutations/ICGC_WGS_Feb20_mutations.LIHC_LIRI.beddata/raw/mutations/lihc_snv_mutation_table.tsv
The five thesis sections are reproduced by the scripts/<section>/reproduce_results.sh entrypoints listed above. The scripts assume that the shared raw and processed inputs are already present. If you need to rebuild those inputs, use the data-build steps below first.
- Build FOXA2 hepatocyte ATAC pseudo-bulk tracks:
Rscript scripts/99_data_build/make_atac_pseudobulk.RThis requires bedGraphToBigWig, liftOver, bedtools, sort, gzip, and gunzip in PATH, and the multiome object at data/raw/multiome/GSE281574_Liver_Multiome_Seurat_GEO.rds.
- Build the LIHC metadata table:
python scripts/99_data_build/build_master_metadata.pyThis writes data/derived/master_metadata.csv.
- Transfer LIHC VCFs:
bash scripts/99_data_build/transfer_lihc_vcfs.sh --test
bash scripts/99_data_build/transfer_lihc_vcfs.shThe transfer step uses config/data_paths.json for the local WGS root and writes the LIHC VCF manifest under data/derived/manifests/.
- Build the LIHC SNV table:
python scripts/99_data_build/build_snv_mutation_table.pyThis writes data/raw/mutations/lihc_snv_mutation_table.tsv.
- Figure 1: pan-cell-type benchmark.
- Figure 2: TCGA-LIHC cohort orientation and FOXA2 hepatocyte references.
- Figure 3: FOXA2 clinical association analyses.
- Figure 4: differential expression and pathway analysis.
- Figure 5: mutation-randomised null benchmark.
The thesis scripts call these core tools:
python -m scripts.grid_search.clifor mutation-to-chromatin scoring.scripts/01_pan_celltype_benchmark/validate_state_scores.pyfor label and score validation.scripts/04_differential_expression/run_differential_expression_by_inferred_labels.Rfor DESeq2 RNA-seq differential expression.scripts/04_differential_expression/run_limma_by_inferred_labels.Rfor limma-voom RNA-seq differential expression.scripts/04_differential_expression/run_fgsea_from_de.Rfor pathway enrichment from ranked DE results.scripts/05_null_bootstrap_validation/bootstrap_shuffle_null.pyfor the mutation-randomised null benchmark.
For the full script layout, see scripts/README.md.
- Project focus is
TCGA-LIHCfor the FOXA2 downstream analyses. - Metadata source of truth is
data/derived/master_metadata.csv. - Fibrosis source of truth is the clinical Ishak field from
clinical.tsvafter case-level aggregation. - HBV/HCV harmonisation uses
data/raw/annotations/mmc1.xlsxconsensus calls first, then fallback fields. - Obesity class is derived from BMI using WHO categories.
For a full, practical guide to the mutation-vs-accessibility grid search runner (inputs, outputs, configuration modes, explicit setups, and resume workflow), see: