-
Notifications
You must be signed in to change notification settings - Fork 136
Description
Hello,
Thank you for the fantastic tool! However, I encountered poor training performance when attempting to train a model on Hi-C data for a specific cell type. The resulting model's Pearson correlation is only around 0.06, and I’m unsure where the issue lies. I suspect it might be related to parameters specific to Hi-C data or preprocessing steps. Below are the details of my setup:
Preprocessing code:
! /basenji/bin/akita_data.py --sample 0.1 \
-g /basenji/data/hg38_gaps_binsize2048_numconseq10.bed \
-l 1048576 --crop 65536 --local -k 1 -o /basenji/data/1m --as_obsexp \
-p 16 -t .1 -v .1 -w 2048 --snap 2048 \
/basenji/data/hg38.ml.fa /basenji/data/HiC_cools.txt
Model parameters (params_tutorial.json):
{
"train": {
"batch_size": 2,
"optimizer": "sgd",
"learning_rate": 0.0065,
"momentum": 0.99575,
"loss": "mse",
"patience": 50,
"clip_norm": 10.0
},
"model": {
"seq_length": 1048576,
"target_length": 512,
"target_crop": 32,
"diagonal_offset": 2,
"augment_rc": true,
"augment_shift": 11,
"activation": "relu",
"norm_type": "batch",
"bn_momentum": 0.9265,
"trunk": [
{"name": "conv_block", "filters": 96, "kernel_size": 11, "pool_size": 2},
{"name": "conv_tower", "filters_init": 96, "filters_mult": 1.0, "kernel_size": 5, "pool_size": 2, "repeat": 10},
{"name": "dilated_residual", "filters": 48, "rate_mult": 1.75, "repeat": 8, "dropout": 0.4},
{"name": "conv_block", "filters": 64, "kernel_size": 5}
],
"head_hic": [
{"name": "one_to_two", "operation": "mean"},
{"name": "concat_dist_2d"},
{"name": "conv_block_2d", "filters": 48, "kernel_size": 3},
{"name": "symmetrize_2d"},
{"name": "dilated_residual_2d", "filters": 24, "kernel_size": 3, "rate_mult": 1.75, "repeat": 6, "dropout": 0.1},
{"name": "cropping_2d", "cropping": 32},
{"name": "upper_tri", "diagonal_offset": 2},
{"name": "final", "units": 1, "activation": "linear"}
]
}
}Training code:
! akita_train.py -k -o ./data/1m/train_out/ ./data/1m/params_tutorial.json ./data/1m/
Besides, My cool file is binned to 2048 bp and iteratively corrected using cooler.
The trained model achieves a Pearson correlation of only ~0.06. I’m uncertain whether the issue stems from Hi-C-specific parameters or preprocessing steps. I noticed your comment in your orignal paper:
"To focus on locus-specific patterns and mitigate the impact of sparse sampling present in even the currently highest-resolution Hi-C maps, we adaptively coarse-grain, normalize for the distance-dependent decrease in contact frequency, take a natural log, clip to (−2,2), linearly interpolate missing bins and convolve with a small 2D Gaussian filter (sigma, 1 and width, 5). The first to third steps use cooltools functions (https://github.com/mirnylab/cooltools). Interpolation of low-coverage bins filtered out in typical Hi-C pipelines was crucial for learning with log(observed/expected) Hi-C targets, greatly outperforming replacing these bins with zeros."
Question:
Are these preprocessing steps (adaptive coarse-graining, distance normalization, log transformation, clipping, interpolation, and Gaussian filtering) already included in akita_data.py? If not, could you provide guidance on how to incorporate them or suggest other potential causes for the poor performance?
I greatly appreciate your help and look forward to your response!