-
Notifications
You must be signed in to change notification settings - Fork 136
Description
Hi,
I am trying to do a study similar to the Enformer study for my final thesis, and to do so I have downloaded 4505 Encode tracks.
When I ran the basenji_data script on I encountered the following error message numerous times
/mnt2/fscratch/users/ac_aux/vguirado/preprocess/bin/basenji_data_read.py:307: RuntimeWarning: overflow encountered in cast
cov = self.cov_open.values(chrm, start, end, numpy=True).astype('float16')
The code:
#SBATCH --job-name=preprocess
#SBATCH --time=0-30:0
#SBATCH --mem=50G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
##SBATCH --ntasks-per-node=1
#SBATCH --constraint=cal
...more sbatch things...
time python bin/basenji_data.py -s .9 -g data/hg38_gaps.bed -l 196608 --local -o data/human -p 128 -v .1 -w 128 data/genome.fa data/human_data.txt
I would like to know if this can affect something to the generation of the TFRecords, since during the training I am finding an extremely strange behavior as I show below:
This is by recovering a checkpoint at epoch 80 and training 50 more until 130 (first graph), recovering the checkpoint from epoch 130 until 180 (second) and from 180 until 230 (right). Here I'm using a small subsample, but the same happens with the whole dataset (and worse loss).
Apparently my training code is fine, because I have tried retrieving the Enformer checkpoint that is public and modifying the output to train the same subset and there I do get results. That is, I keep the trunk part already trained and add a single linear layer on top.
But starting from 0, and also including 1019 tracks for the mouse, the model is not able to learn anything. The values of R^2 are 0 or negative no matter how many steps I train.
So it occurs to me that the problem is in the generation of the TFRecords, but the only warning I found was that.
Thank you for your time.
