-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Hey,
What are the memory requirement to train this model? I am providing 187GB of RAm and it fails after
INFO:tensorflow:Saving checkpoints for 0 into summary/knee_l1/model.ckpt.
Here the memory requirement changes from 4GB to more than 187 GB and the job gets killed as it runs out of memory.
I am just running the model based on train_all.sh command, where I have decreased the batch size from 2 to 1 and iteration steps from 10000 to only 10.
python3 recon_train.py
--shape_y 320 --shape_z 256
--num_channels 8 --num_maps 1
--batch_size 1
--model_dir summary/knee_l1
--loss_l1 1
--max_steps 10
--device $device
Can you please help me, I am unable to train the model? I am proving 1 GPU of 16 GB. Does this model design to run on multiple nodes and CPU?
Thank you.