Skip to content

torch.distributed.launch on eight 40G A100, CUDA out of memory. #26

@zhengbiqing

Description

@zhengbiqing

I run:
export CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7'
task=gene
datadir=data/$task
outdir=runs/$task/GPT2
name=gene0913
checkpoint=/root/siton-glusterfs-eaxtsxdfs/xts/data/BioMedLM
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --use_env run_seqcls_gpt.py
--tokenizer_name $checkpoint --model_name_or_path $checkpoint --train_file
$datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train
--do_eval --do_predict --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-6 --warmup_ratio 0.5 --num_train_epochs 5 --max_seq_length
32 --logging_steps 1 --save_strategy no --evaluation_strategy no --output_dir
$outdir --overwrite_output_dir --bf16 --seed 1000 --run_name %name

but still get CUDA out of memory.
Anyone know to finetune seqcls how many GPUs must be need?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions