Skip to content

Training time shows 18 hr on 16 * H100 #111

@kartikjain-42

Description

@kartikjain-42

We are currently replicating your training pipeline for the Qwen-32B model using Supervised Fine-Tuning (SFT), as described in the repository and associated paper. However, we are experiencing a significant discrepancy in the reported training time.

According to the paper, the training duration is approximately 26 minutes using 16×H100 GPUs. We have set up our environment to match this configuration (also using 16×H100 GPUs), but the training is taking more than 18 hours to complete.

We are running the train/sft_multinode.sh script with minor update configured to the pod settings, here is the script:

uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen/Qwen2.5-32B-Instruct" # meta-llama/Llama-3.1-70B-Instruct
lr=1e-5
min_lr=0
epochs=5
micro_batch_size=1 # If 2 nodes with 8 gpus each, batch_size will be 16
push_to_hub=true
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

torchrun \
--nnodes ${NUM_NODES}:${NUM_NODES} \
--nproc-per-node ${gpu_count} \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train/sft.py \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \
--train_file_path="simplescaling/s1K_tokenized" \
--block_size=32768 \
--model_name=${base_model} \
--warmup_ratio=0.05 \
--fsdp="full_shard auto_wrap" \
--fsdp_config="train/scripts/fsdp_config_qwen.json" \
--bf16=True \
--eval_strategy="steps" \
--eval_steps=50 \
--logging_steps=1 \
--save_steps=100 \
--lr_scheduler_type cosine \
--learning_rate ${lr} \
--weight_decay 1e-4 \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--output_dir="kj42/s1_replicate_${uid}" \
--hub_model_id="kj42/s1_replicate_${uid}" \
--push_to_hub=True \
--hub_always_push=True \
--num_train_epochs ${epochs} \
--save_only_model=True \
--gradient_checkpointing=True

Additionally, we attempted to enable flash_attention2, but observed no noticeable improvement in training time.

We also noticed that in your sft_multinode.sh script, the fsdp_config is being referenced from the path "train/scripts/fsdp_config_qwen.json". However, there is no scripts folder in the repository. We have been using "train/fsdp_config_qwen.json" instead. Could you please confirm if this is the correct path, or if we are missing any files?

Lastly, we would like to confirm whether you are using any additional optimization frameworks such as DeepSpeed or others that may not be explicitly mentioned in the documentation. Any clarification on this would be greatly appreciated.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions