Training time shows 18 hr on 16 * H100

We are currently replicating your training pipeline for the Qwen-32B model using Supervised Fine-Tuning (SFT), as described in the repository and associated paper. However, we are experiencing a significant discrepancy in the reported training time.

According to the paper, the training duration is approximately 26 minutes using 16×H100 GPUs. We have set up our environment to match this configuration (also using 16×H100 GPUs), but the training is taking more than 18 hours to complete.

We are running the train/sft_multinode.sh script with minor update configured to the pod settings, here is the script:

```
uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen/Qwen2.5-32B-Instruct" # meta-llama/Llama-3.1-70B-Instruct
lr=1e-5
min_lr=0
epochs=5
micro_batch_size=1 # If 2 nodes with 8 gpus each, batch_size will be 16
push_to_hub=true
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

torchrun \
--nnodes ${NUM_NODES}:${NUM_NODES} \
--nproc-per-node ${gpu_count} \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train/sft.py \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \
--train_file_path="simplescaling/s1K_tokenized" \
--block_size=32768 \
--model_name=${base_model} \
--warmup_ratio=0.05 \
--fsdp="full_shard auto_wrap" \
--fsdp_config="train/scripts/fsdp_config_qwen.json" \
--bf16=True \
--eval_strategy="steps" \
--eval_steps=50 \
--logging_steps=1 \
--save_steps=100 \
--lr_scheduler_type cosine \
--learning_rate ${lr} \
--weight_decay 1e-4 \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--output_dir="kj42/s1_replicate_${uid}" \
--hub_model_id="kj42/s1_replicate_${uid}" \
--push_to_hub=True \
--hub_always_push=True \
--num_train_epochs ${epochs} \
--save_only_model=True \
--gradient_checkpointing=True
```

Additionally, we attempted to enable flash_attention2, but observed no noticeable improvement in training time.

We also noticed that in your sft_multinode.sh script, the fsdp_config is being referenced from the path "train/scripts/fsdp_config_qwen.json". However, there is no scripts folder in the repository. We have been using "train/fsdp_config_qwen.json" instead. Could you please confirm if this is the correct path, or if we are missing any files?

Lastly, we would like to confirm whether you are using any additional optimization frameworks such as DeepSpeed or others that may not be explicitly mentioned in the documentation. Any clarification on this would be greatly appreciated.


<img width="1090" alt="Image" src="https://github.com/user-attachments/assets/6ec727a0-d1d6-40da-be58-a7f2be06f7f3" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training time shows 18 hr on 16 * H100 #111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training time shows 18 hr on 16 * H100 #111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions