Skip to content

Commit d8a675b

Browse files
authored
refine training scripts (#95)
1 parent 8123cf2 commit d8a675b

File tree

8 files changed

+250
-31
lines changed

8 files changed

+250
-31
lines changed

pretrain/installers/v4-upstream-megatron-abci/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
ABCI 3.0上で以下のコマンドを実行し、`<env_install_path>`に環境を構築できる
66

77
```bash
8-
cd pretrain/installers/v5-megatron-abci/
8+
cd pretrain/installers/v4-upstream-megatron-abci/
99
bash run_setup.sh <env_install_path>
1010
```
1111

pretrain/scripts/v4-upstream-training-template/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ABCI 3.0 上で Megatron-LM を利用した LLM-jp v5 用の学習スクリプ
1313

1414
```bash
1515
cd $EXP_DIR
16-
git clone git@github.com:llm-jp/scripts.git
16+
git clone https://github.com/llm-jp/scripts.git
1717
```
1818

1919
次に、 [pretrain/installers/v5-megatron-abci](../../installers/v5-megatron-abci/README.md) を利用し、`$EXP_DIR/env` に環境を構築する。
@@ -47,20 +47,22 @@ cp -r scripts/pretrain/task_template/ $EXP_DIR/tasks/$TASK_NAME
4747

4848
```bash
4949
cd $EXP_DIR/scripts/pretrain/$TRAINING_SCRIPT_DIR/
50-
bash run_train.sh <RESERVATION_ID> <EXPERIMENT_ID> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES>
50+
bash run_train.sh <GROUP_ID> <RESERVATION_ID> <JOB_NAME> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES> <WALLTIME>
5151

5252
# Example:
53-
bash run_train.sh R0123456789 0123 /path/to/0123_experiment task_name 0123_experiment 32
53+
bash run_train.sh gcg51557 R0123456789 0123_pretrain /path/to/0123_experiment task_name 0123_experiment 32 720:00:00
5454
```
5555

5656
CLIからは以下の引数を指定する
5757

58+
- `<GROUP_ID>`: ABCI グループ ID
5859
- `<RESERVATION_ID>`: ABCI の予約キュー ID
59-
- `<EXPERIMENT_ID>`: 実験の識別子 (e.g. `0123`)
60-
- `<EXPERIMENT_DIR>`: 実験ディレクトリのパス (e.g. `/home/ach17726fj/experiments/0123_experiment`)
60+
- `<JOB_NAME>`: ジョブ名 (e.g., `0123_pretrain`)
61+
- `<EXPERIMENT_DIR>`: 実験ディレクトリのパス (e.g. `/path/to/0123_experiment`)
6162
- `<TASK_NAME>`: タスクディレクトリ名 (e.g. `task_name`)
6263
- `<WANDB_PROJECT>`: WandB に記録するプロジェクト名 (e.g. `0123_experiment`)
6364
- `<NUM_NODES>`: 使用するノード数 (e.g. `32`)
65+
- `<WALLTIME>`: ジョブの制限時間 (e.g., `720:00:00`)
6466

6567
### Training Configuration
6668

@@ -70,5 +72,3 @@ CLIからは以下の引数を指定する
7072
- Megatron-LM の `pretrain_gpt.py` に渡す引数をこのファイル内の変数に定義する
7173
- `train_data.sh`: 学習データのパス及び利用するトークン数などを定義するスクリプト
7274
- Megatron-LM の `--train-data` 引数に渡す値をこのファイル内の `$TRAIN_DATA_PATH` 変数に定義する
73-
- `train_iters.txt`: 学習イテレーション数を定義するファイル
74-
- 学習するイテレーション数を記載し、他には何も記載しない
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
#!/bin/bash
2+
3+
# Predefined variables:
4+
# * EXPERIMENT_DIR: Experiment directory
5+
# * TASK_NAME: Name of the task
6+
# * ITER: Target iteration number
7+
# * TOKENIZER_DIR: Directory of the tokenizer model
8+
9+
cd ${PBS_O_WORKDIR}
10+
11+
TASK_DIR=${EXPERIMENT_DIR}/tasks/${TASK_NAME}
12+
JOB_ID=${PBS_JOBID%%.*}
13+
14+
mkdir -p ${TASK_DIR}/logs
15+
LOGFILE=${TASK_DIR}/logs/convert-${JOB_ID}.out
16+
ERRFILE=${TASK_DIR}/logs/convert-${JOB_ID}.err
17+
exec > ${LOGFILE} 2> ${ERRFILE}
18+
19+
set -eu -o pipefail
20+
21+
ENV_DIR=${EXPERIMENT_DIR}/env
22+
SCRIPT_DIR=${EXPERIMENT_DIR}/scripts
23+
24+
# Load common environment variables
25+
source ${ENV_DIR}/scripts/environment.sh
26+
27+
# Load modules
28+
source /etc/profile.d/modules.sh
29+
module load cuda/${PRETRAIN_CUDA_VERSION}/${PRETRAIN_CUDA_VERSION}.${PRETRAIN_CUDA_VERSION_PATCH}
30+
module load cudnn/${PRETRAIN_CUDNN_VERSION}/${PRETRAIN_CUDNN_VERSION_WITH_PATCH}
31+
module load hpcx/${PRETRAIN_HPCX_VERSION}
32+
module load nccl/${PRETRAIN_NCCL_VERSION}/${PRETRAIN_NCCL_VERSION_WITH_PATCH}
33+
# For logging
34+
module list
35+
36+
# Load Python venv
37+
source ${ENV_DIR}/venv/bin/activate
38+
39+
## Debug/logging flags
40+
export LOGLEVEL=INFO
41+
export NCCL_DEBUG=WARN
42+
export NCCL_DEBUG_SUBSYS=WARN
43+
export PYTHONFAULTHANDLER=1
44+
export CUDA_DEVICE_MAX_CONNECTIONS=1
45+
export CUDA_LAUNCH_BLOCKING=0
46+
export CUDNN_LOGDEST_DBG=stderr
47+
export CUDNN_LOGERR_DBG=1
48+
49+
export MASTER_ADDR=$(head -n 1 ${PBS_NODEFILE} | hostname -f)
50+
export MASTER_PORT=$((10000 + RANDOM % 1000))
51+
echo "hostname: ${MASTER_ADDR}"
52+
53+
ITER_NAME=iter_$(printf %07d ${ITER}) # iter_0123456
54+
55+
MEGATRON_PATH=${ENV_DIR}/src/Megatron-LM
56+
OUTPUT_DIR=${TASK_DIR}/checkpoints_hf/${ITER_NAME}
57+
58+
# Setup working directory
59+
TEMP_DIR=$(mktemp -d "${TASK_DIR}/tmp_converter_${JOB_ID}_XXXXXX")
60+
echo "TEMP_DIR=${TEMP_DIR}"
61+
function rm_tempdir {
62+
if [ -e ${TEMP_DIR} ]; then
63+
echo "Removing temporary directory: ${TEMP_DIR}"
64+
rm -rf ${TEMP_DIR}
65+
echo "Done removing"
66+
fi
67+
}
68+
trap rm_tempdir EXIT
69+
trap 'trap - EXIT; rm_tempdir; exit 1' INT PIPE TERM
70+
71+
########
72+
# Step 1: Convert `torch_dist` format to `torch`
73+
# This process requires to launch the trainer script with the same parallelism configs.
74+
########
75+
echo "Start converting: torch_dist --> torch"
76+
77+
# Prepare source model at specific iteration
78+
mkdir ${TEMP_DIR}/torch_dist
79+
echo ${ITER} > ${TEMP_DIR}/torch_dist/latest_checkpointed_iteration.txt
80+
ln -s ${TASK_DIR}/checkpoints/${ITER_NAME} ${TEMP_DIR}/torch_dist/${ITER_NAME}
81+
82+
# Training data: TRAIN_DATA_PATH
83+
source ${TASK_DIR}/train_data.sh
84+
85+
# Synthesize all model params: ALL_PARAMS
86+
# Requires TRAIN_ITERS and TRAIN_DATA_PATH
87+
source ${TASK_DIR}/params.sh
88+
89+
# Add params for model conversion
90+
ALL_PARAMS+=(
91+
--load ${TEMP_DIR}/torch_dist
92+
--ckpt-convert-format torch
93+
--ckpt-convert-save ${TEMP_DIR}
94+
)
95+
96+
echo "ALL_PARAMS: ${ALL_PARAMS[@]}"
97+
98+
NUM_NODES=$(wc -l < ${PBS_NODEFILE})
99+
NUM_GPUS_PER_NODE=8
100+
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE}))
101+
echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}"
102+
echo NUM_NODES=${NUM_NODES}
103+
echo NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE}
104+
echo NUM_GPUS=${NUM_GPUS}
105+
106+
# Launch trainer script to convert the checkpoint
107+
mpirun \
108+
--display-allocation \
109+
--report-bindings \
110+
--oversubscribe \
111+
-np ${NUM_GPUS} \
112+
--npernode ${NUM_GPUS_PER_NODE} \
113+
-bind-to none \
114+
-map-by slot \
115+
python \
116+
${MEGATRON_PATH}/pretrain_gpt.py \
117+
${ALL_PARAMS[@]}
118+
119+
echo "Files created by the Step 1:"
120+
find ${TEMP_DIR}/torch | sort
121+
122+
########
123+
# Step 2: Convert `torch` to `Hugging Face`
124+
########
125+
126+
echo "Start converting: torch --> hf"
127+
128+
python ${MEGATRON_PATH}/tools/checkpoint/convert.py \
129+
--model-type GPT \
130+
--loader mcore \
131+
--saver llmjp4_hf \
132+
--load-dir ${TEMP_DIR}/torch \
133+
--save-dir ${OUTPUT_DIR} \
134+
--hf-tokenizer-path ${TOKENIZER_DIR} \
135+
--save-dtype bfloat16 \
136+
--loader-transformer-impl transformer_engine \
137+
--megatron-path ${MEGATRON_PATH}
138+
139+
echo "Files created by the Step 2:"
140+
find ${OUTPUT_DIR} | sort
141+
142+
########
143+
# Step 3: Replace tokenizer model
144+
########
145+
146+
echo "Start replacing tokenizer"
147+
148+
cp ${TOKENIZER_DIR}/* ${OUTPUT_DIR}
149+
150+
echo "Final model files:"
151+
find ${OUTPUT_DIR} | sort
152+
153+
echo "Done processing"
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/bin/bash
2+
3+
set -eu -o pipefail
4+
5+
if [ $# -ne 6 ]; then
6+
>&2 echo "Usage: $0 <RESERVATION_ID> <EXPERIMENT_ID> <EXPERIMENT_DIR> <TASK_NAME> <TOKENIZER_DIR> <NUM_NODES>"
7+
>&2 echo "Example: $0 R0123456789 0123 /path/to/0123_experiment task_name /path/to/tokenizer 1"
8+
exit 1
9+
fi
10+
11+
# NOTE(odashi):
12+
# Some variables are not used, but maintained for compatibility with training script.
13+
RESERVATION_ID=$1; shift
14+
EXPERIMENT_ID=$1; shift
15+
EXPERIMENT_DIR=$1; shift
16+
TASK_NAME=$1; shift
17+
TOKENIZER_DIR=$1; shift
18+
NUM_NODES=$1; shift
19+
20+
# This directory
21+
SCRIPT_ROOT=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
22+
23+
TASK_DIR=${EXPERIMENT_DIR}/tasks/${TASK_NAME}
24+
LAST_ITER=$(cat ${TASK_DIR}/checkpoints/latest_checkpointed_iteration.txt)
25+
26+
dependency=()
27+
28+
for iter in $(seq 1000 1000 ${LAST_ITER}); do
29+
if [ ! -e ${TASK_DIR}/checkpoints/iter_$(printf '%07d' ${iter}) ]; then
30+
#echo "Skip iter=${iter}: Source model does not exist."
31+
continue
32+
fi
33+
if [ -e ${TASK_DIR}/checkpoints_hf/iter_$(printf '%07d' ${iter})/tokenizer.json ]; then
34+
#echo "Skip iter=${iter}: Converted model already exists."
35+
continue
36+
fi
37+
38+
# NOTE(odashi): RTYPE=rt_HG doesn't work for 8B models.
39+
job_id=$(qsub \
40+
${dependency[@]} \
41+
-P gcg51557 \
42+
-q ${RESERVATION_ID} \
43+
-N ${EXPERIMENT_ID}_convert \
44+
-l select=${NUM_NODES},walltime=6:00:00 \
45+
-v RTYPE=rt_HF,EXPERIMENT_DIR=${EXPERIMENT_DIR},TASK_NAME=${TASK_NAME},ITER=${iter},TOKENIZER_DIR=${TOKENIZER_DIR} \
46+
-o /dev/null \
47+
-e /dev/null \
48+
-m n \
49+
${SCRIPT_ROOT}/qsub_convert.sh
50+
)
51+
echo "Submitted iter=${iter}: job_id=${job_id}"
52+
#dependency=(-W depend=afterany:${job_id})
53+
done

pretrain/scripts/v4-upstream-training-template/pretrain/qsub_train.sh

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
# * TASK_NAME: Name of the task
66
# * WANDB_PROJECT: W&B project name
77

8+
set -eu -o pipefail
9+
810
cd ${PBS_O_WORKDIR}
911

1012
TASK_DIR=${EXPERIMENT_DIR}/tasks/${TASK_NAME}
@@ -15,8 +17,6 @@ LOGFILE=${TASK_DIR}/logs/pretrain-${JOB_ID}.out
1517
ERRFILE=${TASK_DIR}/logs/pretrain-${JOB_ID}.err
1618
exec > ${LOGFILE} 2> ${ERRFILE}
1719

18-
set -eu -o pipefail
19-
2020
ENV_DIR=${EXPERIMENT_DIR}/env
2121
SCRIPT_DIR=${EXPERIMENT_DIR}/scripts
2222

@@ -55,21 +55,19 @@ echo "hostname: ${MASTER_ADDR}"
5555
NUM_NODES=$(wc -l < ${PBS_NODEFILE})
5656
NUM_GPUS_PER_NODE=8
5757
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE}))
58-
echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}"
5958
echo NUM_NODES=${NUM_NODES}
6059
echo NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE}
6160
echo NUM_GPUS=${NUM_GPUS}
6261

62+
# For logging
63+
echo "PBS_NODEFILE:"
6364
cat ${PBS_NODEFILE}
6465

65-
# Training steps
66-
TRAIN_ITERS=$(cat ${TASK_DIR}/train_iters.txt)
67-
68-
# Training data: TRAIN_DATA_PATH
66+
# Load training data: TRAIN_DATA_PATH
6967
source ${TASK_DIR}/train_data.sh
7068

71-
# Synthesize all model params: ALL_PARAMS
72-
# Requires TRAIN_ITERS and TRAIN_DATA_PATH
69+
# Load model params: ALL_PARAMS
70+
# Requires TRAIN_DATA_PATH
7371
source ${TASK_DIR}/params.sh
7472

7573
# Add logging params
@@ -82,15 +80,30 @@ ALL_PARAMS+=(
8280
)
8381

8482
# Add Checkpointing params
83+
BASE_CHECKPOINT_DIR=${TASK_DIR}/base_checkpoints
8584
TASK_CHECKPOINT_DIR=${TASK_DIR}/checkpoints
85+
86+
if [ -e ${TASK_CHECKPOINT_DIR}/latest_checkpointed_iteration.txt ]; then
87+
echo "Resume from the last checkpoint in this task"
88+
LOAD_DIR=${TASK_CHECKPOINT_DIR}
89+
elif [ -e ${BASE_CHECKPOINT_DIR}/latest_checkpointed_iteration.txt ]; then
90+
echo "Start from the base checkpoint"
91+
LOAD_DIR=${BASE_CHECKPOINT_DIR}
92+
else
93+
echo "Start from scratch"
94+
LOAD_DIR=${TASK_CHECKPOINT_DIR}
95+
fi
96+
8697
ALL_PARAMS+=(
87-
--load ${TASK_CHECKPOINT_DIR}
98+
--load ${LOAD_DIR}
8899
--save ${TASK_CHECKPOINT_DIR}
89100
--save-interval 1000
90101
)
91102

103+
# For logging
92104
echo "ALL_PARAMS: ${ALL_PARAMS[@]}"
93105

106+
echo "Start training..."
94107
mpirun \
95108
--display-allocation \
96109
--report-bindings \
@@ -102,3 +115,5 @@ mpirun \
102115
python \
103116
${ENV_DIR}/src/Megatron-LM/pretrain_gpt.py \
104117
${ALL_PARAMS[@]}
118+
119+
echo "Training completed successfully."

pretrain/scripts/v4-upstream-training-template/pretrain/run_train.sh

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,31 +2,30 @@
22

33
set -eu -o pipefail
44

5-
if [ $# -ne 6 ]; then
6-
>&2 echo "Usage: $0 <RESERVATION_ID> <EXPERIMENT_ID> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES>"
7-
>&2 echo "Example: $0 R0123456789 0123 /path/to/0123_experiment task_name 0123_experiment 32"
5+
if [ $# -ne 8 ]; then
6+
>&2 echo "Usage: $0 <GROUP_ID> <RESERVATION_ID> <JOB_NAME> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES> <WALLTIME>"
7+
>&2 echo "Example: $0 gcg51557 R0123456789 0123 /path/to/0123_experiment task_name 0123_experiment 32 720:00:00"
88
exit 1
99
fi
1010

11+
GROUP_ID=$1; shift
1112
RESERVATION_ID=$1; shift
12-
EXPERIMENT_ID=$1; shift
13+
JOB_NAME=$1; shift
1314
EXPERIMENT_DIR=$1; shift
1415
TASK_NAME=$1; shift
1516
WANDB_PROJECT=$1; shift
1617
NUM_NODES=$1; shift
18+
WALLTIME=$1; shift
1719

1820
# This directory
1921
SCRIPT_ROOT=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
2022

21-
WALLTIME=720:00:00 # 30 days
22-
# WALLTIME=01:00:00 # 1 hour
23-
2423
qsub \
25-
-P gcg51557 \
24+
-P ${GROUP_ID} \
2625
-q ${RESERVATION_ID} \
27-
-N ${EXPERIMENT_ID}_pretrain \
26+
-N ${JOB_NAME} \
2827
-l select=${NUM_NODES},walltime=${WALLTIME} \
29-
-v RTYPE=rt_HF,EXPERIMENT_DIR=${EXPERIMENT_DIR},TASK_NAME=${TASK_NAME},WANDB_PROJECT=${WANDB_PROJECT} \
28+
-v RTYPE=rt_HF,USE_SSH=1,EXPERIMENT_DIR=${EXPERIMENT_DIR},TASK_NAME=${TASK_NAME},WANDB_PROJECT=${WANDB_PROJECT} \
3029
-o /dev/null \
3130
-e /dev/null \
3231
-m n \

pretrain/scripts/v4-upstream-training-template/task_template/params.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,9 @@ ALL_PARAMS+=(
4747

4848
# Scheduler
4949
ALL_PARAMS+=(
50-
--train-iters ${TRAIN_ITERS}
50+
--train-iters 100000
5151
--lr-warmup-iters 2000
52-
--lr-decay-iters ${TRAIN_ITERS}
52+
--lr-decay-iters 100000
5353
--lr-decay-style cosine
5454
--eval-interval 999999999
5555
--eval-iters 0

pretrain/scripts/v4-upstream-training-template/task_template/train_iters.txt

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)