Skip to content

Commit fa2a255

Browse files
Elijah MacCarthysecondspass
authored andcommitted
added multi-node tensorflow
1 parent 0e6a32b commit fa2a255

File tree

7 files changed

+148
-2
lines changed

7 files changed

+148
-2
lines changed

frontier/sample_apps/jax/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,4 @@ The MNIST test has already been downloaded and located in the examples folder he
99
Proceed to submit the job with:
1010
```
1111
sbatch submit.sbatch
12-
12+
```

frontier/sample_apps/tensorflow/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ apptainer pull tensorflow_latest.sif docker://rocm/tensorflow:latest
88
Submit the job with:
99
```
1010
sbatch submit.sbatch
11-
11+
```
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Tensorflow MNIST example
2+
3+
There are two common ways of distributed training with data parallelism: 1). synchronous training where steps of training are synced across workers and replicas and 2). Asynchronous training where training steps are not strictly synched.
4+
5+
Multi-node Tensorflow training is performed using multi-worker distributed training. For this, a TF_CONFIG configuration environment variable is needed for training on multiple nodes. For more on TF_CONFIG and distributed training, please refer to the official Tensorflow tutorials from where this example was borrowed: [Tensorflow/tutorial_distributed_training](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)
6+
7+
We use the `tf.distribute.MultiWorkerMirroredStrategy` API for our multi-node distributed tensorflow example. We set the TF_CONFIG configuration environment variable in our `submit.sbatch` script with two workers.
8+
9+
Submit the job with:
10+
```
11+
sbatch submit.sbatch
12+
```
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import os
2+
import json
3+
4+
import tensorflow as tf
5+
import mnist_setup
6+
7+
per_worker_batch_size = 64
8+
#tf_config = json.loads(os.environ['TF_CONFIG'])
9+
num_workers = 2 #len(tf_config['cluster']['worker'])
10+
11+
strategy = tf.distribute.MultiWorkerMirroredStrategy()
12+
13+
global_batch_size = per_worker_batch_size * num_workers
14+
multi_worker_dataset = mnist_setup.mnist_dataset(global_batch_size)
15+
16+
with strategy.scope():
17+
# Model building/compiling need to be within `strategy.scope()`.
18+
multi_worker_model = mnist_setup.build_and_compile_cnn_model()
19+
20+
21+
multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import os
2+
import tensorflow as tf
3+
import numpy as np
4+
5+
def mnist_dataset(batch_size):
6+
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
7+
# The `x` arrays are in uint8 and have values in the [0, 255] range.
8+
# You need to convert them to float32 with values in the [0, 1] range.
9+
x_train = x_train / np.float32(255)
10+
y_train = y_train.astype(np.int64)
11+
train_dataset = tf.data.Dataset.from_tensor_slices(
12+
(x_train, y_train)).shuffle(60000).repeat().batch(batch_size)
13+
return train_dataset
14+
15+
def build_and_compile_cnn_model():
16+
model = tf.keras.Sequential([
17+
tf.keras.layers.InputLayer(input_shape=(28, 28)),
18+
tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
19+
tf.keras.layers.Conv2D(32, 3, activation='relu'),
20+
tf.keras.layers.Flatten(),
21+
tf.keras.layers.Dense(128, activation='relu'),
22+
tf.keras.layers.Dense(10)
23+
])
24+
model.compile(
25+
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
26+
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
27+
metrics=['accuracy'])
28+
return model
29+
30+
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
3+
#source /sw/frontier/python/3.10/miniforge3/23.11.0/bin/activate #/opt/miniforge/bin/activate
4+
#python -c 'import tensorflow' 2> /dev/null && echo ‘Success’ || echo ‘Failure’
5+
6+
python -W ignore -u ./main.py #mnist_setup.py #multinode_olcf.py 2000 10 --master_addr=$MASTER_ADDR --master_port=3442
7+
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
#!/bin/bash
2+
#SBATCH -A stf016
3+
#SBATCH -J ddp_test
4+
#SBATCH -o logs/frontier_apptainer_mltests.%j
5+
#SBATCH -e logs/frontier_apptainer_mltests.%j
6+
#SBATCH -t 00:30:00
7+
#SBATCH -p batch
8+
#SBATCH -N 2
9+
10+
# Only necessary if submitting like: sbatch --export=NONE ... (recommended)
11+
# Do NOT include this line when submitting without --export=NONE
12+
#unset SLURM_EXPORT_ENV
13+
14+
# Load modules
15+
16+
module load cray-mpich-abi/8.1.31
17+
module load craype-accel-amd-gfx90a
18+
module load rocm/5.7.1
19+
module load miniforge3
20+
21+
read -ra arr <<< ${ips}
22+
23+
export NCCL_SOCKET_IFNAME=hsn0
24+
25+
export MASTER_ADDR=$(getent hosts $(scontrol show hostnames $SLURM_NODELIST | head -n1) | awk '{ print $1 }')
26+
#export MASTER_ADDR=$(hostname -i)
27+
echo "MASTER_ADDR=" $MASTER_ADDR
28+
export MASTER_PORT=3442
29+
export NCCL_SOCKET_IFNAME=hsn0
30+
export GLOO_SOCKET_IFNAME=hsn0
31+
# Needed to bypass MIOpen, Disk I/O Errors
32+
export MIOPEN_USER_DB_PATH="/tmp/my-miopen-cache"
33+
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
34+
rm -rf ${MIOPEN_USER_DB_PATH}
35+
mkdir -p ${MIOPEN_USER_DB_PATH}
36+
37+
#export NCCL_IB_DISABLE=1
38+
##export NCCL_DEBUG=INFO
39+
#export TF_CPP_MIN_LOG_LEVEL=0
40+
##export GRPC_VERBOSITY=debug
41+
#export GRPC_TRACE=all
42+
#export GRPC_ENABLE_FORK_SUPPORT=true
43+
#export TF_FORCE_GPU_ALLOW_GROWTH=true
44+
#export GRPC_ARG_ENABLE_IPV4_ONLY=true
45+
#
46+
#rm -rf ${MIOPEN_USER_DB_PATH}
47+
#mkdir -p ${MIOPEN_USER_DB_PATH}
48+
#
49+
hosts=$(scontrol show hostnames $SLURM_JOB_NODELIST)
50+
hosts_array=($hosts)
51+
#
52+
## Setup TF_CONFIG for each worker
53+
54+
TF_CONFIG=$(cat <<EOF
55+
{
56+
"cluster": {
57+
"worker": ["${hosts_array[0]}:12345", "${hosts_array[1]}:23456"]
58+
},
59+
"task": {"type": "worker", "index": $rank}
60+
}
61+
EOF
62+
)
63+
#
64+
#export TF_CONFIG="$TF_CONFIG"
65+
#echo TF_CONFIG="$TF_CONFIG"
66+
#
67+
export MPICH_GPU_SUPPORT_ENABLED=1
68+
export BINDS=/usr/share/libdrm,/var/spool/slurmd,/opt/cray,${PWD}
69+
export APPTAINERENV_LD_LIBRARY_PATH="/opt/cray/pe/mpich/8.1.31/ofi/crayclang/17.0/lib-abi-mpich:/opt/cray/pe/mpich/8.1.31/gtl/lib:/opt/rocm-5.7.1/lib:$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH:/opt/cray/pe/lib64"
70+
export APPTAINER_CONTAINLIBS="/usr/lib64/libcxi.so.1,/usr/lib64/libjson-c.so.3,/lib64/libtinfo.so.6,/usr/lib64/libnl-3.so.200"
71+
72+
set -ex
73+
# Run script
74+
#
75+
srun -N2 -n2 --gpus=16 --gpu-bind=closest apptainer exec --workdir `pwd` --rocm --bind $BINDS tensorflow_latest.sif ./pyrun.sh
76+

0 commit comments

Comments
 (0)