Merge pull request #1802 from oracle-devrel/LLM_tunning

mariapatelkou · web-flow · commit 321669fefa0f · 2025-06-13T15:53:17.000+02:00
DeepSpeed LLM Training tunning
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/.DS_Store b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/.DS_Store
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/LICENSE b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/LICENSE
@@ -0,0 +1,35 @@
+Copyright (c) 2025 Oracle and/or its affiliates.
+
+The Universal Permissive License (UPL), Version 1.0
+
+Subject to the condition set forth below, permission is hereby granted to any
+person obtaining a copy of this software, associated documentation and/or data
+(collectively the "Software"), free of charge and under any and all copyright
+rights in the Software, and any and all patent rights owned or freely
+licensable by each licensor hereunder covering either (i) the unmodified
+Software as contributed to or provided by such licensor, or (ii) the Larger
+Works (as defined below), to deal in both
+
+(a) the Software, and
+(b) any piece of software and/or hardware listed in the lrgrwrks.txt file if
+one is included with the Software (each a "Larger Work" to which the Software
+is contributed by such licensors),
+
+without restriction, including without limitation the rights to copy, create
+derivative works of, display, perform, and distribute the Software and make,
+use, sell, offer for sale, import, export, have made, and have sold the
+Software and the Larger Work(s), and to sublicense the foregoing rights on
+either these or other terms.
+
+This license is subject to the following condition:
+The above copyright notice and either this complete permission notice or at
+a minimum a reference to the UPL must be included in all copies or
+substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/README.md b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/README.md
@@ -0,0 +1,99 @@
+# Overview
+
+This repository provides a step-by-step deployment of DeepSpeed training for Large Language Models (LLMs) on Oracle Cloud Infrastructure (OCI), using H100 GPU clusters with RDMA and SLURM.
+
+This setup includes a tuned DeepSpeed configuration (`tuned_ds_config.json`) that provides up to **13% performance improvement** over standard configurations.
+
+Reviewed: 06.06.2025
+# When to use this asset?
+
+Use this asset when you need to:
+- Train large-scale language models on OCI with H100 hardware.
+- Utilize RDMA-enabled SLURM clusters for distributed multi-node DeepSpeed training.
+- Achieve improved throughput via custom-tuned DeepSpeed JSON configs.
+
+# How to use this asset?
+- deploy OCI HPC stack with H100 multiple instances.
+- Improve training performance by using a tuned configuration for the deepspeed LLM model.
+
+## Prerequisites & Docs
+
+### Prerequisites
+
+* An OCI tenancy with H100 GPU quota (shape: BM.GPU.H100.8).
+* A [Huggingface](https://huggingface.co/) account with a valid Auth Token.
+* SSH access to the deployed head node of your SLURM cluster.
+
+### Documentation & Resources
+
+* [DeepSpeed Documentation](https://www.deepspeed.ai/docs/)
+* [TinyLlama Model (HF)](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
+* [Mistral LLMs](https://mistral.ai/technology/#models)
+* [OCI HPC Stack](https://github.com/oracle-quickstart/oci-hpc)
+
+## Model Training Workflow
+- please refer files/README.md for more details
+
+### Instance Configuration
+
+The deployment uses a cluster of `BM.GPU.H100.8` bare metal instances, provisioned with cluster networking and RDMA.
+
+The DeepSpeed job is submitted via SLURM using the `run_deepspeed.slurm` script. The environment includes a shared OCI File Storage System mounted on all nodes.
+
+### DeepSpeed Tuned Configuration
+
+The `tuned_ds_config.json` applies the following optimizations:
+- Switched from fp16 to bf16 (optimal for H100)
+- Enabled overlap_comm, contiguous_gradients, and increased bucket sizes
+- Used gradient_accumulation_steps=8 to balance memory use and throughput
+- Tweaked aio settings for better I/O performance during training
+- Removed optimizer/parameter offloading to fully utilize GPU RA
+
+These optimizations are benchmarked to deliver up to **13% faster training throughput** on OCI H100 clusters.
+
+### Launch Training Job
+
+Submit your training job using SLURM:
+
+```bash
+sbatch $HOME$/scripts/run_deepspeed.slurm
+```
+
+The job script uses:
+- `train.py`: your LLM training script
+- `tuned_ds_config.json`: DeepSpeed configuration file
+- Local datasets and Hugging Face model/tokenizer
+
+### Example curl Test (after model fine-tuning)
+
+To serve the trained model via OpenAI-compatible API:
+
+```bash
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "your-model-name",
+        "prompt": "A GPU is a",
+        "max_tokens": 128,
+        "temperature": 0.7
+    }'
+```
+
+## Notes
+
+To train larger models like Mixtral or Mistral 7B on H100, make sure to:
+- Scale the number of nodes appropriately
+- Use quantization or tensor parallelism when needed
+- Ensure models and datasets fit into GPU memory with DeepSpeed ZeRO optimization
+
+# Acknowledgments
+
+- **Author** - Deepak Soni (GPU Black Belt)
+
+# License
+ 
+Copyright (c) 2025 Oracle and/or its affiliates.
+ 
+Licensed under the Universal Permissive License (UPL), Version 1.0.
+ 
+See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/.DS_Store b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/.DS_Store
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/README.md b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/README.md
@@ -0,0 +1,56 @@
+# DeepSpeed LLM Training on OCI H100 SLURM Cluster
+
+This repository automates deployment of a multi-node SLURM cluster with RDMA-enabled H100 GPUs on OCI for training large language models using DeepSpeed.
+
+##  Tuned Configuration
+
+We developed a custom-tuned deepspeed_config.json tailored for:
+- Multi-node training
+- RDMA-aware NCCL backend
+- H100’s bfloat16-optimized tensor cores
+- DeepSpeed ZeRO Stage 2 with communication overlap
+
+The `tuned_ds_config.json` includes:
+- Switched from fp16 to bf16 (optimal for H100)
+- Enabled overlap_comm, contiguous_gradients, and increased bucket sizes
+- Used gradient_accumulation_steps=8 to balance memory use and throughput
+- Tweaked aio settings for better I/O performance during training
+- Removed optimizer/parameter offloading to fully utilize GPU RAM
+
+
+This configuration delivers up to **13% more training throughput** versus default settings on OCI H100 infrastructure.
+
+## With this updated configuration:
+- Training throughput improved by ~13%
+- GPU utilization increased more consistently across all 8 nodes
+- Communication latency reduced on RDMA fabric
+- No stability or memory issues observed with ZeRO Stage 2
+
+## 📂 Contents
+
+- `scripts/tuned_ds_config.json` – optimized DeepSpeed configuration
+- `scripts/run_deepspeed.slurm` – job script for SLURM
+- `README.md` – usage overview and tuning explanation
+
+## Usage
+
+1. Deploy SLURM H100 cluster on OCI
+2. SSH to master node
+3. Submit the job:
+
+```bash
+sbatch run_deepspeed.slurm
+```
+
+Model output and logs will be written to `$HOME/output`.
+
+## Conclusion
+- NCCL tuning alone isn’t always sufficient — framework-level configuration (DeepSpeed) must align with hardware.
+- H100 GPUs benefit significantly from bfloat16 and increased comm overlap.
+- ZeRO Stage 2 provided a solid balance of memory efficiency and speed. ZeRO-3 is reserved for future scaling.
+- System-aware configuration (bucket sizes, threading, and memory layout) is essential for reaching peak performance.
+
+## Next Steps
+- Benchmark with ZeRO Stage 3 for models approaching GPU memory limits.
+- Test pipeline parallelism on >16 node jobs.
+- Evaluate DeepSpeed 0.13+ features such as NVMe offloading and optimizer fusion on upcoming jobs.
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/exec_torchrun.sh b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/exec_torchrun.sh
@@ -0,0 +1,58 @@
+#!/bin/bash
+
+set -ex
+
+source myenv/bin/activate
+
+## NCCL parameters configuration based on OCI H100 GPU Instance deployment 
+export NCCL_TIMEOUT=1800
+
+export NCCL_IGNORE_CPU_AFFINITY=1
+export OMPI_MCA_coll_hcol_enable=0
+export NCCL_CROSS_NIC=2
+export NCCL_SOCKET_NTHREADS=16
+export NCCL_DEBUG=DEBUG
+export NCCL_CUMEM_ENABLE=0
+export NCCL_IB_SPLIT_DATA_ON_QPS=0
+export NCCL_IB_QPS_PER_CONNECTION=16
+export NCCL_IB_GID_INDEX=3
+export NCCL_IB_HCA="mlx5_0,mlx5_1,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17"
+export NCCL_IB_TC=41
+export NCCL_IB_SL=0
+export NCCL_IB_TIMEOUT=22
+export HCOLL_ENABLE_MCAST_ALL=0
+export UCX_TLS=tcp
+export UCX_NET_DEVICES=eth0
+export RX_QUEUE_LEN=8192
+export NCCL_SOCKET_IFNAME=eth0 
+
+export OMP_NUM_THREADS=16  # should be optimally number of CPU cores / number of GPUs per node
+
+export GPUS_PER_NODE=8
+MASTER_NODE=$(scontrol show hostname | head -n 1)
+export MASTER_ADDR=$(scontrol show node=$MASTER_NODE | awk -F= '/NodeAddr=/{print $2}' | awk '{print $1}')
+export NNODES=$SLURM_NTASKS
+export NODE_RANK=$SLURM_NODEID
+export MASTER_PORT=9001
+export WORLD_SIZE_JOB=$SLURM_NTASKS
+export DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT "
+
+torchrun $DISTRIBUTED_ARGS \
+	train.py \
+	--model_config tuned_ds_config.json \
+	--tokenizer_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+	--dataset_mixer data_mixer.json \
+	--dataset_name mix \
+	--dataset_type local \
+	--dataset_packed \
+	--batch_size 12 \
+	--gradient_checkpointing \
+	--max_train_steps 1000000 \
+	--val_after_steps 10000 \
+	--num_warmup_steps 10000 \
+	--learning_rate 1e-4 \
+	--num_gpus_node $GPUS_PER_NODE \
+	--gradient_clipping 1 \
+	--gradient_accumulation_steps 2 \
+	--dataset_cache "./hf-cache"
+
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/run_deepspeed.slurm b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/run_deepspeed.slurm
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+#SBATCH --nodes=4
+#SBATCH --job-name=deepspeed-performance-test
+#SBATCH --exclusive
+srun -l exec_torchrun.sh
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/tuned_ds_config.json b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/tuned_ds_config.json
@@ -0,0 +1,73 @@
+{
+  "train_batch_size": 2048,
+  "train_micro_batch_size_per_gpu": 32,
+  "gradient_accumulation_steps": 8,
+  "steps_per_print": 100,
+  "wall_clock_breakdown": false,
+
+  "bf16": {
+    "enabled": true
+  },
+
+  "optimizer": {
+    "type": "AdamW",
+    "params": {
+      "lr": 1e-4,
+      "betas": [0.9, 0.999],
+      "eps": 1e-8,
+      "weight_decay": 0.01
+    }
+  },
+
+  "scheduler": {
+    "type": "WarmupLR",
+    "params": {
+      "warmup_min_lr": 0,
+      "warmup_max_lr": 1e-4,
+      "warmup_num_steps": 10000
+    }
+  },
+
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "overlap_comm": true,
+    "contiguous_gradients": true
+  },
+
+  "gradient_clipping": 1.0,
+
+  "activation_checkpointing": {
+    "partition_activations": true,
+    "contiguous_memory_optimization": true,
+    "cpu_checkpointing": false,
+    "number_checkpoints": null
+  },
+
+  "aio": {
+    "block_size": 1048576,
+    "queue_depth": 16,
+    "single_submit": false,
+    "overlap_events": true
+  },
+
+  "flops_profiler": {
+    "enabled": false,
+    "profile_step": 10,
+    "module_depth": -1,
+    "top_modules": 1,
+    "detailed": true,
+    "output_file": null
+  },
+
+  "elasticity": {
+    "enabled": false
+  },
+
+  "gradient_accumulation_plugin": {
+    "enabled": true
+  }
+}