Skip to content

Commit ac0e184

Browse files
committed
AWS accelerators trn1_inf support
Signed-off-by: maheedhar reddy chappidi <[email protected]>
1 parent f70c41c commit ac0e184

File tree

1 file changed

+131
-0
lines changed

1 file changed

+131
-0
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# AWS accelerators support on Ray
2+
3+
## Background
4+
*[Feel free to skip this section if you are already familiar with AI accelerator, AWS trn1 EC2 instance and NeuronCore]*
5+
6+
An AI Accelerator is a dedicated processor designed to accelerate machine-learning (ML) computations. These are
7+
specialized hardware designed to improve performance, reduce latency and reduce cost of deploying ML based applications.
8+
9+
In late 2022, AWS announced general availability of Trn1 EC2 instances which are powered by AWS Trainium accelerators.
10+
AWS Trainium accelerator is an AI accelerator, purpose built for high-performance deep learning (DL) training of
11+
generative AI models, including large language models (LLMs) and latent diffusion models[1].
12+
13+
Each Trainium accelerator (aka NeuronDevice) includes two second-generation NeuronCores(NeuronCore-v2).
14+
It is designed for speed model training. NeuronCore-v2 is a fully-independent heterogeneous compute-unit,
15+
with multiple engines (Tensor/Vector/Scalar.. Engines), and on-chip memory, for maximizing data locality[2].
16+
Also, Inferentia2 accelerator(inf2 which supports neuron-core) is designed to speed up inference.
17+
18+
19+
## Summary
20+
21+
[Phase-1] Currently, Ray supports limited accelerators (only NVIDIA hardware) for GPUs which does not include
22+
AWS Trainium/Inferentia2 accelerators.
23+
24+
[Phase-2] Also, Ray Train only supports Pytorch but not Pytorch-XLA (Accelerated Linear Algebra) which is a connector
25+
between Pytorch deep learning framework and TPUs/NeuronCores.
26+
Without these, customers can neither use AWS Trainium/Inferentia2 accelerators on Ray cluster by default nor use it for
27+
distributed training on Ray Train.
28+
29+
## Stewardship
30+
31+
### Required Reviewers
32+
@scv119 @matthewdeng
33+
34+
### Shepherd of the Proposal (should be a senior committer)
35+
@scv119 @matthewdeng
36+
37+
38+
## Design and Architecture
39+
### Phase1
40+
41+
On Ray node initialization, each Raylet represents resource configuration with pre-defined resources
42+
(CPU, GPU, object_store_memory...) and custom resources which resolves to resource specifications.
43+
These node specifications are advertised to RayScheduler which will be used for work assignment.
44+
45+
Unlike distributed tasks. GPUs do not have Python interpreters. Instead of sending python lambdas, high-level
46+
tools like Torch, Tensor will generate or call native GPU/accelerator code. CUDA and Neuron SDK are some low-level
47+
libraries for interacting with GPUs/accelerators.
48+
49+
Currently, Ray supports/detects only NVIDIA accelerators. We make appropriate changes to make AWS accelerators visible
50+
using Neuron-Runtime/Neuron SDK
51+
52+
```text
53+
On node initialization:
54+
if assigned_gpus:
55+
check NEURON_RT_VISIBLE_CORES <= assigned_gpus
56+
else:
57+
auto_detect_number_of_neuron_cores and claim as GPU
58+
Gather GPU type information if possible
59+
60+
On Raylet:
61+
Reserve the neuron_cores to Raylet/WorkerNode by assigning the number
62+
of neuron-cores based on assigned GPU
63+
// For example let say, for 32 neuron-core machine (i-1) if we initialize
64+
// the cluster with num_gpu=4, we would reserve [0, 1, 2, 3] neuron-cores
65+
// to Raylet/WorkerNode
66+
67+
Lastly, add support for these accelerator_type on resources
68+
and auto-scaling NodeProvisioner
69+
70+
```
71+
72+
### Phase2
73+
Ray Train automatically sets up distributed process group and provides utility methods to prepare your model and data
74+
for distributed training. Ray Train supports TorchTrainer for data parallel PyTorch training which supports
75+
SPMD (single program, multiple data) paradigm. Each trainer/deep-learning framework is backed by a Backend which
76+
is used for distributed communication between workers/actors.
77+
78+
TorchBackend is the communication for TorchTrainer and it supports limited backends (nccl, gloo) today.
79+
In order to support NeuronCore we would use PythonXLA framework and configure the backend to XLA.
80+
Also, requires additional configuration of torch-elastic (now called tourchrun) environment variables
81+
for the XLA devices to detect.
82+
83+
```text
84+
class _TorchBackend(Backend):
85+
def on_start():
86+
# support xla backend
87+
# Configure master env of xla device related to torchrun/torch-elastic
88+
def on_shutdown():
89+
# cleanup NeuronCore cache if needed
90+
def on_training_start():
91+
# configure rank/world_size/node_rank based on xla device
92+
93+
94+
95+
# Usage
96+
trainer = TorchTrainer(
97+
train_func,
98+
train_loop_config=config,
99+
scaling_config=ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"num_gpu": 1})
100+
...
101+
)
102+
```
103+
104+
### Should this change be within `ray` or outside?
105+
1. For auto-detection, the changes are within RayCore
106+
2. For XLA backend support, the changes are with RayTrain
107+
108+
### Compatibility
109+
1. Able to auto-detect neuron cores as well as any existing accelerators
110+
```text
111+
2023-07-27 22:48:08,742 INFO worker.py:1621 -- Started a local Ray instance.
112+
{'node:__internal_head__': 1.0, 'CPU': 8.0, 'memory': 18270223566.0, 'object_store_memory': 9135111782.0, 'node:172.31.55.43': 1.0, 'GPU': 2.0}
113+
(GPUActor pid=345602) ray.get_gpu_ids(): [0]
114+
(GPUActor pid=345602) rt_visible_cores: 0
115+
(GPUActor pid=345602) {'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])}
116+
(use_gpu pid=345710) ray.get_gpu_ids(): [1]
117+
```
118+
119+
### Deprecation, Migration Plan
120+
Not required
121+
122+
### Test Plan and Acceptance Criteria
123+
1. Add unit-test coverage for [Phase-1](#Phase1) auto-detection
124+
2. Manual testing using real EC2 trn1 instance to validate the behavior
125+
126+
### Future implementation
127+
Add support for other deep-learning trainers (Tensorflow...) on RayTrain as [NeuronSDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html) support follows.
128+
129+
### Related Issues
130+
* [33504](https://github.com/ray-project/ray/issues/33504)
131+
* [33586](https://github.com/ray-project/ray/issues/33586)

0 commit comments

Comments
 (0)