|
| 1 | +# AWS accelerators support on Ray |
| 2 | + |
| 3 | +## Background |
| 4 | +*[Feel free to skip this section if you are already familiar with AI accelerator, AWS trn1 EC2 instance and NeuronCore]* |
| 5 | + |
| 6 | +An AI Accelerator is a dedicated processor designed to accelerate machine-learning (ML) computations. These are |
| 7 | +specialized hardware designed to improve performance, reduce latency and reduce cost of deploying ML based applications. |
| 8 | + |
| 9 | +In late 2022, AWS announced general availability of Trn1 EC2 instances which are powered by AWS Trainium accelerators. |
| 10 | +AWS Trainium accelerator is an AI accelerator, purpose built for high-performance deep learning (DL) training of |
| 11 | +generative AI models, including large language models (LLMs) and latent diffusion models[1]. |
| 12 | + |
| 13 | +Each Trainium accelerator (aka NeuronDevice) includes two second-generation NeuronCores(NeuronCore-v2). |
| 14 | +It is designed for speed model training. NeuronCore-v2 is a fully-independent heterogeneous compute-unit, |
| 15 | +with multiple engines (Tensor/Vector/Scalar.. Engines), and on-chip memory, for maximizing data locality[2]. |
| 16 | +Also, Inferentia2 accelerator(inf2 which supports neuron-core) is designed to speed up inference. |
| 17 | + |
| 18 | + |
| 19 | +## Summary |
| 20 | + |
| 21 | +[Phase-1] Currently, Ray supports limited accelerators (only NVIDIA hardware) for GPUs which does not include |
| 22 | +AWS Trainium/Inferentia2 accelerators. |
| 23 | + |
| 24 | +[Phase-2] Also, Ray Train only supports Pytorch but not Pytorch-XLA (Accelerated Linear Algebra) which is a connector |
| 25 | +between Pytorch deep learning framework and TPUs/NeuronCores. |
| 26 | +Without these, customers can neither use AWS Trainium/Inferentia2 accelerators on Ray cluster by default nor use it for |
| 27 | +distributed training on Ray Train. |
| 28 | + |
| 29 | +## Stewardship |
| 30 | + |
| 31 | +### Required Reviewers |
| 32 | +@scv119 @matthewdeng |
| 33 | + |
| 34 | +### Shepherd of the Proposal (should be a senior committer) |
| 35 | +@scv119 @matthewdeng |
| 36 | + |
| 37 | + |
| 38 | +## Design and Architecture |
| 39 | +### Phase1 |
| 40 | + |
| 41 | +On Ray node initialization, each Raylet represents resource configuration with pre-defined resources |
| 42 | +(CPU, GPU, object_store_memory...) and custom resources which resolves to resource specifications. |
| 43 | +These node specifications are advertised to RayScheduler which will be used for work assignment. |
| 44 | + |
| 45 | +Unlike distributed tasks. GPUs do not have Python interpreters. Instead of sending python lambdas, high-level |
| 46 | +tools like Torch, Tensor will generate or call native GPU/accelerator code. CUDA and Neuron SDK are some low-level |
| 47 | +libraries for interacting with GPUs/accelerators. |
| 48 | + |
| 49 | +Currently, Ray supports/detects only NVIDIA accelerators. We make appropriate changes to make AWS accelerators visible |
| 50 | +using Neuron-Runtime/Neuron SDK |
| 51 | + |
| 52 | +```text |
| 53 | +On node initialization: |
| 54 | +if assigned_gpus: |
| 55 | + check NEURON_RT_VISIBLE_CORES <= assigned_gpus |
| 56 | +else: |
| 57 | + auto_detect_number_of_neuron_cores and claim as GPU |
| 58 | +Gather GPU type information if possible |
| 59 | +
|
| 60 | +On Raylet: |
| 61 | +Reserve the neuron_cores to Raylet/WorkerNode by assigning the number |
| 62 | +of neuron-cores based on assigned GPU |
| 63 | +// For example let say, for 32 neuron-core machine (i-1) if we initialize |
| 64 | +// the cluster with num_gpu=4, we would reserve [0, 1, 2, 3] neuron-cores |
| 65 | +// to Raylet/WorkerNode |
| 66 | +
|
| 67 | +Lastly, add support for these accelerator_type on resources |
| 68 | +and auto-scaling NodeProvisioner |
| 69 | +
|
| 70 | +``` |
| 71 | + |
| 72 | +### Phase2 |
| 73 | +Ray Train automatically sets up distributed process group and provides utility methods to prepare your model and data |
| 74 | +for distributed training. Ray Train supports TorchTrainer for data parallel PyTorch training which supports |
| 75 | +SPMD (single program, multiple data) paradigm. Each trainer/deep-learning framework is backed by a Backend which |
| 76 | +is used for distributed communication between workers/actors. |
| 77 | + |
| 78 | +TorchBackend is the communication for TorchTrainer and it supports limited backends (nccl, gloo) today. |
| 79 | +In order to support NeuronCore we would use PythonXLA framework and configure the backend to XLA. |
| 80 | +Also, requires additional configuration of torch-elastic (now called tourchrun) environment variables |
| 81 | +for the XLA devices to detect. |
| 82 | + |
| 83 | +```text |
| 84 | +class _TorchBackend(Backend): |
| 85 | + def on_start(): |
| 86 | + # support xla backend |
| 87 | + # Configure master env of xla device related to torchrun/torch-elastic |
| 88 | + def on_shutdown(): |
| 89 | + # cleanup NeuronCore cache if needed |
| 90 | + def on_training_start(): |
| 91 | + # configure rank/world_size/node_rank based on xla device |
| 92 | +
|
| 93 | +
|
| 94 | +
|
| 95 | +# Usage |
| 96 | +trainer = TorchTrainer( |
| 97 | + train_func, |
| 98 | + train_loop_config=config, |
| 99 | + scaling_config=ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"num_gpu": 1}) |
| 100 | + ... |
| 101 | + ) |
| 102 | +``` |
| 103 | + |
| 104 | +### Should this change be within `ray` or outside? |
| 105 | +1. For auto-detection, the changes are within RayCore |
| 106 | +2. For XLA backend support, the changes are with RayTrain |
| 107 | + |
| 108 | +### Compatibility |
| 109 | +1. Able to auto-detect neuron cores as well as any existing accelerators |
| 110 | +```text |
| 111 | +2023-07-27 22:48:08,742 INFO worker.py:1621 -- Started a local Ray instance. |
| 112 | +{'node:__internal_head__': 1.0, 'CPU': 8.0, 'memory': 18270223566.0, 'object_store_memory': 9135111782.0, 'node:172.31.55.43': 1.0, 'GPU': 2.0} |
| 113 | +(GPUActor pid=345602) ray.get_gpu_ids(): [0] |
| 114 | +(GPUActor pid=345602) rt_visible_cores: 0 |
| 115 | +(GPUActor pid=345602) {'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])} |
| 116 | +(use_gpu pid=345710) ray.get_gpu_ids(): [1] |
| 117 | +``` |
| 118 | + |
| 119 | +### Deprecation, Migration Plan |
| 120 | +Not required |
| 121 | + |
| 122 | +### Test Plan and Acceptance Criteria |
| 123 | +1. Add unit-test coverage for [Phase-1](#Phase1) auto-detection |
| 124 | +2. Manual testing using real EC2 trn1 instance to validate the behavior |
| 125 | + |
| 126 | +### Future implementation |
| 127 | +Add support for other deep-learning trainers (Tensorflow...) on RayTrain as [NeuronSDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html) support follows. |
| 128 | + |
| 129 | +### Related Issues |
| 130 | +* [33504](https://github.com/ray-project/ray/issues/33504) |
| 131 | +* [33586](https://github.com/ray-project/ray/issues/33586) |
0 commit comments