Skip to content

Commit 18b448f

Browse files
aws-bowencchannanjgaws
authored andcommitted
add release notes and update README for 0.50 release
SIM: https://i.amazon.com/kaena-18274 cr: https://code.amazon.com/reviews/CR-95668318
1 parent bc829fc commit 18b448f

File tree

2 files changed

+4
-2
lines changed

2 files changed

+4
-2
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,11 +129,11 @@ of NeuronCores participating in sharded matrix multiply operations) for
129129
Neuron-optimized transformer decoder models.
130130

131131
1. The number of attention heads needs to be divisible by the
132-
tensor-parallelism degree. (Not apply to GPT2, OPT and BLOOM, with 1-axis padding).
132+
tensor-parallelism degree. (Note: this limitation only applies to NeoX/GPTJ, it will be removed in future release.)
133133
2. The total data size of model weights and key-value caches needs to be
134134
smaller than 16 GB times the tensor-parallelism degree.
135135
3. Currently, the Neuron runtime supports tensor-parallelism degrees 1,
136-
2, 8, 16 and 32 on Trn1 and supports tensor-parallelism degrees 1, 2, 4,
136+
2, 8, 16, and 32 on Trn1/Trn1n and supports tensor-parallelism degrees 1, 2, 4,
137137
8, and 24 on Inf2.
138138

139139
Some examples:
@@ -419,6 +419,7 @@ for running HuggingFace `facebook/opt-13b` autoregressive sampling on a trn1.2xl
419419
- [OPT](https://huggingface.co/docs/transformers/model_doc/opt)
420420
- [GPT-Neox [Experimental]](https://huggingface.co/docs/transformers/model_doc/gpt_neox)
421421
- [Bloom [Experimental]](https://huggingface.co/docs/transformers/model_doc/bloom)
422+
- [LLaMA [Experimental]](https://huggingface.co/docs/transformers/main/model_doc/llama)
422423

423424
# Upcoming features
424425

releasenotes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ Date: 2023-07-03
66

77
- [Experimental] Added support for GPT-NeoX models.
88
- [Experimental] Added support for BLOOM models.
9+
- [Experimental] Added support for LLaMA models.
910
- Added support for more flexible tensor-parallel configurations to GPT2, OPT, and BLOOM. Previously, we had two constraints on `tp_degree`: 1) The attention heads needs to be evenly divisible by `tp_degree` 2) The `tp_degree` needs to satisfy the runtime topologies constraint for collective communication (i.e Allreduce). For more details on supported topologies, see: [Tensor-parallelism support](README.md#tensor-parallelism-support) and https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/collective-communication.html. We now get rid of 1) by using 1-axis padding.
1011
- Added multi-query / multi-group attention support for GPT2.
1112

0 commit comments

Comments
 (0)