add release notes and update README for 0.50 release

aws-bowencc · hannanjgaws · commit 18b448f670e3 · 2023-07-21T22:59:47.000Z
SIM: https://i.amazon.com/kaena-18274 cr: https://code.amazon.com/reviews/CR-95668318
diff --git a/README.md b/README.md
@@ -129,11 +129,11 @@ of NeuronCores participating in sharded matrix multiply operations) for
 Neuron-optimized transformer decoder models.
 
 1. The number of attention heads needs to be divisible by the
-   tensor-parallelism degree. (Not apply to GPT2, OPT and BLOOM, with 1-axis padding).
+   tensor-parallelism degree. (Note: this limitation only applies to NeoX/GPTJ, it will be removed in future release.)
 2. The total data size of model weights and key-value caches needs to be
    smaller than 16 GB times the tensor-parallelism degree.
 3. Currently, the Neuron runtime supports tensor-parallelism degrees 1,
-   2, 8, 16 and 32 on Trn1 and supports tensor-parallelism degrees 1, 2, 4,
+   2, 8, 16, and 32 on Trn1/Trn1n and supports tensor-parallelism degrees 1, 2, 4,
    8, and 24 on Inf2.
 
 Some examples:
@@ -419,6 +419,7 @@ for running HuggingFace `facebook/opt-13b` autoregressive sampling on a trn1.2xl
 -  [OPT](https://huggingface.co/docs/transformers/model_doc/opt)
 -  [GPT-Neox [Experimental]](https://huggingface.co/docs/transformers/model_doc/gpt_neox)
 -  [Bloom [Experimental]](https://huggingface.co/docs/transformers/model_doc/bloom)
+-  [LLaMA [Experimental]](https://huggingface.co/docs/transformers/main/model_doc/llama)
 
 # Upcoming features
 
diff --git a/releasenotes.md b/releasenotes.md
@@ -6,6 +6,7 @@ Date: 2023-07-03
 
 - [Experimental] Added support for GPT-NeoX models.
 - [Experimental] Added support for BLOOM models.
+- [Experimental] Added support for LLaMA models.
 - Added support for more flexible tensor-parallel configurations to GPT2, OPT, and BLOOM. Previously, we had two constraints on `tp_degree`: 1) The attention heads needs to be evenly divisible by `tp_degree` 2) The `tp_degree` needs to satisfy the runtime topologies constraint for collective communication (i.e Allreduce). For more details on supported topologies, see: [Tensor-parallelism support](README.md#tensor-parallelism-support) and https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/collective-communication.html. We now get rid of 1) by using 1-axis padding.
 - Added multi-query / multi-group attention support for GPT2.