Wangzl/r21100 doc additional update (#2346)

ZailiWang · jingxu10 · chunyuan-w · web-flow · commit 998f2f77ad02 · 2023-12-13T10:38:26.000+08:00
* .rst link fix

* bug fix in docstring

* misc link and wording fixes

* LLM model -&gt; LLM

* correction and adding a note for disabling weights_prepack in ipex.optimize for TorchDynamo mode.

* updating version in intro.rst &amp;&amp; refine 'device' param doc for optimize_transformers

* symbol correction

* adding v2.1.100 release notes

* remove duplicated link

* Fix running Docker container as the root user

* revert device argument desc in optimize_transformers() docstring; solve indent problem in LLM README.md; fix image links in performance_tuning/tuning_guide.md

---------

Co-authored-by: Jing Xu &lt;jing.xu@intel.com&gt;
Co-authored-by: chunyuan-w &lt;chunyuan.wu@intel.com&gt;
Co-authored-by: Mikolaj Zyczynski &lt;mikolaj.zyczynski@intel.com&gt;
diff --git a/docs/index.rst b/docs/index.rst
@@ -10,7 +10,7 @@ Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel®
 Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* ``xpu`` device.
 
 In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain 
-LLM models are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) <tutorials/llm.rst>`_ section.
+LLMs are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) <tutorials/llm.html>`_ section.
 
 The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing ``intel_extension_for_pytorch``.
 
diff --git a/docs/tutorials/examples.md b/docs/tutorials/examples.md
@@ -177,6 +177,8 @@ We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://py
 [//]: # (marker_inf_bert_dynamo_fp32)
 [//]: # (marker_inf_bert_dynamo_fp32)
 
+*Note:* In TorchDynamo mode, since the native PyTorch operators like `aten::convolution` and `aten::linear` are well supported and optimized in `ipex` backend, we need to disable weights prepacking by setting `weights_prepack=False` in `ipex.optimize()`.
+
 #### BFloat16
 
 The `optimize` function works for both Float32 and BFloat16 data type. For BFloat16 data type, set the `dtype` parameter to `torch.bfloat16`.
diff --git a/docs/tutorials/introduction.rst b/docs/tutorials/introduction.rst
@@ -16,7 +16,7 @@ the `Large Language Models (LLM) <llm.html>`_ section.
 
 Get Started
 -----------
-- `Installation <../../../index.html#installation?platform=cpu&version=v2.1.0%2Bcpu>`_
+- `Installation <../../../index.html#installation?platform=cpu&version=v2.1.100%2Bcpu>`_
 - `Quick Start <getting_started.md>`_
 - `Examples <examples.md>`_
 
diff --git a/docs/tutorials/llm.rst b/docs/tutorials/llm.rst
@@ -13,7 +13,7 @@ These LLM-specific optimizations can be automatically applied with a single fron
 
    llm/llm_optimize_transformers
 
-Supported Models
+Optimized Models
 ----------------
 
 .. list-table::
@@ -61,9 +61,9 @@ Supported Models
 
 \*\* For GPT-NEOX/FALCON/OPT models, the accuracy recipes of static quantization INT8 are not ready, thus, they will be skipped in our coverage.
 
-*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above.
+*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well optimized with all approaches like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM families, we are working in progress to cover those optimizations, which will expand the model list above.
 
-Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/v2.1.0%2Bcpu/examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts..
+Check `LLM best known practice <../../examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts..
 
 Demos
 -----
@@ -137,12 +137,12 @@ The section below provides a brief introduction to LLM optimization methodologie
 Linear Operator Optimization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives (TPP), which are used by `Fast BERT feature <./fast_bert.md>`_, and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way. 
+Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives (TPP), which are used by `Fast BERT feature <./features/fast_bert.md>`_, and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way. 
 
 Low Precision Data Types
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
-While Generative AI (GenAI) workloads and models are getting more and more popular, large language models (LLM) used in these workloads are getting more and more parameters. The increasing size of LLM models enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.
+While Generative AI (GenAI) workloads and models are getting more and more popular, LLMs used in these workloads are getting more and more parameters. The increasing size of LLMs enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.
 
 Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations.
 
diff --git a/docs/tutorials/performance_tuning/tuning_guide.md b/docs/tutorials/performance_tuning/tuning_guide.md
@@ -37,7 +37,7 @@ On the Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, (fo
 
 <div align="center">
 
-![Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture](https://software.intel.com/content/dam/develop/external/us/en/images/xeon-processor-scalable-family-tech-overview-fig03-737410.png)
+![Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture](../../../images/performance_tuning_guide/block_diagram_xeon_architecture.png)
 
 Figure 1: Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.
 
@@ -47,7 +47,7 @@ Usually, a CPU chip is called a socket. A typical two-socket configuration is il
 
 <div align="center">
 
-![Typical two-socket configuration](https://software.intel.com/content/dam/develop/external/us/en/images/xeon-processor-scalable-family-tech-overview-fig06-737410.png)
+![Typical two-socket configuration](../../../images/performance_tuning_guide/two_socket_config.png)
 
 Figure 2: Typical two-socket configuration.
 
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
@@ -1,6 +1,22 @@
 Releases
 =============
 
+## 2.1.100
+
+### Highlights
+
+- Improved the performance of BF16 LLM generation inference: [#2253](https://github.com/intel/intel-extension-for-pytorch/commit/99aa54f757de6c7d98f704edc6f8a83650fb1541) [#2251](https://github.com/intel/intel-extension-for-pytorch/commit/1d5e83d85c3aaf7c00323d7cb4019b40849dd2ed) [#2236](https://github.com/intel/intel-extension-for-pytorch/commit/be349962f3362f8afde4f083ec04d335245992bb) [#2278](https://github.com/intel/intel-extension-for-pytorch/commit/066c3bff417df084fa8e1d48375c0e1404320e95)
+
+- Added the optimization for Codegen: [#2257](https://github.com/intel/intel-extension-for-pytorch/commit/7c598e42e5b7899f284616c05c6896bf9d8bd2b8)
+
+- Provided the dockerfile and updated the related doc to improve the UX for LLM users: [#2229](https://github.com/intel/intel-extension-for-pytorch/commit/11484c3ebad9f868d0179a46de3d1330d9011822) [#2195](https://github.com/intel/intel-extension-for-pytorch/commit/0cd25021952bddcf5a364da45dfbefd4a0c77af4) [#2299](https://github.com/intel/intel-extension-for-pytorch/commit/76a42e516a68539752a3a8ab9aeb814d28c44cf8) [#2315](https://github.com/intel/intel-extension-for-pytorch/commit/4091bb5c0bf5f3c3ce5fbece291b44159a7fbf5c) [#2283](https://github.com/intel/intel-extension-for-pytorch/commit/e5ed8270d4d89bf68757f967676db57292c71920)
+
+- Improved the accuracy of the quantization path of LLMs: [#2280](https://github.com/intel/intel-extension-for-pytorch/commit/abc4c4e160cec3c792f5316e358173b8722a786e) [#2292](https://github.com/intel/intel-extension-for-pytorch/commit/4e212e41affa2ed07ffaf57bf10e9781113bc101) [#2275](https://github.com/intel/intel-extension-for-pytorch/commit/ed5957eb3b6190ad0be728656674f0a2a3b89158) [#2319](https://github.com/intel/intel-extension-for-pytorch/commit/1dae69de39408bc0ad245f4914d5f60e008a6eb3)
+
+- Misc fix and enhancement: [#2198](https://github.com/intel/intel-extension-for-pytorch/commit/ed1deccb86403e12e895227045d558117c5ea0fe) [#2264](https://github.com/intel/intel-extension-for-pytorch/commit/5dedcd6eb7bbf70dc92f0c20962fb2340e42e76f) [#2290](https://github.com/intel/intel-extension-for-pytorch/commit/c6e46cecd899317acfd2bd2a44a3f17b3cc1ce69)
+
+**Full Changelog**: https://github.com/intel/intel-extension-for-pytorch/compare/v2.1.0+cpu...v2.1.100+cpu
+
 ## 2.1.0
 
 ### Highlights
diff --git a/examples/cpu/inference/python/bert_torchdynamo_mode_inference_bf16.py b/examples/cpu/inference/python/bert_torchdynamo_mode_inference_bf16.py
@@ -12,7 +12,7 @@
 # Experimental Feature
 #################### code changes ####################  # noqa F401
 import intel_extension_for_pytorch as ipex
-model = ipex.optimize(model, dtype=torch.bfloat16)
+model = ipex.optimize(model, dtype=torch.bfloat16, weights_prepack=False)
 model = torch.compile(model, backend="ipex")
 ######################################################  # noqa F401
 
diff --git a/examples/cpu/inference/python/llm/README.md b/examples/cpu/inference/python/llm/README.md
@@ -1,7 +1,6 @@
 # Text Generation
 
-We provide the inference benchmarking scripts for large language models text generation.<br/>
-Support large language model families, including GPT-J, LLaMA, GPT-Neox, OPT, Falcon, CodeGen.<br/>
+We provide the inference benchmarking scripts for large language models (LLMs) text generation, by which several popular models in LLM family, including GPT-J, LLaMA, GPT-Neox, OPT, Falcon, CodeGen, are optimized.<br/>
 The scripts include both single instance and distributed (DeepSpeed) use cases.<br/>
 The scripts cover model generation inference with low precions cases for different models with best perf and accuracy (bf16 AMP，static quantization and weight only quantization).<br/>
 
@@ -20,15 +19,15 @@ The scripts cover model generation inference with low precions cases for differe
 
 \*\* For GPT-NEOX/FALCON/OPT/CodeGen models, the accuracy recipes of static quantization INT8 are not ready thus they will be skipped in our coverage.
 
-*Note:* The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above.
+*Note:* The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked Linear (fp32/bf16). For other LLM families, we are working in progress to cover those optimizations, which will expand the model list above.
 
 # Models to be Optimized
 
-We are working on the optimizations of a wider range of popular LLM models. Models like BLOOM, ChatGLM2/ChatGLM3, T5, BaiChuan/BaiChuan2, StarCoder and CodeLlama are to be optimized in the next release, and more models like Dolly2, MPT, QWen, Mistral, etc. are on the way.
+We are working on the optimizations of a wider range of popular LLMs. Models like BLOOM, ChatGLM2/ChatGLM3, T5, BaiChuan/BaiChuan2, StarCoder and CodeLlama are to be optimized in the next release, and more models like Dolly2, MPT, QWen, Mistral, etc. are on the way.
 
 # Environment Setup
 
-1. Get the Intel® Extension for PyTorch\* source code
+1\. Get the Intel® Extension for PyTorch\* source code
 
 ```bash
 git clone https://github.com/intel/intel-extension-for-pytorch.git
@@ -38,7 +37,7 @@ git submodule sync
 git submodule update --init --recursive
 ```
 
-2.a. It is highly recommended to build a Docker container from the provided `Dockerfile`.
+2\.a. It is highly recommended to build a Docker container from the provided `Dockerfile`.
 
 ```bash
 # Build an image with the provided Dockerfile by compiling Intel® Extension for PyTorch\* from source
@@ -54,7 +53,7 @@ docker run --rm -it --privileged ipex-llm:2.1.100 bash
 cd llm
 ```
 
-2.b. Alternatively, you can take advantage of a provided environment configuration script to setup an environment without using a docker container.
+2\.b. Alternatively, you can take advantage of a provided environment configuration script to setup an environment without using a docker container.
 
 ```bash
 # GCC 12.3 is required. Installation can be taken care of by the environment configuration script.
@@ -67,7 +66,7 @@ cd examples/cpu/inference/python/llm
 bash ./tools/env_setup.sh
 ```
 
-3. Once an environment is configured with either method above, set necessary environment variables with an environment variables activation script and download the sample `prompt.json`.
+3\. Once an environment is configured with either method above, set necessary environment variables with an environment variables activation script and download the sample `prompt.json`.
 
 ```bash
 # Activate environment variables
diff --git a/examples/cpu/serving/triton/Dockerfile b/examples/cpu/serving/triton/Dockerfile
@@ -3,20 +3,20 @@
 
 FROM nvcr.io/nvidia/tritonserver:23.10-py3
 
-COPY requirements.txt requirements.txt
-RUN apt-get update &&                                              \
-    apt-get install --no-install-recommends -y numactl             \
-                                               google-perftools    \
-                                               python3.9 &&        \
-    ln -s /usr/bin/python3.9 /usr/bin/python &&                    \
-    apt-get clean
+RUN useradd -m tritonuser && \
+    echo 'tritonuser ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers
 
-RUN python3 -m pip --no-cache-dir install -U --upgrade pip &&      \
+USER tritonuser
+WORKDIR /home/tritonuser
+
+COPY --chown=tritonuser:tritonuser requirements.txt requirements.txt
+ENV PATH="/home/tritonuser/.local/bin":${PATH}
+RUN python3 -m pip --no-cache-dir install -U --upgrade pip && \
     python3 -m pip --no-cache-dir install -U -r requirements.txt
 
-ENV LD_PRELOAD="/usr/local/lib/libiomp5.so:/usr/lib/x86_64-linux-gnu/libtcmalloc.so":${LD_PRELOAD}
-ENV KMP_BLOCKTIME=1
-ENV KMP_SETTINGS=1
-ENV KMP_AFFINITY=granularity=fine,compact,1,0
-ENV DNNL_PRIMITIVE_CACHE_CAPACITY=1024
-ENV TOKENIZERS_PARALLELISM=true
+ENV LD_PRELOAD="/home/tritonuser/.local/lib/libiomp5.so:/usr/lib/x86_64-linux-gnu/libtcmalloc.so":${LD_PRELOAD} \
+    KMP_BLOCKTIME=1 \
+    KMP_SETTINGS=1 \
+    KMP_AFFINITY=granularity=fine,compact,1,0 \
+    DNNL_PRIMITIVE_CACHE_CAPACITY=1024 \
+    TOKENIZERS_PARALLELISM=true
diff --git a/examples/cpu/serving/triton/start.sh b/examples/cpu/serving/triton/start.sh
@@ -40,7 +40,7 @@ start_host() {
 
     # Run Triton Host Server for specified model
     CORE_NUMBER=$(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')
-    docker run -it --read-only --rm -e OMP_NUM_THREADS=$CORE_NUMBER --privileged --shm-size=1g -p${1}:8000 -p8001:8001 -p8002:8002 --tmpfs /tmp:rw,noexec,nosuid,size=1g --tmpfs /root/.cache/:rw,noexec,nosuid,size=4g -v$(pwd)/backend:/models --name ai_inference_host ai_inference:v1 numactl -C 0-"$((CORE_NUMBER - 1))" -m 0 tritonserver --model-repository=/models --log-verbose 1 --log-error 1 --log-info 1
+    docker run -it --rm -e OMP_NUM_THREADS=$CORE_NUMBER --cpuset-cpus 0-"$((CORE_NUMBER - 1))" --shm-size=1g -p${1}:8000 -p8001:8001 -p8002:8002 --tmpfs /tmp:rw,noexec,nosuid,size=1g -v$(pwd)/backend:/models:rw --name ai_inference_host ai_inference:v1 tritonserver --model-repository=/models --log-verbose 1 --log-error 1 --log-info 1
 }
 
 start_client() {
diff --git a/images/performance_tuning_guide/block_diagram_xeon_architecture.png b/images/performance_tuning_guide/block_diagram_xeon_architecture.png
diff --git a/images/performance_tuning_guide/two_socket_config.png b/images/performance_tuning_guide/two_socket_config.png

Original file line number	Diff line number	Diff line change
`@@ -40,7 +40,7 @@ start_host() {`
`40`	`40`
`41`	`41`	`# Run Triton Host Server for specified model`
`42`	`42`	`CORE_NUMBER=$(lscpu \| grep 'Core(s) per socket:' \| awk '{print $4}')`
`43`		`- docker run -it --read-only --rm -e OMP_NUM_THREADS=$CORE_NUMBER --privileged --shm-size=1g -p${1}:8000 -p8001:8001 -p8002:8002 --tmpfs /tmp:rw,noexec,nosuid,size=1g --tmpfs /root/.cache/:rw,noexec,nosuid,size=4g -v$(pwd)/backend:/models --name ai_inference_host ai_inference:v1 numactl -C 0-"$((CORE_NUMBER - 1))" -m 0 tritonserver --model-repository=/models --log-verbose 1 --log-error 1 --log-info 1`
	`43`	`+ docker run -it --rm -e OMP_NUM_THREADS=$CORE_NUMBER --cpuset-cpus 0-"$((CORE_NUMBER - 1))" --shm-size=1g -p${1}:8000 -p8001:8001 -p8002:8002 --tmpfs /tmp:rw,noexec,nosuid,size=1g -v$(pwd)/backend:/models:rw --name ai_inference_host ai_inference:v1 tritonserver --model-repository=/models --log-verbose 1 --log-error 1 --log-info 1`
`44`	`44`	`}`
`45`	`45`
`46`	`46`	`start_client() {`