Skip to content

Commit 998f2f7

Browse files
ZailiWangjingxu10chunyuan-wmzyczyns
authored
Wangzl/r21100 doc additional update (#2346)
* .rst link fix * bug fix in docstring * misc link and wording fixes * LLM model -> LLM * correction and adding a note for disabling weights_prepack in ipex.optimize for TorchDynamo mode. * updating version in intro.rst && refine 'device' param doc for optimize_transformers * symbol correction * adding v2.1.100 release notes * remove duplicated link * Fix running Docker container as the root user * revert device argument desc in optimize_transformers() docstring; solve indent problem in LLM README.md; fix image links in performance_tuning/tuning_guide.md --------- Co-authored-by: Jing Xu <[email protected]> Co-authored-by: chunyuan-w <[email protected]> Co-authored-by: Mikolaj Zyczynski <[email protected]>
1 parent b47faf5 commit 998f2f7

File tree

12 files changed

+50
-33
lines changed

12 files changed

+50
-33
lines changed

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel®
1010
Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* ``xpu`` device.
1111

1212
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain
13-
LLM models are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) <tutorials/llm.rst>`_ section.
13+
LLMs are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) <tutorials/llm.html>`_ section.
1414

1515
The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing ``intel_extension_for_pytorch``.
1616

docs/tutorials/examples.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,8 @@ We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://py
177177
[//]: # (marker_inf_bert_dynamo_fp32)
178178
[//]: # (marker_inf_bert_dynamo_fp32)
179179

180+
*Note:* In TorchDynamo mode, since the native PyTorch operators like `aten::convolution` and `aten::linear` are well supported and optimized in `ipex` backend, we need to disable weights prepacking by setting `weights_prepack=False` in `ipex.optimize()`.
181+
180182
#### BFloat16
181183

182184
The `optimize` function works for both Float32 and BFloat16 data type. For BFloat16 data type, set the `dtype` parameter to `torch.bfloat16`.

docs/tutorials/introduction.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ the `Large Language Models (LLM) <llm.html>`_ section.
1616

1717
Get Started
1818
-----------
19-
- `Installation <../../../index.html#installation?platform=cpu&version=v2.1.0%2Bcpu>`_
19+
- `Installation <../../../index.html#installation?platform=cpu&version=v2.1.100%2Bcpu>`_
2020
- `Quick Start <getting_started.md>`_
2121
- `Examples <examples.md>`_
2222

docs/tutorials/llm.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ These LLM-specific optimizations can be automatically applied with a single fron
1313

1414
llm/llm_optimize_transformers
1515

16-
Supported Models
16+
Optimized Models
1717
----------------
1818

1919
.. list-table::
@@ -61,9 +61,9 @@ Supported Models
6161

6262
\*\* For GPT-NEOX/FALCON/OPT models, the accuracy recipes of static quantization INT8 are not ready, thus, they will be skipped in our coverage.
6363

64-
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above.
64+
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well optimized with all approaches like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM families, we are working in progress to cover those optimizations, which will expand the model list above.
6565

66-
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/v2.1.0%2Bcpu/examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts..
66+
Check `LLM best known practice <../../examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts..
6767

6868
Demos
6969
-----
@@ -137,12 +137,12 @@ The section below provides a brief introduction to LLM optimization methodologie
137137
Linear Operator Optimization
138138
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
139139

140-
Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives (TPP), which are used by `Fast BERT feature <./fast_bert.md>`_, and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way.
140+
Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives (TPP), which are used by `Fast BERT feature <./features/fast_bert.md>`_, and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way.
141141

142142
Low Precision Data Types
143143
~~~~~~~~~~~~~~~~~~~~~~~~
144144

145-
While Generative AI (GenAI) workloads and models are getting more and more popular, large language models (LLM) used in these workloads are getting more and more parameters. The increasing size of LLM models enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.
145+
While Generative AI (GenAI) workloads and models are getting more and more popular, LLMs used in these workloads are getting more and more parameters. The increasing size of LLMs enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.
146146

147147
Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations.
148148

docs/tutorials/performance_tuning/tuning_guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ On the Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, (fo
3737

3838
<div align="center">
3939

40-
![Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture](https://software.intel.com/content/dam/develop/external/us/en/images/xeon-processor-scalable-family-tech-overview-fig03-737410.png)
40+
![Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture](../../../images/performance_tuning_guide/block_diagram_xeon_architecture.png)
4141

4242
Figure 1: Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.
4343

@@ -47,7 +47,7 @@ Usually, a CPU chip is called a socket. A typical two-socket configuration is il
4747

4848
<div align="center">
4949

50-
![Typical two-socket configuration](https://software.intel.com/content/dam/develop/external/us/en/images/xeon-processor-scalable-family-tech-overview-fig06-737410.png)
50+
![Typical two-socket configuration](../../../images/performance_tuning_guide/two_socket_config.png)
5151

5252
Figure 2: Typical two-socket configuration.
5353

docs/tutorials/releases.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,22 @@
11
Releases
22
=============
33

4+
## 2.1.100
5+
6+
### Highlights
7+
8+
- Improved the performance of BF16 LLM generation inference: [#2253](https://github.com/intel/intel-extension-for-pytorch/commit/99aa54f757de6c7d98f704edc6f8a83650fb1541) [#2251](https://github.com/intel/intel-extension-for-pytorch/commit/1d5e83d85c3aaf7c00323d7cb4019b40849dd2ed) [#2236](https://github.com/intel/intel-extension-for-pytorch/commit/be349962f3362f8afde4f083ec04d335245992bb) [#2278](https://github.com/intel/intel-extension-for-pytorch/commit/066c3bff417df084fa8e1d48375c0e1404320e95)
9+
10+
- Added the optimization for Codegen: [#2257](https://github.com/intel/intel-extension-for-pytorch/commit/7c598e42e5b7899f284616c05c6896bf9d8bd2b8)
11+
12+
- Provided the dockerfile and updated the related doc to improve the UX for LLM users: [#2229](https://github.com/intel/intel-extension-for-pytorch/commit/11484c3ebad9f868d0179a46de3d1330d9011822) [#2195](https://github.com/intel/intel-extension-for-pytorch/commit/0cd25021952bddcf5a364da45dfbefd4a0c77af4) [#2299](https://github.com/intel/intel-extension-for-pytorch/commit/76a42e516a68539752a3a8ab9aeb814d28c44cf8) [#2315](https://github.com/intel/intel-extension-for-pytorch/commit/4091bb5c0bf5f3c3ce5fbece291b44159a7fbf5c) [#2283](https://github.com/intel/intel-extension-for-pytorch/commit/e5ed8270d4d89bf68757f967676db57292c71920)
13+
14+
- Improved the accuracy of the quantization path of LLMs: [#2280](https://github.com/intel/intel-extension-for-pytorch/commit/abc4c4e160cec3c792f5316e358173b8722a786e) [#2292](https://github.com/intel/intel-extension-for-pytorch/commit/4e212e41affa2ed07ffaf57bf10e9781113bc101) [#2275](https://github.com/intel/intel-extension-for-pytorch/commit/ed5957eb3b6190ad0be728656674f0a2a3b89158) [#2319](https://github.com/intel/intel-extension-for-pytorch/commit/1dae69de39408bc0ad245f4914d5f60e008a6eb3)
15+
16+
- Misc fix and enhancement: [#2198](https://github.com/intel/intel-extension-for-pytorch/commit/ed1deccb86403e12e895227045d558117c5ea0fe) [#2264](https://github.com/intel/intel-extension-for-pytorch/commit/5dedcd6eb7bbf70dc92f0c20962fb2340e42e76f) [#2290](https://github.com/intel/intel-extension-for-pytorch/commit/c6e46cecd899317acfd2bd2a44a3f17b3cc1ce69)
17+
18+
**Full Changelog**: https://github.com/intel/intel-extension-for-pytorch/compare/v2.1.0+cpu...v2.1.100+cpu
19+
420
## 2.1.0
521

622
### Highlights

examples/cpu/inference/python/bert_torchdynamo_mode_inference_bf16.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# Experimental Feature
1313
#################### code changes #################### # noqa F401
1414
import intel_extension_for_pytorch as ipex
15-
model = ipex.optimize(model, dtype=torch.bfloat16)
15+
model = ipex.optimize(model, dtype=torch.bfloat16, weights_prepack=False)
1616
model = torch.compile(model, backend="ipex")
1717
###################################################### # noqa F401
1818

examples/cpu/inference/python/llm/README.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# Text Generation
22

3-
We provide the inference benchmarking scripts for large language models text generation.<br/>
4-
Support large language model families, including GPT-J, LLaMA, GPT-Neox, OPT, Falcon, CodeGen.<br/>
3+
We provide the inference benchmarking scripts for large language models (LLMs) text generation, by which several popular models in LLM family, including GPT-J, LLaMA, GPT-Neox, OPT, Falcon, CodeGen, are optimized.<br/>
54
The scripts include both single instance and distributed (DeepSpeed) use cases.<br/>
65
The scripts cover model generation inference with low precions cases for different models with best perf and accuracy (bf16 AMP,static quantization and weight only quantization).<br/>
76

@@ -20,15 +19,15 @@ The scripts cover model generation inference with low precions cases for differe
2019

2120
\*\* For GPT-NEOX/FALCON/OPT/CodeGen models, the accuracy recipes of static quantization INT8 are not ready thus they will be skipped in our coverage.
2221

23-
*Note:* The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above.
22+
*Note:* The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked Linear (fp32/bf16). For other LLM families, we are working in progress to cover those optimizations, which will expand the model list above.
2423

2524
# Models to be Optimized
2625

27-
We are working on the optimizations of a wider range of popular LLM models. Models like BLOOM, ChatGLM2/ChatGLM3, T5, BaiChuan/BaiChuan2, StarCoder and CodeLlama are to be optimized in the next release, and more models like Dolly2, MPT, QWen, Mistral, etc. are on the way.
26+
We are working on the optimizations of a wider range of popular LLMs. Models like BLOOM, ChatGLM2/ChatGLM3, T5, BaiChuan/BaiChuan2, StarCoder and CodeLlama are to be optimized in the next release, and more models like Dolly2, MPT, QWen, Mistral, etc. are on the way.
2827

2928
# Environment Setup
3029

31-
1. Get the Intel® Extension for PyTorch\* source code
30+
1\. Get the Intel® Extension for PyTorch\* source code
3231

3332
```bash
3433
git clone https://github.com/intel/intel-extension-for-pytorch.git
@@ -38,7 +37,7 @@ git submodule sync
3837
git submodule update --init --recursive
3938
```
4039

41-
2.a. It is highly recommended to build a Docker container from the provided `Dockerfile`.
40+
2\.a. It is highly recommended to build a Docker container from the provided `Dockerfile`.
4241

4342
```bash
4443
# Build an image with the provided Dockerfile by compiling Intel® Extension for PyTorch\* from source
@@ -54,7 +53,7 @@ docker run --rm -it --privileged ipex-llm:2.1.100 bash
5453
cd llm
5554
```
5655

57-
2.b. Alternatively, you can take advantage of a provided environment configuration script to setup an environment without using a docker container.
56+
2\.b. Alternatively, you can take advantage of a provided environment configuration script to setup an environment without using a docker container.
5857

5958
```bash
6059
# GCC 12.3 is required. Installation can be taken care of by the environment configuration script.
@@ -67,7 +66,7 @@ cd examples/cpu/inference/python/llm
6766
bash ./tools/env_setup.sh
6867
```
6968

70-
3. Once an environment is configured with either method above, set necessary environment variables with an environment variables activation script and download the sample `prompt.json`.
69+
3\. Once an environment is configured with either method above, set necessary environment variables with an environment variables activation script and download the sample `prompt.json`.
7170

7271
```bash
7372
# Activate environment variables

examples/cpu/serving/triton/Dockerfile

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,20 @@
33

44
FROM nvcr.io/nvidia/tritonserver:23.10-py3
55

6-
COPY requirements.txt requirements.txt
7-
RUN apt-get update && \
8-
apt-get install --no-install-recommends -y numactl \
9-
google-perftools \
10-
python3.9 && \
11-
ln -s /usr/bin/python3.9 /usr/bin/python && \
12-
apt-get clean
6+
RUN useradd -m tritonuser && \
7+
echo 'tritonuser ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers
138

14-
RUN python3 -m pip --no-cache-dir install -U --upgrade pip && \
9+
USER tritonuser
10+
WORKDIR /home/tritonuser
11+
12+
COPY --chown=tritonuser:tritonuser requirements.txt requirements.txt
13+
ENV PATH="/home/tritonuser/.local/bin":${PATH}
14+
RUN python3 -m pip --no-cache-dir install -U --upgrade pip && \
1515
python3 -m pip --no-cache-dir install -U -r requirements.txt
1616

17-
ENV LD_PRELOAD="/usr/local/lib/libiomp5.so:/usr/lib/x86_64-linux-gnu/libtcmalloc.so":${LD_PRELOAD}
18-
ENV KMP_BLOCKTIME=1
19-
ENV KMP_SETTINGS=1
20-
ENV KMP_AFFINITY=granularity=fine,compact,1,0
21-
ENV DNNL_PRIMITIVE_CACHE_CAPACITY=1024
22-
ENV TOKENIZERS_PARALLELISM=true
17+
ENV LD_PRELOAD="/home/tritonuser/.local/lib/libiomp5.so:/usr/lib/x86_64-linux-gnu/libtcmalloc.so":${LD_PRELOAD} \
18+
KMP_BLOCKTIME=1 \
19+
KMP_SETTINGS=1 \
20+
KMP_AFFINITY=granularity=fine,compact,1,0 \
21+
DNNL_PRIMITIVE_CACHE_CAPACITY=1024 \
22+
TOKENIZERS_PARALLELISM=true

examples/cpu/serving/triton/start.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ start_host() {
4040

4141
# Run Triton Host Server for specified model
4242
CORE_NUMBER=$(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')
43-
docker run -it --read-only --rm -e OMP_NUM_THREADS=$CORE_NUMBER --privileged --shm-size=1g -p${1}:8000 -p8001:8001 -p8002:8002 --tmpfs /tmp:rw,noexec,nosuid,size=1g --tmpfs /root/.cache/:rw,noexec,nosuid,size=4g -v$(pwd)/backend:/models --name ai_inference_host ai_inference:v1 numactl -C 0-"$((CORE_NUMBER - 1))" -m 0 tritonserver --model-repository=/models --log-verbose 1 --log-error 1 --log-info 1
43+
docker run -it --rm -e OMP_NUM_THREADS=$CORE_NUMBER --cpuset-cpus 0-"$((CORE_NUMBER - 1))" --shm-size=1g -p${1}:8000 -p8001:8001 -p8002:8002 --tmpfs /tmp:rw,noexec,nosuid,size=1g -v$(pwd)/backend:/models:rw --name ai_inference_host ai_inference:v1 tritonserver --model-repository=/models --log-verbose 1 --log-error 1 --log-info 1
4444
}
4545

4646
start_client() {

0 commit comments

Comments
 (0)