You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* .rst link fix
* bug fix in docstring
* misc link and wording fixes
* LLM model -> LLM
* correction and adding a note for disabling weights_prepack in ipex.optimize for TorchDynamo mode.
* updating version in intro.rst && refine 'device' param doc for optimize_transformers
* symbol correction
* adding v2.1.100 release notes
* remove duplicated link
* Fix running Docker container as the root user
* revert device argument desc in optimize_transformers() docstring; solve indent problem in LLM README.md; fix image links in performance_tuning/tuning_guide.md
---------
Co-authored-by: Jing Xu <[email protected]>
Co-authored-by: chunyuan-w <[email protected]>
Co-authored-by: Mikolaj Zyczynski <[email protected]>
Copy file name to clipboardExpand all lines: docs/index.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel®
10
10
Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* ``xpu`` device.
11
11
12
12
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain
13
-
LLM models are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) <tutorials/llm.rst>`_ section.
13
+
LLMs are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) <tutorials/llm.html>`_ section.
14
14
15
15
The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing ``intel_extension_for_pytorch``.
Copy file name to clipboardExpand all lines: docs/tutorials/examples.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -177,6 +177,8 @@ We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://py
177
177
[//]: #(marker_inf_bert_dynamo_fp32)
178
178
[//]: #(marker_inf_bert_dynamo_fp32)
179
179
180
+
*Note:* In TorchDynamo mode, since the native PyTorch operators like `aten::convolution` and `aten::linear` are well supported and optimized in `ipex` backend, we need to disable weights prepacking by setting `weights_prepack=False` in `ipex.optimize()`.
181
+
180
182
#### BFloat16
181
183
182
184
The `optimize` function works for both Float32 and BFloat16 data type. For BFloat16 data type, set the `dtype` parameter to `torch.bfloat16`.
Copy file name to clipboardExpand all lines: docs/tutorials/llm.rst
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ These LLM-specific optimizations can be automatically applied with a single fron
13
13
14
14
llm/llm_optimize_transformers
15
15
16
-
Supported Models
16
+
Optimized Models
17
17
----------------
18
18
19
19
.. list-table::
@@ -61,9 +61,9 @@ Supported Models
61
61
62
62
\*\* For GPT-NEOX/FALCON/OPT models, the accuracy recipes of static quantization INT8 are not ready, thus, they will be skipped in our coverage.
63
63
64
-
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above.
64
+
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well optimized with all approaches like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM families, we are working in progress to cover those optimizations, which will expand the model list above.
65
65
66
-
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/v2.1.0%2Bcpu/examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts..
66
+
Check `LLM best known practice <../../examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts..
67
67
68
68
Demos
69
69
-----
@@ -137,12 +137,12 @@ The section below provides a brief introduction to LLM optimization methodologie
137
137
Linear Operator Optimization
138
138
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
139
139
140
-
Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives (TPP), which are used by `Fast BERT feature <./fast_bert.md>`_, and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way.
140
+
Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives (TPP), which are used by `Fast BERT feature <./features/fast_bert.md>`_, and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way.
141
141
142
142
Low Precision Data Types
143
143
~~~~~~~~~~~~~~~~~~~~~~~~
144
144
145
-
While Generative AI (GenAI) workloads and models are getting more and more popular, large language models (LLM) used in these workloads are getting more and more parameters. The increasing size of LLM models enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.
145
+
While Generative AI (GenAI) workloads and models are getting more and more popular, LLMs used in these workloads are getting more and more parameters. The increasing size of LLMs enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.
146
146
147
147
Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations.
Copy file name to clipboardExpand all lines: docs/tutorials/performance_tuning/tuning_guide.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ On the Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, (fo
37
37
38
38
<divalign="center">
39
39
40
-

40
+

41
41
42
42
Figure 1: Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.
43
43
@@ -47,7 +47,7 @@ Usually, a CPU chip is called a socket. A typical two-socket configuration is il
Copy file name to clipboardExpand all lines: docs/tutorials/releases.md
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,22 @@
1
1
Releases
2
2
=============
3
3
4
+
## 2.1.100
5
+
6
+
### Highlights
7
+
8
+
- Improved the performance of BF16 LLM generation inference: [#2253](https://github.com/intel/intel-extension-for-pytorch/commit/99aa54f757de6c7d98f704edc6f8a83650fb1541)[#2251](https://github.com/intel/intel-extension-for-pytorch/commit/1d5e83d85c3aaf7c00323d7cb4019b40849dd2ed)[#2236](https://github.com/intel/intel-extension-for-pytorch/commit/be349962f3362f8afde4f083ec04d335245992bb)[#2278](https://github.com/intel/intel-extension-for-pytorch/commit/066c3bff417df084fa8e1d48375c0e1404320e95)
9
+
10
+
- Added the optimization for Codegen: [#2257](https://github.com/intel/intel-extension-for-pytorch/commit/7c598e42e5b7899f284616c05c6896bf9d8bd2b8)
11
+
12
+
- Provided the dockerfile and updated the related doc to improve the UX for LLM users: [#2229](https://github.com/intel/intel-extension-for-pytorch/commit/11484c3ebad9f868d0179a46de3d1330d9011822)[#2195](https://github.com/intel/intel-extension-for-pytorch/commit/0cd25021952bddcf5a364da45dfbefd4a0c77af4)[#2299](https://github.com/intel/intel-extension-for-pytorch/commit/76a42e516a68539752a3a8ab9aeb814d28c44cf8)[#2315](https://github.com/intel/intel-extension-for-pytorch/commit/4091bb5c0bf5f3c3ce5fbece291b44159a7fbf5c)[#2283](https://github.com/intel/intel-extension-for-pytorch/commit/e5ed8270d4d89bf68757f967676db57292c71920)
13
+
14
+
- Improved the accuracy of the quantization path of LLMs: [#2280](https://github.com/intel/intel-extension-for-pytorch/commit/abc4c4e160cec3c792f5316e358173b8722a786e)[#2292](https://github.com/intel/intel-extension-for-pytorch/commit/4e212e41affa2ed07ffaf57bf10e9781113bc101)[#2275](https://github.com/intel/intel-extension-for-pytorch/commit/ed5957eb3b6190ad0be728656674f0a2a3b89158)[#2319](https://github.com/intel/intel-extension-for-pytorch/commit/1dae69de39408bc0ad245f4914d5f60e008a6eb3)
15
+
16
+
- Misc fix and enhancement: [#2198](https://github.com/intel/intel-extension-for-pytorch/commit/ed1deccb86403e12e895227045d558117c5ea0fe)[#2264](https://github.com/intel/intel-extension-for-pytorch/commit/5dedcd6eb7bbf70dc92f0c20962fb2340e42e76f)[#2290](https://github.com/intel/intel-extension-for-pytorch/commit/c6e46cecd899317acfd2bd2a44a3f17b3cc1ce69)
Copy file name to clipboardExpand all lines: examples/cpu/inference/python/llm/README.md
+7-8Lines changed: 7 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,6 @@
1
1
# Text Generation
2
2
3
-
We provide the inference benchmarking scripts for large language models text generation.<br/>
4
-
Support large language model families, including GPT-J, LLaMA, GPT-Neox, OPT, Falcon, CodeGen.<br/>
3
+
We provide the inference benchmarking scripts for large language models (LLMs) text generation, by which several popular models in LLM family, including GPT-J, LLaMA, GPT-Neox, OPT, Falcon, CodeGen, are optimized.<br/>
5
4
The scripts include both single instance and distributed (DeepSpeed) use cases.<br/>
6
5
The scripts cover model generation inference with low precions cases for different models with best perf and accuracy (bf16 AMP,static quantization and weight only quantization).<br/>
7
6
@@ -20,15 +19,15 @@ The scripts cover model generation inference with low precions cases for differe
20
19
21
20
\*\* For GPT-NEOX/FALCON/OPT/CodeGen models, the accuracy recipes of static quantization INT8 are not ready thus they will be skipped in our coverage.
22
21
23
-
*Note:* The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above.
22
+
*Note:* The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked Linear (fp32/bf16). For other LLM families, we are working in progress to cover those optimizations, which will expand the model list above.
24
23
25
24
# Models to be Optimized
26
25
27
-
We are working on the optimizations of a wider range of popular LLM models. Models like BLOOM, ChatGLM2/ChatGLM3, T5, BaiChuan/BaiChuan2, StarCoder and CodeLlama are to be optimized in the next release, and more models like Dolly2, MPT, QWen, Mistral, etc. are on the way.
26
+
We are working on the optimizations of a wider range of popular LLMs. Models like BLOOM, ChatGLM2/ChatGLM3, T5, BaiChuan/BaiChuan2, StarCoder and CodeLlama are to be optimized in the next release, and more models like Dolly2, MPT, QWen, Mistral, etc. are on the way.
28
27
29
28
# Environment Setup
30
29
31
-
1. Get the Intel® Extension for PyTorch\* source code
30
+
1\. Get the Intel® Extension for PyTorch\* source code
2.b. Alternatively, you can take advantage of a provided environment configuration script to setup an environment without using a docker container.
56
+
2\.b. Alternatively, you can take advantage of a provided environment configuration script to setup an environment without using a docker container.
58
57
59
58
```bash
60
59
# GCC 12.3 is required. Installation can be taken care of by the environment configuration script.
@@ -67,7 +66,7 @@ cd examples/cpu/inference/python/llm
67
66
bash ./tools/env_setup.sh
68
67
```
69
68
70
-
3. Once an environment is configured with either method above, set necessary environment variables with an environment variables activation script and download the sample `prompt.json`.
69
+
3\. Once an environment is configured with either method above, set necessary environment variables with an environment variables activation script and download the sample `prompt.json`.
0 commit comments