Skip to content

Commit 2aee05c

Browse files
committed
Merge branch 'master' of https://github.com/openvinotoolkit/openvino into mock_npuw_compiled_model
2 parents fc7a534 + e0ab225 commit 2aee05c

17 files changed

+764
-399
lines changed
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Asynchronous Compilation for Dynamic Models
2+
3+
## Motivation
4+
5+
When the input shape of any layer changes, a new static kernel must be compiled because the previously built kernel is no longer valid. If inference requests must wait for this compilation to complete, latency increases by the kernel compile time, degrading overall throughput. Asynchronous kernel compilation solves this problem by decoupling kernel compilation from inference execution.
6+
7+
## Overall Workflow
8+
9+
<!-- flowchart TD
10+
A[Start Network Loading] -- > B(Build dynamic kernels)
11+
B -- > C[Start Inferencing]
12+
C -- > D{Is Input Shape Changed?
13+
or Is current impl dynamic?}
14+
D -- > |Yes| G{Does this primitive have a cached impl?}
15+
G -- > |Yes| I(Load pre-built impl from impl cache)
16+
I -- > F
17+
G -- > |No| H(Trigger a new kernel compilation task
18+
Load dynamic kernel from the impl cache)
19+
D -- > |No| F(Execution)
20+
H -- > F -->
21+
22+
<img src="async_compilation.PNG" alt="async compilation overall workflow" width=500>
23+
24+
The diagram above illustrates the overall async compilation flow. During network loading, dynamic kernels are selected and stored in the impl cache. At inference time, if the input shape of a primitive changes or its current implementation is dynamic, the runtime checks the impl cache for a pre-built implementation matching the new shape. On a cache hit, the pre-built implementation is used directly. On a cache miss, a new static kernel compilation task is dispatched in the background, and the dynamic kernel handles execution in the meantime to avoid stalling inference.
25+
26+
## Prioritized Asynchronous Kernel Compilation
27+
28+
Kernel compilation tasks for new input shapes run in threads separate from the inference thread. Without prioritization, less critical kernels could be compiled before performance-critical ones, reducing overall throughput. To address this, async compilation is restricted to four performance-critical primitives: convolution, fully-connected, GEMM, and softmax.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Preprocessing for dynamic shape execution
2+
As explained in basic flow of primitive execution for dynamic shape from [Overall flow flow for dynamic shape](overall_flow.md), several preprocessing steps are performed before setting arguments to kernel and executing selected impl.
3+
4+
* `update_shape` - when the input shape changes, calculate and change the output shape and perform shape inference so that the shape is propagated to the next node.
5+
* `update_impl` - depending on the changed shape, `primitive_impl` is retrieved from in-memory cache or new impl is selected.
6+
* `realloc_if_needed` - allocates new output memory if necessary.
7+
8+
The following is a description for some of the representative preprocessing steps for dynamic shape execution.
9+
10+
## primitive_inst::update_shape
11+
### Dynamic shape inference
12+
To support dynamic shape in GPU plugin, `cldnn::layout` uses `ov::PartialShape` to express shape. While the existing `cldnn::tensor` does not support dynamic shape and has limitations in rank, `ov::PartialShape` supports static and dynamic dimensions and has no limitations in rank. And when creating `cldnn::primitive` from `ov::op`, `ov::PartialShape` that `ov::op` already has is directly used.
13+
14+
> **Note**: In the execution flow for the existing static shape in GPU plugin, the shape of `ov::op` may be transformed into `ov::tensor` and used, so when creating `cldnn::primitive` from `ov::op`, it is separated from the dynamic shape execution flow. When building `cldnn::program`, if there is at least one dynamic node among the nodes, `ov::intel_gpu::allow_new_shape_infer` property is set [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/plugin/program_builder.cpp#L139) and execution of static shape and dynamic shape is separated through this property during `cldnn::primitive` creation. It will be integrated in the future when GPU plugin fully supports dynamic shape.
15+
16+
When the input shape of the model changes, the input shape of the current primitive is also updated by checking whether the input shape has changed, and the output shape is calculated through the input shape, then this shape is propagated to the next primitive on shape inference stage.
17+
Details on how to execute shape inference through `primitive_inst::update_shape` when executing primitive in GPU plugin for dynamic shape are as follows:
18+
19+
1. In the basic flow that executes primitive, there is a runtime optimization stage (i.e. `primitive_inst::do_runtime_in_place_concat` [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L720)) that runs before `update_shape()`. At this time, if `update_shape()` has already been executed by another primitive, set `update_shape_done_by_other` to TRUE. Therefore, if `update_shape_done_by_other` is TRUE, `update_shape()` is skipped. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L247)
20+
2. First, output layouts of `kernel_impl_params` from the dependencies of `primitive_inst` are compared with the input layouts of `kernel_impl_params` of the current primitive. If changed, the changed shape is updated to input layouts of `kernel_impl_params`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L254)
21+
3. Set `_shape_changed` to TRUE if the input shape has changed. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L266)
22+
4. If the current node is `shape_of` and the input shape has not changed, reset `_shape_changed` to FALSE and skip `update_shape()`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L270)
23+
5. If the current node is *`shape_of` subgraph*, check *dependent `shape_of` primitives* and skip `update_shape()` if the shape has not changed. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L276)
24+
6. `update_shape()` is skipped if any of the following conditions hold: the input shape has not changed, the node generates dynamic output (e.g. `Nonzero`, `Unique`), or the output layouts of `kernel_impl_params` are already static. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L313)
25+
7. In static shape execution, data for additional inputs that determine the output shape are set as attributes when creating `cldnn::primitive`. In dynamic shape execution, if that data is stored in the output memory of a preceding node, execution waits until those dependent nodes complete. To determine which input nodes have memory dependencies, most `program_node`s define `get_shape_infer_dependencies()`. The dependency information (index and memory for each dependent input node) is collected from the current node, stored in a `map`, and the corresponding primitive events are added to an event list to await completion. Finally, the populated map is saved in `memory_deps` of `kernel_impl_params`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L319)
26+
8. There are two APIs for output shape calculation on `program_node`: `calc_output_layout()` for static shape execution and `calc_output_layouts()` for dynamic shape execution. In this step, `calc_output_layouts()` is called, which invokes the `shape_infer()` API of `ov::op` with the updated input layouts from `kernel_impl_params`, the primitive's attributes, and `memory_deps`, and returns output layouts as a vector. The newly calculated output layout is then written back to `output_layouts` in `kernel_impl_params` [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L366)
27+
```cpp
28+
struct program_node {
29+
...
30+
public:
31+
layout calc_output_layout() const;
32+
std::vector<layout> calc_output_layouts() const;
33+
}
34+
```
35+
9. If there is fused operation in `kernel_impl_params`, the output layout of the descriptor is also updated with `ov::PartialShape` of updated output layout. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L379)
36+
37+
## primitive_inst::update_weight
38+
If `primitive_impl` is created or updated through `update_impl()`, and it is a weightable node (e.g. `convolution`, `deconvolution`, `fc`), the weight should be reordered to the layout required by kernel as needed. The following describes the processes performed in `update_weights()`.
39+
40+
1. If impl is nullptr or the current node is not weightable node, `update_weight()` is skipped. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1168)
41+
2. Create *reorder kernel params* (i.e. `kernel_impl_params` for weights reorder) from `WeightsReorderParams` of `primitive_inst`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1172)
42+
3. In cases where weights reorder is not necessary, if weights were previously reordered, incorrect memory buffer is allocated, so reset *reordered weights cache* to original weight memory layout. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1181)
43+
4. If weights reorder is necessary, update the weight layout of `kernel_impl_params` to the output layout of *reorder kernel params*. This is the expected layout. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1186)
44+
- If the expected layout hits *reordered weights cache*, it is reused.
45+
- If the expected layout is compatible with the original layout, the original weights memory is reinterpreted and added to *reordered weights cache* without the need for reordering.
46+
- If the expected layout misses *reordered weights cache*, retrieve a cached reorder impl from `implementations cache` using `reorder kernel params`, or create a new reorder impl through `WeightsReordersFactory` and set the compiled kernel on it. Add the impl to `implementation cache`. Check whether the weights memory can be reused in `reordered weights cache`; if so, reuse it, otherwise allocate a new buffer. Update `reordered weights cache` accordingly. Finally, use `kernel_arguments_data()` to set kernel arguments in the reorder impl and execute the kernel.
47+
48+
## primitive_inst::realloc_if_needed
49+
In the case of static shape execution, output memory is allocated when creating `primitive_inst`, but in dynamic shape execution, output memory is allocated before arguments are set to kernel and execution. The following describes the processes performed in `realloc_if_needed()`.
50+
51+
1. If the current node is `concat` and has 1 user, `can_be_optimized()` is TRUE but `allocation_done_by_other` is FALSE (i.e. not yet allocated by another node), execute `concat`'s `realloc_if_needed()` and set `allocation_done_by_other` to TRUE. Also, use concat's output memory as the output memory of the current node and skip `realloc_if_needed()`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L390)
52+
2. For better performance, if *fake aligned shape* is used when executing the kernel (e.g. `fully_connected`), the input and output shapes of `kernel_impl_params` are updated accordingly. A more detailed explanation will be added as a separate section later (TBD). [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L403)
53+
3. If the node is `input_layout`, `realloc_if_needed()` is skipped because it is assumed to always use external memory. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L408)
54+
4. Check whether output memory is already allocated and the requested buffer size is smaller than the current buffer size, and store the result in `can_reuse_buffer`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L421)
55+
5. If the current node is `concat` and both `can_be_optimized()` and `allocation_done_by_other` are TRUE, `realloc_if_needed()` is skipped. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L424)
56+
6. `ShapePredictor` predicts a preallocation shape from the current shape and data type, and updates the output layout shape of `kernel_impl_params` accordingly. A more detailed explanation will be added as a separate section later (TBD). [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L429)
57+
7. If `can_reuse_buffer` is TRUE, `reused` of output memory is set to TRUE and output memory is updated with reinterpreted buffer. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L439)
58+
8. If `can_reuse_buffer` is FALSE, reallocate with `allocate_outputs()` to set the output memory and update `max_output_layout_size`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L448)
59+
9. Get internal buffer layouts from the current `primitive_impl`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L458)
60+
- If the previously allocated intermediate memory can be reused, the intermediate memory is updated with reinterpreted buffer.
61+
- If it cannot be reused, allocate a new buffer through `allocate_internal_buffer()` to update or add a new intermediate memory that has already been allocated.

src/plugins/intel_gpu/docs/gpu_plugin_driver_troubleshooting.md

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,16 +22,16 @@ Number of devices 1
2222
Device OpenCL C Version OpenCL C 3.0
2323
Device Type GPU
2424
```
25-
## 1. Make sure that you have GPU on your system
25+
## Make sure that you have GPU on your system
2626

2727
Some Intel® CPUs might not have integrated GPU, so if you want to run OpenVINO on iGPU, go to [ark.intel website](https://ark.intel.com/) and make sure that your CPU has it.
2828

29-
## 2. Make sure that OpenCL® Runtime is installed
29+
## Make sure that OpenCL® Runtime is installed
3030

3131
OpenCL runtime is a part of the GPU driver on Windows, but on Linux it should be installed separately. For the installation tips, refer to [OpenVINO docs](https://docs.openvino.ai/2026/get-started/install-openvino/install-openvino-linux.html) and [OpenCL Compute Runtime docs](https://github.com/intel/compute-runtime/tree/master/opencl/doc).
3232
To get the support of Intel® Iris® Xe MAX Graphics with Linux, follow the [driver installation guide](https://dgpu-docs.intel.com/devices/iris-xe-max-graphics/index.html)
3333

34-
## 3. Make sure that user has all required permissions to work with GPU device
34+
## Make sure that user has all required permissions to work with GPU device
3535

3636
Add the current Linux user to the `video` and `render` group:
3737
```
@@ -40,26 +40,46 @@ sudo usermod -a -G render "$(whoami)"
4040
```
4141
Note: The required group depends on the Linux distribution. Adding to both `video` and `render` is a safe option.
4242

43-
## 4. Make sure that iGPU is enabled
43+
## Make sure that iGPU is enabled
4444

4545
```
4646
$ cat /sys/devices/pci0000\:00/0000\:00\:02.0/enable
4747
1
4848
```
4949

50-
## 5. Make sure that "/etc/OpenCL/vendors/intel.icd" contains proper paths to the OpenCL driver
50+
## Make sure that "/etc/OpenCL/vendors/intel.icd" contains proper paths to the OpenCL driver
5151

5252
```
5353
$ cat /etc/OpenCL/vendors/intel.icd
5454
/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
5555
```
5656
Note: path to the runtime lib may vary in different driver versions
5757

58-
## 6. Use LD_DEBUG=libs to trace loaded libraries
58+
## On Linux, make sure your KMD(kernel-mode driver) is loaded
59+
60+
On Xe2+ platform, KMD name is `xe`. Before that, it was `i915`.
61+
62+
Check the required module is properly loaded
63+
```
64+
$ lsmod | grep -w -e ^xe -e ^i915
65+
xe 2723840 0
66+
i915 4288512 16
67+
```
68+
69+
## On Linux, make sure your UMD(user-mode driver) is up-to-date
70+
71+
Old UMD may not work properly on newer HW. Make sure the UMD is up-to-date.
72+
In the example below, 25.31 means year 2025 and 31st week.
73+
```
74+
$ dpkg -l | grep intel-opencl-icd
75+
ii intel-opencl-icd 25.31.34666.3-0 amd64 Intel graphics compute runtime for OpenCL
76+
```
77+
78+
## Use LD_DEBUG=libs to trace loaded libraries
5979

6080
For more details, see the [OpenCL on Linux](https://github.com/bashbaug/OpenCLPapers/blob/markdown/OpenCLOnLinux.md)
6181

62-
## 7. If you are using dGPU with XMX, ensure that HW_MATMUL feature is recognized
82+
## If you are using dGPU with XMX, ensure that HW_MATMUL feature is recognized
6383

6484
OpenVINO contains *hello_query_device* sample application: [link](https://docs.openvino.ai/2026/get-started/learn-openvino/openvino-samples/hello-query-device.html)
6585

@@ -71,7 +91,7 @@ $ ./hello_query_device.py
7191
[ INFO ] OPTIMIZATION_CAPABILITIES: FP32, BIN, FP16, INT8, GPU_HW_MATMUL, GPU_USM_MEMORY
7292
```
7393

74-
## 8. If you have errors with OpenCL headers in application build
94+
## If you have errors with OpenCL headers in application build
7595
OpenCL headers should be installed in your system to build application using OpenCL objects. OpenVINO source code distribution contains OpenCL headers thirdparty/ocl/cl_headers. Alternatively you can
7696
install them from [OpenCL Git](https://github.com/KhronosGroup/OpenCL-Headers). To ensure compatibility, make sure that the installed version of OpenCL headers had been released before the OpenVINO version you are using.
7797

src/plugins/intel_npu/src/plugin/npuw/embedding_infer_request.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
// Copyright (C) 2025 Intel Corporation
1+
// Copyright (C) 2018-2026 Intel Corporation
22
// SPDX-License-Identifier: Apache-2.0
33
//
44

src/plugins/intel_npu/src/plugin/npuw/embedding_infer_request.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
// Copyright (C) 2025 Intel Corporation
1+
// Copyright (C) 2018-2026 Intel Corporation
22
// SPDX-License-Identifier: Apache-2.0
33
//
44

src/plugins/intel_npu/src/plugin/npuw/embedding_model_utils.cpp

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -370,9 +370,10 @@ class ReConstructEmbeddingModel : public ov::pass::ModelPass {
370370

371371
} // namespace
372372

373-
void ov::npuw::util::prepare_text_embedding_model(std::shared_ptr<ov::Model> model, uint32_t seq_len_dim) {
373+
bool ov::npuw::util::PrepareTextEmbeddingModel::run_on_model(const std::shared_ptr<ov::Model>& model) {
374374
ov::pass::Manager manager("prepare-embedding");
375375
manager.set_per_pass_validation(true);
376-
manager.register_pass<ReConstructEmbeddingModel>(seq_len_dim);
377-
manager.run_passes(model);
376+
manager.register_pass<ReConstructEmbeddingModel>(m_seq_len_dim);
377+
378+
return manager.run_passes(model);
378379
}
Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
// Copyright (C) 2025 Intel Corporation
1+
// Copyright (C) 2018-2026 Intel Corporation
22
// SPDX-License-Identifier: Apache-2.0
33
//
44

@@ -8,6 +8,15 @@
88

99
namespace ov ::npuw ::util {
1010

11-
void prepare_text_embedding_model(std::shared_ptr<ov::Model> model, uint32_t seq_len_dim);
11+
class PrepareTextEmbeddingModel : public ov::pass::ModelPass {
12+
uint32_t m_seq_len_dim;
13+
14+
public:
15+
OPENVINO_MODEL_PASS_RTTI("ov::npuw::PrepareTextEmbeddingModel");
16+
17+
explicit PrepareTextEmbeddingModel(uint32_t seq_len_dim) : m_seq_len_dim(seq_len_dim) {}
18+
19+
bool run_on_model(const std::shared_ptr<ov::Model>& model) override;
20+
};
1221

1322
} // namespace ov::npuw::util

0 commit comments

Comments
 (0)