openvinotoolkit
diff --git a/‎src/plugins/intel_gpu/docs/dynamic_shape/async_compilation.md‎
Lines changed: 28 additions & 0 deletions b/‎src/plugins/intel_gpu/docs/dynamic_shape/async_compilation.md‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎src/plugins/intel_gpu/docs/dynamic_shape/preprocessing.md‎
Lines changed: 61 additions & 0 deletions b/‎src/plugins/intel_gpu/docs/dynamic_shape/preprocessing.md‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎src/plugins/intel_gpu/docs/gpu_plugin_driver_troubleshooting.md‎
Lines changed: 28 additions & 8 deletions b/‎src/plugins/intel_gpu/docs/gpu_plugin_driver_troubleshooting.md‎
Lines changed: 28 additions & 8 deletions
diff --git a/‎src/plugins/intel_npu/src/plugin/npuw/embedding_infer_request.cpp‎
Lines changed: 1 addition & 1 deletion b/‎src/plugins/intel_npu/src/plugin/npuw/embedding_infer_request.cpp‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/plugins/intel_npu/src/plugin/npuw/embedding_infer_request.hpp‎
Lines changed: 1 addition & 1 deletion b/‎src/plugins/intel_npu/src/plugin/npuw/embedding_infer_request.hpp‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/plugins/intel_npu/src/plugin/npuw/embedding_model_utils.cpp‎
Lines changed: 4 additions & 3 deletions b/‎src/plugins/intel_npu/src/plugin/npuw/embedding_model_utils.cpp‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎src/plugins/intel_npu/src/plugin/npuw/embedding_model_utils.hpp‎
Lines changed: 11 additions & 2 deletions b/‎src/plugins/intel_npu/src/plugin/npuw/embedding_model_utils.hpp‎
Lines changed: 11 additions & 2 deletions
@@ -0,0 +1,28 @@
+# Asynchronous Compilation for Dynamic Models
+
+## Motivation
+
+When the input shape of any layer changes, a new static kernel must be compiled because the previously built kernel is no longer valid. If inference requests must wait for this compilation to complete, latency increases by the kernel compile time, degrading overall throughput. Asynchronous kernel compilation solves this problem by decoupling kernel compilation from inference execution.
+
+## Overall Workflow
+
+<!-- flowchart TD
+    A[Start Network Loading] -- > B(Build dynamic kernels)
+    B -- > C[Start Inferencing]
+    C -- > D{Is Input Shape Changed?
+            or Is current impl dynamic?}
+    D -- > |Yes| G{Does this primitive have a cached impl?}
+    G -- > |Yes| I(Load pre-built impl from impl cache)
+    I -- > F
+    G -- > |No| H(Trigger a new kernel compilation task
+                 Load dynamic kernel from the impl cache)
+    D -- > |No| F(Execution)
+    H -- > F -->
+
+<img src="async_compilation.PNG" alt="async compilation overall workflow" width=500>
+
+The diagram above illustrates the overall async compilation flow. During network loading, dynamic kernels are selected and stored in the impl cache. At inference time, if the input shape of a primitive changes or its current implementation is dynamic, the runtime checks the impl cache for a pre-built implementation matching the new shape. On a cache hit, the pre-built implementation is used directly. On a cache miss, a new static kernel compilation task is dispatched in the background, and the dynamic kernel handles execution in the meantime to avoid stalling inference.
+
+## Prioritized Asynchronous Kernel Compilation
+
+Kernel compilation tasks for new input shapes run in threads separate from the inference thread. Without prioritization, less critical kernels could be compiled before performance-critical ones, reducing overall throughput. To address this, async compilation is restricted to four performance-critical primitives: convolution, fully-connected, GEMM, and softmax.
@@ -0,0 +1,61 @@
+# Preprocessing for dynamic shape execution
+As explained in basic flow of primitive execution for dynamic shape from [Overall flow flow for dynamic shape](overall_flow.md), several preprocessing steps are performed before setting arguments to kernel and executing selected impl.
+
+* `update_shape` - when the input shape changes, calculate and change the output shape and perform shape inference so that the shape is propagated to the next node.
+* `update_impl` - depending on the changed shape, `primitive_impl` is retrieved from in-memory cache or new impl is selected.
+* `realloc_if_needed` - allocates new output memory if necessary.
+
+The following is a description for some of the representative preprocessing steps for dynamic shape execution.
+
+## primitive_inst::update_shape
+### Dynamic shape inference
+To support dynamic shape in GPU plugin, `cldnn::layout` uses `ov::PartialShape` to express shape. While the existing `cldnn::tensor` does not support dynamic shape and has limitations in rank, `ov::PartialShape` supports static and dynamic dimensions and has no limitations in rank. And when creating `cldnn::primitive` from `ov::op`, `ov::PartialShape` that `ov::op` already has is directly used.
+
+> **Note**: In the execution flow for the existing static shape in GPU plugin, the shape of `ov::op` may be transformed into `ov::tensor` and used, so when creating `cldnn::primitive` from `ov::op`, it is separated from the dynamic shape execution flow. When building `cldnn::program`, if there is at least one dynamic node among the nodes, `ov::intel_gpu::allow_new_shape_infer` property is set [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/plugin/program_builder.cpp#L139) and execution of static shape and dynamic shape is separated through this property during `cldnn::primitive` creation. It will be integrated in the future when GPU plugin fully supports dynamic shape.
+
+When the input shape of the model changes, the input shape of the current primitive is also updated by checking whether the input shape has changed, and the output shape is calculated through the input shape, then this shape is propagated to the next primitive on shape inference stage.
+Details on how to execute shape inference through `primitive_inst::update_shape` when executing primitive in GPU plugin for dynamic shape are as follows:
+
+1. In the basic flow that executes primitive, there is a runtime optimization stage (i.e. `primitive_inst::do_runtime_in_place_concat` [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L720)) that runs before `update_shape()`. At this time, if `update_shape()` has already been executed by another primitive, set `update_shape_done_by_other` to TRUE. Therefore, if `update_shape_done_by_other` is TRUE, `update_shape()` is skipped. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L247)
+2. First, output layouts of `kernel_impl_params` from the dependencies of `primitive_inst` are compared with the input layouts of `kernel_impl_params` of the current primitive. If changed, the changed shape is updated to input layouts of `kernel_impl_params`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L254)
+3. Set `_shape_changed` to TRUE if the input shape has changed. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L266)
+4. If the current node is `shape_of` and the input shape has not changed, reset `_shape_changed` to FALSE and skip `update_shape()`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L270)
+5. If the current node is *`shape_of` subgraph*, check *dependent `shape_of` primitives* and skip `update_shape()` if the shape has not changed. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L276)
+6. `update_shape()` is skipped if any of the following conditions hold: the input shape has not changed, the node generates dynamic output (e.g. `Nonzero`, `Unique`), or the output layouts of `kernel_impl_params` are already static. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L313)
+7. In static shape execution, data for additional inputs that determine the output shape are set as attributes when creating `cldnn::primitive`. In dynamic shape execution, if that data is stored in the output memory of a preceding node, execution waits until those dependent nodes complete. To determine which input nodes have memory dependencies, most `program_node`s define `get_shape_infer_dependencies()`. The dependency information (index and memory for each dependent input node) is collected from the current node, stored in a `map`, and the corresponding primitive events are added to an event list to await completion. Finally, the populated map is saved in `memory_deps` of `kernel_impl_params`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L319)
+8. There are two APIs for output shape calculation on `program_node`: `calc_output_layout()` for static shape execution and `calc_output_layouts()` for dynamic shape execution. In this step, `calc_output_layouts()` is called, which invokes the `shape_infer()` API of `ov::op` with the updated input layouts from `kernel_impl_params`, the primitive's attributes, and `memory_deps`, and returns output layouts as a vector. The newly calculated output layout is then written back to `output_layouts` in `kernel_impl_params` [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L366)
+    ```cpp
+    struct program_node {
+        ...
+    public:
+        layout calc_output_layout() const;
+        std::vector<layout> calc_output_layouts() const;
+    } 
+    ```
+9. If there is fused operation in `kernel_impl_params`, the output layout of the descriptor is also updated with `ov::PartialShape` of updated output layout. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L379)
+
+## primitive_inst::update_weight
+If `primitive_impl` is created or updated through `update_impl()`, and it is a weightable node (e.g. `convolution`, `deconvolution`, `fc`), the weight should be reordered to the layout required by kernel as needed. The following describes the processes performed in `update_weights()`.
+
+1. If impl is nullptr or the current node is not weightable node, `update_weight()` is skipped. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1168)
+2. Create *reorder kernel params* (i.e. `kernel_impl_params` for weights reorder) from `WeightsReorderParams` of `primitive_inst`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1172)
+3. In cases where weights reorder is not necessary, if weights were previously reordered, incorrect memory buffer is allocated, so reset *reordered weights cache* to original weight memory layout. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1181)
+4. If weights reorder is necessary, update the weight layout of `kernel_impl_params` to the output layout of *reorder kernel params*. This is the expected layout. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L1186)
+    - If the expected layout hits *reordered weights cache*, it is reused.
+    - If the expected layout is compatible with the original layout, the original weights memory is reinterpreted and added to *reordered weights cache* without the need for reordering.
+    - If the expected layout misses *reordered weights cache*, retrieve a cached reorder impl from `implementations cache` using `reorder kernel params`, or create a new reorder impl through `WeightsReordersFactory` and set the compiled kernel on it. Add the impl to `implementation cache`. Check whether the weights memory can be reused in `reordered weights cache`; if so, reuse it, otherwise allocate a new buffer. Update `reordered weights cache` accordingly. Finally, use `kernel_arguments_data()` to set kernel arguments in the reorder impl and execute the kernel.
+
+## primitive_inst::realloc_if_needed
+In the case of static shape execution, output memory is allocated when creating `primitive_inst`, but in dynamic shape execution, output memory is allocated before arguments are set to kernel and execution. The following describes the processes performed in `realloc_if_needed()`.
+
+1. If the current node is `concat` and has 1 user, `can_be_optimized()` is TRUE but `allocation_done_by_other` is FALSE (i.e. not yet allocated by another node), execute `concat`'s `realloc_if_needed()` and set `allocation_done_by_other` to TRUE. Also, use concat's output memory as the output memory of the current node and skip `realloc_if_needed()`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L390)
+2. For better performance, if *fake aligned shape* is used when executing the kernel (e.g. `fully_connected`), the input and output shapes of `kernel_impl_params` are updated accordingly. A more detailed explanation will be added as a separate section later (TBD). [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L403)
+3. If the node is `input_layout`, `realloc_if_needed()` is skipped because it is assumed to always use external memory. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L408)
+4. Check whether output memory is already allocated and the requested buffer size is smaller than the current buffer size, and store the result in `can_reuse_buffer`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L421)
+5. If the current node is `concat` and both `can_be_optimized()` and `allocation_done_by_other` are TRUE, `realloc_if_needed()` is skipped. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L424)
+6. `ShapePredictor` predicts a preallocation shape from the current shape and data type, and updates the output layout shape of `kernel_impl_params` accordingly. A more detailed explanation will be added as a separate section later (TBD). [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L429)
+7. If `can_reuse_buffer` is TRUE, `reused` of output memory is set to TRUE and output memory is updated with reinterpreted buffer. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L439)
+8. If `can_reuse_buffer` is FALSE, reallocate with `allocate_outputs()` to set the output memory and update `max_output_layout_size`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L448)
+9. Get internal buffer layouts from the current `primitive_impl`. [(link)](https://github.com/openvinotoolkit/openvino/blob/eea49f3c9e6bba5463460fdc126c2df38a4a5215/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L458)
+    - If the previously allocated intermediate memory can be reused, the intermediate memory is updated with reinterpreted buffer.
+    - If it cannot be reused, allocate a new buffer through `allocate_internal_buffer()` to update or add a new intermediate memory that has already been allocated.
@@ -22,16 +22,16 @@ Number of devices                                 1
   Device OpenCL C Version                         OpenCL C 3.0
   Device Type                                     GPU
 ```
-## 1. Make sure that you have GPU on your system
+## Make sure that you have GPU on your system
 
 Some Intel® CPUs might not have integrated GPU, so if you want to run OpenVINO on iGPU, go to [ark.intel website](https://ark.intel.com/) and make sure that your CPU has it.
 
-## 2. Make sure that OpenCL® Runtime is installed
+## Make sure that OpenCL® Runtime is installed
 
 OpenCL runtime is a part of the GPU driver on Windows, but on Linux it should be installed separately. For the installation tips, refer to [OpenVINO docs](https://docs.openvino.ai/2026/get-started/install-openvino/install-openvino-linux.html) and [OpenCL Compute Runtime docs](https://github.com/intel/compute-runtime/tree/master/opencl/doc).
 To get the support of Intel® Iris® Xe MAX Graphics with Linux, follow the [driver installation guide](https://dgpu-docs.intel.com/devices/iris-xe-max-graphics/index.html)
 
-## 3. Make sure that user has all required permissions to work with GPU device
+## Make sure that user has all required permissions to work with GPU device
 
 Add the current Linux user to the `video` and `render` group:
 ```
@@ -40,26 +40,46 @@ sudo usermod -a -G render "$(whoami)"
 ```
 Note: The required group depends on the Linux distribution. Adding to both `video` and `render` is a safe option.
 
-## 4. Make sure that iGPU is enabled
+## Make sure that iGPU is enabled
 
 ```
 $ cat /sys/devices/pci0000\:00/0000\:00\:02.0/enable
 1
 ```
 
-## 5. Make sure that "/etc/OpenCL/vendors/intel.icd" contains proper paths to the OpenCL driver
+## Make sure that "/etc/OpenCL/vendors/intel.icd" contains proper paths to the OpenCL driver
 
 ```
 $ cat /etc/OpenCL/vendors/intel.icd
 /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
 ```
 Note: path to the runtime lib may vary in different driver versions
 
-## 6. Use LD_DEBUG=libs to trace loaded libraries
+## On Linux, make sure your KMD(kernel-mode driver) is loaded
+
+On Xe2+ platform, KMD name is `xe`. Before that, it was `i915`.
+
+Check the required module is properly loaded
+```
+$ lsmod | grep -w -e ^xe -e ^i915
+xe                   2723840  0
+i915                 4288512  16
+```
+
+## On Linux, make sure your UMD(user-mode driver) is up-to-date
+
+Old UMD may not work properly on newer HW. Make sure the UMD is up-to-date.
+In the example below, 25.31 means year 2025 and 31st week.
+```
+$ dpkg -l | grep intel-opencl-icd
+ii  intel-opencl-icd       25.31.34666.3-0             amd64        Intel graphics compute runtime for OpenCL
+```
+
+## Use LD_DEBUG=libs to trace loaded libraries
 
 For more details, see the [OpenCL on Linux](https://github.com/bashbaug/OpenCLPapers/blob/markdown/OpenCLOnLinux.md)
 
-## 7. If you are using dGPU with XMX, ensure that HW_MATMUL feature is recognized
+## If you are using dGPU with XMX, ensure that HW_MATMUL feature is recognized
 
 OpenVINO contains *hello_query_device* sample application: [link](https://docs.openvino.ai/2026/get-started/learn-openvino/openvino-samples/hello-query-device.html)
 
@@ -71,7 +91,7 @@ $ ./hello_query_device.py
 [ INFO ]                OPTIMIZATION_CAPABILITIES: FP32, BIN, FP16, INT8, GPU_HW_MATMUL, GPU_USM_MEMORY
 ```
 
-## 8. If you have errors with OpenCL headers in application build
+## If you have errors with OpenCL headers in application build
 OpenCL headers should be installed in your system to build application using OpenCL objects. OpenVINO source code distribution contains OpenCL headers thirdparty/ocl/cl_headers. Alternatively you can
 install them from [OpenCL Git](https://github.com/KhronosGroup/OpenCL-Headers). To ensure compatibility, make sure that the installed version of OpenCL headers had been released before the OpenVINO version you are using.
 
 
@@ -1,4 +1,4 @@
-// Copyright (C) 2025 Intel Corporation
+// Copyright (C) 2018-2026 Intel Corporation
 // SPDX-License-Identifier: Apache-2.0
 //
 
 
@@ -1,4 +1,4 @@
-// Copyright (C) 2025 Intel Corporation
+// Copyright (C) 2018-2026 Intel Corporation
 // SPDX-License-Identifier: Apache-2.0
 //
 
 
@@ -370,9 +370,10 @@ class ReConstructEmbeddingModel : public ov::pass::ModelPass {
 
 }  // namespace
 
-void ov::npuw::util::prepare_text_embedding_model(std::shared_ptr<ov::Model> model, uint32_t seq_len_dim) {
+bool ov::npuw::util::PrepareTextEmbeddingModel::run_on_model(const std::shared_ptr<ov::Model>& model) {
     ov::pass::Manager manager("prepare-embedding");
     manager.set_per_pass_validation(true);
-    manager.register_pass<ReConstructEmbeddingModel>(seq_len_dim);
-    manager.run_passes(model);
+    manager.register_pass<ReConstructEmbeddingModel>(m_seq_len_dim);
+
+    return manager.run_passes(model);
 }
@@ -1,4 +1,4 @@
-// Copyright (C) 2025 Intel Corporation
+// Copyright (C) 2018-2026 Intel Corporation
 // SPDX-License-Identifier: Apache-2.0
 //
 
@@ -8,6 +8,15 @@
 
 namespace ov ::npuw ::util {
 
-void prepare_text_embedding_model(std::shared_ptr<ov::Model> model, uint32_t seq_len_dim);
+class PrepareTextEmbeddingModel : public ov::pass::ModelPass {
+    uint32_t m_seq_len_dim;
+
+public:
+    OPENVINO_MODEL_PASS_RTTI("ov::npuw::PrepareTextEmbeddingModel");
+
+    explicit PrepareTextEmbeddingModel(uint32_t seq_len_dim) : m_seq_len_dim(seq_len_dim) {}
+
+    bool run_on_model(const std::shared_ptr<ov::Model>& model) override;
+};
 
 }  // namespace ov::npuw::util
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-// Copyright (C) 2025 Intel Corporation`
	`1`	`+// Copyright (C) 2018-2026 Intel Corporation`
`2`	`2`	`// SPDX-License-Identifier: Apache-2.0`
`3`	`3`	`//`
`4`	`4`