Skip to content

[GPU] Add bf16 data type support for models traced in bfloat16#34714

Open
jpatrickiles-dev wants to merge 1 commit intoopenvinotoolkit:masterfrom
jpatrickiles-dev:fix/gpu-bf16-op-support
Open

[GPU] Add bf16 data type support for models traced in bfloat16#34714
jpatrickiles-dev wants to merge 1 commit intoopenvinotoolkit:masterfrom
jpatrickiles-dev:fix/gpu-bf16-op-support

Conversation

@jpatrickiles-dev
Copy link

Title: [GPU] Add bf16 data type support for Slice, VariadicSplit and other ops

Details:

Intel Arc Xe-LPG and other GPU devices that export models in bfloat16 (e.g. Qwen3.5 GatedDeltaNet hybrid models) fail at runtime with:
No layout format available for variadicsplit/slice, impl_type: any (format: bfyx, data_type: bf16)
The GPU plugin's kernel selectors and OCL impl registrations lacked bf16 in their supported data types for multiple op types.
Root cause: 12 kernel selectors and 11 OCL impl registrations were missing Datatype::BF16 / data_types::bf16. Additionally, the ConvertPrecision(bf16→f16) pass ran too early in the transformations pipeline, leaving residual bf16 tensors that had no layout implementation.

Architecture context:

Qwen3.5 models use a hybrid GatedDeltaNet + full attention architecture where the GatedDeltaNet layers are traced in bfloat16. This is the first widely-used model family to expose this gap in the GPU plugin's bf16 op coverage. Previously, most models either used f16/f32 throughout or only used bf16 in matmul ops (which were already supported via XMX). The GatedDeltaNet's projection splitting and conv state operations introduce bf16 Slice, VariadicSplit, and related ops that had no registered layout format.

Fix:

Added Datatype::BF16 to kernel selectors for: slice, strided_slice, eltwise, activation, concatenation, reduce, gather, select, convolution, gemm
Added data_types::bf16 to OCL impl registrations for: slice, strided_slice, crop, eltwise, activation, concatenation, reduce, gather, select, gemm, convolution
Added a final bf16→f16 ConvertPrecision cleanup pass at the end of transformations_pipeline.cpp to catch any bf16 that survives earlier passes

Tickets:

N/A

Tested on:

Intel Arc Xe-LPG (Meteor Lake), kernel 6.19.4, Ubuntu 24.04, OpenVINO 2026.1. Validated with Qwen3.5-0.8B, 4B, and 9B INT4 models on GPU. No regression on existing Qwen3-8B INT4 workloads.

AI Assistance:

yes — Claude used for root cause analysis and fix development. Human validation: rebuilt GPU plugin locally, verified Qwen3.5 0.8B/4B/9B all produce coherent output on Intel Arc Xe-LPG.

Models traced in bf16 (e.g., Qwen3.5 from transformers 5.x) fail on GPU
devices that support f16 but not bf16 (Intel Arc Xe-LPG / Meteor Lake).
The GPU plugin's ConvertPrecision pass maps bf16→f16 but
KeepConstantsPrecisionAndAddConverts preserves bf16 constants feeding
MatMul, causing bf16 to persist in the compiled graph.

Two-part fix:

1. Add bf16 (Datatype::BF16 / data_types::bf16) to kernel selectors and
   OCL impl registrations for: slice, strided_slice, crop (variadic_split),
   eltwise (multiply/add/divide), activation (swish/sigmoid/sqrt),
   concatenation, reduce, gather, select, convolution, gemm.
   This enables the GPU to handle any bf16 tensors that survive precision
   conversion passes.

2. Add a final bf16→f16 ConvertPrecision cleanup pass at the end of the
   GPU transformation pipeline. This catches bf16 that survives earlier
   passes due to KeepConstantsPrecision and store_original_precision
   interactions, converting it to f16 which the GPU natively supports.
   The cleanup pass uses convert_input_output_precision=true to ensure
   complete bf16 elimination.

Tested: Qwen3.5-0.8B INT4 model now runs correctly on Intel Arc Xe-LPG
(Meteor Lake iGPU) producing coherent text output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jpatrickiles-dev jpatrickiles-dev requested review from a team as code owners March 16, 2026 02:20
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Mar 16, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin ExternalPR External contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants