[GPU] Add bf16 data type support for models traced in bfloat16#34714
Open
jpatrickiles-dev wants to merge 1 commit intoopenvinotoolkit:masterfrom
Open
[GPU] Add bf16 data type support for models traced in bfloat16#34714jpatrickiles-dev wants to merge 1 commit intoopenvinotoolkit:masterfrom
jpatrickiles-dev wants to merge 1 commit intoopenvinotoolkit:masterfrom
Conversation
Models traced in bf16 (e.g., Qwen3.5 from transformers 5.x) fail on GPU devices that support f16 but not bf16 (Intel Arc Xe-LPG / Meteor Lake). The GPU plugin's ConvertPrecision pass maps bf16→f16 but KeepConstantsPrecisionAndAddConverts preserves bf16 constants feeding MatMul, causing bf16 to persist in the compiled graph. Two-part fix: 1. Add bf16 (Datatype::BF16 / data_types::bf16) to kernel selectors and OCL impl registrations for: slice, strided_slice, crop (variadic_split), eltwise (multiply/add/divide), activation (swish/sigmoid/sqrt), concatenation, reduce, gather, select, convolution, gemm. This enables the GPU to handle any bf16 tensors that survive precision conversion passes. 2. Add a final bf16→f16 ConvertPrecision cleanup pass at the end of the GPU transformation pipeline. This catches bf16 that survives earlier passes due to KeepConstantsPrecision and store_original_precision interactions, converting it to f16 which the GPU natively supports. The cleanup pass uses convert_input_output_precision=true to ensure complete bf16 elimination. Tested: Qwen3.5-0.8B INT4 model now runs correctly on Intel Arc Xe-LPG (Meteor Lake iGPU) producing coherent text output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Title: [GPU] Add bf16 data type support for Slice, VariadicSplit and other ops
Details:
Intel Arc Xe-LPG and other GPU devices that export models in bfloat16 (e.g. Qwen3.5 GatedDeltaNet hybrid models) fail at runtime with:
No layout format available for variadicsplit/slice, impl_type: any (format: bfyx, data_type: bf16)
The GPU plugin's kernel selectors and OCL impl registrations lacked bf16 in their supported data types for multiple op types.
Root cause: 12 kernel selectors and 11 OCL impl registrations were missing Datatype::BF16 / data_types::bf16. Additionally, the ConvertPrecision(bf16→f16) pass ran too early in the transformations pipeline, leaving residual bf16 tensors that had no layout implementation.
Architecture context:
Qwen3.5 models use a hybrid GatedDeltaNet + full attention architecture where the GatedDeltaNet layers are traced in bfloat16. This is the first widely-used model family to expose this gap in the GPU plugin's bf16 op coverage. Previously, most models either used f16/f32 throughout or only used bf16 in matmul ops (which were already supported via XMX). The GatedDeltaNet's projection splitting and conv state operations introduce bf16 Slice, VariadicSplit, and related ops that had no registered layout format.
Fix:
Added Datatype::BF16 to kernel selectors for: slice, strided_slice, eltwise, activation, concatenation, reduce, gather, select, convolution, gemm
Added data_types::bf16 to OCL impl registrations for: slice, strided_slice, crop, eltwise, activation, concatenation, reduce, gather, select, gemm, convolution
Added a final bf16→f16 ConvertPrecision cleanup pass at the end of transformations_pipeline.cpp to catch any bf16 that survives earlier passes
Tickets:
N/A
Tested on:
Intel Arc Xe-LPG (Meteor Lake), kernel 6.19.4, Ubuntu 24.04, OpenVINO 2026.1. Validated with Qwen3.5-0.8B, 4B, and 9B INT4 models on GPU. No regression on existing Qwen3-8B INT4 workloads.
AI Assistance:
yes — Claude used for root cause analysis and fix development. Human validation: rebuilt GPU plugin locally, verified Qwen3.5 0.8B/4B/9B all produce coherent output on Intel Arc Xe-LPG.