Skip to content

benchdnn: add per-type NaN-safe mask to gpu_fill_random#4851

Draft
kwieloch-intel wants to merge 2 commits intouxlfoundation:mainfrom
kwieloch-intel:philox-mask
Draft

benchdnn: add per-type NaN-safe mask to gpu_fill_random#4851
kwieloch-intel wants to merge 2 commits intouxlfoundation:mainfrom
kwieloch-intel:philox-mask

Conversation

@kwieloch-intel
Copy link
Contributor

@kwieloch-intel kwieloch-intel commented Mar 18, 2026

This PR modifies logic in fill_random.cl kernel to use custom per-type masks instead of single universal 0xEEEEEEEEu mask.

JIRA: MFDNN-14789


Problem

The existing gpu_fill_random kernel applied a hardcoded 0xEEEEEEEE universal mask to "Philox PRNG" output. This mask only allowed even nibbles, compressing e.g. FP16 from $65536$ to $4096$ unique values and severely limiting the dynamic range of generated data.

Limitations of "sub_group_block_write"

Philox PRNG always generates vector of $4$ uint values. The kernel uses intel_sub_group_block_write_uc16, which transposes bytes across work-items: byte rnd[c] from work-item j lands at address base + c*16 + j. Each FP element in memory is therefore assembled from bytes at the same rnd[] index across adjacent work-items. This means the NaN-safe mask must be byte-uniform (identical pattern in every byte). A non-uniform mask like 0xFBFFFBFF would leave some bytes at 0xFF — providing no masking and allowing NaN/Inf values. For this reason, at least $1$ bit must be reserved in every byte, so, for example, for the FP16 type, we are left with a limit of $2^{14} = 16384$ unique values even though, in theory, reserving just $1$ bit would suffice in this case.


Proposed Solution

The direct vstore4-based approach has been restored. Expanding the range of samples generated is possible by switching from a static mask to a dynamic one determined by the buffor data type.

  • Compute a per-type byte-uniform mask that clears the exponent LSB position within each byte, preventing all-ones exponent (NaN/Inf).
  • Pass that custom per-type mask as a kernel argument (uint mask).
  • Add dnnl_data_type_t dt parameter to dnnl_impl_gpu_fill_random so the caller provides the data type explicitly.
  • Restore initial vstore4-based approach.

Simple Direct masks

switch (dt) {
    case data_type::f32: return 0xFF7FFFFFu;
    case data_type::f16: return 0xFBFFFBFFu;
    case data_type::bf16: return 0xFF7FFF7Fu;
    case data_type::f8_e5m2: return 0xFBFBFBFBu;
    case data_type::f8_e4m3: return 0xF7F7F7F7u;
    case data_type::e8m0: return 0xFEFEFEFEu;
    case data_type::f64: return 0xFFEFFFFFu;
    default: return 0xFFFFFFFFu;
}

Modified Files

  • fill_random.cl — added uint mask kernel parameter, applied to PRNG output. Logic restored to vstore4.
  • fill_random.cppnan_safe_mask() function, dt parameter in API.
  • dnnl_memory.cpp — pass dt(buffer_index) to fill_random.

Results

For one million generated samples:

Data Type Before (unique samples) After (unique sample) Change Limit ($2^{20}$ samples)
dnnl_f32 1,016,560 1,048,318 ×1.03 ↑ 1 048 576
dnnl_f16 4,096 32,768 ×8 ↑ 32768
dnnl_f8_e5m2 64 128 ×2 ↑ 128

@github-actions github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch labels Mar 18, 2026
@echeresh
Copy link
Contributor

Limitations of "sub_group_block_write"

Philox PRNG always generates vector of 4 uint values. The kernel uses intel_sub_group_block_write_uc16, which transposes bytes across work-items: byte rnd[c] from work-item j lands at address base + c*16 + j. Each FP element in memory is therefore assembled from bytes at the same rnd[] index across adjacent work-items. This means the NaN-safe mask must be byte-uniform (identical pattern in every byte). A non-uniform mask like 0xFBFFFBFF would leave some bytes at 0xFF — providing no masking and allowing NaN/Inf values. For this reason, at least 1 bit must be reserved in every byte, so, for example, for the FP16 type, we are left with a limit of 2 14 = 16384 unique values even though, in theory, reserving just 1 bit would suffice in this case.

@kwieloch-intel I suggest to rework the kernel to drop intel_sub_group_block_write_uc16 usage. Due to its byte-granularity behavior NaN/infinity masking 1) gets non-trivial and 2) we have to exclude "normal" numbers just because of that specific byte-step limitation.

Let's simplify this to handle >= 8 contiguous bytes per item (to cover types up to fp64), e.g. like in the snippet I shared: #4699 (comment)

This will allow us to construct the mask in a more direct way, clearing just one bit per FP number. This is a simpler approach and gives us a wider range to cover.

@kwieloch-intel
Copy link
Contributor Author

kwieloch-intel commented Mar 19, 2026

Limitations of "sub_group_block_write"

Philox PRNG always generates vector of 4 uint values. The kernel uses intel_sub_group_block_write_uc16, which transposes bytes across work-items: byte rnd[c] from work-item j lands at address base + c*16 + j. Each FP element in memory is therefore assembled from bytes at the same rnd[] index across adjacent work-items. This means the NaN-safe mask must be byte-uniform (identical pattern in every byte). A non-uniform mask like 0xFBFFFBFF would leave some bytes at 0xFF — providing no masking and allowing NaN/Inf values. For this reason, at least 1 bit must be reserved in every byte, so, for example, for the FP16 type, we are left with a limit of 2 14 = 16384 unique values even though, in theory, reserving just 1 bit would suffice in this case.

@kwieloch-intel I suggest to rework the kernel to drop intel_sub_group_block_write_uc16 usage. Due to its byte-granularity behavior NaN/infinity masking 1) gets non-trivial and 2) we have to exclude "normal" numbers just because of that specific byte-step limitation.

Let's simplify this to handle >= 8 contiguous bytes per item (to cover types up to fp64), e.g. like in the snippet I shared: #4699 (comment)

This will allow us to construct the mask in a more direct way, clearing just one bit per FP number. This is a simpler approach and gives us a wider range to cover.

@echeresh I’ve restored the original kernel code using vstore. It’s very similar to the one we had in PR4699 before switch to intel_sub_group_block_write_uc16. I have also simplified the logic for generating masks. We are now maximizing randomness by blocking one bit per data type rather than per byte. This increases the number of unique values generated for 16-bit floating-point types to $32768$, which is the theoretical limit.

@echeresh
Copy link
Contributor

@kwieloch-intel Thanks, this looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:tests Codeowner: @oneapi-src/onednn-arch platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants