benchdnn: add per-type NaN-safe mask to `gpu_fill_random` by kwieloch-intel · Pull Request #4851 · uxlfoundation/oneDNN

kwieloch-intel · 2026-03-18T14:16:20Z

This PR modifies logic in `fill_random.cl` kernel to use custom per-type masks instead of single universal `0xEEEEEEEEu` mask.

Problem

The existing gpu_fill_random kernel applied a hardcoded 0xEEEEEEEE universal mask to "Philox PRNG" output. This mask only allowed even nibbles, compressing e.g. FP16 from $65536$ to $4096$ unique values and severely limiting the dynamic range of generated data.

Limitations of "sub_group_block_write"

Philox PRNG always generates vector of $4$ uint values. The kernel uses intel_sub_group_block_write_uc16, which transposes bytes across work-items: byte rnd[c] from work-item j lands at address base + c*16 + j. Each FP element in memory is therefore assembled from bytes at the same rnd[] index across adjacent work-items. This means the NaN-safe mask must be byte-uniform (identical pattern in every byte). A non-uniform mask like 0xFBFFFBFF would leave some bytes at 0xFF — providing no masking and allowing NaN/Inf values. For this reason, at least $1$ bit must be reserved in every byte, so, for example, for the FP16 type, we are left with a limit of $2^{14} = 16384$ unique values even though, in theory, reserving just $1$ bit would suffice in this case.

Proposed Solution

The direct vstore4-based approach has been restored. Expanding the range of samples generated is possible by switching from a static mask to a dynamic one determined by the buffor data type.

Compute a per-type byte-uniform mask that clears the exponent LSB position within each byte, preventing all-ones exponent (NaN/Inf).
Pass that custom per-type mask as a kernel argument (uint mask).
Add dnnl_data_type_t dt parameter to dnnl_impl_gpu_fill_random so the caller provides the data type explicitly.
Restore initial vstore4-based approach.

Simple Direct masks

switch (dt) {
    case data_type::f32: return 0xFF7FFFFFu;
    case data_type::f16: return 0xFBFFFBFFu;
    case data_type::bf16: return 0xFF7FFF7Fu;
    case data_type::f8_e5m2: return 0xFBFBFBFBu;
    case data_type::f8_e4m3: return 0xF7F7F7F7u;
    case data_type::e8m0: return 0xFEFEFEFEu;
    case data_type::f64: return 0xFFEFFFFFu;
    default: return 0xFFFFFFFFu;
}

Modified Files

fill_random.cl — added uint mask kernel parameter, applied to PRNG output. Logic restored to vstore4.
fill_random.cpp — nan_safe_mask() function, dt parameter in API.
dnnl_memory.cpp — pass dt(buffer_index) to fill_random.

Results

For one million generated samples:

Data Type	Before (unique samples)	After (unique sample)	Change	Limit ($2^{20}$ samples)
`dnnl_f32`	1,016,560	1,048,318	×1.03 ↑	1 048 576
`dnnl_f16`	4,096	32,768	×8 ↑	32768
`dnnl_f8_e5m2`	64	128	×2 ↑	128

echeresh · 2026-03-18T17:14:40Z

Limitations of "sub_group_block_write"

Philox PRNG always generates vector of 4 uint values. The kernel uses intel_sub_group_block_write_uc16, which transposes bytes across work-items: byte rnd[c] from work-item j lands at address base + c*16 + j. Each FP element in memory is therefore assembled from bytes at the same rnd[] index across adjacent work-items. This means the NaN-safe mask must be byte-uniform (identical pattern in every byte). A non-uniform mask like 0xFBFFFBFF would leave some bytes at 0xFF — providing no masking and allowing NaN/Inf values. For this reason, at least 1 bit must be reserved in every byte, so, for example, for the FP16 type, we are left with a limit of 2 14 = 16384 unique values even though, in theory, reserving just 1 bit would suffice in this case.

@kwieloch-intel I suggest to rework the kernel to drop intel_sub_group_block_write_uc16 usage. Due to its byte-granularity behavior NaN/infinity masking 1) gets non-trivial and 2) we have to exclude "normal" numbers just because of that specific byte-step limitation.

Let's simplify this to handle >= 8 contiguous bytes per item (to cover types up to fp64), e.g. like in the snippet I shared: #4699 (comment)

This will allow us to construct the mask in a more direct way, clearing just one bit per FP number. This is a simpler approach and gives us a wider range to cover.

kwieloch-intel · 2026-03-19T14:41:34Z

Limitations of "sub_group_block_write"

Philox PRNG always generates vector of 4 uint values. The kernel uses intel_sub_group_block_write_uc16, which transposes bytes across work-items: byte rnd[c] from work-item j lands at address base + c*16 + j. Each FP element in memory is therefore assembled from bytes at the same rnd[] index across adjacent work-items. This means the NaN-safe mask must be byte-uniform (identical pattern in every byte). A non-uniform mask like 0xFBFFFBFF would leave some bytes at 0xFF — providing no masking and allowing NaN/Inf values. For this reason, at least 1 bit must be reserved in every byte, so, for example, for the FP16 type, we are left with a limit of 2 14 = 16384 unique values even though, in theory, reserving just 1 bit would suffice in this case.

@kwieloch-intel I suggest to rework the kernel to drop intel_sub_group_block_write_uc16 usage. Due to its byte-granularity behavior NaN/infinity masking 1) gets non-trivial and 2) we have to exclude "normal" numbers just because of that specific byte-step limitation.

Let's simplify this to handle >= 8 contiguous bytes per item (to cover types up to fp64), e.g. like in the snippet I shared: #4699 (comment)

This will allow us to construct the mask in a more direct way, clearing just one bit per FP number. This is a simpler approach and gives us a wider range to cover.

@echeresh I’ve restored the original kernel code using vstore. It’s very similar to the one we had in PR4699 before switch to intel_sub_group_block_write_uc16. I have also simplified the logic for generating masks. We are now maximizing randomness by blocking one bit per data type rather than per byte. This increases the number of unique values generated for 16-bit floating-point types to $32768$, which is the theoretical limit.

echeresh · 2026-03-20T22:28:05Z

@kwieloch-intel Thanks, this looks good to me!

benchdnn: add per-type mask to gpu_fill_random

05f6755

github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch labels Mar 18, 2026

benchdnn: back to vstore and simple masks

93f9248

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchdnn: add per-type NaN-safe mask to `gpu_fill_random`#4851

benchdnn: add per-type NaN-safe mask to `gpu_fill_random`#4851
kwieloch-intel wants to merge 2 commits intouxlfoundation:mainfrom
kwieloch-intel:philox-mask

kwieloch-intel commented Mar 18, 2026 •

edited

Loading

Uh oh!

echeresh commented Mar 18, 2026

Limitations of "sub_group_block_write"

Uh oh!

kwieloch-intel commented Mar 19, 2026 •

edited

Loading

Limitations of "sub_group_block_write"

Uh oh!

echeresh commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kwieloch-intel commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR modifies logic in fill_random.cl kernel to use custom per-type masks instead of single universal 0xEEEEEEEEu mask.

Problem

Limitations of "sub_group_block_write"

Proposed Solution

Simple Direct masks

Modified Files

Results

Uh oh!

echeresh commented Mar 18, 2026

Limitations of "sub_group_block_write"

Uh oh!

kwieloch-intel commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Limitations of "sub_group_block_write"

Uh oh!

echeresh commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kwieloch-intel commented Mar 18, 2026 •

edited

Loading

This PR modifies logic in `fill_random.cl` kernel to use custom per-type masks instead of single universal `0xEEEEEEEEu` mask.

kwieloch-intel commented Mar 19, 2026 •

edited

Loading