benchdnn: add per-type NaN-safe mask to gpu_fill_random#4851
benchdnn: add per-type NaN-safe mask to gpu_fill_random#4851kwieloch-intel wants to merge 2 commits intouxlfoundation:mainfrom
gpu_fill_random#4851Conversation
@kwieloch-intel I suggest to rework the kernel to drop Let's simplify this to handle >= 8 contiguous bytes per item (to cover types up to fp64), e.g. like in the snippet I shared: #4699 (comment) This will allow us to construct the |
@echeresh I’ve restored the original kernel code using |
|
@kwieloch-intel Thanks, this looks good to me! |
This PR modifies logic in
fill_random.clkernel to use custom per-type masks instead of single universal0xEEEEEEEEumask.JIRA: MFDNN-14789
Problem
The existing$65536$ to $4096$ unique values and severely limiting the dynamic range of generated data.
gpu_fill_randomkernel applied a hardcoded0xEEEEEEEEuniversal mask to "Philox PRNG" output. This mask only allowed even nibbles, compressing e.g.FP16fromLimitations of "sub_group_block_write"
Philox PRNG always generates vector of$4$ $1$ bit must be reserved in every byte, so, for example, for the $2^{14} = 16384$ unique values even though, in theory, reserving just $1$ bit would suffice in this case.
uintvalues. The kernel usesintel_sub_group_block_write_uc16, which transposes bytes across work-items: byternd[c]from work-itemjlands at addressbase + c*16 + j. Each FP element in memory is therefore assembled from bytes at the samernd[]index across adjacent work-items. This means the NaN-safe mask must be byte-uniform (identical pattern in every byte). A non-uniform mask like0xFBFFFBFFwould leave some bytes at0xFF— providing no masking and allowing NaN/Inf values. For this reason, at leastFP16type, we are left with a limit ofProposed Solution
The direct
vstore4-based approach has been restored. Expanding the range of samples generated is possible by switching from a static mask to a dynamic one determined by the buffor data type.NaN/Inf).uint mask).dnnl_data_type_t dtparameter todnnl_impl_gpu_fill_randomso the caller provides the data type explicitly.vstore4-based approach.Simple Direct masks
Modified Files
uint maskkernel parameter, applied to PRNG output. Logic restored tovstore4.nan_safe_mask()function,dtparameter in API.dt(buffer_index)to fill_random.Results
For one million generated samples:
dnnl_f32dnnl_f16dnnl_f8_e5m2