-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Regular sub-group reduction not taking into account layouts may lead to subpar performance on PVC. This kind of workflows takes place when a reduction follows a matrix multiplication or a tensor with the same layout as the output of a matrix multiplication (DPAS layout). #2907 was the final PR trying to fix this at the Triton level. However, a parallel approach to fix it on IGC was run. The IGC approach however was still subpar as it required moving data around after doing the reduction while also using way more operations on the reduction itself.
Running FlashAttention using the SIMD reduction does not currently give good performance as, per my investigation, spilling is way higher in that case. This should not be the case as the algorithm should not increase register pressure, so maybe this is related to some kind of suboptimal instruction scheduling.
Reducing the DModel dimension to just 16 so no spilling takes place lead to better performance and overall better codegen in the SIMD reduction compared to the baseline reduction and approach. This may lead to think the SIMD reduction will give better performance (as well as being more general) as it acts in a higher level.
Now, to take full profit out of the optimization, we may have two paths:
- Improving instruction scheduling in the backend
- Explore splitting tensors across warps in the DModel dimension (reduction dimension). This may also alleviate register pressure and avoid spilling while exploiting the SIMD reduction