-
Notifications
You must be signed in to change notification settings - Fork 79
SWDEV-546808 optimize match_any() with dpp wave_rot1 #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: amd-staging
Are you sure you want to change the base?
Conversation
retval = 1; | ||
for (int i = 1; i < static_cast<int>(warpSize); i++) { | ||
if constexpr (sizeof(value) == 8) | ||
dill_.ill = __builtin_amdgcn_mov_dpp(dill_.ill, 0x134, 0xf, 0xf, 0); //wave_rol1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the comment "wave_rol1" mean? Also, the magic constants need comments like (/* foo= */ 0x134)
. Or even better would be const unsigned int meaningfulName = 0x134
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the dpp_ctrl. I tried documenting here:
https://medium.com/@hashem.hashemi/amds-dpp-operations-for-super-fast-reduction-29f9aa888376
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's just an external personal log. Having meaningful names (with comments if necessary) at the top of this file will be much better. Could be const integers, or could be an enum class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments added in new check-in. Let me know if it makes it clearer.
dill_.i[0] = __builtin_amdgcn_mov_dpp(dill_.i[0], 0x134, 0xf, 0xf, 0); //wave_rol1 | ||
retval |= ((decltype(mask))(dill_.val == value)) << i; | ||
} | ||
int rotv = __lane_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does rotv
mean? More comments will be useful for future reference, like what is the overall strategy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the rotate amount. I added comment to code.
retval = 1; | ||
for (int i = 1; i < static_cast<int>(warpSize); i++) { | ||
if constexpr (sizeof(value) == 8) | ||
dill_.ill = __builtin_amdgcn_mov_dpp(dill_.ill, 0x134, 0xf, 0xf, 0); //wave_rol1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to read up on this, but just to be sure, do these _dpp
intrinsics work with both wave64 and wave32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why it wouldn't work for wave32. That is why I put the rotate amount as static_cast(warpSize).
} | ||
} | ||
} else { | ||
union dill { unsigned int i[2]; unsigned long long ill; decltype(value) val; } dill_ = { .val = value }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we push this union declaration into the if constexpr
, then we will only need one integer member with the correct width.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't match_any() need to handle int and long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was kinda assuming the compiler will optimize out the upper bits since the whole function is a template anyway. But this works too.
* SWDEV-460151 - add gfx1201 to amd-staging clr * SWDEV-460151 - removed pal macro --------- Co-authored-by: Jimbo Xie <[email protected]>
Associated JIRA ticket number/Github issue number
Fixes SWDEV-546808
What type of PR is this? (check all applicable)
What were the changes?
match_any() showing suboptimal perf. This change converts it to a sequence of wave_rot1 dpp ops.
Why are these changes needed?
Prior solution uses inefficient firstlane in loop.
Updated CHANGELOG?
Added/Updated documentation?
Additional Checks