Skip to content

SWDEV-546808 optimize match_any() with dpp wave_rot1 #183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: amd-staging
Choose a base branch
from

Conversation

amd-hhashemi
Copy link

@amd-hhashemi amd-hhashemi commented Jul 29, 2025

Associated JIRA ticket number/Github issue number

Fixes SWDEV-546808

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Continuous Integration

What were the changes?

match_any() showing suboptimal perf. This change converts it to a sequence of wave_rot1 dpp ops.

Why are these changes needed?

Prior solution uses inefficient firstlane in loop.

Updated CHANGELOG?

  • Yes
  • No, Does not apply to this PR.

Added/Updated documentation?

  • Yes
  • No, Does not apply to this PR.

Additional Checks

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally.
  • Any dependent changes have been merged.

retval = 1;
for (int i = 1; i < static_cast<int>(warpSize); i++) {
if constexpr (sizeof(value) == 8)
dill_.ill = __builtin_amdgcn_mov_dpp(dill_.ill, 0x134, 0xf, 0xf, 0); //wave_rol1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the comment "wave_rol1" mean? Also, the magic constants need comments like (/* foo= */ 0x134). Or even better would be const unsigned int meaningfulName = 0x134.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's just an external personal log. Having meaningful names (with comments if necessary) at the top of this file will be much better. Could be const integers, or could be an enum class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments added in new check-in. Let me know if it makes it clearer.

dill_.i[0] = __builtin_amdgcn_mov_dpp(dill_.i[0], 0x134, 0xf, 0xf, 0); //wave_rol1
retval |= ((decltype(mask))(dill_.val == value)) << i;
}
int rotv = __lane_id();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does rotv mean? More comments will be useful for future reference, like what is the overall strategy here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the rotate amount. I added comment to code.

retval = 1;
for (int i = 1; i < static_cast<int>(warpSize); i++) {
if constexpr (sizeof(value) == 8)
dill_.ill = __builtin_amdgcn_mov_dpp(dill_.ill, 0x134, 0xf, 0xf, 0); //wave_rol1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to read up on this, but just to be sure, do these _dpp intrinsics work with both wave64 and wave32?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why it wouldn't work for wave32. That is why I put the rotate amount as static_cast(warpSize).

}
}
} else {
union dill { unsigned int i[2]; unsigned long long ill; decltype(value) val; } dill_ = { .val = value };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we push this union declaration into the if constexpr, then we will only need one integer member with the correct width.

Copy link
Author

@amd-hhashemi amd-hhashemi Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't match_any() need to handle int and long?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was kinda assuming the compiler will optimize out the upper bits since the whole function is a template anyway. But this works too.

yxsamliu pushed a commit to yxsamliu/clr that referenced this pull request Aug 5, 2025
* SWDEV-460151 - add gfx1201 to amd-staging clr

* SWDEV-460151 - removed pal macro

---------

Co-authored-by: Jimbo Xie <[email protected]>
@amd-hhashemi amd-hhashemi changed the title optimize match_any() with dpp wave_rot1 SWDEV-546808 optimize match_any() with dpp wave_rot1 Aug 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants