Skip to content

ROCM-20519 - stabilize DeviceSynchronize functional race with scoped …#4250

Open
jaydeeppatel1111 wants to merge 1 commit intodevelopfrom
ROCM-20519
Open

ROCM-20519 - stabilize DeviceSynchronize functional race with scoped …#4250
jaydeeppatel1111 wants to merge 1 commit intodevelopfrom
ROCM-20519

Conversation

@jaydeeppatel1111
Copy link
Contributor

…copy size.

Motivation

Unit_hipDeviceSynchronize_Functional's pre-sync assertion races on fast GPUs
when transfers are tiny. Use a dedicated copy size for that test only (default
1 GiB) so Nullstream stays on small _SIZE.

Technical Details

Add HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB to tune or shrink on
memory-limited CI. Refresh comments to describe H2D/kernel/D2H ordering.

Issue observed on MI300a 4xGPUs.

JIRA ID

ROCM-20519

Test Plan

NA

Test Result

NA

Submission Checklist

@jaydeeppatel1111 jaydeeppatel1111 requested a review from a team as a code owner March 20, 2026 14:13
Copilot AI review requested due to automatic review settings March 20, 2026 14:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Unit_hipDeviceSynchronize_Functional to reduce flakiness from a pre-sync host read racing with fast GPU execution by introducing a dedicated (configurable) transfer size for that test.

Changes:

  • Added functionalDeviceSynchronizeCopyBytes() to select a larger per-test copy size, configurable via HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB.
  • Switched the functional test’s host/device allocations and async copies from _SIZE to the dedicated copyBytes.
  • Updated comments to better describe the intended H2D/kernel/D2H ordering and the rationale for the larger transfer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +45
// 1 GiB per buffer (four allocations in this test ≈ 4 GiB total) — a practical default
// for fast links; use a smaller HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB on
// memory-constrained runners if the pre-sync check is not required there.
return 1024ULL * 1024 * 1024;
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default functionalDeviceSynchronizeCopyBytes() return value (1 GiB) makes this test allocate ~2 GiB pinned host memory + ~2 GiB device memory and perform ~4 GiB of H2D/D2H traffic. That is likely to fail (hipErrorOutOfMemory / memlock limits) or significantly slow down/timing-flake on common CI runners. Consider choosing a smaller default and/or deriving the size from available memory (e.g., via hipMemGetInfo) and skipping the test when the requested allocation can’t be satisfied, with the env var used to increase when needed on fast GPUs.

Suggested change
// 1 GiB per buffer (four allocations in this test ≈ 4 GiB total) — a practical default
// for fast links; use a smaller HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB on
// memory-constrained runners if the pre-sync check is not required there.
return 1024ULL * 1024 * 1024;
// 64 MiB per buffer by default (four allocations in this test ≈ 256 MiB total).
// Use HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB to increase this on very fast GPUs
// if the pre-sync check still races, or to further reduce it on memory-constrained runners.
return 64ULL * 1024 * 1024;

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot , Below check fails on some of ASICs which makes it racy and increasing _SIZE makes the kernel and memcopies to take longer and hence check passes.
REQUIRE(NUM_ITERS != A[NUM_STREAMS - 1][0] - 1);

Comment on lines +155 to 160
// This check assumes per-stream work is still in flight (typically still in a large H2D
// or the long kernel) when the host reads A[NUM_STREAMS-1][0]. Very fast GPUs or
// HIP_LAUNCH_BLOCKING=true can still fail it; increase copy size via
// HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB or drop the assertion if needed.

REQUIRE(NUM_ITERS != A[NUM_STREAMS - 1][0] - 1);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-sync REQUIRE(NUM_ITERS != A[...]) is still inherently racy because it reads the host buffer without any synchronization while a D2H copy may eventually write to that same memory. Even with a larger copy size, this can still be nondeterministic across devices/settings (and the comment already notes it may fail). To make the test stable, consider removing the pre-sync assertion entirely (keeping only the post-hipDeviceSynchronize check), or replacing it with a deterministic readiness check (e.g., hipStreamQuery/event query) rather than a host read race.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Contributor

Copilot AI commented Mar 20, 2026

@jaydeeppatel1111 I've opened a new pull request, #4256, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Mar 20, 2026

@jaydeeppatel1111 I've opened a new pull request, #4257, to work on those changes. Once the pull request is ready, I'll request review from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants