ROCM-20519 - stabilize DeviceSynchronize functional race with scoped …#4250
ROCM-20519 - stabilize DeviceSynchronize functional race with scoped …#4250jaydeeppatel1111 wants to merge 1 commit intodevelopfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adjusts Unit_hipDeviceSynchronize_Functional to reduce flakiness from a pre-sync host read racing with fast GPU execution by introducing a dedicated (configurable) transfer size for that test.
Changes:
- Added
functionalDeviceSynchronizeCopyBytes()to select a larger per-test copy size, configurable viaHIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB. - Switched the functional test’s host/device allocations and async copies from
_SIZEto the dedicatedcopyBytes. - Updated comments to better describe the intended H2D/kernel/D2H ordering and the rationale for the larger transfer.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // 1 GiB per buffer (four allocations in this test ≈ 4 GiB total) — a practical default | ||
| // for fast links; use a smaller HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB on | ||
| // memory-constrained runners if the pre-sync check is not required there. | ||
| return 1024ULL * 1024 * 1024; |
There was a problem hiding this comment.
The default functionalDeviceSynchronizeCopyBytes() return value (1 GiB) makes this test allocate ~2 GiB pinned host memory + ~2 GiB device memory and perform ~4 GiB of H2D/D2H traffic. That is likely to fail (hipErrorOutOfMemory / memlock limits) or significantly slow down/timing-flake on common CI runners. Consider choosing a smaller default and/or deriving the size from available memory (e.g., via hipMemGetInfo) and skipping the test when the requested allocation can’t be satisfied, with the env var used to increase when needed on fast GPUs.
| // 1 GiB per buffer (four allocations in this test ≈ 4 GiB total) — a practical default | |
| // for fast links; use a smaller HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB on | |
| // memory-constrained runners if the pre-sync check is not required there. | |
| return 1024ULL * 1024 * 1024; | |
| // 64 MiB per buffer by default (four allocations in this test ≈ 256 MiB total). | |
| // Use HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB to increase this on very fast GPUs | |
| // if the pre-sync check still races, or to further reduce it on memory-constrained runners. | |
| return 64ULL * 1024 * 1024; |
There was a problem hiding this comment.
@copilot , Below check fails on some of ASICs which makes it racy and increasing _SIZE makes the kernel and memcopies to take longer and hence check passes.
REQUIRE(NUM_ITERS != A[NUM_STREAMS - 1][0] - 1);
| // This check assumes per-stream work is still in flight (typically still in a large H2D | ||
| // or the long kernel) when the host reads A[NUM_STREAMS-1][0]. Very fast GPUs or | ||
| // HIP_LAUNCH_BLOCKING=true can still fail it; increase copy size via | ||
| // HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB or drop the assertion if needed. | ||
|
|
||
| REQUIRE(NUM_ITERS != A[NUM_STREAMS - 1][0] - 1); |
There was a problem hiding this comment.
The pre-sync REQUIRE(NUM_ITERS != A[...]) is still inherently racy because it reads the host buffer without any synchronization while a D2H copy may eventually write to that same memory. Even with a larger copy size, this can still be nondeterministic across devices/settings (and the comment already notes it may fail). To make the test stable, consider removing the pre-sync assertion entirely (keeping only the post-hipDeviceSynchronize check), or replacing it with a deterministic readiness check (e.g., hipStreamQuery/event query) rather than a host read race.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
|
@jaydeeppatel1111 I've opened a new pull request, #4256, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@jaydeeppatel1111 I've opened a new pull request, #4257, to work on those changes. Once the pull request is ready, I'll request review from you. |
…copy size.
Motivation
Unit_hipDeviceSynchronize_Functional's pre-sync assertion races on fast GPUs
when transfers are tiny. Use a dedicated copy size for that test only (default
1 GiB) so Nullstream stays on small _SIZE.
Technical Details
Add HIP_TEST_DEVICE_SYNCHRONIZE_FUNCTIONAL_COPY_MB to tune or shrink on
memory-limited CI. Refresh comments to describe H2D/kernel/D2H ordering.
Issue observed on MI300a 4xGPUs.
JIRA ID
ROCM-20519
Test Plan
NA
Test Result
NA
Submission Checklist