rocr: Optimize SDMA multi-producer hot path for reduced contention and overhead by saleelk · Pull Request #3264 · ROCm/rocm-systems

saleelk · 2026-02-13T20:02:40Z

Motivation

Time taken by hot path

Technical Details

Reduce PadRingToEnd lock contention: unlock reservation_lock_ around UpdateWriteAndDoorbellRegister spin-wait so other producers can reserve while NOP padding commits.
Use mwaitx for ordered-commit spin in UpdateWriteAndDoorbellRegister instead of sched_yield(), with fallback when mwaitx is unavailable.
Cache immutable flags and MMIO pointers at init
Replace os::YieldThread with _mm_pause in CAS retry loops
Eliminate redundant atomic Load in AcquireWriteAddress
Stack-allocate single-packet command buffers in SubmitLinearCopyCommand and SubmitLinearFillCommand to avoid per-call heap allocation.

JIRA ID

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This pull request optimizes the SDMA (System DMA) multi-producer hot path in the ROCr runtime to reduce lock contention and overhead. The changes focus on performance improvements in the command submission pipeline, which is critical for efficient GPU command dispatch.

Changes:

Reduced lock contention by unlocking during blocking operations in PadRingToEnd, allowing other producers to reserve space while one commits
Replaced busy-waiting mechanisms with more efficient CPU instructions (_mm_pause for CAS retries, mwaitx for ordered-commit waits)
Cached immutable configuration flags and MMIO pointers at initialization to avoid repeated pointer chasing in hot paths
Optimized stack allocation for common single-packet command submissions to avoid heap allocation overhead

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp	Implements all hot path optimizations: caching, lock management improvements, efficient wait mechanisms, and stack allocation for single-packet commands
projects/rocr-runtime/runtime/hsa-runtime/core/inc/amd_blit_sdma.h	Adds cached member variables and updates function signatures to support lock management optimization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T20:09:28Z

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp

+    // Use mwaitx to efficiently monitor cached_commit_index_ instead of
+    // burning CPU cycles.
+    if (core::g_use_mwaitx) {
+      timer::DoMwaitx(reinterpret_cast<int64_t*>(&cached_commit_index_), 60000, true);


Type mismatch in DoMwaitx call: cached_commit_index_ is uint64_t but DoMwaitx expects int64_t*. While this works in practice because mwaitx just monitors the memory address, for type safety and consistency with other signal wait paths in the codebase (which use int64_t), consider either changing cached_commit_index_ to int64_t or using a cast through void* to make the type conversion more explicit.

Suggested change

timer::DoMwaitx(reinterpret_cast<int64_t*>(&cached_commit_index_), 60000, true);

timer::DoMwaitx(static_cast<int64_t*>(static_cast<void*>(&cached_commit_index_)), 60000, true);

Copilot · 2026-02-13T20:09:29Z

projects/rocr-runtime/runtime/hsa-runtime/core/inc/amd_blit_sdma.h

+  char* AcquireWriteAddress(uint32_t cmd_size, uint64_t& curr_index,
+                            std::unique_lock<std::mutex>& lock);


Function signature was updated to include a lock parameter, but the documentation comment (lines 150-162) was not updated to document this new parameter. Add documentation for the lock parameter explaining that it's a reference to the reservation lock that may be temporarily released and reacquired during the call.

…d overhead - Reduce PadRingToEnd lock contention: unlock reservation_lock_ around UpdateWriteAndDoorbellRegister spin-wait so other producers can reserve while NOP padding commits. - Use mwaitx for ordered-commit spin in UpdateWriteAndDoorbellRegister instead of sched_yield(), with fallback when mwaitx is unavailable. - Cache immutable flags and MMIO pointers at init - Replace os::YieldThread with _mm_pause in CAS retry loops - Eliminate redundant atomic Load in AcquireWriteAddress - Stack-allocate single-packet command buffers in SubmitLinearCopyCommand and SubmitLinearFillCommand to avoid per-call heap allocation.

saleelk requested a review from lmoriche February 13, 2026 20:02

saleelk requested review from cfreeamd, dayatsin-amd and kentrussell as code owners February 13, 2026 20:02

Copilot AI review requested due to automatic review settings February 13, 2026 20:02

github-actions bot added the project: rocr-runtime label Feb 13, 2026

Copilot started reviewing on behalf of saleelk February 13, 2026 20:03 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

systems-assistant bot added the organization: ROCm label Feb 13, 2026

saleelk force-pushed the users/saleelk/sdmaHotPath branch 2 times, most recently from 3f36f5f to 6c4a49d Compare February 14, 2026 22:24

saleelk force-pushed the users/saleelk/sdmaHotPath branch from 6c4a49d to a26afe8 Compare February 14, 2026 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocr: Optimize SDMA multi-producer hot path for reduced contention and overhead#3264

rocr: Optimize SDMA multi-producer hot path for reduced contention and overhead#3264
saleelk wants to merge 1 commit intodevelopfrom
users/saleelk/sdmaHotPath

saleelk commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	timer::DoMwaitx(reinterpret_cast<int64_t*>(&cached_commit_index_), 60000, true);
	timer::DoMwaitx(static_cast<int64_t>(static_cast<void>(&cached_commit_index_)), 60000, true);

		char* AcquireWriteAddress(uint32_t cmd_size, uint64_t& curr_index,
		std::unique_lock<std::mutex>& lock);

Conversation

saleelk commented Feb 13, 2026

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant