Skip to content

rocr: Optimize SDMA multi-producer hot path for reduced contention and overhead#3264

Open
saleelk wants to merge 1 commit intodevelopfrom
users/saleelk/sdmaHotPath
Open

rocr: Optimize SDMA multi-producer hot path for reduced contention and overhead#3264
saleelk wants to merge 1 commit intodevelopfrom
users/saleelk/sdmaHotPath

Conversation

@saleelk
Copy link
Contributor

@saleelk saleelk commented Feb 13, 2026

Motivation

Time taken by hot path

Technical Details

  • Reduce PadRingToEnd lock contention: unlock reservation_lock_ around UpdateWriteAndDoorbellRegister spin-wait so other producers can reserve while NOP padding commits.
  • Use mwaitx for ordered-commit spin in UpdateWriteAndDoorbellRegister instead of sched_yield(), with fallback when mwaitx is unavailable.
  • Cache immutable flags and MMIO pointers at init
  • Replace os::YieldThread with _mm_pause in CAS retry loops
  • Eliminate redundant atomic Load in AcquireWriteAddress
  • Stack-allocate single-packet command buffers in SubmitLinearCopyCommand and SubmitLinearFillCommand to avoid per-call heap allocation.

JIRA ID

Test Plan

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request optimizes the SDMA (System DMA) multi-producer hot path in the ROCr runtime to reduce lock contention and overhead. The changes focus on performance improvements in the command submission pipeline, which is critical for efficient GPU command dispatch.

Changes:

  • Reduced lock contention by unlocking during blocking operations in PadRingToEnd, allowing other producers to reserve space while one commits
  • Replaced busy-waiting mechanisms with more efficient CPU instructions (_mm_pause for CAS retries, mwaitx for ordered-commit waits)
  • Cached immutable configuration flags and MMIO pointers at initialization to avoid repeated pointer chasing in hot paths
  • Optimized stack allocation for common single-packet command submissions to avoid heap allocation overhead

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp Implements all hot path optimizations: caching, lock management improvements, efficient wait mechanisms, and stack allocation for single-packet commands
projects/rocr-runtime/runtime/hsa-runtime/core/inc/amd_blit_sdma.h Adds cached member variables and updates function signatures to support lock management optimization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Use mwaitx to efficiently monitor cached_commit_index_ instead of
// burning CPU cycles.
if (core::g_use_mwaitx) {
timer::DoMwaitx(reinterpret_cast<int64_t*>(&cached_commit_index_), 60000, true);
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type mismatch in DoMwaitx call: cached_commit_index_ is uint64_t but DoMwaitx expects int64_t*. While this works in practice because mwaitx just monitors the memory address, for type safety and consistency with other signal wait paths in the codebase (which use int64_t), consider either changing cached_commit_index_ to int64_t or using a cast through void* to make the type conversion more explicit.

Suggested change
timer::DoMwaitx(reinterpret_cast<int64_t*>(&cached_commit_index_), 60000, true);
timer::DoMwaitx(static_cast<int64_t*>(static_cast<void*>(&cached_commit_index_)), 60000, true);

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +165
char* AcquireWriteAddress(uint32_t cmd_size, uint64_t& curr_index,
std::unique_lock<std::mutex>& lock);
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function signature was updated to include a lock parameter, but the documentation comment (lines 150-162) was not updated to document this new parameter. Add documentation for the lock parameter explaining that it's a reference to the reservation lock that may be temporarily released and reacquired during the call.

Copilot uses AI. Check for mistakes.
@saleelk saleelk force-pushed the users/saleelk/sdmaHotPath branch 2 times, most recently from 3f36f5f to 6c4a49d Compare February 14, 2026 22:24
…d overhead

- Reduce PadRingToEnd lock contention: unlock reservation_lock_ around
  UpdateWriteAndDoorbellRegister spin-wait so other producers can reserve
  while NOP padding commits.
- Use mwaitx for ordered-commit spin in UpdateWriteAndDoorbellRegister
  instead of sched_yield(), with fallback when mwaitx is unavailable.
- Cache immutable flags and MMIO pointers at init
- Replace os::YieldThread with _mm_pause in CAS retry loops
- Eliminate redundant atomic Load in AcquireWriteAddress
- Stack-allocate single-packet command buffers in SubmitLinearCopyCommand
  and SubmitLinearFillCommand to avoid per-call heap allocation.
@saleelk saleelk force-pushed the users/saleelk/sdmaHotPath branch from 6c4a49d to a26afe8 Compare February 14, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant