rocr: Optimize SDMA multi-producer hot path for reduced contention and overhead#3264
rocr: Optimize SDMA multi-producer hot path for reduced contention and overhead#3264
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request optimizes the SDMA (System DMA) multi-producer hot path in the ROCr runtime to reduce lock contention and overhead. The changes focus on performance improvements in the command submission pipeline, which is critical for efficient GPU command dispatch.
Changes:
- Reduced lock contention by unlocking during blocking operations in PadRingToEnd, allowing other producers to reserve space while one commits
- Replaced busy-waiting mechanisms with more efficient CPU instructions (_mm_pause for CAS retries, mwaitx for ordered-commit waits)
- Cached immutable configuration flags and MMIO pointers at initialization to avoid repeated pointer chasing in hot paths
- Optimized stack allocation for common single-packet command submissions to avoid heap allocation overhead
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp | Implements all hot path optimizations: caching, lock management improvements, efficient wait mechanisms, and stack allocation for single-packet commands |
| projects/rocr-runtime/runtime/hsa-runtime/core/inc/amd_blit_sdma.h | Adds cached member variables and updates function signatures to support lock management optimization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Use mwaitx to efficiently monitor cached_commit_index_ instead of | ||
| // burning CPU cycles. | ||
| if (core::g_use_mwaitx) { | ||
| timer::DoMwaitx(reinterpret_cast<int64_t*>(&cached_commit_index_), 60000, true); |
There was a problem hiding this comment.
Type mismatch in DoMwaitx call: cached_commit_index_ is uint64_t but DoMwaitx expects int64_t*. While this works in practice because mwaitx just monitors the memory address, for type safety and consistency with other signal wait paths in the codebase (which use int64_t), consider either changing cached_commit_index_ to int64_t or using a cast through void* to make the type conversion more explicit.
| timer::DoMwaitx(reinterpret_cast<int64_t*>(&cached_commit_index_), 60000, true); | |
| timer::DoMwaitx(static_cast<int64_t*>(static_cast<void*>(&cached_commit_index_)), 60000, true); |
| char* AcquireWriteAddress(uint32_t cmd_size, uint64_t& curr_index, | ||
| std::unique_lock<std::mutex>& lock); |
There was a problem hiding this comment.
Function signature was updated to include a lock parameter, but the documentation comment (lines 150-162) was not updated to document this new parameter. Add documentation for the lock parameter explaining that it's a reference to the reservation lock that may be temporarily released and reacquired during the call.
3f36f5f to
6c4a49d
Compare
…d overhead - Reduce PadRingToEnd lock contention: unlock reservation_lock_ around UpdateWriteAndDoorbellRegister spin-wait so other producers can reserve while NOP padding commits. - Use mwaitx for ordered-commit spin in UpdateWriteAndDoorbellRegister instead of sched_yield(), with fallback when mwaitx is unavailable. - Cache immutable flags and MMIO pointers at init - Replace os::YieldThread with _mm_pause in CAS retry loops - Eliminate redundant atomic Load in AcquireWriteAddress - Stack-allocate single-packet command buffers in SubmitLinearCopyCommand and SubmitLinearFillCommand to avoid per-call heap allocation.
6c4a49d to
a26afe8
Compare
Motivation
Time taken by hot path
Technical Details
JIRA ID
Test Plan
Test Result
Submission Checklist