-
Notifications
You must be signed in to change notification settings - Fork 115
Strategy for running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory #972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to fb50e90
Previous suggestionsSuggestions up to commit 4065c02
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #972 +/- ##
==========================================
+ Coverage 40.91% 40.93% +0.01%
==========================================
Files 70 70
Lines 20288 20288
Branches 2517 2517
==========================================
+ Hits 8301 8305 +4
+ Misses 10450 10447 -3
+ Partials 1537 1536 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e47036b
to
8fef22d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve to run benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve to run benchmark
User description
This PR builds on top of the work done in #9, and aims to bring to the MFC
master
branch the zero-copy out-of-core approach that relies oncudaMallocManaged
andpinned
CPU memory allocations. This strategy works around some issues with unified memory and will be cleaned up as soon as these are resolved. The use ofcudaMallocManaged
allows the use of 2MB pages for the GPU allocations which leads to fewer TLB misses and improves performance compared to the 64KB pages ofmalloc
when configured without huge pages. The use ofpinned
host allocations allows locking some buffers in host memory and directly accessing them from GPU code via NVLink-C2C at peak host memory bandwidth. To ensure that MPI communications follow the fast GPUDirect paths also for unified memory, we use OpenACCcapture
on the send and receive buffers in order to switch to separate memory for these buffers, i.e. to allocate them usingcudaMalloc
. It is also important to note that we implement a series of rearranged timestep updates for the Runge-Kutta schemes that substantially improve the locality and hence performance of the out-of-core approach. All of the above are crucial for good performance.The out-of-core implementation is highly configurable, allowing the control of the memory placement of certain arrays through the following case file parameters:
nv_uvm_out_of_core
: Enable/disable the out-of-core approach. This parameter essentially controls the placement ofq_cons_ts(2)
which can be either on the GPU viacudaMallocManaged
, or on the CPU viacudaMallocHost
.nv_uvm_igr_temps_on_gpu
: Set the number of IGR temporaries to keep in GPU memory. The rest will stay in CPU memory and will be directly accessed from there.nv_uvm_pref_gpu
: Enable/disable@:PREFER_GPU
macro, that implements some expicit CUDA memory hints for improving performance. These can be summarized as follows: (i) set preferred location GPU to resist migrations, (ii) set accessed by CPU to prefer direct mappings over faulting, and (iii) prefetch to GPU to populate memory pages on the GPU in a very efficient way before first-touch.This PR will also:
3D_IGR_TaylorGreenVortex_nvidia
.fastmath
option to improve performance of mathy GPU kernels.I used the
3D_IGR_TaylorGreenVortex_nvidia
testcase on ALPS supercomputer.The code was tested with NVHPC 25.1 as well as latest NVHPC nightly build.
PR Type
Enhancement
Description
Implement out-of-core strategy for NVIDIA Grace-Hopper using Unified Memory
Allow controlling memory placement of certain arrays
Introduce pinned memory pools for CPU-side allocations
Modify time-stepping algorithm for improved locality in out-of-core updates and unified memory compatibility
Diagram Walkthrough
File Walkthrough
5 files
Add PREFER_GPU macro for memory placement
Conditional MPI buffer allocation for unified memory
Apply GPU preference to grid variables
Implement pinned memory pools for IGR temporaries
Add out-of-core time stepping with pinned memory
1 files
Add Taylor-Green vortex test case configuration
1 files
Add home directory path helper method
5 files
Add GPU and CPU binding script for Santis
Add NVIDIA Nsight profiling wrapper script
Add Santis supercomputer job template with UVM settings
Update NVHPC compiler flags for unified memory
Add Santis module configuration