Skip to content

Strategy for running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory #972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Aug 14, 2025

Conversation

ntselepidis
Copy link
Contributor

@ntselepidis ntselepidis commented Aug 1, 2025

User description

This PR builds on top of the work done in #9, and aims to bring to the MFC master branch the zero-copy out-of-core approach that relies on cudaMallocManaged and pinned CPU memory allocations. This strategy works around some issues with unified memory and will be cleaned up as soon as these are resolved. The use of cudaMallocManaged allows the use of 2MB pages for the GPU allocations which leads to fewer TLB misses and improves performance compared to the 64KB pages of malloc when configured without huge pages. The use of pinned host allocations allows locking some buffers in host memory and directly accessing them from GPU code via NVLink-C2C at peak host memory bandwidth. To ensure that MPI communications follow the fast GPUDirect paths also for unified memory, we use OpenACC capture on the send and receive buffers in order to switch to separate memory for these buffers, i.e. to allocate them using cudaMalloc. It is also important to note that we implement a series of rearranged timestep updates for the Runge-Kutta schemes that substantially improve the locality and hence performance of the out-of-core approach. All of the above are crucial for good performance.

The out-of-core implementation is highly configurable, allowing the control of the memory placement of certain arrays through the following case file parameters:

  • nv_uvm_out_of_core: Enable/disable the out-of-core approach. This parameter essentially controls the placement of q_cons_ts(2) which can be either on the GPU via cudaMallocManaged, or on the CPU via cudaMallocHost.
  • nv_uvm_igr_temps_on_gpu: Set the number of IGR temporaries to keep in GPU memory. The rest will stay in CPU memory and will be directly accessed from there.
  • nv_uvm_pref_gpu: Enable/disable @:PREFER_GPU macro, that implements some expicit CUDA memory hints for improving performance. These can be summarized as follows: (i) set preferred location GPU to resist migrations, (ii) set accessed by CPU to prefer direct mappings over faulting, and (iii) prefetch to GPU to populate memory pages on the GPU in a very efficient way before first-touch.

This PR will also:

  • Add an example testcase called 3D_IGR_TaylorGreenVortex_nvidia.
  • Add helper scripts for binding, running, and profiling on Santis/ALPS.
  • Add fastmath option to improve performance of mathy GPU kernels.
  • Fix a bug in 2nd order TVD RK introduced by merge.

I used the 3D_IGR_TaylorGreenVortex_nvidia testcase on ALPS supercomputer.
The code was tested with NVHPC 25.1 as well as latest NVHPC nightly build.


PR Type

Enhancement


Description

  • Implement out-of-core strategy for NVIDIA Grace-Hopper using Unified Memory

  • Allow controlling memory placement of certain arrays

  • Introduce pinned memory pools for CPU-side allocations

  • Modify time-stepping algorithm for improved locality in out-of-core updates and unified memory compatibility


Diagram Walkthrough

flowchart LR
  A["Environment Variables"] --> B["Memory Placement Control"]
  B --> C["GPU Unified Memory"]
  B --> D["CPU Pinned Memory"]
  C --> E["Time Stepping Algorithm"]
  D --> E
  E --> F["Out-of-Core Computation"]
Loading

File Walkthrough

Relevant files
Enhancement
5 files
macros.fpp
Add PREFER_GPU macro for memory placement                               
+47/-0   
m_mpi_common.fpp
Conditional MPI buffer allocation for unified memory         
+8/-0     
m_global_parameters.fpp
Apply GPU preference to grid variables                                     
+9/-0     
m_igr.fpp
Implement pinned memory pools for IGR temporaries               
+105/-0 
m_time_steppers.fpp
Add out-of-core time stepping with pinned memory                 
+124/-6 
Tests
1 files
case.py
Add Taylor-Green vortex test case configuration                   
+101/-0 
Miscellaneous
1 files
build.py
Add home directory path helper method                                       
+3/-0     
Configuration changes
5 files
bind.sh
Add GPU and CPU binding script for Santis                               
+24/-0   
nsys.sh
Add NVIDIA Nsight profiling wrapper script                             
+24/-0   
santis.mako
Add Santis supercomputer job template with UVM settings   
+85/-0   
CMakeLists.txt
Update NVHPC compiler flags for unified memory                     
+3/-3     
modules
Add Santis module configuration                                                   
+3/-0     

@ntselepidis ntselepidis changed the title Modified Strategy for Running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory Modified strategy for running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory Aug 1, 2025
@ntselepidis ntselepidis marked this pull request as ready for review August 2, 2025 19:52
@ntselepidis ntselepidis requested review from a team and sbryngelson as code owners August 2, 2025 19:52
Copy link

qodo-merge-pro bot commented Aug 2, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ❌

9 - Not compliant

Non-compliant requirements:

• Implement ZFP-like compression of halo regions sent over MPI
• Improve performance through compression and decompression of MPI halo regions
• Use Lawrence Livermore National Laboratory's ZFP library as candidate solution

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Code Duplication

The time-stepping algorithm contains significant code duplication between unified memory and non-unified memory paths. The same mathematical operations are implemented twice with only minor differences in variable assignments.

#if !defined(__NVCOMPILER_GPU_UNIFIED_MEM)
        $:GPU_PARALLEL_LOOP(collapse=4)
        do i = 1, sys_size
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        q_cons_ts(2)%vf(i)%sf(j, k, l) = &
                            q_cons_ts(1)%vf(i)%sf(j, k, l) &
                            + dt*rhs_vf(i)%sf(j, k, l)
                    end do
                end do
            end do
        end do
        dest = 2 ! result in q_cons_ts(2)%vf
#else
        $:GPU_PARALLEL_LOOP(collapse=4)
        do i = 1, sys_size
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        q_cons_ts(2)%vf(i)%sf(j, k, l) = &
                            q_cons_ts(1)%vf(i)%sf(j, k, l)
                        q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                            q_cons_ts(1)%vf(i)%sf(j, k, l) &
                            + dt*rhs_vf(i)%sf(j, k, l)
                    end do
                end do
            end do
        end do
        dest = 1 ! result in q_cons_ts(1)%vf
#endif

        !Evolve pb and mv for non-polytropic qbmm
        if (qbmm .and. (.not. polytropic)) then
            $:GPU_PARALLEL_LOOP(collapse=5)
            do i = 1, nb
                do l = 0, p
                    do k = 0, n
                        do j = 0, m
                            do q = 1, nnode
                                pb_ts(2)%sf(j, k, l, q, i) = &
                                    pb_ts(1)%sf(j, k, l, q, i) &
                                    + dt*rhs_pb(j, k, l, q, i)
                            end do
                        end do
                    end do
                end do
            end do
        end if

        if (qbmm .and. (.not. polytropic)) then
            $:GPU_PARALLEL_LOOP(collapse=5)
            do i = 1, nb
                do l = 0, p
                    do k = 0, n
                        do j = 0, m
                            do q = 1, nnode
                                mv_ts(2)%sf(j, k, l, q, i) = &
                                    mv_ts(1)%sf(j, k, l, q, i) &
                                    + dt*rhs_mv(j, k, l, q, i)
                            end do
                        end do
                    end do
                end do
            end do
        end if

        if (bodyForces) call s_apply_bodyforces(q_cons_ts(2)%vf, q_prim_vf, rhs_vf, dt)

        if (grid_geometry == 3) call s_apply_fourier_filter(q_cons_ts(2)%vf)

        if (model_eqns == 3 .and. (.not. relax)) then
            call s_pressure_relaxation_procedure(q_cons_ts(2)%vf)
        end if

        if (adv_n) call s_comp_alpha_from_n(q_cons_ts(2)%vf)

        if (ib) then
            if (qbmm .and. .not. polytropic) then
                call s_ibm_correct_state(q_cons_ts(2)%vf, q_prim_vf, pb_ts(2)%sf, mv_ts(2)%sf)
            else
                call s_ibm_correct_state(q_cons_ts(2)%vf, q_prim_vf)
            end if
        end if

        ! Stage 2 of 3

        call s_compute_rhs(q_cons_ts(dest)%vf, q_T_sf, q_prim_vf, bc_type, rhs_vf, pb_ts(2)%sf, rhs_pb, mv_ts(2)%sf, rhs_mv, t_step, time_avg, 2)

        if (bubbles_lagrange .and. .not. adap_dt) call s_update_lagrange_tdv_rk(stage=2)

#if  !defined(__NVCOMPILER_GPU_UNIFIED_MEM)
        $:GPU_PARALLEL_LOOP(collapse=4)
        do i = 1, sys_size
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        q_cons_ts(2)%vf(i)%sf(j, k, l) = &
                            (3._wp*q_cons_ts(1)%vf(i)%sf(j, k, l) &
                             + q_cons_ts(2)%vf(i)%sf(j, k, l) &
                             + dt*rhs_vf(i)%sf(j, k, l))/4._wp
                    end do
                end do
            end do
        end do
        dest = 2 ! result in q_cons_ts(2)%vf
#else
        $:GPU_PARALLEL_LOOP(collapse=4)
        do i = 1, sys_size
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                            (3._wp*q_cons_ts(2)%vf(i)%sf(j, k, l) &
                             + q_cons_ts(1)%vf(i)%sf(j, k, l) &
                             + dt*rhs_vf(i)%sf(j, k, l))/4._wp
                    end do
                end do
            end do
        end do
        dest = 1 ! result in q_cons_ts(1)%vf
#endif

        if (qbmm .and. (.not. polytropic)) then
            $:GPU_PARALLEL_LOOP(collapse=5)
            do i = 1, nb
                do l = 0, p
                    do k = 0, n
                        do j = 0, m
                            do q = 1, nnode
                                pb_ts(2)%sf(j, k, l, q, i) = &
                                    (3._wp*pb_ts(1)%sf(j, k, l, q, i) &
                                     + pb_ts(2)%sf(j, k, l, q, i) &
                                     + dt*rhs_pb(j, k, l, q, i))/4._wp
                            end do
                        end do
                    end do
                end do
            end do
        end if

        if (qbmm .and. (.not. polytropic)) then
            $:GPU_PARALLEL_LOOP(collapse=5)
            do i = 1, nb
                do l = 0, p
                    do k = 0, n
                        do j = 0, m
                            do q = 1, nnode
                                mv_ts(2)%sf(j, k, l, q, i) = &
                                    (3._wp*mv_ts(1)%sf(j, k, l, q, i) &
                                     + mv_ts(2)%sf(j, k, l, q, i) &
                                     + dt*rhs_mv(j, k, l, q, i))/4._wp
                            end do
                        end do
                    end do
                end do
            end do
        end if

        if (bodyForces) call s_apply_bodyforces(q_cons_ts(2)%vf, q_prim_vf, rhs_vf, dt/4._wp)

        if (grid_geometry == 3) call s_apply_fourier_filter(q_cons_ts(2)%vf)

        if (model_eqns == 3 .and. (.not. relax)) then
            call s_pressure_relaxation_procedure(q_cons_ts(2)%vf)
        end if

        if (adv_n) call s_comp_alpha_from_n(q_cons_ts(2)%vf)

        if (ib) then
            if (qbmm .and. .not. polytropic) then
                call s_ibm_correct_state(q_cons_ts(2)%vf, q_prim_vf, pb_ts(2)%sf, mv_ts(2)%sf)
            else
                call s_ibm_correct_state(q_cons_ts(2)%vf, q_prim_vf)
            end if
        end if

        ! Stage 3 of 3
        call s_compute_rhs(q_cons_ts(dest)%vf, q_T_sf, q_prim_vf, bc_type, rhs_vf, pb_ts(2)%sf, rhs_pb, mv_ts(2)%sf, rhs_mv, t_step, time_avg, 3)

        if (bubbles_lagrange .and. .not. adap_dt) call s_update_lagrange_tdv_rk(stage=3)

#if !defined(__NVCOMPILER_GPU_UNIFIED_MEM)
        $:GPU_PARALLEL_LOOP(collapse=4)
        do i = 1, sys_size
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                            (q_cons_ts(1)%vf(i)%sf(j, k, l) &
                             + 2._wp*q_cons_ts(2)%vf(i)%sf(j, k, l) &
                             + 2._wp*dt*rhs_vf(i)%sf(j, k, l))/3._wp
                    end do
                end do
            end do
        end do
        dest = 1 ! result in q_cons_ts(1)%vf
#else
        $:GPU_PARALLEL_LOOP(collapse=4)
        do i = 1, sys_size
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                            (q_cons_ts(2)%vf(i)%sf(j, k, l) &
                             + 2._wp*q_cons_ts(1)%vf(i)%sf(j, k, l) &
                             + 2._wp*dt*rhs_vf(i)%sf(j, k, l))/3._wp
                    end do
                end do
            end do
        end do
        dest = 1 ! result in q_cons_ts(1)%vf
#endif
Error Handling

The PREFER_GPU macro writes CUDA error messages to stdout but continues execution even when CUDA operations fail. This could lead to silent failures and incorrect behavior.

istat = cudaMemAdvise( c_devloc(${arg}$), SIZEOF(${arg}$), cudaMemAdviseSetPreferredLocation, 0 )
if (istat /= cudaSuccess) then
    write(*,"('Error code: ',I0, ': ')") istat
    write(*,*) cudaGetErrorString(istat)
endif
! set accessed by CPU
istat = cudaMemAdvise( c_devloc(${arg}$), SIZEOF(${arg}$), cudaMemAdviseSetAccessedBy, cudaCpuDeviceId )
if (istat /= cudaSuccess) then
    write(*,"('Error code: ',I0, ': ')") istat
    write(*,*) cudaGetErrorString(istat)
endif
! prefetch to GPU - physically populate memory pages
istat = cudaMemPrefetchAsync( c_devloc(${arg}$), SIZEOF(${arg}$), 0, 0 )
if (istat /= cudaSuccess) then
    write(*,"('Error code: ',I0, ': ')") istat
    write(*,*) cudaGetErrorString(istat)
endif
Memory Management

Complex conditional memory allocation logic with pointers and pinned memory pools could lead to memory leaks or dangling pointers if not properly managed during error conditions.

        if ( temp_on_gpu(1) == 1 ) then
            @:ALLOCATE(jac(idwbuff(1)%beg:idwbuff(1)%end, &
                idwbuff(2)%beg:idwbuff(2)%end, &
                idwbuff(3)%beg:idwbuff(3)%end))
            @:PREFER_GPU(jac)
        else
            !print*, 'jac on CPU'
            allocate(pool_host1(idwbuff(1)%beg:idwbuff(1)%end, &
                idwbuff(2)%beg:idwbuff(2)%end, &
                idwbuff(3)%beg:idwbuff(3)%end))

            jac(idwbuff(1)%beg:idwbuff(1)%end, &
                idwbuff(2)%beg:idwbuff(2)%end, &
                idwbuff(3)%beg:idwbuff(3)%end) => pool_host1(:,:,:)
        end if

        if ( temp_on_gpu(2) == 1 ) then
            @:ALLOCATE(jac_rhs(-1:m,-1:n,-1:p))
            @:PREFER_GPU(jac_rhs)
        else
            !print*, 'jac_rhs on CPU'
            allocate(pool_host2(-1:m,-1:n,-1:p))

            jac_rhs(-1:m,-1:n,-1:p) => pool_host2(:,:,:)
        end if

        if (igr_iter_solver == 1) then ! Jacobi iteration
            if ( temp_on_gpu(3) == 1 ) then
                @:ALLOCATE(jac_old(idwbuff(1)%beg:idwbuff(1)%end, &
                    idwbuff(2)%beg:idwbuff(2)%end, &
                    idwbuff(3)%beg:idwbuff(3)%end))
                @:PREFER_GPU(jac_old)
            else
                !print*, 'jac_old on CPU'
                allocate(pool_host3(idwbuff(1)%beg:idwbuff(1)%end, &
                    idwbuff(2)%beg:idwbuff(2)%end, &
                    idwbuff(3)%beg:idwbuff(3)%end))

                jac_old(idwbuff(1)%beg:idwbuff(1)%end, &
                    idwbuff(2)%beg:idwbuff(2)%end, &
                    idwbuff(3)%beg:idwbuff(3)%end) => pool_host3(:,:,:)
            end if
        end if
#endif

Copy link

qodo-merge-pro bot commented Aug 2, 2025

PR Code Suggestions ✨

Latest suggestions up to fb50e90

CategorySuggestion                                                                                                                                    Impact
Incremental [*]
Synchronize after prefetch

Ensure device synchronization after prefetch to avoid race conditions where
subsequent kernels access unmigrated pages; add a stream/device synchronize
following the last successful cudaMemPrefetchAsync to guarantee prefetch
completion before use.

src/common/include/macros.fpp [40-56]

 istat = cudaMemAdvise(c_devloc(${arg}$), SIZEOF(${arg}$), cudaMemAdviseSetPreferredLocation, 0)
 if (istat /= cudaSuccess) then
     write (*, "('Error code: ',I0, ': ')") istat
-    !write(*,*) cudaGetErrorString(istat)
 end if
-! set accessed by CPU
 istat = cudaMemAdvise(c_devloc(${arg}$), SIZEOF(${arg}$), cudaMemAdviseSetAccessedBy, cudaCpuDeviceId)
 if (istat /= cudaSuccess) then
     write (*, "('Error code: ',I0, ': ')") istat
-    !write(*,*) cudaGetErrorString(istat)
 end if
-! prefetch to GPU - physically populate memory pages
 istat = cudaMemPrefetchAsync(c_devloc(${arg}$), SIZEOF(${arg}$), 0, 0)
 if (istat /= cudaSuccess) then
     write (*, "('Error code: ',I0, ': ')") istat
-    !write(*,*) cudaGetErrorString(istat)
+end if
+istat = cudaDeviceSynchronize()
+if (istat /= cudaSuccess) then
+    write (*, "('Error code: ',I0, ': ')") istat
 end if
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential race condition by adding cudaDeviceSynchronize() after an asynchronous prefetch, which is good practice to ensure data is on the GPU before use.

Medium
  • More

Previous suggestions

Suggestions up to commit 4065c02
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add proper error handling termination

The CUDA error handling should include proper error recovery or program
termination. Currently, errors are only logged but execution continues, which
could lead to undefined behavior or silent failures in GPU memory management.

src/common/include/macros.fpp [38-42]

 istat = cudaMemAdvise( c_devloc(${arg}$), SIZEOF(${arg}$), cudaMemAdviseSetPreferredLocation, 0 )
 if (istat /= cudaSuccess) then
     write(*,"('Error code: ',I0, ': ')") istat
     write(*,*) cudaGetErrorString(istat)
+    stop 'CUDA memory advise failed'
 endif
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that a failure in cudaMemAdvise is only logged, and adds a stop statement, which improves the code's robustness by preventing execution with potentially incorrect memory configurations.

Medium
General
Simplify pointer assignment for safety

The pointer assignment assumes that pool_host1 has the same bounds as the target
slice, but pool_host1 is allocated with the same bounds. This creates a
potential bounds mismatch if the pointer target doesn't align properly with the
allocated memory.

src/simulation/m_igr.fpp [153-155]

-jac(idwbuff(1)%beg:idwbuff(1)%end, &
-    idwbuff(2)%beg:idwbuff(2)%end, &
-    idwbuff(3)%beg:idwbuff(3)%end) => pool_host1(:,:,:)
+jac => pool_host1
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly points out that the explicit slicing in the pointer assignment is redundant and can be simplified to jac => pool_host1, which is safer and more readable as it relies on the declared bounds.

Low

Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 40.93%. Comparing base (5e8f116) to head (fb50e90).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_time_steppers.fpp 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #972      +/-   ##
==========================================
+ Coverage   40.91%   40.93%   +0.01%     
==========================================
  Files          70       70              
  Lines       20288    20288              
  Branches     2517     2517              
==========================================
+ Hits         8301     8305       +4     
+ Misses      10450    10447       -3     
+ Partials     1537     1536       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ntselepidis ntselepidis changed the title Modified strategy for running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory Strategy for running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory Aug 7, 2025
@ntselepidis ntselepidis force-pushed the nvidia branch 3 times, most recently from e47036b to 8fef22d Compare August 8, 2025 17:19
Copy link
Collaborator

@wilfonba wilfonba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to run benchmark

Copy link
Collaborator

@wilfonba wilfonba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to run benchmark

@sbryngelson sbryngelson merged commit f8830e5 into MFlowCode:master Aug 14, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants