Skip to content

Improve GPU-aware section in the docs#927

Merged
lcw merged 5 commits intomasterfrom
lr/update-doc
Jan 13, 2026
Merged

Improve GPU-aware section in the docs#927
lcw merged 5 commits intomasterfrom
lr/update-doc

Conversation

@luraess
Copy link
Contributor

@luraess luraess commented Dec 17, 2025

Adds infos to the doc as per discussion in #924.

@luraess luraess changed the title Improve GPU-aware section Improve GPU-aware section in the docs Dec 17, 2025
> comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
> rank_loc = MPI.Comm_rank(comm_loc)
> ```
> If using (2), one can use the default device but make sur to handle device visbility in the scheduler; for SLURM on Cray systems, this can be mostly achieved using `--gpus-per-task=1`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't '--gpus-per-task' for SLURM prevent the use of GPU Peer2Peer IPC mechanisms (https://cpe.ext.hpe.com/docs/24.03/mpt/mpich/intro_mpi.html) which would have a negative impact on performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I also remember, but perhaps Nvidia has finally fixed this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not as far as I can tell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I can make the text more generic

Comment on lines +85 to +87
Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the files into this repository?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were shall one put them?

Comment on lines +103 to +108
!!! note "Preloads"
On Cray machines, you may need to ensure the following preloads to be set in the preferences:
```
preloads = ["libmpi_gtl_hsa.so"]
preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also true for CUDA.

preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
```

!!! note "Multiple GPUs per node"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!!! note "Multiple GPUs per node"
### "Multiple GPUs per node"

Since the text is not just on ROCM?

@luraess
Copy link
Contributor Author

luraess commented Jan 6, 2026

Happy if someone would review added file naming and location, and updated text 🙏

@lcw
Copy link
Member

lcw commented Jan 13, 2026

This looks good to me. It seems all issues were addressed. Thanks!

@lcw lcw merged commit 7dabd91 into master Jan 13, 2026
1 of 4 checks passed
@lcw lcw deleted the lr/update-doc branch January 13, 2026 21:23
@giordano
Copy link
Member

This PR broke building the docs

┌ Error: invalid local link/image: path pointing to a file outside of build directory in docs/src/usage.md
│   link =
│    @ast MarkdownAST.Link("../examples/alltoall_test_cuda.jl", "") do
│      MarkdownAST.Text("alltoall")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("test")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("cuda.jl")
│    end
│    
└ @ Documenter ~/.julia/packages/Documenter/xvqbW/src/utilities/utilities.jl:47
┌ Error: invalid local link/image: path pointing to a file outside of build directory in docs/src/usage.md
│   link =
│    @ast MarkdownAST.Link("../examples/alltoall_test_cuda_multigpu.jl", "") do
│      MarkdownAST.Text("alltoall")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("test")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("cuda")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("multigpu.jl")
│    end
│    
└ @ Documenter ~/.julia/packages/Documenter/xvqbW/src/utilities/utilities.jl:47
┌ Error: invalid local link/image: path pointing to a file outside of build directory in docs/src/usage.md
│   link =
│    @ast MarkdownAST.Link("../examples/alltoall_test_rocm.jl", "") do
│      MarkdownAST.Text("alltoall")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("test")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("rocm.jl")
│    end
│    
└ @ Documenter ~/.julia/packages/Documenter/xvqbW/src/utilities/utilities.jl:47
┌ Error: invalid local link/image: path pointing to a file outside of build directory in docs/src/usage.md
│   link =
│    @ast MarkdownAST.Link("../examples/alltoall_test_rocm_multigpu.jl", "") do
│      MarkdownAST.Text("alltoall")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("test")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("rocm")
│      MarkdownAST.Text("_")
│      MarkdownAST.Text("multigpu.jl")
│    end
│    
└ @ Documenter ~/.julia/packages/Documenter/xvqbW/src/utilities/utilities.jl:47
[ Info: CheckDocument: running document checks.

@luraess
Copy link
Contributor Author

luraess commented Jan 26, 2026

Yeah, this was somehow my open question in "review added file naming and location", as I would suspect it may not work like other examples but was unsure whether to put those examples in another folder. Any suggestions @giordano ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants