-
Notifications
You must be signed in to change notification settings - Fork 87
Add GPU & large ARM runners, and setup-palace-ci action #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gbozzola/hypre3
Are you sure you want to change the base?
Conversation
f5c241d to
c48c530
Compare
| } | ||
|
|
||
| TEST_CASE("2D libCEED Interpolators", "[libCEED][Interpolator][Serial][Parallel][GPU]") | ||
| TEST_CASE("2D libCEED Interpolators", "[libCEED][Interpolator][Serial][Parallel]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because of #516
| end | ||
|
|
||
| abstol = 2.0e-12 | ||
| abstol = 1.0e-11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember which case led me to change this, but I found it necessary
c48c530 to
6c7fe66
Compare
| strumpack_packages = ["ParMETIS", "METIS", "LAPACK", "BLAS", "MPI", "MPI_Fortran"] | ||
| if self.spec.satisfies("+openmp"): | ||
| strumpack_packages.append("OpenMP") | ||
| args.append(self.define("STRUMPACK_REQUIRED_PACKAGES", ";".join(strumpack_packages))) | ||
| scalapack_libs = self.spec["scalapack"].libs | ||
| fortran_libs = "" | ||
| if "gfortran" in self.compiler.fc: | ||
| fortran_libs = "gfortran" | ||
| elif "ifort" in self.compiler.fc or "ifx" in self.compiler.fc: | ||
| fortran_libs = "ifport;ifcore" | ||
| # For other compilers (flang, etc.), don't add extra libs | ||
| strumpack_libs = str(scalapack_libs).replace(" ", ";") | ||
| if fortran_libs: | ||
| strumpack_libs += ";" + fortran_libs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this entire section might away with #550
hughcars
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small question about usage of the cache, but generally LGTM.
6c7fe66 to
f724e1b
Compare
f724e1b to
656cce1
Compare
a0ff316 to
9330759
Compare
|
I updated the PR with the following changes:
More importantly, I changed from a
For some builds, CI is now broken down in two steps: build (e.g., To allow this workflow, I extended the runtest script to take command line arguments and overwrite the config files with the passed arguments. Environment variables are still accepted (but get overwritten if the corresponding command line argument is passed). I updated the documentation for this. I also reorganized the workflows so that they are in repo-local actions. One of the undersired side effects of this is that the GitHub action log is not as neatly broken down in steps that show the individual time (which is very useful to optimize the runtime). I checked if there was a way around this, and could not find one. In the future, I will add back timing information. I also tried adding macOS in the matrix, but found lots of problems:
|
This commit runs the spack workflow on larger/GPU-accelerated runners. I added a full test matrix, but I could not get everything to work. In particular, the static builds have several linking errors, which indicates some issue in the build system. I tried quick fixes, but I could not get the builds to work, so I am leaving this for future work.
At the end, we pay the cost of compiling them only once (with the buildcache)
9330759 to
65dcf6f
Compare
hughcars
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small question about how to disable failing tests in future if using matrices, i.e. one entry in a 3x3 matrix for example. But generally this looks good to me.
| # Intel | ||
| echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list | ||
| wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \ | ||
| | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be placed within the Setup actions straight after? Or is there value in having this broken out? I imagine they're fairly quick so the granularity is probably helpful even if it adds them on runs without these enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this separately to have only one apt update. apt update pull from the internet and takes O(10) seconds. I moved it out for etiquette (I don't want to make too many server calls if I can avoid that) and shave off that tiny amount of time. This is a micro-optimization and I can move this to the steps below if you think that would be better.
| - name: Install Spack | ||
| uses: actions/checkout@v6 | ||
| with: | ||
| repository: spack/spack | ||
| path: spack | ||
| ref: ${{ inputs.spack-version }} | ||
|
|
||
| - name: Setup Spack environment | ||
| shell: bash | ||
| run: | | ||
| echo "$(realpath spack/bin)" >> "$GITHUB_PATH" | ||
| echo "SPACK_DISABLE_LOCAL_CONFIG=1" >> "$GITHUB_ENV" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this done for the cmake builds too? Again, probably not that big a deal as I bet it's very fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this file is not used for the cmake builds. We might need a couple of small tweaks for that.
| test-x64-gcc: | ||
| needs: [filter, build-x64-gcc] | ||
| if: success() | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| solver: [MUMPS, SuperLU, STRUMPACK, Default] | ||
| runs-on: [self-hosted, x64, 4xlarge] | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| - uses: ./.github/actions/run-regression-tests | ||
| with: | ||
| toolchain: gcc | ||
| solver: ${{ matrix.solver }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this matrix approach, if we have a failing test for one entry in the tensor product that we want to disable, how would we do that? I.e. only the STRUMPACK run on 4xlarge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action defines three separate build-xxxxxx jobs and three corresponding test-xxxxxx jobs. The test jobs define a matrix (all of which running on 4xlarge). If you want to disable a specific combination, you can simply comment that out. For example, if you want to disable STRUMPACK for openmp, we would just have
test-x64-gcc-openmp:
needs: [filter, build-x64-gcc-openmp]
if: success()
strategy:
fail-fast: false
matrix:
include:
- solver: MUMPS
- solver: SuperLU
# - solver: STRUMPACK
- solver: Default
runs-on: [self-hosted, x64, 4xlarge]
steps:
- uses: actions/checkout@v6
- uses: ./.github/actions/run-regression-tests
with:
toolchain: gcc
variant: "+openmp+int64~shared"
solver: ${{ matrix.solver }}| # Find Umpire and Camp (needed when CUDA/HIP is enabled) | ||
| if(PALACE_WITH_CUDA OR PALACE_WITH_HIP) | ||
| find_package(umpire REQUIRED CONFIG) | ||
| find_package(camp REQUIRED CONFIG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does camp do? I've not heard of it before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't quite know. It is used by umpire: https://camp.readthedocs.io/en/latest/sphinx/user_guide/using_camp.html#camp-used-in-umpire
| cuda_variant = f"+cuda cuda_arch={arch}" | ||
|
|
||
| # TODO: Remove me after blt > 0.7.1 is released | ||
| depends_on(f"blt@develop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be modeled as a conflict for 0.7.1 and below? I.e. I think that will always pull develop if the conflict covers the most recent version, and then this will not require updating again (I think we're doing this for libceed for instance, depending on a version that doesn't exist yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can just say depends_on("[email protected]:").
| depends_on("libxsmm+debug", when="build_type=Debug") | ||
| depends_on("libceed+libxsmm", when="@0.14:") | ||
| # NOTE: libxsmm builds on MacOS have linker issues | ||
| # https://github.com/libxsmm/libxsmm/issues/883 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the 0.15.0 spack recipe PR is in, I think you should open another one with these changes in addition. This seems like a really good upgrade of the general capability and robustness of the spack release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We change other stuff in #550, but yes, I agree we should update the spack recipe.
| #### Command Line Arguments | ||
|
|
||
| - `PALACE_TEST`: Path to *Palace* executable and optional arguments (default: "`palace`") | ||
| - `NUM_PROC_TEST`: Number of MPI processes (default: number of physical cores) | ||
| - `OMP_NUM_THREADS`: Number of OpenMP threads (default: 1) | ||
| - `TEST_CASES`: Space-separated list of test cases to run (default: all examples) | ||
| The test runner supports command line arguments for configuration. Each argument can also be set via environment variables as fallbacks. | ||
|
|
||
| **Key Options:** | ||
|
|
||
| - `--palace-test`: Path to *Palace* executable and optional arguments (default: "`palace`") | ||
| - `--num-proc-test`: Number of MPI processes (default: number of physical cores) | ||
| - `--test-cases`: Space-separated list of test cases to run (default: all examples) | ||
|
|
||
| Run `julia --project runtests.jl --help` to see all available options with descriptions and defaults. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A much nicer approach
This commit runs the spack workflow on larger/GPU-accelerated runners.
Updated description: #578 (comment)
OLD description, new description in #578 (comment)
I added a full test matrix, but I could not get everything to work (despite fixing several bugs). Some of this work conflicts with #550, so I will address more of that there. I tried quick fixes, but I could not get the builds to work, so I am leaving this for future work.
To be noted:
libxsmmdoes not supportflang, so I am using clang+gfortran flang is not supported libxsmm/libxsmm#996cylinder/floquet (periodic)test (maybe OpenMP)palace-develop)CPW (lumped ports, adaptive)takes 12 minutes on GPU. I don't know if this is another instance of Improve GPU performance for adaptive sweep wave ports #375There is this annoying bug with oneAPI and spack where the version of
ifxis incorrectly reported, breaking everything. I asked about this on slack, and I was told that I should manually merge the two compiler entries in the package definition (as a workaround). I added a step to take care of this, but I hope we can remove it soon.Good news that (after a week of fixing bugs), lots of variants are working, including OpenMP cases, cases with mixed compilers, GPU+LLVM, newer dependencies (e.g., eigen 5), et cetera.
The progression I see is:
(#576 at any time)