-
Notifications
You must be signed in to change notification settings - Fork 37
Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…plains, and does not build
… and Device version: put it back in CPPProcess for now
gCPPProcess.cu(689): warning #20091-D: a __constant__ variable "mg5amcGpu::cNGoodHel" cannot be directly read in a host function gCPPProcess.cu(691): warning #20091-D: a __constant__ variable "mg5amcGpu::cGoodHel" cannot be directly read in a host function
/cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_blockIdx' /cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_blockDim' /cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_threadIdx'
… to move goodhel to host
Confirm a 30% difference between ihel_hack EvtsPerSec[MECalcOnly] (3a) = ( 9.992406e+07 ) sec^-1 and hack EvtsPerSec[MECalcOnly] (3a) = ( 1.340965e+08 ) sec^-1
./tput/teeThroughputX.sh -eemumu -ggtt -ggttg -ggttgg -ggttggg
…res, some faster, some slower
…tests on rd90 after tuning the scripts Note that peak performance of ggttgg decreases by approximately 10% in classic tests (but the peak performance remains visible in scaling tests) No difference otherwise ./tmad/teeMadX.sh -ggttgg -dmf -hip STARTED AT Sat Sep 20 10:20:48 AM CEST 2025 ENDED AT Sat Sep 20 10:23:19 AM CEST 2025
…- 144 tput and tmad logs git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD tput/log* tmad/log*)
…- codegen logs for all processes git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD */CODEGEN*txt)
…- generated code except gg_tt.mad (Will merge and fix conflicts only in CODEGEN, gg_tt.mad and tput scripts) git checkout origin/hack_ihel3 $(git ls-tree -r --name-only HEAD *.sa *.mad \ | grep -v ^gg_tt.mad | \egrep '(MatrixElementKernels.cc|CPPProcess.cc)')
Fix conflicts: 1) merge HIP fixes for gpuStream (hack_ihel2_sep25) and new BLAS features (hack_ihel3) - epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc - epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc 2) merge helicity stream features (changed in hack_ihel3) and disabling of fpeEnable (master) - epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/MatrixElementKernels.cc - epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.cc (Note: I would personally prefer to keep FPEs enabled or at least add an env variable to enable FPEs) 3) merge new scaling (hack_ihel2_sep25) and BLAS (hack_ihel3) tput tests, add 6 more BLAS scaling tests - epochX/cudacpp/tput/allTees.sh - epochX/cudacpp/tput/teeThroughputX.sh - epochX/cudacpp/tput/throughputX.sh
…t cublas/hipblas headers with #ifndef MGONGPU_HAS_NO_BLAS
… #ifndef MGONGPU_HAS_NO_BLAS
… gpuBlasHandle_t as void in noBLAS builds
…llptr to keep the same API in noBLAS builds
…dColorMatrix() for noBLAS HIP builds
…or hipblas.h in include/hipblas
…scaling) tput tests on LUMI - all ok (This commit adds the 6 new blas/scaling logs) With respect to the last LUMI logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - hip and c++ results are unchanged in ggtt/ggttgg (within 10% fluctuations) for blas builds with blas disabled at runtime (for hip, results are unchanged both at small grids and for peak performance at large grids) Comparing results with blas disabled (blasOff) and enabled (blasOn) at runtime on a V100: - hip scaling and peak performance are significantly worse for blasOn with respect to blasOff (for ggtt, blas is a factor ~100 worse both at small grids and large grids) (for ggttgg, blas is a factor ~100 worse at small grids and a factor ~5 worse for peak performance at large grids)
With respect to the last LUMI logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - hip and c++ results are ~unchanged in ggtt/ggttgg (within 10% fluctuations) for blas builds with blas disabled at runtime (many results seem lower, but as mentioned there are large fluctuations on LUMI)
…/scaling) tput tests on itscrd90 - all ok (This commit adds the 6 new blas/scaling logs) With respect to the last itscrd90 logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - cuda and c++ results are unchanged across ggtt/ggttggg (within 1%) for blas builds with blas disabled at runtime (for cuda, results are unchanged both at small grids and for peak performance at large grids) Comparing results with blas disabled (blasOff) and enabled (blasOn) at runtime on a V100: - cuda scaling and peak performance are significantly worse for blasOn with respect to blasOff (for ggtt, blas is a factor 6-8 worse both at small grids and large grids) (for ggttgg, blas is a factor 6-8 worse at small grids and around 10% worse for peak performance at large grids) STARTED AT Sat Sep 20 11:46:40 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sun Sep 21 02:16:49 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling ENDED(1-scaling) AT Sun Sep 21 02:27:47 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ENDED(2) AT Sun Sep 21 02:32:04 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -scaling ENDED(2-scaling) AT Sun Sep 21 02:36:29 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(3) AT Sun Sep 21 02:55:26 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(4) AT Sun Sep 21 03:06:23 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(5) AT Sun Sep 21 03:09:46 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(6) AT Sun Sep 21 03:13:11 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(7) AT Sun Sep 21 03:16:41 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(8) AT Sun Sep 21 03:29:44 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(9) AT Sun Sep 21 03:52:43 AM CEST 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
With respect to the last itscrd90 logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - cuda and c++ results are unchanged across ggtt/ggttggg (within 1%) for blas builds with blas disabled at runtime STARTED AT Sun Sep 21 03:52:43 AM CEST 2025 (SM tests) ENDED(1) AT Sun Sep 21 04:44:34 AM CEST 2025 [Status=0] (BSM tests) ENDED(1) AT Sun Sep 21 04:48:33 AM CEST 2025 [Status=0] 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ! No asserts found in logs No segmentation fault found in logs
…_global__ INLINE CPPProcess.cc(235): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
…cesses, remove INLINE from __global__ to fix build warnings
Thanks a lot Andrea, Conflict was not relevant (a comment) so I just fixed it (I guess at least). So I will start the review! Thanks a lot, Olivier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Andrea,
One point is worrying me in such MR but to be honest, this sounds disconnected from your MR. But let me check that specific point (i.e. the format of the color matrix) before approving.
They are also two point of detail that should be fine to not include from your PR.
Thanks,
Olivier
{ | ||
for( int icol = 0; icol < ncolor; icol++ ) | ||
for( int jcol = 0; jcol < ncolor; jcol++ ) | ||
value[icol * ncolor + jcol] = colorMatrix[icol][jcol] / colorDenom[icol]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point to check:
Main version of MG5 moves from a full matrix to an upper diagonal representation.
(and to a single denominator for the full matrix).
So here they are a risk that the color_matrix_lines templating are not returning the correct format for the information.
On a "hardware optimization" point of view, would not be better to keep the integer matrix colorMatrix (either upper diagonal or the full matrix) rather than the float matrix "value" (format here is irrelevant since flatter)? Especially if colorDenom is a single integer that we can normalise the output of the computation with?
{ | ||
// nprocesses == 2 may happen for "mirror processes" such as P0_uux_ttx within pp_tt012j (see PR #754) | ||
constexpr int nprocesses = %(nproc)i; | ||
constexpr int nprocesses = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr int nprocesses = 1; | |
constexpr int nprocesses = %(nproc)i; |
Is there a reason to revert this change? If yes why keep the assert?
constexpr int nprocesses = 1; | ||
static_assert( nprocesses == 1 || nprocesses == 2, "Assume nprocesses == 1 or 2" ); | ||
constexpr int process_id = %(proc_id)i; // code generation source: %(proc_id_source)s | ||
constexpr int process_id = 1; // code generation source: standalone_cudacpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr int process_id = 1; // code generation source: standalone_cudacpp | |
constexpr int process_id = %(proc_id)i; // code generation source: %(proc_id_source)s |
kind of similar comment, but here it sounds less impactfull (given the assert but still better to have the dynamical version that blindly assume that it is.
template = open(pjoin(self.template_path,'gpu','color_sum.cc'),'r').read() | ||
replace_dict = {} | ||
# Extract color matrix again (this was also in get_matrix_single_process called within get_all_sigmaKin_lines) | ||
replace_dict['color_matrix_lines'] = self.get_color_matrix_lines(self.matrix_elements[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a comment to pint-point, the previous point of attention,
this is where we need to check if the format is still correct.
(if it is we likely want to create an issue to update it, if not then no choice than fixing it)
Hi @oliviermattelaer, as discussed recently and as per my presentation at the MG5AMC meeting last Friday.
This is the PR for my kernel splitting changes that I recommend merging:
I have prepared a paper that will be shortly in arxiv with all the details.
Until yesterday, it would have been possible to merge this automatically, as I had merged the latest upstream in my developments. Yesterday there were some new changes merged (for tREX I think), so this will need some massaging. I can do that later on, or let me know how you want to proceed.
Thanks, Andrea