Skip to content

Conversation

valassi
Copy link
Member

@valassi valassi commented Oct 6, 2025

Hi @oliviermattelaer, as discussed recently and as per my presentation at the MG5AMC meeting last Friday.

This is the PR for my kernel splitting changes that I recommend merging:

  • ihel1: helicity streams
  • ihel2: color sum kernels
  • ihel3: color sum in BLAS

I have prepared a paper that will be shortly in arxiv with all the details.

Until yesterday, it would have been possible to merge this automatically, as I had merged the latest upstream in my developments. Yesterday there were some new changes merged (for tREX I think), so this will need some massaging. I can do that later on, or let me know how you want to proceed.

Thanks, Andrea

oliviermattelaer and others added 30 commits September 26, 2022 18:35
… and Device version: put it back in CPPProcess for now
gCPPProcess.cu(689): warning #20091-D: a __constant__ variable "mg5amcGpu::cNGoodHel" cannot be directly read in a host function
gCPPProcess.cu(691): warning #20091-D: a __constant__ variable "mg5amcGpu::cGoodHel" cannot be directly read in a host function
/cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_blockIdx'
/cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_blockDim'
/cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_threadIdx'
Confirm a 30% difference between ihel_hack
EvtsPerSec[MECalcOnly] (3a) = ( 9.992406e+07                 )  sec^-1
and hack
EvtsPerSec[MECalcOnly] (3a) = ( 1.340965e+08                 )  sec^-1
./tput/teeThroughputX.sh -eemumu -ggtt -ggttg -ggttgg -ggttggg
…tests on rd90 after tuning the scripts

Note that peak performance of ggttgg decreases by approximately 10% in classic tests
(but the peak performance remains visible in scaling tests)
No difference otherwise

./tmad/teeMadX.sh -ggttgg -dmf -hip

STARTED AT Sat Sep 20 10:20:48 AM CEST 2025
ENDED   AT Sat Sep 20 10:23:19 AM CEST 2025
…- 144 tput and tmad logs

git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD tput/log* tmad/log*)
…- codegen logs for all processes

git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD */CODEGEN*txt)
…- generated code except gg_tt.mad

(Will merge and fix conflicts only in CODEGEN, gg_tt.mad and tput scripts)

git checkout origin/hack_ihel3 $(git ls-tree -r --name-only HEAD *.sa *.mad \
  | grep -v ^gg_tt.mad | \egrep '(MatrixElementKernels.cc|CPPProcess.cc)')
Fix conflicts:

1) merge HIP fixes for gpuStream (hack_ihel2_sep25) and new BLAS features (hack_ihel3)
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc
- epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc

2) merge helicity stream features (changed in hack_ihel3) and disabling of fpeEnable (master)
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/MatrixElementKernels.cc
- epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.cc
(Note: I would personally prefer to keep FPEs enabled or at least add an env variable to enable FPEs)

3) merge new scaling (hack_ihel2_sep25) and BLAS (hack_ihel3) tput tests, add 6 more BLAS scaling tests
- epochX/cudacpp/tput/allTees.sh
- epochX/cudacpp/tput/teeThroughputX.sh
- epochX/cudacpp/tput/throughputX.sh
…t cublas/hipblas headers with #ifndef MGONGPU_HAS_NO_BLAS
…scaling) tput tests on LUMI - all ok

(This commit adds the 6 new blas/scaling logs)

With respect to the last LUMI logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c):
- hip and c++ results are unchanged in ggtt/ggttgg (within 10% fluctuations) for blas builds with blas disabled at runtime
  (for hip, results are unchanged both at small grids and for peak performance at large grids)

Comparing results with blas disabled (blasOff) and enabled (blasOn) at runtime on a V100:
- hip scaling and peak performance are significantly worse for blasOn with respect to blasOff
  (for ggtt, blas is a factor ~100 worse both at small grids and large grids)
  (for ggttgg, blas is a factor ~100 worse at small grids and a factor ~5 worse for peak performance at large grids)
With respect to the last LUMI logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c):
- hip and c++ results are ~unchanged in ggtt/ggttgg (within 10% fluctuations) for blas builds with blas disabled at runtime
  (many results seem lower, but as mentioned there are large fluctuations on LUMI)
…ep25/itscrd90 logs

Revert "[hack_ihel3_sep25] rerun 30 tmad tests on LUMI - all ok"
This reverts commit ac04c54.

Revert "[hack_ihel3_sep25] rerun 132 (96 + 12 blas + 18 scaling + 6 new blas/scaling) tput tests on LUMI - all ok"
This reverts commit b56251e.
…/scaling) tput tests on itscrd90 - all ok

(This commit adds the 6 new blas/scaling logs)

With respect to the last itscrd90 logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c):
- cuda and c++ results are unchanged across ggtt/ggttggg (within 1%) for blas builds with blas disabled at runtime
  (for cuda, results are unchanged both at small grids and for peak performance at large grids)

Comparing results with blas disabled (blasOff) and enabled (blasOn) at runtime on a V100:
- cuda scaling and peak performance are significantly worse for blasOn with respect to blasOff
  (for ggtt, blas is a factor 6-8 worse both at small grids and large grids)
  (for ggttgg, blas is a factor 6-8 worse at small grids and around 10% worse for peak performance at large grids)

STARTED  AT Sat Sep 20 11:46:40 PM CEST 2025
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Sun Sep 21 02:16:49 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling
ENDED(1-scaling) AT Sun Sep 21 02:27:47 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn
ENDED(2) AT Sun Sep 21 02:32:04 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -scaling
ENDED(2-scaling) AT Sun Sep 21 02:36:29 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(3) AT Sun Sep 21 02:55:26 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(4) AT Sun Sep 21 03:06:23 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(5) AT Sun Sep 21 03:09:46 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(6) AT Sun Sep 21 03:13:11 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(7) AT Sun Sep 21 03:16:41 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean
ENDED(8) AT Sun Sep 21 03:29:44 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(9) AT Sun Sep 21 03:52:43 AM CEST 2025 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

No aborts found in logs

./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
With respect to the last itscrd90 logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c):
- cuda and c++ results are unchanged across ggtt/ggttggg (within 1%) for blas builds with blas disabled at runtime

STARTED  AT Sun Sep 21 03:52:43 AM CEST 2025
(SM tests)
ENDED(1) AT Sun Sep 21 04:44:34 AM CEST 2025 [Status=0]
(BSM tests)
ENDED(1) AT Sun Sep 21 04:48:33 AM CEST 2025 [Status=0]

12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ!

No asserts found in logs

No segmentation fault found in logs
…_global__ INLINE

CPPProcess.cc(235): warning #20050-D: inline qualifier ignored for "__global__" function
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
…cesses, remove INLINE from __global__ to fix build warnings
@oliviermattelaer
Copy link
Member

Thanks a lot Andrea,

Conflict was not relevant (a comment) so I just fixed it (I guess at least).

So I will start the review!

Thanks a lot,

Olivier

Copy link
Member

@oliviermattelaer oliviermattelaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrea,

One point is worrying me in such MR but to be honest, this sounds disconnected from your MR. But let me check that specific point (i.e. the format of the color matrix) before approving.

They are also two point of detail that should be fine to not include from your PR.

Thanks,

Olivier

{
for( int icol = 0; icol < ncolor; icol++ )
for( int jcol = 0; jcol < ncolor; jcol++ )
value[icol * ncolor + jcol] = colorMatrix[icol][jcol] / colorDenom[icol];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point to check:
Main version of MG5 moves from a full matrix to an upper diagonal representation.
(and to a single denominator for the full matrix).

So here they are a risk that the color_matrix_lines templating are not returning the correct format for the information.

On a "hardware optimization" point of view, would not be better to keep the integer matrix colorMatrix (either upper diagonal or the full matrix) rather than the float matrix "value" (format here is irrelevant since flatter)? Especially if colorDenom is a single integer that we can normalise the output of the computation with?

{
// nprocesses == 2 may happen for "mirror processes" such as P0_uux_ttx within pp_tt012j (see PR #754)
constexpr int nprocesses = %(nproc)i;
constexpr int nprocesses = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constexpr int nprocesses = 1;
constexpr int nprocesses = %(nproc)i;

Is there a reason to revert this change? If yes why keep the assert?

constexpr int nprocesses = 1;
static_assert( nprocesses == 1 || nprocesses == 2, "Assume nprocesses == 1 or 2" );
constexpr int process_id = %(proc_id)i; // code generation source: %(proc_id_source)s
constexpr int process_id = 1; // code generation source: standalone_cudacpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constexpr int process_id = 1; // code generation source: standalone_cudacpp
constexpr int process_id = %(proc_id)i; // code generation source: %(proc_id_source)s

kind of similar comment, but here it sounds less impactfull (given the assert but still better to have the dynamical version that blindly assume that it is.

template = open(pjoin(self.template_path,'gpu','color_sum.cc'),'r').read()
replace_dict = {}
# Extract color matrix again (this was also in get_matrix_single_process called within get_all_sigmaKin_lines)
replace_dict['color_matrix_lines'] = self.get_color_matrix_lines(self.matrix_elements[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a comment to pint-point, the previous point of attention,
this is where we need to check if the format is still correct.
(if it is we likely want to create an issue to update it, if not then no choice than fixing it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants