Skip to content

Conversation

valassi
Copy link
Member

@valassi valassi commented Oct 6, 2025

Hi @oliviermattelaer, as discussed recently and as per my presentation at the MG5AMC meeting last Friday.

This is the PR for my kernel splitting changes that I recommend NOT merging:

  • ihel4: Feynman diagram kernels

I have prepared a paper that will be shortly in arxiv with all the details.

I file this for the record and as a fully functional proof of concept that can be used as the basis for further developments (I will probably try one small thing in addition).

Note, like the previous PR #1049, until yesterday, it would have been possible to merge this automatically, as I had merged the latest upstream in my developments. Yesterday there were some new changes merged (for tREX I think), so this has conflicts. I suggest that this should be closed eventually as not merged, but I will in any case port this to the levl of the other one, hen tREX conflicts are solved.

Thanks
Andrea

…city per kernel - many failures in some processes

STARTED  AT Sun Nov  3 08:23:00 PM CET 2024
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Sun Nov  3 10:54:35 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Sun Nov  3 11:14:25 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(3) AT Sun Nov  3 11:23:09 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(4) AT Sun Nov  3 11:25:50 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(5) AT Sun Nov  3 11:28:28 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(6) AT Sun Nov  3 11:31:11 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(7) AT Mon Nov  4 12:01:02 AM CET 2024 [Status=0]

./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_common.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_curhst.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_d_inl1_hrd0.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_d_inl1_hrd1.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_common.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_curhst.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl1_hrd0.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_f_inl1_hrd1.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_common.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_curhst.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl1_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl1_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_common.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_curhst.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl1_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl1_hrd1.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd1.txt: 2 FAILED TESTS
./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt: 2 FAILED TESTS
./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd1.txt: 2 FAILED TESTS
…nel - many failures in various processes

STARTED  AT Mon Nov  4 12:01:02 AM CET 2024
(SM tests)
ENDED(1) AT Mon Nov  4 01:14:56 AM CET 2024 [Status=0]
(BSM tests)
ENDED(1) AT Mon Nov  4 01:20:26 AM CET 2024 [Status=0]

0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
…hat are only needed for multichannel (fix build warnings)
… (#ifdef out variables that are only needed for multichannel)
…ad after fixing build warning for color selection
… to 4 (as in upstream/master) instead of 256 - runTest now passes

This is the result of careful debugging comparing results to upstream/master using printf.

The question is, why is CODEGEN now giving 256 instead of 4?...
…ators instead of the hardcoded 256 - this fixes ee_mumu runTest
…ass DeviceAccessJamp2 for decoding allJamp2s buffers
…city per kernel - finally all ok!

STARTED  AT Tue Nov  5 07:29:55 AM CET 2024
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Tue Nov  5 10:07:22 AM CET 2024 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Tue Nov  5 10:27:27 AM CET 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(3) AT Tue Nov  5 10:36:27 AM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(4) AT Tue Nov  5 10:39:13 AM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(5) AT Tue Nov  5 10:41:58 AM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(6) AT Tue Nov  5 10:44:47 AM CET 2024 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(7) AT Tue Nov  5 11:11:29 AM CET 2024 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

(Note1): there seems to be a performance advantage, but not as big as I was expecting

(Note2): the profiling scripts must be fixed, now sigmaKin is no longer a kernel!

git diff 6be9482 --no-ext-diff tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt

 runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_m_inl0_hrd0/check_cuda.exe -p 1 256 1
-==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
-==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
+==WARNING== No kernels were profiled.
+   launch__registers_per_thread N/A
+   sm__sass_average_branch_targets_threads_uniform.pct N/A
 .........................................................................
 runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_m_inl0_hrd0/check_cuda.exe -p 64 256 1 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
 Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:MIX+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = MIXED (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 1.108221e+04                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 1.108518e+04                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 1.108553e+04                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 1.292099e+04                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 1.292537e+04                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 1.292566e+04                 )  sec^-1
 MeanMatrixElemValue         = ( 1.856249e-04 +- 8.329951e-05 )  GeV^-6
-TOTAL       :     3.432158 sec
+TOTAL       :     2.037956 sec
 INFO: No Floating Point Exceptions have been reported
-    11,402,912,009      cycles                           #    3.032 GHz
-    24,689,535,297      instructions                     #    2.17  insn per cycle
-       3.818442336 seconds time elapsed
+     6,757,906,381      cycles                           #    2.881 GHz
+    14,685,293,303      instructions                     #    2.17  insn per cycle
+       2.402242543 seconds time elapsed
…itscrd90 with one helicity per kernel - finally also all ok!

STARTED  AT Tue Nov  5 11:11:29 AM CET 2024
(SM tests)
ENDED(1) AT Tue Nov  5 03:09:08 PM CET 2024 [Status=0]
(BSM tests)
ENDED(1) AT Tue Nov  5 03:19:40 PM CET 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

(Note1): there seems to be a performance advantage, but this remains to be better understood
What is interesting is the behaviour for fewer events (8192 rather than 81920)
However this may be due to a different accounting of helicity filtering rather than to a real speedup...
What is strange is that this appears for ggtt but not for ggttggg
TODO: should try to reduce below 8192 and see what this gives...?

git diff 6be9482 --no-ext-diff tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt

@@ -534,10 +534,10 @@ DEBUG: MEK processed 8192 events across 3 channels { 1 : 8192 }
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 47.14 [47.138611963547788] fbridge_mode=1
  [UNWEIGHT] Wrote 1618 events (found 1623 events)
- [COUNTERS] PROGRAM TOTAL          :    0.8403s
- [COUNTERS] Fortran Overhead ( 0 ) :    0.8366s
- [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0030s for     8192 events => throughput is 2.75E+06 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
+ [COUNTERS] PROGRAM TOTAL          :    0.8671s
+ [COUNTERS] Fortran Overhead ( 0 ) :    0.8626s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0009s for     8192 events => throughput is 8.99E+06 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0036s

 *** (3-cuda) Compare MADEVENT_CUDA x1 xsec to MADEVENT_FORTRAN xsec ***

@@ -569,10 +569,10 @@ DEBUG: MEK processed 81920 events across 3 channels { 1 : 81920 }
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 47.14 [47.144596232269095] fbridge_mode=1
  [UNWEIGHT] Wrote 1613 events (found 1618 events)
- [COUNTERS] PROGRAM TOTAL          :    1.9861s
- [COUNTERS] Fortran Overhead ( 0 ) :    1.9767s
- [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0087s for    81920 events => throughput is 9.38E+06 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
+ [COUNTERS] PROGRAM TOTAL          :    2.0740s
+ [COUNTERS] Fortran Overhead ( 0 ) :    2.0624s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0082s for    81920 events => throughput is 1.00E+07 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0034s

(Note2): as a better check of performance speedups, I have run this test on gg_ttgg.mad
There is a speedup, but not huge - most likely it is necessary to move to cuda streams

for b in 1 2 4 8 16 32 64 128 256 512 1024; do \
  ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\
  awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done

(Step 0)
In upstream/master before moving to one helicity per thread
8.174664e+03    1 256
1.646385e+04    2 256
3.278964e+04    4 256
6.259986e+04    8 256
1.200837e+05   16 256
2.157126e+05   32 256
3.311354e+05   64 256
3.529729e+05  128 256
3.688460e+05  256 256
4.013773e+05  512 256
4.139693e+05 1024 256

(Step 1a)
After moving to one helicity per thread, without using cuda streams yet
1.434033e+04    1 256
2.851371e+04    2 256
5.646722e+04    4 256
7.650805e+04    8 256
1.422132e+05   16 256
2.452440e+05   32 256
3.680172e+05   64 256
3.658933e+05  128 256
3.631490e+05  256 256
3.921660e+05  512 256
4.038302e+05 1024 256
…ng cuda streams

Repeat the manual tests to check the performance advantage on gg_ttgg.mad

for b in 1 2 4 8 16 32 64 128 256 512 1024; do \
  ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\
  awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done

(Step 1a)
Previously, after moving to one helicity per thread, without using cuda streams yet
1.434033e+04    1 256
2.851371e+04    2 256
5.646722e+04    4 256
7.650805e+04    8 256
1.422132e+05   16 256
2.452440e+05   32 256
3.680172e+05   64 256
3.658933e+05  128 256
3.631490e+05  256 256
3.921660e+05  512 256
4.038302e+05 1024 256

In this commit, after moving to cuda streams:
1.385174e+04    1 256
2.748146e+04    2 256
5.451755e+04    4 256
7.480672e+04    8 256
1.394517e+05   16 256
2.408220e+05   32 256
3.646268e+05   64 256
3.645053e+05  128 256
3.621090e+05  256 256
3.956870e+05  512 256
4.033960e+05 1024 256

So it seems that there is no obvious performance advantage yet - there is still something to be fixed
…defer memcpy after helicity loop) - BUT runTest fails

The following results for gg_ttgg.mad show the performance speedup from the use of streams

(Step 0)
In upstream/master before moving to one helicity per thread
8.174664e+03    1 256
1.646385e+04    2 256
3.278964e+04    4 256
6.259986e+04    8 256
1.200837e+05   16 256
2.157126e+05   32 256
3.311354e+05   64 256
3.529729e+05  128 256
3.688460e+05  256 256
4.013773e+05  512 256
4.139693e+05 1024 256

(Step 1a)
After moving to one helicity per thread, but before moving to cuda streams
1.434033e+04    1 256
2.851371e+04    2 256
5.646722e+04    4 256
7.650805e+04    8 256
1.422132e+05   16 256
2.452440e+05   32 256
3.680172e+05   64 256
3.658933e+05  128 256
3.631490e+05  256 256
3.921660e+05  512 256
4.038302e+05 1024 256

(current commit)
In the current commit after adding and fixing cuda stream parallelism (defer memcpy after the helicity loop on kernels)
2.502965e+05    1 256
3.191079e+05    2 256
3.340419e+05    4 256
3.697397e+05    8 256
3.897092e+05   16 256
3.912757e+05   32 256
3.917159e+05   64 256
3.974842e+05  128 256
4.029707e+05  256 256
4.103303e+05  512 256
4.068449e+05 1024 256

This is a huge performance boost, but the implementation needs functional fixes...
(A posteriori: the issue here was that numerators/denominators would need an atomicAdd, or a different implementation)
…efunctions as the kernel to be profiled with nsight
…ormance and streams functionality!

All runTests now succeed for cuda and c++ with/without multichannel

The issue was that numerators, denominators, jamps were sums over helicities, but '+=' is not thread safe
- For numerators and denominators, this has been fixed by using a superbuffer with one event buffer per helicity
- For jamp2s, this has been solved using atomicAdd

The overall performance boost is impressive.
Using the same test in gg_ttgg.mad as done previously:

for b in 1 2 4 8 16 32 64 128 256 512 1024; do \
  ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\
  awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done

(Step 0)
In upstream/master before moving to one helicity per thread
8.174664e+03    1 256
1.646385e+04    2 256
3.278964e+04    4 256
6.259986e+04    8 256
1.200837e+05   16 256
2.157126e+05   32 256
3.311354e+05   64 256
3.529729e+05  128 256
3.688460e+05  256 256
4.013773e+05  512 256
4.139693e+05 1024 256

(Step 1a)
After moving to one helicity per thread, but before moving to cuda streams
1.434033e+04    1 256
2.851371e+04    2 256
5.646722e+04    4 256
7.650805e+04    8 256
1.422132e+05   16 256
2.452440e+05   32 256
3.680172e+05   64 256
3.658933e+05  128 256
3.631490e+05  256 256
3.921660e+05  512 256
4.038302e+05 1024 256

(Step 1b - this commit)
2.731214e+05    1 256
3.591143e+05    2 256
3.542609e+05    4 256
3.840352e+05    8 256
3.978597e+05   16 256
3.979454e+05   32 256
3.961940e+05   64 256
4.054739e+05  128 256
4.048128e+05  256 256
4.168482e+05  512 256
4.132319e+05 1024 256
…d functionality with cuda streams (end of part 1b)
…city per kernel and cuda stream - all ok!

STARTED  AT Wed Nov  6 10:54:42 AM CET 2024
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Wed Nov  6 01:50:58 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Wed Nov  6 02:11:58 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(3) AT Wed Nov  6 02:21:49 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(4) AT Wed Nov  6 02:24:52 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(5) AT Wed Nov  6 02:27:52 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(6) AT Wed Nov  6 02:30:58 PM CET 2024 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(7) AT Wed Nov  6 02:59:52 PM CET 2024 [Status=0]
No errors found in logs
No FPEs or '{ }' found in logs

Note the impressive performance improvement in the test '-p 1 256 2' with gg_ttggg (grid 256!):

 Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:MIX+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = MIXED (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 5.036815e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 5.037459e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 5.037582e+02                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 1.273555e+04                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 1.277447e+04                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 1.278128e+04                 )  sec^-1
 MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825548e-06 )  GeV^-6
-TOTAL       :     1.772749 sec
+TOTAL       :     0.829704 sec

Essentially this is now enough to reach the maximum throughput, which was previously at '-p 64 256 1' (grid 8192):

 Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:MIX+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = MIXED (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 1.292099e+04                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 1.292537e+04                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 1.292566e+04                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 1.366211e+04                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 1.366633e+04                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 1.366663e+04                 )  sec^-1
 MeanMatrixElemValue         = ( 1.856249e-04 +- 8.329951e-05 )  GeV^-6
-TOTAL       :     2.037956 sec
+TOTAL       :     1.956331 sec
…crd90 with one helicity per kernel and streams

STARTED  AT Wed Nov  6 02:59:52 PM CET 2024
(SM tests)
ENDED(1) AT Wed Nov  6 06:56:35 PM CET 2024 [Status=0]
(BSM tests)
ENDED(1) AT Wed Nov  6 07:07:01 PM CET 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

Note: the performance improvemnt wih streams is noce but not really shown here,
because I am using a fixed grid 8192, while I should reduce that.

Note that the two tests 8192 and 81920 are not meant to reach max throughput (8192 is enough).
Instead they are meant to make the initialization negligible, to better understand Amdhal.

See for instance here for gg_ttggg:

 *** (3-cuda) EXECUTE MADEVENT_CUDA x1 (create events.lhe) ***
 --------------------
@@ -532,16 +532,16 @@ DEBUG: MEK processed 8192 events across 1240 channels { 1 : 8192 }
  [XSECTION] MultiChannel = TRUE
  [XSECTION] Configuration = 1
  [XSECTION] ChannelId = 1
- [XSECTION] Cross section = 2.357e-07 [2.3572561518129471E-007] fbridge_mode=1
+ [XSECTION] Cross section = 2.357e-07 [2.3572561518129449E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 18 events (found 285 events)
- [COUNTERS] PROGRAM TOTAL          :    2.0818s
- [COUNTERS] Fortran Overhead ( 0 ) :    1.0200s
- [COUNTERS] CudaCpp MEs      ( 2 ) :    0.7805s for     8192 events => throughput is 1.05E+04 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    0.2812s
+ [COUNTERS] PROGRAM TOTAL          :    1.9215s
+ [COUNTERS] Fortran Overhead ( 0 ) :    1.0238s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :    0.6150s for     8192 events => throughput is 1.33E+04 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    0.2827s

 *** (3-cuda) Compare MADEVENT_CUDA x1 events.lhe to MADEVENT_FORTRAN events.lhe reference (including colors and helicities)>
@@ -567,16 +567,16 @@ DEBUG: MEK processed 81920 events across 1240 channels { 1 : 81920 }
  [XSECTION] MultiChannel = TRUE
  [XSECTION] Configuration = 1
  [XSECTION] ChannelId = 1
- [XSECTION] Cross section = 2.284e-07 [2.2842713109538129E-007] fbridge_mode=1
+ [XSECTION] Cross section = 2.284e-07 [2.2842713109538103E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 380 events (found 1707 events)
- [COUNTERS] PROGRAM TOTAL          :   13.1212s
- [COUNTERS] Fortran Overhead ( 0 ) :    5.0520s
- [COUNTERS] CudaCpp MEs      ( 2 ) :    7.7872s for    81920 events => throughput is 1.05E+04 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    0.2820s
+ [COUNTERS] PROGRAM TOTAL          :   11.3058s
+ [COUNTERS] Fortran Overhead ( 0 ) :    4.9192s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :    6.1024s for    81920 events => throughput is 1.34E+04 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    0.2842s
…PI calls (not yet used inside calculate_wavefunctions)

Also rename J_ACCESS as J2_ACCESS (this is for jamp2 not for jamps)
…ssJamps.h (a simpler accessor is in CPPProcess.cc)
…s into calculate_jamps (Feynman diagrams) and color_sum

This completes part 2a of kernel splitting.

On my usual ggttgg test, this gives another small improvement, though nothing impressive

for b in 1 2 4 8 16 32 64 128 256 512 1024; do \
 ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\
 awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done

(Step 0)
In upstream/master before moving to one helicity per thread
8.174664e+03    1 256
1.646385e+04    2 256
3.278964e+04    4 256
6.259986e+04    8 256
1.200837e+05   16 256
2.157126e+05   32 256
3.311354e+05   64 256
3.529729e+05  128 256
3.688460e+05  256 256
4.013773e+05  512 256
4.139693e+05 1024 256

(Step 1a)
After moving to one helicity per thread, but before moving to cuda streams
1.434033e+04    1 256
2.851371e+04    2 256
5.646722e+04    4 256
7.650805e+04    8 256
1.422132e+05   16 256
2.452440e+05   32 256
3.680172e+05   64 256
3.658933e+05  128 256
3.631490e+05  256 256
3.921660e+05  512 256
4.038302e+05 1024 256

(Step 1b)
After moving to one helicity per thread, with one helicity per cuda stream
2.731214e+05    1 256
3.591143e+05    2 256
3.542609e+05    4 256
3.840352e+05    8 256
3.978597e+05   16 256
3.979454e+05   32 256
3.961940e+05   64 256
4.054739e+05  128 256
4.048128e+05  256 256
4.168482e+05  512 256
4.132319e+05 1024 256

(Step 2a - this commit)
2.957141e+05    1 256
3.666159e+05    2 256
3.878858e+05    4 256
4.266927e+05    8 256
4.459400e+05   16 256
4.447514e+05   32 256
4.454484e+05   64 256
4.442835e+05  128 256
4.519324e+05  256 256
4.573049e+05  512 256
4.575413e+05 1024 256
…_global__ INLINE

CPPProcess.cc(235): warning #20050-D: inline qualifier ignored for "__global__" function
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
…cesses, remove INLINE from __global__ to fix build warnings
…__ INLINE in HELINL=1 mode

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
diagrams.h(80): warning #20050-D: inline qualifier ignored for "__global__" function
CPPProcess.cc(456): warning #20050-D: inline qualifier ignored for "__global__" function
…arate color_sum files (ease the merge with hack_ihel3)
…- 132 tput and tmad logs

git checkout origin/hack_ihel4 $(git ls-tree --name-only HEAD tput/log* tmad/log*)
…- codegen logs for all processes

git checkout origin/hack_ihel4 $(git ls-tree --name-only HEAD */CODEGEN*txt)
…- generated code except gg_tt.mad

git checkout origin/hack_ihel4 $(git ls-tree -r --name-only HEAD *.sa *.mad \
  | grep -v ^gg_tt.mad | \egrep '(CPPProcess|MatrixElementKernels).(h|cc)')
This merges the 'diagram kernel splitting' (hack_ihel4) and cublas+master (hack_ihel3_sep25) functionalities

Fix conflicts:
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/MatrixElementKernels.cc
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/MatrixElementKernels.h
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_h.inc
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_matrix.inc
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/model_handling.py
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/output.py
	epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.cc
	epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.h
	epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc
	epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.h
	epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/color_sum.cc
	epochX/cudacpp/gg_tt.mad/SubProcesses/color_sum.h
	epochX/cudacpp/tmad/allTees.sh
	epochX/cudacpp/tput/allTees.sh
	epochX/cudacpp/tput/throughputX.sh

Also modify epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/color_sum.h
(as done in a previous incorrect merge on qhich this one is based)
…tggg and smeft_ggtttt to avoid out-of-memory asserts on itscrd90
…rgs2) for all processes: no longer needed and may abort
) codebase on itgold91

STARTED AT Tue Sep 23 07:11:21 PM CEST 2025
./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg
ENDED   AT Tue Sep 23 07:21:31 PM CEST 2025 [Status=0]
…ack_ihel_sep25 (9f802a9) codebase on itgold91

STARTED AT Tue Sep 23 07:27:31 PM CEST 2025
./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg
ENDED   AT Tue Sep 23 07:42:57 PM CEST 2025 [Status=0]
…hack_ihel2_sep25 (7b12a5c) codebase on itgold91

STARTED AT Tue Sep 23 07:53:41 PM CEST 2025
./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg
ENDED   AT Tue Sep 23 08:09:00 PM CEST 2025 [Status=0]
…hack_ihel3_sep25 (f98c217) codebase on itgold91

STARTED AT Tue Sep 23 ~08:20 PM CEST 2025
./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg
ENDED   AT Tue Sep 23 ~08:33 PM CEST 2025 [Status=0]
…hack_ihel4_sep25 (3e3f200) codebase on itgold91

STARTED AT Wed Sep 24 07:36:25 AM CEST 2025
./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg
ENDED   AT Wed Sep 24 07:43:16 AM CEST 2025 [Status=0]
…o hack_ihel3_sep25/itscrd90 logs

git checkout 3e3f200 tput/logs_ggtt*_mad/log_ggtt*_mad_*_inl0_hrd0.txt
…ing) tput tests on LUMI

(after tuning scripts and fixing issues in mg5amcnlo submodule)

With respect to the last LUMI logs for the 'hack_ihel3_sep25' codebase (commit ac04c54):
1) With blas disabled at runtime
- hip/256 throughput is a factor 3 worse at large grids and not very different at small grids
- hip/32 throughput is not very different either at large or small grids
- hip peak throughput at large grids was in any case higher with 256 tpb rather than 32 tpb
- c++ throughputs are ~30% slower
2) With blas enabled at runtime
- hip throughputs show similar trends as without blas (they are worse in ihel4 than in ihel3)
- in any case the blasOn results are a factor 10-100 worse than blasOff results

So, overall ihel4 is much worse than ihel3 on AMD GPUs with HIP (and also worse on AMD CPUs)

STARTED  AT Wed 24 Sep 2025 11:26:42 AM EEST
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean  -nocuda
ENDED(1) AT Wed 24 Sep 2025 11:59:25 AM EEST [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling  -nocuda
ENDED(1-scaling) AT Wed 24 Sep 2025 12:06:45 PM EEST [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn  -nocuda
ENDED(2) AT Wed 24 Sep 2025 12:09:57 PM EEST [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -scaling  -nocuda
ENDED(2-scaling) AT Wed 24 Sep 2025 12:20:22 PM EEST [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -nocuda
ENDED(3) AT Wed 24 Sep 2025 12:30:38 PM EEST [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean  -nocuda
ENDED(4) AT Wed 24 Sep 2025 12:40:30 PM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst  -nocuda
ENDED(5) AT Wed 24 Sep 2025 12:42:27 PM EEST [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -nocuda'
ENDED(6) AT Wed 24 Sep 2025 12:42:27 PM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -nocuda
ENDED(7) AT Wed 24 Sep 2025 12:44:23 PM EEST [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean  -nocuda
ENDED(8) AT Wed 24 Sep 2025 12:49:53 PM EEST [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -nocuda
ENDED(9) AT Wed 24 Sep 2025 01:07:25 PM EEST [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_curhst.txt: P1_gg_ttxgg/build.cuda_d_inl0_hrd0/check_cuda.exe: Aborted
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_curhst.txt: P1_gg_ttxgg/build.cuda_f_inl0_hrd0/check_cuda.exe: Aborted

./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_blasOn.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_blasOn.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0_blasOn.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed.
(after tuning scripts and fixing issues in mg5amcnlo submodule)

With respect to the last LUMI logs for the 'hack_ihel3_sep25' codebase (commit ac04c54):
- hip throughputs are a factor ~2 worse in ggttgg

STARTED  AT Wed 24 Sep 2025 01:07:26 PM EEST
(SM tests)
ENDED(1) AT Wed 24 Sep 2025 01:50:35 PM EEST [Status=0]
(BSM tests)
ENDED(1) AT Wed 24 Sep 2025 01:53:53 PM EEST [Status=0]

8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
6 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
6 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
6 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ!

No asserts found in logs

No segmentation fault found in logs
…ck_ihel3_sep25/itscrd90 logs

Revert "[hack_ihel4_sep25] rerun 30 tmad tests on LUMI"
This reverts commit 6dbe816.

Revert "[hack_ihel4_sep25] rerun 132 (96 + 12 blas + 18 scaling + 6 blas/scaling) tput tests on LUMI"
This reverts commit b752ca3.
…ling) tput tests on itscrd90

(after tuning scripts and fixing issues in mg5amcnlo submodule)

With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b):
1) With blas disabled at runtime
- gpu throughput scaling is worse (it picks up at larger grids)
- gpu throughput is a factor ~100 worse at small grids and ~4 worse at large grids for ggttggg
- gpu throughput is a factor ~10 worse at small grids and ~10 worse at large grids for ggttg
- gpu throughput is a factor ~2 worse at small grids and ~10 worse at large grids for ggtt
- c++ throughputs are also 30% worse for ggttggg (and somwehat worse also for simpler processes)
2) With blas enabled at runtime
- gpu throughput is also much worse at small grids and large grids for ggttgg and ggtt
- strangely, now blasOn and blasOff results are essentially the same (jamps is so bad that blas does not matter)

So, overall ihel4 is much worse than ihel3 on NVidia GPUs with CUDA (and also worse on Intel CPUs)

STARTED  AT Wed Sep 24 07:34:30 AM CEST 2025
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Wed Sep 24 08:15:46 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling
ENDED(1-scaling) AT Wed Sep 24 08:31:02 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn
ENDED(2) AT Wed Sep 24 08:36:19 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -scaling
ENDED(2-scaling) AT Wed Sep 24 08:41:35 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(3) AT Wed Sep 24 08:51:23 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(4) AT Wed Sep 24 09:05:21 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(5) AT Wed Sep 24 09:09:16 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(6) AT Wed Sep 24 09:13:15 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(7) AT Wed Sep 24 09:17:20 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean
ENDED(8) AT Wed Sep 24 09:24:32 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(9) AT Wed Sep 24 09:39:46 AM CEST 2025 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

No aborts found in logs

./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_blasOn.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_blasOn.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0_blasOn.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.
…s on itscrd90

(after tuning scripts and fixing issues in mg5amcnlo submodule)

With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b):
(Note: blas is disabled at runtime in tmad tests)
- gpu throughputs are a factor ~4 slower in ggttggg and ~2 slower in ggtt
- c++ is around 30% slower in ggttggg and somewhat slower in simpler processes

So, overall ihel4 is much worse than ihel3 on NVidia GPUs with CUDA (and also worse on Intel CPUs)

STARTED  AT Wed Sep 24 09:39:46 AM CEST 2025
(SM tests)
ENDED(1) AT Wed Sep 24 10:44:58 AM CEST 2025 [Status=0]
(BSM tests)
ENDED(1) AT Wed Sep 24 10:48:59 AM CEST 2025 [Status=0]

12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ!

No asserts found in logs

No segmentation fault found in logs
@valassi valassi requested a review from a team as a code owner October 6, 2025 17:46
@valassi valassi marked this pull request as draft October 6, 2025 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants