(WIP, NOT FOR MERGING) Kernel splitting ihel4: Feynman diagram kernels #1050

valassi · 2025-10-06T17:46:29Z

Hi @oliviermattelaer, as discussed recently and as per my presentation at the MG5AMC meeting last Friday.

This is the PR for my kernel splitting changes that I recommend NOT merging:

ihel4: Feynman diagram kernels

I have prepared a paper that will be shortly in arxiv with all the details.

I file this for the record and as a fully functional proof of concept that can be used as the basis for further developments (I will probably try one small thing in addition).

Note, like the previous PR #1049, until yesterday, it would have been possible to merge this automatically, as I had merged the latest upstream in my developments. Yesterday there were some new changes merged (for tREX I think), so this has conflicts. I suggest that this should be closed eventually as not merged, but I will in any case port this to the levl of the other one, hen tREX conflicts are solved.

Thanks
Andrea

…city per kernel - many failures in some processes STARTED AT Sun Nov 3 08:23:00 PM CET 2024 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sun Nov 3 10:54:35 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sun Nov 3 11:14:25 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(3) AT Sun Nov 3 11:23:09 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(4) AT Sun Nov 3 11:25:50 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(5) AT Sun Nov 3 11:28:28 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(6) AT Sun Nov 3 11:31:11 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Mon Nov 4 12:01:02 AM CET 2024 [Status=0] ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_common.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_curhst.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl1_hrd0.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_d_inl1_hrd1.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_common.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_curhst.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl1_hrd0.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_f_inl1_hrd1.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_common.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_curhst.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl1_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl1_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_common.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_curhst.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_rmbhst.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl1_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl1_hrd1.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd1.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd1.txt: 2 FAILED TESTS

…nel - many failures in various processes STARTED AT Mon Nov 4 12:01:02 AM CET 2024 (SM tests) ENDED(1) AT Mon Nov 4 01:14:56 AM CET 2024 [Status=0] (BSM tests) ENDED(1) AT Mon Nov 4 01:20:26 AM CET 2024 [Status=0] 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

…hat are only needed for multichannel (fix build warnings)

… (#ifdef out variables that are only needed for multichannel)

…ad after fixing build warning for color selection

… to 4 (as in upstream/master) instead of 256 - runTest now passes This is the result of careful debugging comparing results to upstream/master using printf. The question is, why is CODEGEN now giving 256 instead of 4?...

…ators instead of the hardcoded 256 - this fixes ee_mumu runTest

… for decoding allJamp2s buffers

…ass DeviceAccessJamp2 for decoding allJamp2s buffers

…n C++ builds

…city per kernel - finally all ok! STARTED AT Tue Nov 5 07:29:55 AM CET 2024 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Tue Nov 5 10:07:22 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Tue Nov 5 10:27:27 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(3) AT Tue Nov 5 10:36:27 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(4) AT Tue Nov 5 10:39:13 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(5) AT Tue Nov 5 10:41:58 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(6) AT Tue Nov 5 10:44:47 AM CET 2024 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Tue Nov 5 11:11:29 AM CET 2024 [Status=0] No errors found in logs No FPEs or '{ }' found in logs (Note1): there seems to be a performance advantage, but not as big as I was expecting (Note2): the profiling scripts must be fixed, now sigmaKin is no longer a kernel! git diff 6be9482 --no-ext-diff tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_m_inl0_hrd0/check_cuda.exe -p 1 256 1 -==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 -==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% +==WARNING== No kernels were profiled. + launch__registers_per_thread N/A + sm__sass_average_branch_targets_threads_uniform.pct N/A ......................................................................... runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_m_inl0_hrd0/check_cuda.exe -p 64 256 1 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:MIX+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = MIXED (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 1.108221e+04 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 1.108518e+04 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 1.108553e+04 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 1.292099e+04 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 1.292537e+04 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 1.292566e+04 ) sec^-1 MeanMatrixElemValue = ( 1.856249e-04 +- 8.329951e-05 ) GeV^-6 -TOTAL : 3.432158 sec +TOTAL : 2.037956 sec INFO: No Floating Point Exceptions have been reported - 11,402,912,009 cycles # 3.032 GHz - 24,689,535,297 instructions # 2.17 insn per cycle - 3.818442336 seconds time elapsed + 6,757,906,381 cycles # 2.881 GHz + 14,685,293,303 instructions # 2.17 insn per cycle + 2.402242543 seconds time elapsed

…itscrd90 with one helicity per kernel - finally also all ok! STARTED AT Tue Nov 5 11:11:29 AM CET 2024 (SM tests) ENDED(1) AT Tue Nov 5 03:09:08 PM CET 2024 [Status=0] (BSM tests) ENDED(1) AT Tue Nov 5 03:19:40 PM CET 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt (Note1): there seems to be a performance advantage, but this remains to be better understood What is interesting is the behaviour for fewer events (8192 rather than 81920) However this may be due to a different accounting of helicity filtering rather than to a real speedup... What is strange is that this appears for ggtt but not for ggttggg TODO: should try to reduce below 8192 and see what this gives...? git diff 6be9482 --no-ext-diff tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt @@ -534,10 +534,10 @@ DEBUG: MEK processed 8192 events across 3 channels { 1 : 8192 } [XSECTION] ChannelId = 1 [XSECTION] Cross section = 47.14 [47.138611963547788] fbridge_mode=1 [UNWEIGHT] Wrote 1618 events (found 1623 events) - [COUNTERS] PROGRAM TOTAL : 0.8403s - [COUNTERS] Fortran Overhead ( 0 ) : 0.8366s - [COUNTERS] CudaCpp MEs ( 2 ) : 0.0030s for 8192 events => throughput is 2.75E+06 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s + [COUNTERS] PROGRAM TOTAL : 0.8671s + [COUNTERS] Fortran Overhead ( 0 ) : 0.8626s + [COUNTERS] CudaCpp MEs ( 2 ) : 0.0009s for 8192 events => throughput is 8.99E+06 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 0.0036s *** (3-cuda) Compare MADEVENT_CUDA x1 xsec to MADEVENT_FORTRAN xsec *** @@ -569,10 +569,10 @@ DEBUG: MEK processed 81920 events across 3 channels { 1 : 81920 } [XSECTION] ChannelId = 1 [XSECTION] Cross section = 47.14 [47.144596232269095] fbridge_mode=1 [UNWEIGHT] Wrote 1613 events (found 1618 events) - [COUNTERS] PROGRAM TOTAL : 1.9861s - [COUNTERS] Fortran Overhead ( 0 ) : 1.9767s - [COUNTERS] CudaCpp MEs ( 2 ) : 0.0087s for 81920 events => throughput is 9.38E+06 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s + [COUNTERS] PROGRAM TOTAL : 2.0740s + [COUNTERS] Fortran Overhead ( 0 ) : 2.0624s + [COUNTERS] CudaCpp MEs ( 2 ) : 0.0082s for 81920 events => throughput is 1.00E+07 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 0.0034s (Note2): as a better check of performance speedups, I have run this test on gg_ttgg.mad There is a speedup, but not huge - most likely it is necessary to move to cuda streams for b in 1 2 4 8 16 32 64 128 256 512 1024; do \ ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\ awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done (Step 0) In upstream/master before moving to one helicity per thread 8.174664e+03 1 256 1.646385e+04 2 256 3.278964e+04 4 256 6.259986e+04 8 256 1.200837e+05 16 256 2.157126e+05 32 256 3.311354e+05 64 256 3.529729e+05 128 256 3.688460e+05 256 256 4.013773e+05 512 256 4.139693e+05 1024 256 (Step 1a) After moving to one helicity per thread, without using cuda streams yet 1.434033e+04 1 256 2.851371e+04 2 256 5.646722e+04 4 256 7.650805e+04 8 256 1.422132e+05 16 256 2.452440e+05 32 256 3.680172e+05 64 256 3.658933e+05 128 256 3.631490e+05 256 256 3.921660e+05 512 256 4.038302e+05 1024 256

…r CUDA/HIP streams

…per thread using cuda streams!

…ng cuda streams Repeat the manual tests to check the performance advantage on gg_ttgg.mad for b in 1 2 4 8 16 32 64 128 256 512 1024; do \ ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\ awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done (Step 1a) Previously, after moving to one helicity per thread, without using cuda streams yet 1.434033e+04 1 256 2.851371e+04 2 256 5.646722e+04 4 256 7.650805e+04 8 256 1.422132e+05 16 256 2.452440e+05 32 256 3.680172e+05 64 256 3.658933e+05 128 256 3.631490e+05 256 256 3.921660e+05 512 256 4.038302e+05 1024 256 In this commit, after moving to cuda streams: 1.385174e+04 1 256 2.748146e+04 2 256 5.451755e+04 4 256 7.480672e+04 8 256 1.394517e+05 16 256 2.408220e+05 32 256 3.646268e+05 64 256 3.645053e+05 128 256 3.621090e+05 256 256 3.956870e+05 512 256 4.033960e+05 1024 256 So it seems that there is no obvious performance advantage yet - there is still something to be fixed

…defer memcpy after helicity loop) - BUT runTest fails The following results for gg_ttgg.mad show the performance speedup from the use of streams (Step 0) In upstream/master before moving to one helicity per thread 8.174664e+03 1 256 1.646385e+04 2 256 3.278964e+04 4 256 6.259986e+04 8 256 1.200837e+05 16 256 2.157126e+05 32 256 3.311354e+05 64 256 3.529729e+05 128 256 3.688460e+05 256 256 4.013773e+05 512 256 4.139693e+05 1024 256 (Step 1a) After moving to one helicity per thread, but before moving to cuda streams 1.434033e+04 1 256 2.851371e+04 2 256 5.646722e+04 4 256 7.650805e+04 8 256 1.422132e+05 16 256 2.452440e+05 32 256 3.680172e+05 64 256 3.658933e+05 128 256 3.631490e+05 256 256 3.921660e+05 512 256 4.038302e+05 1024 256 (current commit) In the current commit after adding and fixing cuda stream parallelism (defer memcpy after the helicity loop on kernels) 2.502965e+05 1 256 3.191079e+05 2 256 3.340419e+05 4 256 3.697397e+05 8 256 3.897092e+05 16 256 3.912757e+05 32 256 3.917159e+05 64 256 3.974842e+05 128 256 4.029707e+05 256 256 4.103303e+05 512 256 4.068449e+05 1024 256 This is a huge performance boost, but the implementation needs functional fixes... (A posteriori: the issue here was that numerators/denominators would need an atomicAdd, or a different implementation)

…efunctions as the kernel to be profiled with nsight

…ormance and streams functionality! All runTests now succeed for cuda and c++ with/without multichannel The issue was that numerators, denominators, jamps were sums over helicities, but '+=' is not thread safe - For numerators and denominators, this has been fixed by using a superbuffer with one event buffer per helicity - For jamp2s, this has been solved using atomicAdd The overall performance boost is impressive. Using the same test in gg_ttgg.mad as done previously: for b in 1 2 4 8 16 32 64 128 256 512 1024; do \ ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\ awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done (Step 0) In upstream/master before moving to one helicity per thread 8.174664e+03 1 256 1.646385e+04 2 256 3.278964e+04 4 256 6.259986e+04 8 256 1.200837e+05 16 256 2.157126e+05 32 256 3.311354e+05 64 256 3.529729e+05 128 256 3.688460e+05 256 256 4.013773e+05 512 256 4.139693e+05 1024 256 (Step 1a) After moving to one helicity per thread, but before moving to cuda streams 1.434033e+04 1 256 2.851371e+04 2 256 5.646722e+04 4 256 7.650805e+04 8 256 1.422132e+05 16 256 2.452440e+05 32 256 3.680172e+05 64 256 3.658933e+05 128 256 3.631490e+05 256 256 3.921660e+05 512 256 4.038302e+05 1024 256 (Step 1b - this commit) 2.731214e+05 1 256 3.591143e+05 2 256 3.542609e+05 4 256 3.840352e+05 8 256 3.978597e+05 16 256 3.979454e+05 32 256 3.961940e+05 64 256 4.054739e+05 128 256 4.048128e+05 256 256 4.168482e+05 512 256 4.132319e+05 1024 256

…d functionality with cuda streams (end of part 1b)

…city per kernel and cuda stream - all ok! STARTED AT Wed Nov 6 10:54:42 AM CET 2024 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Wed Nov 6 01:50:58 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Wed Nov 6 02:11:58 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(3) AT Wed Nov 6 02:21:49 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(4) AT Wed Nov 6 02:24:52 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(5) AT Wed Nov 6 02:27:52 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(6) AT Wed Nov 6 02:30:58 PM CET 2024 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Wed Nov 6 02:59:52 PM CET 2024 [Status=0] No errors found in logs No FPEs or '{ }' found in logs Note the impressive performance improvement in the test '-p 1 256 2' with gg_ttggg (grid 256!): Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:MIX+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = MIXED (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 5.036815e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 5.037459e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 5.037582e+02 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 1.273555e+04 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 1.277447e+04 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 1.278128e+04 ) sec^-1 MeanMatrixElemValue = ( 1.187066e-05 +- 9.825548e-06 ) GeV^-6 -TOTAL : 1.772749 sec +TOTAL : 0.829704 sec Essentially this is now enough to reach the maximum throughput, which was previously at '-p 64 256 1' (grid 8192): Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:MIX+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = MIXED (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 1.292099e+04 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 1.292537e+04 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 1.292566e+04 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 1.366211e+04 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 1.366633e+04 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 1.366663e+04 ) sec^-1 MeanMatrixElemValue = ( 1.856249e-04 +- 8.329951e-05 ) GeV^-6 -TOTAL : 2.037956 sec +TOTAL : 1.956331 sec

…crd90 with one helicity per kernel and streams STARTED AT Wed Nov 6 02:59:52 PM CET 2024 (SM tests) ENDED(1) AT Wed Nov 6 06:56:35 PM CET 2024 [Status=0] (BSM tests) ENDED(1) AT Wed Nov 6 07:07:01 PM CET 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt Note: the performance improvemnt wih streams is noce but not really shown here, because I am using a fixed grid 8192, while I should reduce that. Note that the two tests 8192 and 81920 are not meant to reach max throughput (8192 is enough). Instead they are meant to make the initialization negligible, to better understand Amdhal. See for instance here for gg_ttggg: *** (3-cuda) EXECUTE MADEVENT_CUDA x1 (create events.lhe) *** -------------------- @@ -532,16 +532,16 @@ DEBUG: MEK processed 8192 events across 1240 channels { 1 : 8192 } [XSECTION] MultiChannel = TRUE [XSECTION] Configuration = 1 [XSECTION] ChannelId = 1 - [XSECTION] Cross section = 2.357e-07 [2.3572561518129471E-007] fbridge_mode=1 + [XSECTION] Cross section = 2.357e-07 [2.3572561518129449E-007] fbridge_mode=1 [UNWEIGHT] Wrote 18 events (found 285 events) - [COUNTERS] PROGRAM TOTAL : 2.0818s - [COUNTERS] Fortran Overhead ( 0 ) : 1.0200s - [COUNTERS] CudaCpp MEs ( 2 ) : 0.7805s for 8192 events => throughput is 1.05E+04 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 0.2812s + [COUNTERS] PROGRAM TOTAL : 1.9215s + [COUNTERS] Fortran Overhead ( 0 ) : 1.0238s + [COUNTERS] CudaCpp MEs ( 2 ) : 0.6150s for 8192 events => throughput is 1.33E+04 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 0.2827s *** (3-cuda) Compare MADEVENT_CUDA x1 events.lhe to MADEVENT_FORTRAN events.lhe reference (including colors and helicities)> @@ -567,16 +567,16 @@ DEBUG: MEK processed 81920 events across 1240 channels { 1 : 81920 } [XSECTION] MultiChannel = TRUE [XSECTION] Configuration = 1 [XSECTION] ChannelId = 1 - [XSECTION] Cross section = 2.284e-07 [2.2842713109538129E-007] fbridge_mode=1 + [XSECTION] Cross section = 2.284e-07 [2.2842713109538103E-007] fbridge_mode=1 [UNWEIGHT] Wrote 380 events (found 1707 events) - [COUNTERS] PROGRAM TOTAL : 13.1212s - [COUNTERS] Fortran Overhead ( 0 ) : 5.0520s - [COUNTERS] CudaCpp MEs ( 2 ) : 7.7872s for 81920 events => throughput is 1.05E+04 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 0.2820s + [COUNTERS] PROGRAM TOTAL : 11.3058s + [COUNTERS] Fortran Overhead ( 0 ) : 4.9192s + [COUNTERS] CudaCpp MEs ( 2 ) : 6.1024s for 81920 events => throughput is 1.34E+04 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 0.2842s

…PI calls (not yet used inside calculate_wavefunctions) Also rename J_ACCESS as J2_ACCESS (this is for jamp2 not for jamps)

…ssJamps.h (a simpler accessor is in CPPProcess.cc)

…s into calculate_jamps (Feynman diagrams) and color_sum This completes part 2a of kernel splitting. On my usual ggttgg test, this gives another small improvement, though nothing impressive for b in 1 2 4 8 16 32 64 128 256 512 1024; do \ ./build.cuda_m_inl0_hrd0/check_cuda.exe -p $b 256 1 | \grep 'EvtsPerSec\[MECalcOnly\]' |\ awk -vb=$b '{printf "%s %4d %3d\n", $5, b, 256}'; done (Step 0) In upstream/master before moving to one helicity per thread 8.174664e+03 1 256 1.646385e+04 2 256 3.278964e+04 4 256 6.259986e+04 8 256 1.200837e+05 16 256 2.157126e+05 32 256 3.311354e+05 64 256 3.529729e+05 128 256 3.688460e+05 256 256 4.013773e+05 512 256 4.139693e+05 1024 256 (Step 1a) After moving to one helicity per thread, but before moving to cuda streams 1.434033e+04 1 256 2.851371e+04 2 256 5.646722e+04 4 256 7.650805e+04 8 256 1.422132e+05 16 256 2.452440e+05 32 256 3.680172e+05 64 256 3.658933e+05 128 256 3.631490e+05 256 256 3.921660e+05 512 256 4.038302e+05 1024 256 (Step 1b) After moving to one helicity per thread, with one helicity per cuda stream 2.731214e+05 1 256 3.591143e+05 2 256 3.542609e+05 4 256 3.840352e+05 8 256 3.978597e+05 16 256 3.979454e+05 32 256 3.961940e+05 64 256 4.054739e+05 128 256 4.048128e+05 256 256 4.168482e+05 512 256 4.132319e+05 1024 256 (Step 2a - this commit) 2.957141e+05 1 256 3.666159e+05 2 256 3.878858e+05 4 256 4.266927e+05 8 256 4.459400e+05 16 256 4.447514e+05 32 256 4.454484e+05 64 256 4.442835e+05 128 256 4.519324e+05 256 256 4.573049e+05 512 256 4.575413e+05 1024 256

…_global__ INLINE CPPProcess.cc(235): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

…cesses, remove INLINE from __global__ to fix build warnings

…__ INLINE in HELINL=1 mode Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" diagrams.h(80): warning #20050-D: inline qualifier ignored for "__global__" function CPPProcess.cc(456): warning #20050-D: inline qualifier ignored for "__global__" function

…rocesses

…arate color_sum files (ease the merge with hack_ihel3)

…- 132 tput and tmad logs git checkout origin/hack_ihel4 $(git ls-tree --name-only HEAD tput/log* tmad/log*)

…- codegen logs for all processes git checkout origin/hack_ihel4 $(git ls-tree --name-only HEAD */CODEGEN*txt)

…- generated code except gg_tt.mad git checkout origin/hack_ihel4 $(git ls-tree -r --name-only HEAD *.sa *.mad \ | grep -v ^gg_tt.mad | \egrep '(CPPProcess|MatrixElementKernels).(h|cc)')

This merges the 'diagram kernel splitting' (hack_ihel4) and cublas+master (hack_ihel3_sep25) functionalities Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/MatrixElementKernels.cc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/MatrixElementKernels.h epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_h.inc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_matrix.inc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/model_handling.py epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/output.py epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.cc epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.h epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.h epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/color_sum.cc epochX/cudacpp/gg_tt.mad/SubProcesses/color_sum.h epochX/cudacpp/tmad/allTees.sh epochX/cudacpp/tput/allTees.sh epochX/cudacpp/tput/throughputX.sh Also modify epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/color_sum.h (as done in a previous incorrect merge on qhich this one is based)

…tggg and smeft_ggtttt to avoid out-of-memory asserts on itscrd90

…rgs2) for all processes: no longer needed and may abort

) codebase on itgold91 STARTED AT Tue Sep 23 07:11:21 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg ENDED AT Tue Sep 23 07:21:31 PM CEST 2025 [Status=0]

…ack_ihel_sep25 (9f802a9) codebase on itgold91 STARTED AT Tue Sep 23 07:27:31 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg ENDED AT Tue Sep 23 07:42:57 PM CEST 2025 [Status=0]

…hack_ihel2_sep25 (7b12a5c) codebase on itgold91 STARTED AT Tue Sep 23 07:53:41 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg ENDED AT Tue Sep 23 08:09:00 PM CEST 2025 [Status=0]

…hack_ihel3_sep25 (f98c217) codebase on itgold91 STARTED AT Tue Sep 23 ~08:20 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg ENDED AT Tue Sep 23 ~08:33 PM CEST 2025 [Status=0]

…hack_ihel4_sep25 (3e3f200) codebase on itgold91 STARTED AT Wed Sep 24 07:36:25 AM CEST 2025 ./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg ENDED AT Wed Sep 24 07:43:16 AM CEST 2025 [Status=0]

…o hack_ihel3_sep25/itscrd90 logs git checkout 3e3f200 tput/logs_ggtt*_mad/log_ggtt*_mad_*_inl0_hrd0.txt

…ing) tput tests on LUMI (after tuning scripts and fixing issues in mg5amcnlo submodule) With respect to the last LUMI logs for the 'hack_ihel3_sep25' codebase (commit ac04c54): 1) With blas disabled at runtime - hip/256 throughput is a factor 3 worse at large grids and not very different at small grids - hip/32 throughput is not very different either at large or small grids - hip peak throughput at large grids was in any case higher with 256 tpb rather than 32 tpb - c++ throughputs are ~30% slower 2) With blas enabled at runtime - hip throughputs show similar trends as without blas (they are worse in ihel4 than in ihel3) - in any case the blasOn results are a factor 10-100 worse than blasOff results So, overall ihel4 is much worse than ihel3 on AMD GPUs with HIP (and also worse on AMD CPUs) STARTED AT Wed 24 Sep 2025 11:26:42 AM EEST ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Wed 24 Sep 2025 11:59:25 AM EEST [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling -nocuda ENDED(1-scaling) AT Wed 24 Sep 2025 12:06:45 PM EEST [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -nocuda ENDED(2) AT Wed 24 Sep 2025 12:09:57 PM EEST [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -scaling -nocuda ENDED(2-scaling) AT Wed 24 Sep 2025 12:20:22 PM EEST [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(3) AT Wed 24 Sep 2025 12:30:38 PM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean -nocuda ENDED(4) AT Wed 24 Sep 2025 12:40:30 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -nocuda ENDED(5) AT Wed 24 Sep 2025 12:42:27 PM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda' ENDED(6) AT Wed 24 Sep 2025 12:42:27 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda ENDED(7) AT Wed 24 Sep 2025 12:44:23 PM EEST [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean -nocuda ENDED(8) AT Wed 24 Sep 2025 12:49:53 PM EEST [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(9) AT Wed 24 Sep 2025 01:07:25 PM EEST [Status=0] No errors found in logs No FPEs or '{ }' found in logs ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_curhst.txt: P1_gg_ttxgg/build.cuda_d_inl0_hrd0/check_cuda.exe: Aborted ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_curhst.txt: P1_gg_ttxgg/build.cuda_f_inl0_hrd0/check_cuda.exe: Aborted ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_blasOn.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_blasOn.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0_blasOn.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.scaling:check_hip.exe: ./Assertion `code == gpuSuccess' failed.

(after tuning scripts and fixing issues in mg5amcnlo submodule) With respect to the last LUMI logs for the 'hack_ihel3_sep25' codebase (commit ac04c54): - hip throughputs are a factor ~2 worse in ggttgg STARTED AT Wed 24 Sep 2025 01:07:26 PM EEST (SM tests) ENDED(1) AT Wed 24 Sep 2025 01:50:35 PM EEST [Status=0] (BSM tests) ENDED(1) AT Wed 24 Sep 2025 01:53:53 PM EEST [Status=0] 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 6 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 6 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 6 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 8 /users/valassia/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ! No asserts found in logs No segmentation fault found in logs

…ck_ihel3_sep25/itscrd90 logs Revert "[hack_ihel4_sep25] rerun 30 tmad tests on LUMI" This reverts commit 6dbe816. Revert "[hack_ihel4_sep25] rerun 132 (96 + 12 blas + 18 scaling + 6 blas/scaling) tput tests on LUMI" This reverts commit b752ca3.

…ling) tput tests on itscrd90 (after tuning scripts and fixing issues in mg5amcnlo submodule) With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b): 1) With blas disabled at runtime - gpu throughput scaling is worse (it picks up at larger grids) - gpu throughput is a factor ~100 worse at small grids and ~4 worse at large grids for ggttggg - gpu throughput is a factor ~10 worse at small grids and ~10 worse at large grids for ggttg - gpu throughput is a factor ~2 worse at small grids and ~10 worse at large grids for ggtt - c++ throughputs are also 30% worse for ggttggg (and somwehat worse also for simpler processes) 2) With blas enabled at runtime - gpu throughput is also much worse at small grids and large grids for ggttgg and ggtt - strangely, now blasOn and blasOff results are essentially the same (jamps is so bad that blas does not matter) So, overall ihel4 is much worse than ihel3 on NVidia GPUs with CUDA (and also worse on Intel CPUs) STARTED AT Wed Sep 24 07:34:30 AM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Wed Sep 24 08:15:46 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling ENDED(1-scaling) AT Wed Sep 24 08:31:02 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ENDED(2) AT Wed Sep 24 08:36:19 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -scaling ENDED(2-scaling) AT Wed Sep 24 08:41:35 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(3) AT Wed Sep 24 08:51:23 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(4) AT Wed Sep 24 09:05:21 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(5) AT Wed Sep 24 09:09:16 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(6) AT Wed Sep 24 09:13:15 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(7) AT Wed Sep 24 09:17:20 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(8) AT Wed Sep 24 09:24:32 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(9) AT Wed Sep 24 09:39:46 AM CEST 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_blasOn.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0_blasOn.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0_blasOn.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.

…s on itscrd90 (after tuning scripts and fixing issues in mg5amcnlo submodule) With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b): (Note: blas is disabled at runtime in tmad tests) - gpu throughputs are a factor ~4 slower in ggttggg and ~2 slower in ggtt - c++ is around 30% slower in ggttggg and somewhat slower in simpler processes So, overall ihel4 is much worse than ihel3 on NVidia GPUs with CUDA (and also worse on Intel CPUs) STARTED AT Wed Sep 24 09:39:46 AM CEST 2025 (SM tests) ENDED(1) AT Wed Sep 24 10:44:58 AM CEST 2025 [Status=0] (BSM tests) ENDED(1) AT Wed Sep 24 10:48:59 AM CEST 2025 [Status=0] 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ! No asserts found in logs No segmentation fault found in logs

valassi added 30 commits November 6, 2024 10:53

[hack_ihel] regenerate all processes

a0f04fa

[hack_ihel] in gg_tt.sa, outside multichannel mode remove variables t…

8b8a650

…hat are only needed for multichannel (fix build warnings)

[hack_ihel] backport to CODEGEN the gg_tt.sa fixes for build warnings…

1bffacd

… (#ifdef out variables that are only needed for multichannel)

[hack_ihel] regenerate gg_tt.sa (all ok with formatting fixes) and .m…

7f7918b

…ad after fixing build warning for color selection

[hack_ihel] regenerate ee_mumu.sa to debug the issues in runTest

43adfea

[hack_ihel] bug fix in CODEGEN: use %(den_factors)s for helcolDenomin…

980b865

…ators instead of the hardcoded 256 - this fixes ee_mumu runTest

[hack_ihel] regenerate ee_mumu.sa (all ok) and ee_mumu.mad

80d021d

[hack_ihel] in ee_mumu.mad CPPProcess.cc, add class DeviceAccessJamp2…

74b1e71

… for decoding allJamp2s buffers

[hack_ihel] in CODEGEN, backport ee_mumu.mad CPPProcess.cc, adding cl…

82e2f93

…ass DeviceAccessJamp2 for decoding allJamp2s buffers

[hack_ihel] regenerate all processes (checked that ee_mumu.mad was ok)

405ccfe

[hack_ihel] in ee_mumu.mad and CODEGEN CPPProcess.cc, fix silly bug i…

87990d3

…n C++ builds

[hack_ihel] regenerate all processes (checked that ee_mumu.mad was ok)

bf5c6b1

[hack_ihel] in gg_tt.mad and CODEGEN GpuAbstraction.h, add support fo…

038f244

…r CUDA/HIP streams

[hack_ihel] in gg_tt.mad and CODEGEN, finally implement one helicity …

dd93156

…per thread using cuda streams!

[hack_ihel] in tput/throughputX.sh, replace sigmaKin by calculate_wav…

75c9734

…efunctions as the kernel to be profiled with nsight

[hack_ihel] regenerate all processes after fixing both performance an…

a988f74

…d functionality with cuda streams (end of part 1b)

[hack_ihel2] in gg_tt.mad and CODEGEN, add allJamps argument in all A…

143e1c3

…PI calls (not yet used inside calculate_wavefunctions) Also rename J_ACCESS as J2_ACCESS (this is for jamp2 not for jamps)

[hack_ihel2] regenerate gg_ttgg.mad

0bdcac0

[hack_ihel2] in gg_tt.mad, gg_ttgg.mad and CODEGEN, remove MemoryAcce…

1a8da67

…ssJamps.h (a simpler accessor is in CPPProcess.cc)

valassi added 25 commits September 21, 2025 18:23

[hack_ihel3_sep25] minor fixes in tmad/madX.sh

a91bf29

[hack_ihel3_sep25] in CODEGEN, fix a build warning for incompatible _…

4ade11e

…_global__ INLINE CPPProcess.cc(235): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

[hack_ihel3_sep25] ** COMPLETE HACK_IHEL3_SEP25 ** regenerate all pro…

f98c217

…cesses, remove INLINE from __global__ to fix build warnings

[hack_ihel4] ** COMPLETE (PROOF-OF-CONCEPT) IHEL4 ** regenerate all p…

47e0057

…rocesses

[hack_ihel4] in gg_tt.mad, move color_sum and DeviceAccessJamp to sep…

4c72399

…arate color_sum files (ease the merge with hack_ihel3)

[hack_ihel4_sep25] prepare to merge hack_ihel4 into hack_ihel3_sep25 …

db245f5

…- 132 tput and tmad logs git checkout origin/hack_ihel4 $(git ls-tree --name-only HEAD tput/log* tmad/log*)

[hack_ihel4_sep25] prepare to merge hack_ihel4 into hack_ihel3_sep25 …

fbe415f

…- codegen logs for all processes git checkout origin/hack_ihel4 $(git ls-tree --name-only HEAD */CODEGEN*txt)

[hack_ihel4_sep25] prepare to merge hack_ihel4 into hack_ihel3_sep25 …

ff16c2d

…- generated code except gg_tt.mad git checkout origin/hack_ihel4 $(git ls-tree -r --name-only HEAD *.sa *.mad \ | grep -v ^gg_tt.mad | \egrep '(CPPProcess|MatrixElementKernels).(h|cc)')

[hack_ihel4_sep25] regenerate all processes

8dde4aa

[hack_ihel4_sep25] in tmad/madX.sh, redefine max grids for ggttg, ggt…

f33a950

…tggg and smeft_ggtttt to avoid out-of-memory asserts on itscrd90

[hack_ihel4_sep25] in tput/throughputX.sh, skip extra gpu tests (exeA…

0a943d5

…rgs2) for all processes: no longer needed and may abort

[hack_ihel4_sep25] in tmad/allTees.sh, improve assert checks

3e3f200

[sep25/gold] run 12 tput tests for ggtt-ggttggg using the sep25 (30305e3

46cf416

) codebase on itgold91 STARTED AT Tue Sep 23 07:11:21 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -makej -makeclean -cpponly -ggtt -ggttg -ggttgg -ggttggg ENDED AT Tue Sep 23 07:21:31 PM CEST 2025 [Status=0]

[hack_ihel4_sep25/gold] go back from hack_ihel4_sep25/itgold91 logs t…

bea4fd5

…o hack_ihel3_sep25/itscrd90 logs git checkout 3e3f200 tput/logs_ggtt*_mad/log_ggtt*_mad_*_inl0_hrd0.txt

valassi requested a review from oliviermattelaer October 6, 2025 17:46

valassi assigned valassi and oliviermattelaer Oct 6, 2025

valassi requested a review from a team as a code owner October 6, 2025 17:46

valassi marked this pull request as draft October 6, 2025 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(WIP, NOT FOR MERGING) Kernel splitting ihel4: Feynman diagram kernels #1050

(WIP, NOT FOR MERGING) Kernel splitting ihel4: Feynman diagram kernels #1050

Uh oh!

valassi commented Oct 6, 2025

Uh oh!

Uh oh!

(WIP, NOT FOR MERGING) Kernel splitting ihel4: Feynman diagram kernels #1050

Are you sure you want to change the base?

(WIP, NOT FOR MERGING) Kernel splitting ihel4: Feynman diagram kernels #1050

Uh oh!

Conversation

valassi commented Oct 6, 2025

Uh oh!

Uh oh!