Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049

valassi · 2025-10-06T17:46:10Z

Hi @oliviermattelaer, as discussed recently and as per my presentation at the MG5AMC meeting last Friday.

This is the PR for my kernel splitting changes that I recommend merging:

ihel1: helicity streams
ihel2: color sum kernels
ihel3: color sum in BLAS

I have prepared a paper that will be shortly in arxiv with all the details.

Until yesterday, it would have been possible to merge this automatically, as I had merged the latest upstream in my developments. Yesterday there were some new changes merged (for tREX I think), so this will need some massaging. I can do that later on, or let me know how you want to proceed.

Thanks, Andrea

…ck_ihel

…plains, and does not build

… and Device version: put it back in CPPProcess for now

gCPPProcess.cu(689): warning #20091-D: a __constant__ variable "mg5amcGpu::cNGoodHel" cannot be directly read in a host function gCPPProcess.cu(691): warning #20091-D: a __constant__ variable "mg5amcGpu::cGoodHel" cannot be directly read in a host function

/cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_blockIdx' /cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_blockDim' /cvmfs/sft.cern.ch/lcg/releases/binutils/2.37-4177a/x86_64-centos7/bin/ld: ../../lib/libmg5amc_gg_ttx_cuda.so: undefined reference to `__device_builtin_variable_threadIdx'

…ults

… to move goodhel to host

Confirm a 30% difference between ihel_hack EvtsPerSec[MECalcOnly] (3a) = ( 9.992406e+07 ) sec^-1 and hack EvtsPerSec[MECalcOnly] (3a) = ( 1.340965e+08 ) sec^-1

…ll ok

./tput/teeThroughputX.sh -eemumu -ggtt -ggttg -ggttgg -ggttggg

…res, some faster, some slower

…tests on rd90 after tuning the scripts Note that peak performance of ggttgg decreases by approximately 10% in classic tests (but the peak performance remains visible in scaling tests) No difference otherwise ./tmad/teeMadX.sh -ggttgg -dmf -hip STARTED AT Sat Sep 20 10:20:48 AM CEST 2025 ENDED AT Sat Sep 20 10:23:19 AM CEST 2025

…- 144 tput and tmad logs git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD tput/log* tmad/log*)

…- codegen logs for all processes git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD */CODEGEN*txt)

…- generated code except gg_tt.mad (Will merge and fix conflicts only in CODEGEN, gg_tt.mad and tput scripts) git checkout origin/hack_ihel3 $(git ls-tree -r --name-only HEAD *.sa *.mad \ | grep -v ^gg_tt.mad | \egrep '(MatrixElementKernels.cc|CPPProcess.cc)')

Fix conflicts: 1) merge HIP fixes for gpuStream (hack_ihel2_sep25) and new BLAS features (hack_ihel3) - epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc - epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc 2) merge helicity stream features (changed in hack_ihel3) and disabling of fpeEnable (master) - epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/MatrixElementKernels.cc - epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.cc (Note: I would personally prefer to keep FPEs enabled or at least add an env variable to enable FPEs) 3) merge new scaling (hack_ihel2_sep25) and BLAS (hack_ihel3) tput tests, add 6 more BLAS scaling tests - epochX/cudacpp/tput/allTees.sh - epochX/cudacpp/tput/teeThroughputX.sh - epochX/cudacpp/tput/throughputX.sh

…t cublas/hipblas headers with #ifndef MGONGPU_HAS_NO_BLAS

… #ifndef MGONGPU_HAS_NO_BLAS

… gpuBlasHandle_t as void in noBLAS builds

…llptr to keep the same API in noBLAS builds

…dColorMatrix() for noBLAS HIP builds

…or hipblas.h in include/hipblas

…hipblas

…scaling) tput tests on LUMI - all ok (This commit adds the 6 new blas/scaling logs) With respect to the last LUMI logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - hip and c++ results are unchanged in ggtt/ggttgg (within 10% fluctuations) for blas builds with blas disabled at runtime (for hip, results are unchanged both at small grids and for peak performance at large grids) Comparing results with blas disabled (blasOff) and enabled (blasOn) at runtime on a V100: - hip scaling and peak performance are significantly worse for blasOn with respect to blasOff (for ggtt, blas is a factor ~100 worse both at small grids and large grids) (for ggttgg, blas is a factor ~100 worse at small grids and a factor ~5 worse for peak performance at large grids)

With respect to the last LUMI logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - hip and c++ results are ~unchanged in ggtt/ggttgg (within 10% fluctuations) for blas builds with blas disabled at runtime (many results seem lower, but as mentioned there are large fluctuations on LUMI)

…ep25/itscrd90 logs Revert "[hack_ihel3_sep25] rerun 30 tmad tests on LUMI - all ok" This reverts commit ac04c54. Revert "[hack_ihel3_sep25] rerun 132 (96 + 12 blas + 18 scaling + 6 new blas/scaling) tput tests on LUMI - all ok" This reverts commit b56251e.

…/scaling) tput tests on itscrd90 - all ok (This commit adds the 6 new blas/scaling logs) With respect to the last itscrd90 logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - cuda and c++ results are unchanged across ggtt/ggttggg (within 1%) for blas builds with blas disabled at runtime (for cuda, results are unchanged both at small grids and for peak performance at large grids) Comparing results with blas disabled (blasOff) and enabled (blasOn) at runtime on a V100: - cuda scaling and peak performance are significantly worse for blasOn with respect to blasOff (for ggtt, blas is a factor 6-8 worse both at small grids and large grids) (for ggttgg, blas is a factor 6-8 worse at small grids and around 10% worse for peak performance at large grids) STARTED AT Sat Sep 20 11:46:40 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sun Sep 21 02:16:49 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling ENDED(1-scaling) AT Sun Sep 21 02:27:47 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ENDED(2) AT Sun Sep 21 02:32:04 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -scaling ENDED(2-scaling) AT Sun Sep 21 02:36:29 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(3) AT Sun Sep 21 02:55:26 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(4) AT Sun Sep 21 03:06:23 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(5) AT Sun Sep 21 03:09:46 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(6) AT Sun Sep 21 03:13:11 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(7) AT Sun Sep 21 03:16:41 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(8) AT Sun Sep 21 03:29:44 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(9) AT Sun Sep 21 03:52:43 AM CEST 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed. ./tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.scaling:check_cuda.exe: Assertion `code == gpuSuccess' failed.

With respect to the last itscrd90 logs for the 'hack_ihel2_sep25' codebase (commit 7b12a5c): - cuda and c++ results are unchanged across ggtt/ggttggg (within 1%) for blas builds with blas disabled at runtime STARTED AT Sun Sep 21 03:52:43 AM CEST 2025 (SM tests) ENDED(1) AT Sun Sep 21 04:44:34 AM CEST 2025 [Status=0] (BSM tests) ENDED(1) AT Sun Sep 21 04:48:33 AM CEST 2025 [Status=0] 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 12 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ! No asserts found in logs No segmentation fault found in logs

…_global__ INLINE CPPProcess.cc(235): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

…cesses, remove INLINE from __global__ to fix build warnings

oliviermattelaer · 2025-10-07T07:55:04Z

Thanks a lot Andrea,

Conflict was not relevant (a comment) so I just fixed it (I guess at least).

So I will start the review!

Thanks a lot,

Olivier

oliviermattelaer

Hi Andrea,

One point is worrying me in such MR but to be honest, this sounds disconnected from your MR. But let me check that specific point (i.e. the format of the color matrix) before approving.

They are also two point of detail that should be fine to not include from your PR.

Thanks,

Olivier

oliviermattelaer · 2025-10-07T08:08:52Z

epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/color_sum.cc

+    {
+      for( int icol = 0; icol < ncolor; icol++ )
+        for( int jcol = 0; jcol < ncolor; jcol++ )
+          value[icol * ncolor + jcol] = colorMatrix[icol][jcol] / colorDenom[icol];


Point to check:
Main version of MG5 moves from a full matrix to an upper diagonal representation.
(and to a single denominator for the full matrix).

So here they are a risk that the color_matrix_lines templating are not returning the correct format for the information.

On a "hardware optimization" point of view, would not be better to keep the integer matrix colorMatrix (either upper diagonal or the full matrix) rather than the float matrix "value" (format here is irrelevant since flatter)? Especially if colorDenom is a single integer that we can normalise the output of the computation with?

oliviermattelaer · 2025-10-07T08:24:44Z

...PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc

    {
      // nprocesses == 2 may happen for "mirror processes" such as P0_uux_ttx within pp_tt012j (see PR #754)
-      constexpr int nprocesses = %(nproc)i;
+      constexpr int nprocesses = 1;


Suggested change

constexpr int nprocesses = 1;

constexpr int nprocesses = %(nproc)i;

Is there a reason to revert this change? If yes why keep the assert?

oliviermattelaer · 2025-10-07T08:26:07Z

...PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc

+      constexpr int nprocesses = 1;
      static_assert( nprocesses == 1 || nprocesses == 2, "Assume nprocesses == 1 or 2" );
-      constexpr int process_id = %(proc_id)i; // code generation source: %(proc_id_source)s
+      constexpr int process_id = 1; // code generation source: standalone_cudacpp


Suggested change

constexpr int process_id = 1; // code generation source: standalone_cudacpp

constexpr int process_id = %(proc_id)i; // code generation source: %(proc_id_source)s

kind of similar comment, but here it sounds less impactfull (given the assert but still better to have the dynamical version that blindly assume that it is.

oliviermattelaer · 2025-10-07T15:02:51Z

epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/model_handling.py

+        template = open(pjoin(self.template_path,'gpu','color_sum.cc'),'r').read()
+        replace_dict = {}
+        # Extract color matrix again (this was also in get_matrix_single_process called within get_all_sigmaKin_lines)
+        replace_dict['color_matrix_lines'] = self.get_color_matrix_lines(self.matrix_elements[0])


This is a comment to pint-point, the previous point of attention,
this is where we need to check if the format is still correct.
(if it is we likely want to create an issue to update it, if not then no choice than fixing it)

oliviermattelaer and others added 30 commits September 26, 2022 18:35

move the loop over ihel to the cpu (still a loop)

3fb7f26

Merge branch 'hack' into hack_ihel

9ce82f5

Merge branch 'hack' into hack_ihel

4c948ec

adding denominator as extra kernel

f6b3327

Merge branch 'hack_ihel' of github.com:madgraph5/madgraph4gpu into ha…

16d5aa7

…ck_ihel

[hack_ihel] regenerate ggtt.sa from Olivier's code - clang format com…

3a5aa61

…plains, and does not build

[hack_ihel] progress in building ggtt.sa, modify CPPProcess.cc/h

0111923

[hack_ihel] add back MatrixElementKernels.cc to the P1 directory

7995719

[hack_ihel] change a few things in sigmakin... but I would nee a Host…

e8d66ac

… and Device version: put it back in CPPProcess for now

[hack_ihel] mov esigmakin back to CPPProcess.cc for now

74e41c5

[hack_ihel] remove helicity warnings

bf3f331

[hack_ihel] fix getgoodhel for gpu, remove blockDim link error

d551272

[hack_ihel] complete fixing the ggtt.sa build (did not test running)

d767df8

[hack_ihel] fix clang format (on manually modified generated code)

ce9b9b2

[hack_ihel] drop support for MGONGPU_NSIGHT_DEBUG

4f2d277

[hack_ihel] reenable cudaDeviceSync (and peek last error)

27bb72c

[hack_ihel] reduce diffs to origin/hack (did not try to run yet)

b0aa4f1

[hack_ihel} try to fix a segfault in GPU getgoodhel - it still segfaults

8e2c240

[hack_ihel] copy allMEs to host in GPU getgoodhel, but it still segfa…

e7adbe2

…ults

[hack_ihel] remove the segfault, but not functionally complete - need…

0ba75f7

… to move goodhel to host

[hack_ihel] fix all segfaults, complete functtionality: but 30% slower??

dc502cd

[hack_ihel] empty commit

63b1f6e

Confirm a 30% difference between ihel_hack EvtsPerSec[MECalcOnly] (3a) = ( 9.992406e+07 ) sec^-1 and hack EvtsPerSec[MECalcOnly] (3a) = ( 1.340965e+08 ) sec^-1

[hack_ihel] backport to code generation

94dd68e

[hack_ihel] regenerate ggtt.sa (with some formatting improvements), a…

fe3506f

…ll ok

[hack_ihel] regenerate all processes sa

3d85749

[hack_ihel] regenerate all processes mad

8a764a9

[hack_ihel] rerun 5 tputs BEFORE hack_ihel (a671843)

be4a3ae

./tput/teeThroughputX.sh -eemumu -ggtt -ggttg -ggttgg -ggttggg

[hack_ihel] first results for hack_ihel - quite confusing, some failu…

717157c

…res, some faster, some slower

valassi added 22 commits September 20, 2025 11:09

[hack_ihel3_sep25] prepare to merge hack_ihel3 into hack_ihel2_sep25 …

d6f7049

…- 144 tput and tmad logs git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD tput/log* tmad/log*)

[hack_ihel3_sep25] prepare to merge hack_ihel3 into hack_ihel2_sep25 …

d482bd5

…- codegen logs for all processes git checkout origin/hack_ihel3 $(git ls-tree --name-only HEAD */CODEGEN*txt)

[hack_ihel3_sep25] regenerate all processes

44c73de

[hack_ihel3_sep25] in CODEGEN (and .mad/.sa) GpuAbstraction.h, protec…

975022a

…t cublas/hipblas headers with #ifndef MGONGPU_HAS_NO_BLAS

[hack_ihel3_sep25] in CODEGEN (and .mad/.sa) GpuRuntime.h, bug fix in…

878e522

… #ifndef MGONGPU_HAS_NO_BLAS

[hack_ihel3_sep25] in CODEGEN (and .mad/.sa) GpuAbstraction.h, define…

d652d45

… gpuBlasHandle_t as void in noBLAS builds

[hack_ihel3_sep25] in CODEGEN, add another hack with pBlasHandle = nu…

f974a77

…llptr to keep the same API in noBLAS builds

[hack_ihel3_sep25] in CODEGEN color_sum.cc, add __host__ to Normalize…

95be3da

…dColorMatrix() for noBLAS HIP builds

[hack_ihel3_sep25] in CODEGEN cudacpp.mk and GpuAbstraction.h, look f…

c964d6a

…or hipblas.h in include/hipblas

[hack_ihel3_sep25] regenerate all processes again with all fixes for …

335fcd9

…hipblas

[hack_ihel3_sep25] minor fixes in tput/allTees.sh

03c0f5e

[hack_ihel3_sep25] minor fixes in tmad/madX.sh

a91bf29

[hack_ihel3_sep25] in CODEGEN, fix a build warning for incompatible _…

4ade11e

…_global__ INLINE CPPProcess.cc(235): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

[hack_ihel3_sep25] ** COMPLETE HACK_IHEL3_SEP25 ** regenerate all pro…

f98c217

…cesses, remove INLINE from __global__ to fix build warnings

valassi requested a review from oliviermattelaer October 6, 2025 17:46

valassi assigned valassi and oliviermattelaer Oct 6, 2025

valassi requested a review from a team as a code owner October 6, 2025 17:46

valassi mentioned this pull request Oct 6, 2025

(WIP, NOT FOR MERGING) Kernel splitting ihel4: Feynman diagram kernels #1050

Draft

Merge branch 'master' into hack_ihel3_sep25_pr

037b612

oliviermattelaer reviewed Oct 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049

Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049

Uh oh!

valassi commented Oct 6, 2025

Uh oh!

oliviermattelaer commented Oct 7, 2025

Uh oh!

oliviermattelaer left a comment

Uh oh!

oliviermattelaer Oct 7, 2025

Uh oh!

oliviermattelaer Oct 7, 2025

Uh oh!

oliviermattelaer Oct 7, 2025

Uh oh!

oliviermattelaer Oct 7, 2025

Uh oh!

Uh oh!

	constexpr int nprocesses = 1;
	constexpr int nprocesses = %(nproc)i;

	constexpr int process_id = 1; // code generation source: standalone_cudacpp
	constexpr int process_id = %(proc_id)i; // code generation source: %(proc_id_source)s

Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049

Are you sure you want to change the base?

Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049

Uh oh!

Conversation

valassi commented Oct 6, 2025

Uh oh!

oliviermattelaer commented Oct 7, 2025

Uh oh!

oliviermattelaer left a comment

Choose a reason for hiding this comment

Uh oh!

oliviermattelaer Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

oliviermattelaer Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

oliviermattelaer Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

oliviermattelaer Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!