[CI] Add GHA job to test downstream GB-25 #1197

giordano · 2025-07-29T17:38:39Z

No description provided.

giordano · 2025-07-29T18:11:10Z

ERROR: /root/.bazel/external/xla/xla/BUILD:1237:11: Compiling xla/cpu_function_runtime.cc failed: absolute path inclusion(s) found in rule '@@xla//xla:cpu_function_runtime':
the source file 'xla/cpu_function_runtime.cc' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
  '/root/.bazel/external/eigen_archive/Eigen/src/Core/InternalHeaderCheck.h'
  '/root/.bazel/external/eigen_archive/Eigen/src/Core/InternalHeaderCheck.h'

Wat

wsmoses · 2025-07-29T18:12:13Z

er, wat

GleasonK · 2025-07-29T18:53:57Z

I wonder if it has to do with using include <Eigen/Core> vs include "Eigen/Core", looks like we have both patterns floating around: https://github.com/search?q=repo%3Aopenxla%2Fxla%20Eigen%2FCore&type=code

Workaround in a few other similar bugs was adding those paths to system include dirs.. but that's not ideal

This reverts commit f7cbf92.

wsmoses · 2025-07-29T21:44:57Z

if we use clang we get:


1 warning generated.
INFO: From Compiling upb/mini_table/extension_registry.c:
clang: warning: argument unused during compilation: '--cuda-path=external/cuda_nvcc' [-Wunused-command-line-argument]
ERROR: /root/.bazel/external/boringssl/BUILD:133:11: Compiling err_data.c failed: (Exit 1): process-wrapper failed: error executing CppCompile command 
  (cd /root/.bazel/sandbox/processwrapper-sandbox/358/execroot/__main__ && \
  exec env - \
    GRPC_BAZEL_RUNTIME=1 \
    JULIA=/__w/_tool/julia/1.11.6/x64/bin/julia \
    PATH=/__w/.cache/bazel/bazelisk/downloads/sha256/c97f02133adce63f0c28678ac1f21d65fa8255c80429b588aeeba8a1fac6202b/bin:/__w/_tool/julia/1.11.6/x64/bin:/__w/_tool/bazelisk/1.x/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
    PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=upb \
    *** \
    TMPDIR=/tmp \
  /__w/.cache/bazel/bazel/_bazel_root/install/81618c1cfcf8a55fe29d247a9003bce4/process-wrapper '--timeout=0' '--kill_delay=15' '--stats=/root/.bazel/sandbox/processwrapper-sandbox/358/stats.out' external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/external/boringssl/_objs/crypto/err_data.pic.d '-frandom-seed=bazel-out/k8-opt/bin/external/boringssl/_objs/crypto/err_data.pic.o' -iquote external/boringssl -iquote bazel-out/k8-opt/bin/external/boringssl -isystem external/boringssl/src/include -isystem bazel-out/k8-opt/bin/external/boringssl/src/include -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fPIC -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -fno-omit-frame-pointer -no-canonical-prefixes -DNDEBUG -g0 -O2 -ffunction-sections -fdata-sections '--cuda-path=external/cuda_nvcc' -DGRPC_BAZEL_BUILD -DBORINGSSL_IMPLEMENTATION -Wa,--noexecstack -Wall -Werror '-Wformat=2' -Wsign-compare -Wmissing-field-initializers -Wwrite-strings -Wshadow -fno-common '-D_XOPEN_SOURCE=700' '-std=c11' -Wmissing-prototypes -Wold-style-definition -Wstrict-prototypes -c external/boringssl/err_data.c -o bazel-out/k8-opt/bin/external/boringssl/_objs/crypto/err_data.pic.o)
clang: error: argument unused during compilation: '--cuda-path=external/cuda_nvcc' [-Werror,-Wunused-command-line-argument]
[3,353 / 16,619] [Prepa] action 'SolibSymlink _solib_local/_U_A_Acuda_Ucublas_S_S_CcublasLt_Ushared_Ulibrary___Ulib/libcublasLt.so.12'
[3,354 / 16,619] checking cached actions
Target //:libReactantExtra.so failed to build
INFO: Elapsed time: 274.757s, Critical Path: 4.72s
INFO: 3354 processes: 8 disk cache hit, 3035 internal, 311 processwrapper-sandbox.

giordano · 2025-07-29T21:47:52Z

Why -Werror, why.

giordano · 2025-07-30T22:06:25Z

I think this is basically ready now, but the question is what to do with it? It takes 45 minutes only to recompile xla/libreactant every time (bazel cache doesn't seem to be very effective here), pushing total runtime to over 70 minutes, not exactly a quick turnaround. And we have only two runners of these, the queue would get backed up very quickly when there are more than one pull request being worked on at the same time.

wsmoses · 2025-07-30T22:23:07Z

can we add a new matrix of XLA commits, default empty string, with an optional hash. This would be exceptionally useful for ablation tests (including the comms).

We should be able to fix the cache issue shortly so I'm also fine for now temporary using more resources

wsmoses · 2025-07-30T22:24:06Z

also if it speeds things up we can elect to do the non super verbose xla dump

giordano · 2025-07-30T22:32:36Z

also if it speeds things up we can elect to do the non super verbose xla dump

That I've already done, it was timing out with --xla_dump_hlo_pass_re=.* 😄

wsmoses · 2025-07-30T22:33:41Z

Ah fair.

In any case, let's set up the xla commit part, and ablate the comm pr

giordano · 2025-07-30T22:49:38Z

Is 6965da3 (#1197) what you were thinking for XLA? You would change 0123456789abcdef0123456789abcdef01234567 as you need, besides uncommenting that line.

wsmoses · 2025-07-30T22:53:29Z

.github/workflows/test-gb-25.yml

          sed -i.bak 's/ENZYMEXLA_COMMIT = ".*"/ENZYMEXLA_COMMIT = "${{ github.sha }}"/' ReactantExtra/WORKSPACE

+          # Modify XLA commit
+          # sed -E -i.bak -e 's/(# )?XLA_COMMIT = ".*"/XLA_COMMIT = "0123456789abcdef0123456789abcdef01234567"/' -e 's/(# )?XLA_SHA256 = ""/XLA_SHA256 = ""/' ReactantExtra/WORKSPACE


Two things:

we also need to comment out or delete the load of the xla commit from Jax

we should match and hash

Ah jk I see you do 2

We should also add a series of targets, like
Xla_hash:

""

"abcd..."

To the github actions yml, and the have it use the hash if non empty

Uhm, not sure how you mean exactly. Do you have an example?

Ok, done in 09919a3 (#1197)

giordano · 2025-07-30T23:41:45Z

Note that we also upload the profile traces (example), which show that NCCL communication is still a sizable fraction of the whole runtime (almost 18% in this trace)

wsmoses · 2025-07-30T23:50:46Z

lets see how the default compares against openxla/xla#29448 [both in terms of overall runtime and also # of all-gathers / all-reduces].

cc @felixwqp @frgossen

wsmoses · 2025-07-30T23:51:13Z

[in a follow up we should also ablate the impact of https://github.com/EnzymeAD/Reactant.jl/pull/1496]

wsmoses · 2025-07-31T03:18:29Z

okay all gathers arent in the optimized code which is good, seemingly just all-reduces to go

giordano added the github_actions Pull requests that update GitHub Actions code label Jul 29, 2025

giordano force-pushed the mg/gb-25 branch 4 times, most recently from 7064a04 to f4b49e9 Compare July 29, 2025 17:53

giordano force-pushed the mg/gb-25 branch from 223d8dc to f7cbf92 Compare July 29, 2025 18:27

giordano and others added 7 commits July 29, 2025 21:05

[CI] Add GHA job to test downstream GB-25

4048dc9

[CI] Temporarily disable repository cache

2907d3e

Revert "[CI] Temporarily disable repository cache"

6cd7378

This reverts commit f7cbf92.

[CI] Upload build directory for debugging

7c8ab16

Update test-gb-25.yml

11ea4c1

Update test-gb-25.yml

076d3b8

Use tmate for debugging

0ec4c8e

giordano force-pushed the mg/gb-25 branch from cd96c94 to 0ec4c8e Compare July 29, 2025 20:06

wsmoses added 4 commits July 29, 2025 15:05

Update test-gb-25.yml

e6fcb7f

Update test-gb-25.yml

10290d8

Update test-gb-25.yml

3a58f96

Update test-gb-25.yml

16bb5c5

[CI] Change Enzyme-JAX commit used for compiling libReactantExtra

622cb49

wsmoses marked this pull request as ready for review July 29, 2025 22:56

wsmoses and others added 5 commits July 29, 2025 16:57

Update test-gb-25.yml

853e142

[CI] Set clang as compiler

e652e15

[CI] Restore bazel cache

9f4cbb2

[CI] Run tmate only on failure (will eventually remove it)

89bf7af

[CI] Checkout GB-25 repo

8270a24

giordano and others added 2 commits July 30, 2025 09:31

Update test-gb-25.yml

e87bb97

[CI] Run GB-25 simulation

5c091dc

giordano force-pushed the mg/gb-25 branch from dd5b3d3 to 5c091dc Compare July 30, 2025 10:54

giordano added 3 commits July 30, 2025 15:20

[CI] Add timeouts

bbacbd6

[CI] Disable earlyoom

c57c507

[CI] Upload artifacts

e3d81e5

giordano force-pushed the mg/gb-25 branch from b06d290 to e3d81e5 Compare July 30, 2025 15:34

giordano and others added 2 commits July 30, 2025 16:39

[CI] Also dump all XLA passes

7d8adee

Longer timeouts

1be36f4

[CI] Add code (commented out) to change XLA commit

6965da3

wsmoses reviewed Jul 30, 2025

View reviewed changes

[CI] Also remove loading of XLA from JAX

cf9f1a2

This comment was marked as resolved.

Sign in to view

[CI] Add XLA, GB-25, and Reactant commits as matrix elements

09919a3

wsmoses and others added 2 commits July 31, 2025 00:51

Update test-gb-25.yml

6d10a77

[CI] Fix paths of artifacts to upload

4cfd560

giordano force-pushed the mg/gb-25 branch from bea2eb0 to 4cfd560 Compare July 30, 2025 23:51

Update test-gb-25.yml

f1afb2e

wsmoses merged commit af5f295 into main Jul 31, 2025
2 of 4 checks passed

wsmoses deleted the mg/gb-25 branch July 31, 2025 05:18

[CI] Add GHA job to test downstream GB-25 #1197

[CI] Add GHA job to test downstream GB-25 #1197

Uh oh!

Conversation

giordano commented Jul 29, 2025

Uh oh!

giordano commented Jul 29, 2025

Uh oh!

wsmoses commented Jul 29, 2025

Uh oh!

GleasonK commented Jul 29, 2025

Uh oh!

wsmoses commented Jul 29, 2025

Uh oh!

giordano commented Jul 29, 2025

Uh oh!

giordano commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsmoses commented Jul 30, 2025

Uh oh!

wsmoses commented Jul 30, 2025

Uh oh!

giordano commented Jul 30, 2025

Uh oh!

wsmoses commented Jul 30, 2025

Uh oh!

giordano commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsmoses Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

wsmoses Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

wsmoses Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

giordano Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

giordano Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

giordano commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsmoses commented Jul 30, 2025

Uh oh!

wsmoses commented Jul 30, 2025

Uh oh!

wsmoses commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

giordano commented Jul 30, 2025 •

edited

Loading

giordano commented Jul 30, 2025 •

edited

Loading

giordano commented Jul 30, 2025 •

edited

Loading