Skip to content

Conversation

giordano
Copy link
Member

No description provided.

@giordano giordano added the github_actions Pull requests that update GitHub Actions code label Jul 29, 2025
@giordano giordano force-pushed the mg/gb-25 branch 4 times, most recently from 7064a04 to f4b49e9 Compare July 29, 2025 17:53
@giordano
Copy link
Member Author

ERROR: /root/.bazel/external/xla/xla/BUILD:1237:11: Compiling xla/cpu_function_runtime.cc failed: absolute path inclusion(s) found in rule '@@xla//xla:cpu_function_runtime':
the source file 'xla/cpu_function_runtime.cc' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
  '/root/.bazel/external/eigen_archive/Eigen/src/Core/InternalHeaderCheck.h'
  '/root/.bazel/external/eigen_archive/Eigen/src/Core/InternalHeaderCheck.h'

Wat

@wsmoses
Copy link
Member

wsmoses commented Jul 29, 2025

er, wat

@GleasonK
Copy link
Contributor

I wonder if it has to do with using include <Eigen/Core> vs include "Eigen/Core", looks like we have both patterns floating around: https://github.com/search?q=repo%3Aopenxla%2Fxla%20Eigen%2FCore&type=code

Workaround in a few other similar bugs was adding those paths to system include dirs.. but that's not ideal

@wsmoses
Copy link
Member

wsmoses commented Jul 29, 2025

if we use clang we get:


1 warning generated.
INFO: From Compiling upb/mini_table/extension_registry.c:
clang: warning: argument unused during compilation: '--cuda-path=external/cuda_nvcc' [-Wunused-command-line-argument]
ERROR: /root/.bazel/external/boringssl/BUILD:133:11: Compiling err_data.c failed: (Exit 1): process-wrapper failed: error executing CppCompile command 
  (cd /root/.bazel/sandbox/processwrapper-sandbox/358/execroot/__main__ && \
  exec env - \
    GRPC_BAZEL_RUNTIME=1 \
    JULIA=/__w/_tool/julia/1.11.6/x64/bin/julia \
    PATH=/__w/.cache/bazel/bazelisk/downloads/sha256/c97f02133adce63f0c28678ac1f21d65fa8255c80429b588aeeba8a1fac6202b/bin:/__w/_tool/julia/1.11.6/x64/bin:/__w/_tool/bazelisk/1.x/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
    PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=upb \
    *** \
    TMPDIR=/tmp \
  /__w/.cache/bazel/bazel/_bazel_root/install/81618c1cfcf8a55fe29d247a9003bce4/process-wrapper '--timeout=0' '--kill_delay=15' '--stats=/root/.bazel/sandbox/processwrapper-sandbox/358/stats.out' external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/external/boringssl/_objs/crypto/err_data.pic.d '-frandom-seed=bazel-out/k8-opt/bin/external/boringssl/_objs/crypto/err_data.pic.o' -iquote external/boringssl -iquote bazel-out/k8-opt/bin/external/boringssl -isystem external/boringssl/src/include -isystem bazel-out/k8-opt/bin/external/boringssl/src/include -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fPIC -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -fno-omit-frame-pointer -no-canonical-prefixes -DNDEBUG -g0 -O2 -ffunction-sections -fdata-sections '--cuda-path=external/cuda_nvcc' -DGRPC_BAZEL_BUILD -DBORINGSSL_IMPLEMENTATION -Wa,--noexecstack -Wall -Werror '-Wformat=2' -Wsign-compare -Wmissing-field-initializers -Wwrite-strings -Wshadow -fno-common '-D_XOPEN_SOURCE=700' '-std=c11' -Wmissing-prototypes -Wold-style-definition -Wstrict-prototypes -c external/boringssl/err_data.c -o bazel-out/k8-opt/bin/external/boringssl/_objs/crypto/err_data.pic.o)
clang: error: argument unused during compilation: '--cuda-path=external/cuda_nvcc' [-Werror,-Wunused-command-line-argument]
[3,353 / 16,619] [Prepa] action 'SolibSymlink _solib_local/_U_A_Acuda_Ucublas_S_S_CcublasLt_Ushared_Ulibrary___Ulib/libcublasLt.so.12'
[3,354 / 16,619] checking cached actions
Target //:libReactantExtra.so failed to build
INFO: Elapsed time: 274.757s, Critical Path: 4.72s
INFO: 3354 processes: 8 disk cache hit, 3035 internal, 311 processwrapper-sandbox.

@giordano
Copy link
Member Author

Why -Werror, why.

@wsmoses wsmoses marked this pull request as ready for review July 29, 2025 22:56
@giordano
Copy link
Member Author

giordano commented Jul 30, 2025

I think this is basically ready now, but the question is what to do with it? It takes 45 minutes only to recompile xla/libreactant every time (bazel cache doesn't seem to be very effective here), pushing total runtime to over 70 minutes, not exactly a quick turnaround. And we have only two runners of these, the queue would get backed up very quickly when there are more than one pull request being worked on at the same time.

@wsmoses
Copy link
Member

wsmoses commented Jul 30, 2025

can we add a new matrix of XLA commits, default empty string, with an optional hash. This would be exceptionally useful for ablation tests (including the comms).

We should be able to fix the cache issue shortly so I'm also fine for now temporary using more resources

@wsmoses
Copy link
Member

wsmoses commented Jul 30, 2025

also if it speeds things up we can elect to do the non super verbose xla dump

@giordano
Copy link
Member Author

also if it speeds things up we can elect to do the non super verbose xla dump

That I've already done, it was timing out with --xla_dump_hlo_pass_re=.* 😄

@wsmoses
Copy link
Member

wsmoses commented Jul 30, 2025

Ah fair.

In any case, let's set up the xla commit part, and ablate the comm pr

@giordano
Copy link
Member Author

giordano commented Jul 30, 2025

Is 6965da3 (#1197) what you were thinking for XLA? You would change 0123456789abcdef0123456789abcdef01234567 as you need, besides uncommenting that line.

sed -i.bak 's/ENZYMEXLA_COMMIT = ".*"/ENZYMEXLA_COMMIT = "${{ github.sha }}"/' ReactantExtra/WORKSPACE
# Modify XLA commit
# sed -E -i.bak -e 's/(# )?XLA_COMMIT = ".*"/XLA_COMMIT = "0123456789abcdef0123456789abcdef01234567"/' -e 's/(# )?XLA_SHA256 = ""/XLA_SHA256 = ""/' ReactantExtra/WORKSPACE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. we also need to comment out or delete the load of the xla commit from Jax
  2. we should match and hash

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah jk I see you do 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a series of targets, like
Xla_hash:

  • ""
  • "abcd..."

To the github actions yml, and the have it use the hash if non empty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, not sure how you mean exactly. Do you have an example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done in 09919a3 (#1197)

@giordano

This comment was marked as resolved.

@giordano
Copy link
Member Author

giordano commented Jul 30, 2025

Note that we also upload the profile traces (example), which show that NCCL communication is still a sizable fraction of the whole runtime (almost 18% in this trace)
image

@wsmoses
Copy link
Member

wsmoses commented Jul 30, 2025

lets see how the default compares against openxla/xla#29448 [both in terms of overall runtime and also # of all-gathers / all-reduces].

cc @felixwqp @frgossen

@wsmoses
Copy link
Member

wsmoses commented Jul 30, 2025

[in a follow up we should also ablate the impact of https://github.com/EnzymeAD/Reactant.jl/pull/1496]

@wsmoses
Copy link
Member

wsmoses commented Jul 31, 2025

okay all gathers arent in the optimized code which is good, seemingly just all-reduces to go

@wsmoses wsmoses merged commit af5f295 into main Jul 31, 2025
2 of 4 checks passed
@wsmoses wsmoses deleted the mg/gb-25 branch July 31, 2025 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github_actions Pull requests that update GitHub Actions code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants