-
Notifications
You must be signed in to change notification settings - Fork 21
[CI] Add GHA job to test downstream GB-25 #1197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7064a04
to
f4b49e9
Compare
Wat |
er, wat |
I wonder if it has to do with using Workaround in a few other similar bugs was adding those paths to system include dirs.. but that's not ideal |
This reverts commit f7cbf92.
if we use clang we get:
|
Why -Werror, why. |
I think this is basically ready now, but the question is what to do with it? It takes 45 minutes only to recompile xla/libreactant every time (bazel cache doesn't seem to be very effective here), pushing total runtime to over 70 minutes, not exactly a quick turnaround. And we have only two runners of these, the queue would get backed up very quickly when there are more than one pull request being worked on at the same time. |
can we add a new matrix of XLA commits, default empty string, with an optional hash. This would be exceptionally useful for ablation tests (including the comms). We should be able to fix the cache issue shortly so I'm also fine for now temporary using more resources |
also if it speeds things up we can elect to do the non super verbose xla dump |
That I've already done, it was timing out with |
Ah fair. In any case, let's set up the xla commit part, and ablate the comm pr |
Is |
.github/workflows/test-gb-25.yml
Outdated
sed -i.bak 's/ENZYMEXLA_COMMIT = ".*"/ENZYMEXLA_COMMIT = "${{ github.sha }}"/' ReactantExtra/WORKSPACE | ||
# Modify XLA commit | ||
# sed -E -i.bak -e 's/(# )?XLA_COMMIT = ".*"/XLA_COMMIT = "0123456789abcdef0123456789abcdef01234567"/' -e 's/(# )?XLA_SHA256 = ""/XLA_SHA256 = ""/' ReactantExtra/WORKSPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- we also need to comment out or delete the load of the xla commit from Jax
- we should match and hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah jk I see you do 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add a series of targets, like
Xla_hash:
- ""
- "abcd..."
To the github actions yml, and the have it use the hash if non empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, not sure how you mean exactly. Do you have an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done in 09919a3
(#1197)
This comment was marked as resolved.
This comment was marked as resolved.
Note that we also upload the profile traces (example), which show that NCCL communication is still a sizable fraction of the whole runtime (almost 18% in this trace) |
lets see how the default compares against openxla/xla#29448 [both in terms of overall runtime and also # of all-gathers / all-reduces]. |
[in a follow up we should also ablate the impact of https://github.com/EnzymeAD/Reactant.jl/pull/1496] |
okay all gathers arent in the optimized code which is good, seemingly just all-reduces to go |
No description provided.