Profiling Infrastructure #354

Snektron · 2025-09-09T20:33:29Z

Description

This PR adds some infrastructure for dealing with profiling data. Basically, the idea is that any data which is placed in the profile_data/ directory in the github runner is exported (when EvalResult.profile_result is set) to the user via discord. As discord attachments could not handle the size of download artifacts (typically up to ~35 MB), I've opted to simply present a direct link to the GitHub artifact. To this end, I've modified the GH launcher to provide a sort of 'index' of artifacts, which can then either be downloaded by the bot or presented as download link.

I've also fixed some minor bugs in run_eval.py related to fetching ROCm system info, as well as added some extra info to SystemInfo about the runtime (useful elsewhere in the evaluating process when the actual profiling stuff is added).

Some caveats:

Users can export any file as download by profiling and writing to profile_data/ currently. I think that the only way around that would be to launch the profiling process with higher privileges and to then drop those privileges in eval.py. Lmk if you guys want that.
Profiling artifacts aren't protected, they can be downloaded by anyone (provided that they have a GH account), possibly giving away information about a solution. It shouldn't contain kernel source, just kernel names + run order.

Extracted from #339

This de-duplicates some duplicated code paths. This makes it easier to patch profiling calls into the function later on.

github-actions · 2025-09-09T20:34:24Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/libkernelbot
report.py					60
Project Total

_{This report was generated by python-coverage-comment-action}

This way we can tell whether we are using CUDA or ROCm later on. This also fixes the ROCm fallback path.

This will be used to communicate external download links such as profiling results.

A new ProfileResult type is added to run_eval, which is is returned in the EvalResult type. Among other fields, this contains the `download_url` field which should be used by the user to download profiling data. Note that the actual public download link may not be known in run_eval.py. In this case, it is the intention that the launcher fixes up the `download_url` before returning the results back to libkernelbot.

The new function `GitHubRun.get_artifact_index` returns a dict of artifacts available from the run. For each artifact, the GitHub API URL and public download URL are returned. The latter is not available directly from the GitHub API, however, it can be easily constructed from the data that is available in the worflow result. `download_artifacts` is replaced by a function which downloads a specific artifact rather than all of them. Additionally, the function no longer writes to a temp file when downloading the artifact; the results of the download request can be piped directly into zipfile using BytesIO.

The idea is that eval_run.py places profiling data in the profile_data/ directory, which is then automatically exported to the user. This is done by uploading that directory as the 'profile-data' artifact, then fetching its public download link and returning that as the ProfileResult.download_url.

msaroufim · 2025-09-10T02:48:38Z

Current caveats seem fine to me, the retention policy makes it so you don't have too much time to be abusive. Feel free to merge

Simplify run_single_evaluation

b7e8c8c

This de-duplicates some duplicated code paths. This makes it easier to patch profiling calls into the function later on.

Snektron force-pushed the profiling-infra branch from f7fc61e to b7e8c8c Compare September 9, 2025 20:38

Snektron marked this pull request as draft September 9, 2025 20:39

Snektron force-pushed the profiling-infra branch from f5c02ef to e5007f6 Compare September 9, 2025 20:54

add gpu runtime info to SystemInfo

f3dd42e

This way we can tell whether we are using CUDA or ROCm later on. This also fixes the ROCm fallback path.

Snektron force-pushed the profiling-infra branch from e5007f6 to f3dd42e Compare September 9, 2025 20:55

Snektron added 4 commits September 9, 2025 23:00

add 'link' report result type

c4c6b41

This will be used to communicate external download links such as profiling results.

Snektron marked this pull request as ready for review September 9, 2025 21:02

Snektron mentioned this pull request Sep 9, 2025

Initial AMD Profiling #339

Merged

6 tasks

Snektron requested review from ngc92, msaroufim and S1ro1 and removed request for ngc92 and msaroufim September 9, 2025 21:13

msaroufim approved these changes Sep 10, 2025

View reviewed changes

Snektron merged commit 31a047f into main Sep 10, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiling Infrastructure #354

Profiling Infrastructure #354

Uh oh!

Snektron commented Sep 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

msaroufim commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Profiling Infrastructure #354

Profiling Infrastructure #354

Uh oh!

Conversation

Snektron commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

msaroufim commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Snektron commented Sep 9, 2025 •

edited

Loading

github-actions bot commented Sep 9, 2025 •

edited

Loading