Conversation
|
|
think i rly like this one: likewhatevs#24 (comment) -- it links to bootlin and has more tasteful emoji and url use. |
9829c80 to
3f03e85
Compare
|
I fed a handful more PRs through this CI setup on my fork here: Runtime for non-ai-review jobs is down to 6 minutes (i.e. the blocking merge queue stuff) and that libbpf-cargo should take maybe a 2 minutes off that plus remove a fan-out bottleneck in the merge queue. This was to be simpler than it ended up being but getting runtimes acceptable (not 30 mins for the blocking stuff) w/o using an external cache w/ that libbpf issue required all the caching. That being said, it's still less code with additional capabilities and all shell or github ci yaml. It looks like my fix for libbpf might be OK after some iteration, so times will improve. WRT/ the example PRs, all i've seen so far seem informative but a good few are in-flight. The way the AI review works is it takes ~30 mins from when a PR is opened to comment but that is serial. This sounds bad but, looking at historic data, I'd guess the most anyone would ever have to wait would be an hour or two. Note, AI review is non-blocking so maybe that's fine. There is a lot of noise on the ai/perf box (i set some probably-should-be weekly index/update cron jobs to be hourly) so it'll be interesting to see what AI makes of that wrt/ interpreting changes in the context of performance. This is a picture of what the pipeline does (github ui doesn't render fanout w/ matrices well, I think): I also updated the underlying tooling such that I think we can do all that we do on x86 on arm by just adding another matrix variable (presuming things aren't unusably slow, gh runners don't have kvm on arm). |
ci grew in complexity over time. this commit simplifies it while adding some capabilities we have wanted (proper veristat support, claude, automated performance testing and analysis) in the process. also cleanup a lint issue for a linter that wasn't running or something (for green signal). Signed-off-by: Pat Somaru <patso@likewhatevs.io>
3f03e85 to
d4f8c95
Compare

ci grew in complexity over time. this commit simplifies it while adding some capabilities we have wanted (proper veristat support, ai review, automated performance testing and analysis) in the process.
I made a couple of tools we've needed for a while now in the process of doing this:
I figure once i know how folks feel about them, i'll publish on gh marketplace (for the action) and crates.io (for the cargo plugin).
Anyway, on the topic of the points of this refactor
simple
install stuff with apt, cache with gh action cache where needed, minimal scripts (only 1 i think, for the benchmark stuff), pretty much everything is in github action yml, and there are only 2 of those files.
veristat
TL;DR -- put rodata dumps in a directory named
veristatin your scheduler's directory and ci will run X kernel's through verification using all Y dumps present. layered's inst count goes from 20k to 200k w/ this w/ --run-example.longer story
about a year ago i tested a layered binary in a vm on a desktop i got just to do that and learned then that vm-based testing is well, not the whole story when it comes to verifying bpf programs.
In that particular instance, the vm's default config was something that layered detected as "smt disabled" and the "smt enabled" (i.e. the common-if-not-always) case always failed to verify.
That cargo veristat plugin is for this case, it contains the glue logic necessary to feed rodata dumps from bpftool into veristat.
This paired with vm based testing (to change the verifier being ran) enables what in my head has been a near gold-standard for knowing if something will verify. I'm sure we'll find cases where this will not work, but I think it's a step in the right direction.
terse/categorized output
on ci job logs page, we now get things like this:

and this:
https://github.com/likewhatevs/scx/actions/runs/22052475511/attempts/1#summary-63713324076 (cargo veristat does some loop detection so infinite logs are made small enough to render in UIs and be work-with-able in logs).
AI
we haz ai: likewhatevs#13 (comment)
benchmarks
when scheduler code (not dependency code) is edited, a script will run a handful of benchmarks and rsched against the modified scheduler. on it's own, this is ehh. but we haz ai!, and ai knows to grep through kernel sources and a reasonably fresh index of all lore emails from all mailing lists and to ingest all the data output above before providing analyses like this.
when i set out to do this the prior paragraph was a hunch, but I actually kind of saw things working in that manner across these 3 pr's, which was kinda cool
note -- neither ai nor benchmarks are blocking
downsides
I'm not sure how long getting that AI feedback is going to take wrt/ queue'ing until libbpf/libbpf-rs#1336 is merged.
TL;DR on that PR is builds get 4x (probably closer to 6-8x, i did not test this w/o the "limit parallelism to prevent ooms" fix in place) faster w/ 10x less ram, enabling better resource allocation of the ci stuff such that I can be more confident no queue'ing for comments/feedback as folks iterate (i think).
No periodic tests against for-next for now. PRs are ran against a battery of kernels, including for-next, etc., but I think that libbpf-rs commit or something like it needs to be merged for it to be tenable (wrt/ compute) for me to enable those.
I had to do some log policing wrt/ signal-to-noise. More/less print when things fail and or warn, do not print positive progress info (i.e. "compiled" or "test passed"). Our logs were in the megabytes, most of the information not particularly actionable. This makes them more information dense.
I'm opening this PR now instead of iterating more because I think making this work best requires folks use it/we see where it falls over.