Skip to content

rework ci#3330

Open
likewhatevs wants to merge 1 commit intosched-ext:mainfrom
likewhatevs:ci-debug
Open

rework ci#3330
likewhatevs wants to merge 1 commit intosched-ext:mainfrom
likewhatevs:ci-debug

Conversation

@likewhatevs
Copy link
Contributor

@likewhatevs likewhatevs commented Feb 16, 2026

ci grew in complexity over time. this commit simplifies it while adding some capabilities we have wanted (proper veristat support, ai review, automated performance testing and analysis) in the process.

I made a couple of tools we've needed for a while now in the process of doing this:

I figure once i know how folks feel about them, i'll publish on gh marketplace (for the action) and crates.io (for the cargo plugin).

Anyway, on the topic of the points of this refactor

simple

install stuff with apt, cache with gh action cache where needed, minimal scripts (only 1 i think, for the benchmark stuff), pretty much everything is in github action yml, and there are only 2 of those files.

veristat

TL;DR -- put rodata dumps in a directory named veristat in your scheduler's directory and ci will run X kernel's through verification using all Y dumps present. layered's inst count goes from 20k to 200k w/ this w/ --run-example.

longer story

about a year ago i tested a layered binary in a vm on a desktop i got just to do that and learned then that vm-based testing is well, not the whole story when it comes to verifying bpf programs.

In that particular instance, the vm's default config was something that layered detected as "smt disabled" and the "smt enabled" (i.e. the common-if-not-always) case always failed to verify.

That cargo veristat plugin is for this case, it contains the glue logic necessary to feed rodata dumps from bpftool into veristat.

This paired with vm based testing (to change the verifier being ran) enables what in my head has been a near gold-standard for knowing if something will verify. I'm sure we'll find cases where this will not work, but I think it's a step in the right direction.

terse/categorized output

on ci job logs page, we now get things like this:
image
and this:
https://github.com/likewhatevs/scx/actions/runs/22052475511/attempts/1#summary-63713324076 (cargo veristat does some loop detection so infinite logs are made small enough to render in UIs and be work-with-able in logs).

AI

we haz ai: likewhatevs#13 (comment)

benchmarks

when scheduler code (not dependency code) is edited, a script will run a handful of benchmarks and rsched against the modified scheduler. on it's own, this is ehh. but we haz ai!, and ai knows to grep through kernel sources and a reasonably fresh index of all lore emails from all mailing lists and to ingest all the data output above before providing analyses like this.

when i set out to do this the prior paragraph was a hunch, but I actually kind of saw things working in that manner across these 3 pr's, which was kinda cool

  1. no benchmark tools (and also, i targeted the wrong branch w/ PR), AI has some comments but I mean, the code looks good, why not: p2dq: prefer idle cores over prev_cpu in select_cpu fast path likewhatevs/scx#14
  2. ai has some benchmark tools, pretty confident no: p2dq: prefer idle cores over prev_cpu in select_cpu fast path likewhatevs/scx#15
  3. ai has full benchmark suite with rsched outputs while they are running -- it spots the intentional bug: p2dq: prefer idle cores over prev_cpu in select_cpu fast path likewhatevs/scx#20
  4. after clarifying the importance of rsched/improving logging: p2dq: prefer idle cores over prev_cpu in select_cpu fast path likewhatevs/scx#21

note -- neither ai nor benchmarks are blocking

downsides

I'm not sure how long getting that AI feedback is going to take wrt/ queue'ing until libbpf/libbpf-rs#1336 is merged.

TL;DR on that PR is builds get 4x (probably closer to 6-8x, i did not test this w/o the "limit parallelism to prevent ooms" fix in place) faster w/ 10x less ram, enabling better resource allocation of the ci stuff such that I can be more confident no queue'ing for comments/feedback as folks iterate (i think).

No periodic tests against for-next for now. PRs are ran against a battery of kernels, including for-next, etc., but I think that libbpf-rs commit or something like it needs to be merged for it to be tenable (wrt/ compute) for me to enable those.

I had to do some log policing wrt/ signal-to-noise. More/less print when things fail and or warn, do not print positive progress info (i.e. "compiled" or "test passed"). Our logs were in the megabytes, most of the information not particularly actionable. This makes them more information dense.

I'm opening this PR now instead of iterating more because I think making this work best requires folks use it/we see where it falls over.

@likewhatevs
Copy link
Contributor Author

@likewhatevs
Copy link
Contributor Author

likewhatevs commented Feb 17, 2026

think i rly like this one: likewhatevs#24 (comment) -- it links to bootlin and has more tasteful emoji and url use.

@likewhatevs likewhatevs marked this pull request as draft February 17, 2026 17:34
@likewhatevs likewhatevs force-pushed the ci-debug branch 13 times, most recently from 9829c80 to 3f03e85 Compare February 18, 2026 04:05
@likewhatevs likewhatevs requested a review from hodgesds February 18, 2026 04:47
@likewhatevs
Copy link
Contributor Author

I fed a handful more PRs through this CI setup on my fork here:

https://github.com/likewhatevs/scx/pulls?q=is%3Apr+created%3A2026-02-18T00%3A46%3A34..2026-02-18T04%3A46%3A34

Runtime for non-ai-review jobs is down to 6 minutes (i.e. the blocking merge queue stuff) and that libbpf-cargo should take maybe a 2 minutes off that plus remove a fan-out bottleneck in the merge queue.

This was to be simpler than it ended up being but getting runtimes acceptable (not 30 mins for the blocking stuff) w/o using an external cache w/ that libbpf issue required all the caching. That being said, it's still less code with additional capabilities and all shell or github ci yaml.

It looks like my fix for libbpf might be OK after some iteration, so times will improve.

WRT/ the example PRs, all i've seen so far seem informative but a good few are in-flight. The way the AI review works is it takes ~30 mins from when a PR is opened to comment but that is serial. This sounds bad but, looking at historic data, I'd guess the most anyone would ever have to wait would be an hour or two. Note, AI review is non-blocking so maybe that's fine.

There is a lot of noise on the ai/perf box (i set some probably-should-be weekly index/update cron jobs to be hourly) so it'll be interesting to see what AI makes of that wrt/ interpreting changes in the context of performance.

This is a picture of what the pipeline does (github ui doesn't render fanout w/ matrices well, I think):
Screenshot 2026-02-18 000621

I also updated the underlying tooling such that I think we can do all that we do on x86 on arm by just adding another matrix variable (presuming things aren't unusably slow, gh runners don't have kvm on arm).

@likewhatevs likewhatevs marked this pull request as ready for review February 18, 2026 05:22
ci grew in complexity over time. this commit simplifies it while adding
some capabilities we have wanted (proper veristat support, claude,
automated performance testing and analysis) in the process.

also cleanup a lint issue for a linter that wasn't running or something
(for green signal).

Signed-off-by: Pat Somaru <patso@likewhatevs.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments