Skip to content

[herd] Decouple analysis results from CLI output#1833

Open
fsestini wants to merge 8 commits into
herd:masterfrom
fsestini:herd-test-results-simple-iter
Open

[herd] Decouple analysis results from CLI output#1833
fsestini wants to merge 8 commits into
herd:masterfrom
fsestini:herd-test-results-simple-iter

Conversation

@fsestini
Copy link
Copy Markdown
Collaborator

This PR is a step toward making herd7 reusable from OCaml code, as outlined in #1782. It primarily addresses point 2 of that issue, namely:

Gradually decouple outputs from stdout/files, so callers can consume results directly as OCaml values instead of scraping CLI output or temporary files.

The main change is to split the functionality of the Top_herd module into:

  • a result-producing core, which returns structured OCaml values; and
  • a CLI layer, which preserves the existing stdout/dot-file behaviour.

The user-facing herd7 CLI is intended to remain unchanged.

What Changed

Before this PR, Top_herd mixed three responsibilities in the same control flow:

  • generating candidate executions and computing analysis results;
  • formatting analysis results and selected executions for output; and
  • managing CLI output resources such as stdout dot blocks, output directories, temporary dot files, and viewers.

This PR separates those responsibilities. Top_herd.Make(...).run is now the result-producing part: it generates executions, checks them against the model, and returns structured OCaml values rather than printing directly. Top_herd.Printer is the formatting layer: it knows how to turn those structured values into strings. The new Cli module implements the remaining CLI-specific resource management: directories, temporary files, channels, etc.

The intended split is that the first two pieces can be reused from library-style callers (hence why they are both accessed through Top_herd), while file and directory management stays in a CLI-specific place and won't be exposed.

One detail of this split is that selected executions are returned through an iterator-shaped value, rather than as a list. This is to preserve the previous memory behaviour where executions are processed one by one and not retained in memory.

Note for reviewers: most of this refactor consists of taking pieces of the old Top_herd module and moving them to more specific places. The diff might therefore look noisier than the actual conceptual change, as most of the code that is shown being added/removed in the diff is simply being moved around with little to no change. Overall, I tried my best to keep the diff as tight as possible.

Concretely, this PR proposes the following changes:

New module Top_herd.TestResult

The main types introduced here are stats and execution. stats represents the overall results of a litmus test simulation, i.e. the final states, candidate counts, witnesses, flags, etc. execution carries data for a single execution.

Calling Top_herd.Make(M).run now takes a test : M.S.test as input, and returns a triple:

  • A M.S.test, which is an updated version of the input
  • A list M.S.event_structure list of event structures (pre solver)
  • An iterator (execution -> unit) -> stats that iterates over executions and produces a stats summary at the end

The iterator follows the existing CLI selection policy: it visits the executions selected by options such as show and nshow, not every candidate execution. This keeps the new API close to the behaviour currently exposed by the CLI. A more fine-grained iterator over candidate executions can be added later if needed.

Updated ParseTest/RunTest

The "run" functions in ParseTest/RunTest have been updated so parsing and running return structured outcomes, rather than print out those outcomes themselves.

Outcomes are returned as a first-class module. This seems needed, because the type of analysis outcomes is indexed by semantics S : SemExtra.S, and moreover the input test determines which SemExtra.S/XXXMem.S instance herd7 constructs. Since the exact semantics module is only known after parsing and dispatching on the test architecture and variants, it has to be part of the returned value somehow.

New module Cli

The new Cli module contains the parts of the old Top_herd implementation that are specific to the CLI. Most of this code is intentionally copied almost directly from the old Top_herd module.

Changed how showsome is computed within ParseTest

Before this refactor, showsome was determined inside ParseTest (in part) from the outputdir parameter. As ParseTest is now moving toward being a more generic API, I thought a very CLI-specific parameter like outputdir would not be appropriate. The new collect_graph_data parameter for ParseTest is an attempt to preserve the old behaviour using a less CLI-specific parameter. Having said this, I do still find it a bit awkward to use a module parameter for controlling optimization behaviour, so it would be great if we can later on find a way to remove this while preserving the optimizations as much as possible.

Tests

Added cram tests under herd/tests/other/output.t to ensure that CLI behaviours touched by the refactor are preserved.

@fsestini fsestini self-assigned this May 12, 2026
@fsestini fsestini force-pushed the herd-test-results-simple-iter branch from c2da892 to b969e28 Compare May 12, 2026 12:01
Copy link
Copy Markdown
Contributor

@TiberiuBucur TiberiuBucur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten around to reviewing everything just yet, but so far I like the implementation. I left two minor comments so far. In addition, I'm having trouble running make install after the project builds (using make). Apparently the _build/default/herdtools7.install is missing.

Comment thread herd/parseTest.ml
Comment on lines -170 to -175
(* START NOTWWW *)
(* Interval timer will be stopped just before output, see top_herd *)
Itimer.start name TopConf.timeout ;
(* END NOTWWW *)
let start_time = Sys.time () in
Misc.input_protect (do_from_file start_time env name) name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any concern in dropping the timing data recording?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time recording is still performed as before, it just has been moved out of ParseTest and into the new Cli module.

The rationale is that I think timing is a property of a particular client’s consumption of the result, not of the result itself. Since ParseTest is now moving towards exposing a reusable library API whose results are consumed through an iterator in various different ways (in CLI, concurrently, in tests, etc.), the library itself should not impose any particular timing policy onto callers.
In herd7, the CLI the component that knows what operations need to be timed, so now timing is handled there. Other consumers of this API will have the choice to set up their own timing policy, or to not measure time at all (like we do in our test suite).

Having said this, you've reminded me that I should make the time parameter of Top_herd.Printer.pp_stats optional, in case callers want to print out run stats without the 'Time' bit!

Comment thread herd/parseTest.ml
Comment on lines -99 to +108
begin match Conf.outputdir with
| PrettyConf.StdoutOutput | PrettyConf.Outputdir _ -> true
| _ -> false
end || Misc.is_some Conf.PC.view || Conf.variant Variant.MemTag
Conf.collect_graph_data
|| Misc.is_some Conf.PC.view || Conf.variant Variant.MemTag
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally fine with having this flag, but I was curious if you measured a change in performance with and without the showsome flag guard for the code that builds the show member of the state in interpreter.ml. How big of an optimisation are we looking at, compared to recording every relation and filtering at the end?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good point. The short answer is that I have not looked deeply into this yet, but I agree it is a question worth asking.

compared to recording every relation and filtering at the end?

I am interpreting this as: what would happen if the interpreter always constructed the show state, and we only decided later which parts of it to use.

One thing worth noting is that show is currently represented lazily via Lazy.t. So, even with showsome = true, we are not necessarily eagerly computing every shown relation while the interpreter runs. What we may still be doing is allocating thunks needed to compute the shown relations at a later stage. So my current understanding is that the likely cost of setting showsome = true unconditionally is less "eagerly compute every relation" and more "allocate and retain more show-related data, possibly increasing memory footprint and GC pressure".

I did a rough comparison on this test:

AArch64 wait-flag1
{
   0:X0=x; 1:X0=x;
   1:X2=y;
}
   P0          | P1                ;
   MOV W1,#1   |L0:                ;
               | MOV W3,#1         ;
               | LDADD W3,WZR,[X2] ;
   STR W1,[X0] | LDR W1,[X0]       ;
               | CBZ W1,L0         ;
exists(1:X1=1)

with

$ /usr/bin/time -l ./_build/default/herd/herd.exe -set-libdir ./herd/libdir ./wait-flag1.litmus -unroll 3

For this particular test, I did not see a measurable difference in runtime or peak memory between the normal build and a build with showsome = true unconditionally. On average:

Standard:

        5.74 real         3.52 user         1.20 sys
            17907712  maximum resident set size
            11616664  peak memory footprint

With showsome = true:

        5.71 real         3.53 user         1.19 sys
            17874944  maximum resident set size
            11600304  peak memory footprint

That said, this is not a definitive benchmark. I chose this test because it is heavier on herd7 than others, but nevertheless, it might not be stressing the show machinery particularly well. (@HadrienRenaud I wonder if you might know of any litmus tests that could serve as a good benchmark for this.) I also tried time make test and did not see a visible performance difference there either.

Bottom line is, I don't feel I have enough evidence one way or the other, right now. My reason for keeping the flag is to keep the refactor conservative w.r.t. runtime behaviour, but I agree this is worth measuring more systematically at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants