Allow emitting output directories as plain Directory messages #258

EdSchouten · 2023-05-26T11:46:13Z

As part of #257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.

lucasmeijer · 2023-10-31T18:10:25Z

This is exactly what I'm looking for. I'm trying to use reapi in a situation where one action produces an output directory that is consumed by a followup action. the actual build client never needs to see any of these bytes. today, I have to after execution retrieve the Tree object from the CAS, and then manually upload all embedded Directory messages. It would make my scenario much easier if the ActionResult can return directly the digest of the Directory of the output directory.

sluongng · 2023-11-15T07:24:46Z

Listening to yesterday discussion, I am a bit curious on this PR:

I know that the current intention is to have the field output_directory_format included in the Command message, which in turn used to compute the cache key. The intention so that the newer clients which going to use this field will NOT be sharing the cache with the existing clients (built without support for this field).

My question is: is there a world where, after this PR merged, we DO want to share the cache result between the older clients and the newer clients?

The diff in cached result, in this case, is more of a metadata: how the output is being presented on the API layer, and not actually functional differences (same exit code, same output files, dirs, symlinks etc…). So separating the cache result here seems wasteful? Would it be better if we could move this to a different place that would not affect the cache key computation and enable cache sharing between clients with and without this new field?

juergbi · 2023-11-15T07:58:11Z

My question is: is there a world where, after this PR merged, we DO want to share the cache result between the older clients and the newer clients?

Is your concern about caching across different clients or across different versions of the same client?

I'm not concerned about the former as I expect it to be very rare for different clients to construct identical Action messages even without this new option.

The latter may be a more practical concern. This would reduce the cache hit rate as long as there are active users of different versions of the same client. It may be acceptable as it typically only affects a transitional period, though. And it's very similar to the transition to output_paths or the move of the platform properties from the Command message to the Action message.

For remote execution, the server could mitigate this by setting both Tree and Directory digests in the result and storing a second action cache entry (for the digest of the Action without the new option). For remote caching, the new version of the client could perform this mitigation.

EdSchouten · 2023-11-15T08:32:19Z

@sluongng Though I agree with your feedback in spirit, there are some aspects to it that make that hard to achieve.

For example, consider the case where a client asks for an action to be executed in such a way that it only yields Directory messages. Then another client shows up and executes the same action, but only expecting Trees. Should the scheduler then still perform in-flight deduplication between the two? (In the case of Buildbarn it should not.)

In this case @lucasmeijer is intending to use this feature for a completely different kind of client (not Bazel), and he's also not expecting to have cache hits between the two. I therefore think the most conservative approach here is to just have it part of Action/Command.

sluongng · 2023-11-15T08:52:41Z

Should the scheduler then still perform in-flight deduplication between the two? (In the case of Buildbarn it should not.)

Could you please elaborate a bit on why we do not want to have cache deduplication when the actions are identical?

And even if we do desire cache separation between 2 types of clients, we already have other mechanisms in place for that right? (i.e. instance_name, Action's salt, etc...)

I will brainstorm this with an alternative approach:

Instead of adding this field to Command message, we add it to something like ResultsCachePolicy (or another new field) in ExecuteRequest. This means that the action_digest is not affected, the cache key will remain the same. The server side will still receive this "presentation option" from the client and could opt to modify/decorate the ActionResult accordingly.

With this approach, if the newer client still wishes to have a separate cache result from other tools, it could always use a unique Action.salt (i.e. the tool's name + version) to force the cache partition.

Perhaps I am missing something?

EdSchouten · 2023-11-15T10:00:29Z

Should the scheduler then still perform in-flight deduplication between the two? (In the case of Buildbarn it should not.)

Could you please elaborate a bit on why we do not want to have cache deduplication when the actions are identical?

At least in the case of Buildbarn, the worker is the one uploading the Tree objects into the CAS. This logic is also going to be extended to upload the Directory messages. Now if in-flight deduplication occurs, it may be the case that the worker will end up making the wrong choice. It will only upload the output directories in one format, and subsequently return a single ActionResult to the scheduler, which the scheduler sends back to both clients.

I think we shouldn't treat this differently from output_paths vs output_{files,directories}. Sure, in a perfect world you'd treat them as identical actions having a single ActionResult, but that's not how we want to do it in practice, because it makes implementations more complex.

bergsieker · 2023-12-12T04:03:26Z

One concern I have: what about Actions that produce a large output tree? The current flow requires calling GetTree, which is a streaming API can has reasonable affordances for dealing with large trees (streamed responses, to fit within the max response size, resumable downloads, etc.) We'd either need to duplicate that sort of functionality here or figure out a way to fall back to the tree digest when the response gets too large, neither of which I'm wild about.

As an alternative, I wonder if we could add an optional "inlined directories" field that could be used to inline the results of simple cases like single-node "trees." I think that would likely solve Lucas' use-case of wanting to avoid unnecessary calls to GetTree, but I'm not sure that it solves Ed's concerns about wanting to support Git directories as first-class citizens.

EdSchouten · 2023-12-13T11:51:14Z

As discussed during the monthly meeting, @bergsieker's concerns shouldn't be an issue. Only the digest of the root directory is stored in ActionResult. This extension should be fairly light-weight to implement for servers, as all they need to do as part of GetActionResult() is walk over the Tree objects (as they do right now) and validate that all individual Directory messages are also present in the CAS.

With regards to supporting Git directories as first-class citizens: this is something that is out of scope for this specific change, but as far as I know not interfering with it. Lucas is interested in using the existing Directory message.

I've rebased this PR to address merge conflicts in the .pb.go.

It looks like the current ones are out of sync.

As part of bazelbuild#257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.

This brings in a newer version of REv2 that includes bazelbuild/remote-apis#258.

This brings in bazelbuild/remote-apis#258, which we need to add support for emitting output directories as plain Directory messages.

bazelbuild/remote-apis#258 added support to REv2 for storing output directories in the form of individual Directory messages, as opposed to using Trees. This change adds partial support for it. Namely, we always upload Trees, but additionally upload separate Directory messages if requested by the client. Furthermore, we add a configuration option to force enable it. Some users may desire this for the reasons documented in bb_worker.proto.

This brings in bb-storage's and bb-remote-execution's support for bazelbuild/remote-apis#258.

EdSchouten · 2023-12-22T11:08:49Z

Support for this has now been added to Buildbarn:

Thanks for merging this, @bergsieker! @lucasmeijer, be sure to upgrade to the latest Buildbarn container images and give this a try.

lucasmeijer · 2023-12-22T19:24:33Z

Amazing, will try right after Xmas break

Now that bazelbuild/remote-apis#258 has landed, it's also possible for individual REv2 Directory messages to be outputs of build actions. This means that we should no longer call them "input directories".

EdSchouten requested a review from bergsieker as a code owner May 26, 2023 11:46

EdSchouten mentioned this pull request May 26, 2023

add GITSHA1 value to supported digest functions #257

Open

EdSchouten force-pushed the eschouten/20230526-output-directories branch from 0c93b51 to 3ef3fc0 Compare June 13, 2023 14:05

EdSchouten force-pushed the eschouten/20230526-output-directories branch from 3ef3fc0 to a28d04e Compare July 5, 2023 11:22

sluongng mentioned this pull request Nov 23, 2023

REv3: Reduce asymmetry between O(n) output files and O(1) output directories #281

Open

EdSchouten added 2 commits December 13, 2023 12:55

Regenerate all .pb.go files

1329808

It looks like the current ones are out of sync.

EdSchouten force-pushed the eschouten/20230526-output-directories branch from a28d04e to 5d15896 Compare December 13, 2023 11:56

bergsieker approved these changes Dec 21, 2023

View reviewed changes

bergsieker merged commit d20ae8b into bazelbuild:main Dec 21, 2023

EdSchouten deleted the eschouten/20230526-output-directories branch December 21, 2023 16:01

EdSchouten added a commit to buildbarn/bb-storage that referenced this pull request Dec 22, 2023

Upgrade Go dependencies

0dded70

This brings in a newer version of REv2 that includes bazelbuild/remote-apis#258.

EdSchouten added a commit to buildbarn/bb-remote-execution that referenced this pull request Dec 22, 2023

Upgrade all Go dependencies

186be9d

This brings in bazelbuild/remote-apis#258, which we need to add support for emitting output directories as plain Directory messages.

EdSchouten added a commit to buildbarn/bb-clientd that referenced this pull request Dec 22, 2023

Upgrade all Go dependencies

d33e581

This brings in bb-storage's and bb-remote-execution's support for bazelbuild/remote-apis#258.

EdSchouten mentioned this pull request Jan 27, 2024

AC: Treat Directories as blobs buildbarn/bb-remote-asset#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow emitting output directories as plain Directory messages #258

Allow emitting output directories as plain Directory messages #258

Uh oh!

EdSchouten commented May 26, 2023

Uh oh!

lucasmeijer commented Oct 31, 2023

Uh oh!

sluongng commented Nov 15, 2023

Uh oh!

juergbi commented Nov 15, 2023

Uh oh!

EdSchouten commented Nov 15, 2023

Uh oh!

sluongng commented Nov 15, 2023

Uh oh!

EdSchouten commented Nov 15, 2023

Uh oh!

bergsieker commented Dec 12, 2023

Uh oh!

EdSchouten commented Dec 13, 2023 •

edited

Loading

Uh oh!

EdSchouten commented Dec 22, 2023

Uh oh!

lucasmeijer commented Dec 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Allow emitting output directories as plain Directory messages #258

Allow emitting output directories as plain Directory messages #258

Uh oh!

Conversation

EdSchouten commented May 26, 2023

Uh oh!

lucasmeijer commented Oct 31, 2023

Uh oh!

sluongng commented Nov 15, 2023

Uh oh!

juergbi commented Nov 15, 2023

Uh oh!

EdSchouten commented Nov 15, 2023

Uh oh!

sluongng commented Nov 15, 2023

Uh oh!

EdSchouten commented Nov 15, 2023

Uh oh!

bergsieker commented Dec 12, 2023

Uh oh!

EdSchouten commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdSchouten commented Dec 22, 2023

Uh oh!

lucasmeijer commented Dec 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

EdSchouten commented Dec 13, 2023 •

edited

Loading