Skip to content

Conversation

@EdSchouten
Copy link
Collaborator

As part of #257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.

@EdSchouten EdSchouten requested a review from bergsieker as a code owner May 26, 2023 11:46
@EdSchouten EdSchouten force-pushed the eschouten/20230526-output-directories branch from 0c93b51 to 3ef3fc0 Compare June 13, 2023 14:05
@EdSchouten EdSchouten force-pushed the eschouten/20230526-output-directories branch from 3ef3fc0 to a28d04e Compare July 5, 2023 11:22
@lucasmeijer
Copy link

This is exactly what I'm looking for. I'm trying to use reapi in a situation where one action produces an output directory that is consumed by a followup action. the actual build client never needs to see any of these bytes. today, I have to after execution retrieve the Tree object from the CAS, and then manually upload all embedded Directory messages. It would make my scenario much easier if the ActionResult can return directly the digest of the Directory of the output directory.

@sluongng
Copy link
Collaborator

Listening to yesterday discussion, I am a bit curious on this PR:

I know that the current intention is to have the field output_directory_format included in the Command message, which in turn used to compute the cache key. The intention so that the newer clients which going to use this field will NOT be sharing the cache with the existing clients (built without support for this field).

My question is: is there a world where, after this PR merged, we DO want to share the cache result between the older clients and the newer clients?

The diff in cached result, in this case, is more of a metadata: how the output is being presented on the API layer, and not actually functional differences (same exit code, same output files, dirs, symlinks etc…). So separating the cache result here seems wasteful? Would it be better if we could move this to a different place that would not affect the cache key computation and enable cache sharing between clients with and without this new field?

@juergbi
Copy link
Contributor

juergbi commented Nov 15, 2023

My question is: is there a world where, after this PR merged, we DO want to share the cache result between the older clients and the newer clients?

Is your concern about caching across different clients or across different versions of the same client?

I'm not concerned about the former as I expect it to be very rare for different clients to construct identical Action messages even without this new option.

The latter may be a more practical concern. This would reduce the cache hit rate as long as there are active users of different versions of the same client. It may be acceptable as it typically only affects a transitional period, though. And it's very similar to the transition to output_paths or the move of the platform properties from the Command message to the Action message.

For remote execution, the server could mitigate this by setting both Tree and Directory digests in the result and storing a second action cache entry (for the digest of the Action without the new option). For remote caching, the new version of the client could perform this mitigation.

@EdSchouten
Copy link
Collaborator Author

@sluongng Though I agree with your feedback in spirit, there are some aspects to it that make that hard to achieve.

For example, consider the case where a client asks for an action to be executed in such a way that it only yields Directory messages. Then another client shows up and executes the same action, but only expecting Trees. Should the scheduler then still perform in-flight deduplication between the two? (In the case of Buildbarn it should not.)

In this case @lucasmeijer is intending to use this feature for a completely different kind of client (not Bazel), and he's also not expecting to have cache hits between the two. I therefore think the most conservative approach here is to just have it part of Action/Command.

@sluongng
Copy link
Collaborator

Should the scheduler then still perform in-flight deduplication between the two? (In the case of Buildbarn it should not.)

Could you please elaborate a bit on why we do not want to have cache deduplication when the actions are identical?

And even if we do desire cache separation between 2 types of clients, we already have other mechanisms in place for that right? (i.e. instance_name, Action's salt, etc...)


I will brainstorm this with an alternative approach:

Instead of adding this field to Command message, we add it to something like ResultsCachePolicy (or another new field) in ExecuteRequest. This means that the action_digest is not affected, the cache key will remain the same. The server side will still receive this "presentation option" from the client and could opt to modify/decorate the ActionResult accordingly.

With this approach, if the newer client still wishes to have a separate cache result from other tools, it could always use a unique Action.salt (i.e. the tool's name + version) to force the cache partition.

Perhaps I am missing something?

@EdSchouten
Copy link
Collaborator Author

Should the scheduler then still perform in-flight deduplication between the two? (In the case of Buildbarn it should not.)

Could you please elaborate a bit on why we do not want to have cache deduplication when the actions are identical?

At least in the case of Buildbarn, the worker is the one uploading the Tree objects into the CAS. This logic is also going to be extended to upload the Directory messages. Now if in-flight deduplication occurs, it may be the case that the worker will end up making the wrong choice. It will only upload the output directories in one format, and subsequently return a single ActionResult to the scheduler, which the scheduler sends back to both clients.

I think we shouldn't treat this differently from output_paths vs output_{files,directories}. Sure, in a perfect world you'd treat them as identical actions having a single ActionResult, but that's not how we want to do it in practice, because it makes implementations more complex.

@bergsieker
Copy link
Contributor

One concern I have: what about Actions that produce a large output tree? The current flow requires calling GetTree, which is a streaming API can has reasonable affordances for dealing with large trees (streamed responses, to fit within the max response size, resumable downloads, etc.) We'd either need to duplicate that sort of functionality here or figure out a way to fall back to the tree digest when the response gets too large, neither of which I'm wild about.

As an alternative, I wonder if we could add an optional "inlined directories" field that could be used to inline the results of simple cases like single-node "trees." I think that would likely solve Lucas' use-case of wanting to avoid unnecessary calls to GetTree, but I'm not sure that it solves Ed's concerns about wanting to support Git directories as first-class citizens.

@EdSchouten
Copy link
Collaborator Author

EdSchouten commented Dec 13, 2023

As discussed during the monthly meeting, @bergsieker's concerns shouldn't be an issue. Only the digest of the root directory is stored in ActionResult. This extension should be fairly light-weight to implement for servers, as all they need to do as part of GetActionResult() is walk over the Tree objects (as they do right now) and validate that all individual Directory messages are also present in the CAS.

With regards to supporting Git directories as first-class citizens: this is something that is out of scope for this specific change, but as far as I know not interfering with it. Lucas is interested in using the existing Directory message.

I've rebased this PR to address merge conflicts in the .pb.go.

It looks like the current ones are out of sync.
As part of bazelbuild#257 we're discussing adding support for storing directories
in Git's format. This means OutputDirectory.tree_digest will no longer
point to an actual recursive tree object (like REv2 Tree). Instead of
doing that, I would like to investigate whether we can add native
support for storing output directories in decomposed form.
@EdSchouten EdSchouten force-pushed the eschouten/20230526-output-directories branch from a28d04e to 5d15896 Compare December 13, 2023 11:56
@bergsieker bergsieker merged commit d20ae8b into bazelbuild:main Dec 21, 2023
@EdSchouten EdSchouten deleted the eschouten/20230526-output-directories branch December 21, 2023 16:01
EdSchouten added a commit to buildbarn/bb-storage that referenced this pull request Dec 22, 2023
This brings in a newer version of REv2 that includes
bazelbuild/remote-apis#258.
EdSchouten added a commit to buildbarn/bb-remote-execution that referenced this pull request Dec 22, 2023
This brings in bazelbuild/remote-apis#258, which
we need to add support for emitting output directories as plain
Directory messages.
EdSchouten added a commit to buildbarn/bb-remote-execution that referenced this pull request Dec 22, 2023
bazelbuild/remote-apis#258 added support to REv2
for storing output directories in the form of individual Directory
messages, as opposed to using Trees. This change adds partial support
for it. Namely, we always upload Trees, but additionally upload separate
Directory messages if requested by the client.

Furthermore, we add a configuration option to force enable it. Some
users may desire this for the reasons documented in bb_worker.proto.
EdSchouten added a commit to buildbarn/bb-clientd that referenced this pull request Dec 22, 2023
This brings in bb-storage's and bb-remote-execution's support for
bazelbuild/remote-apis#258.
@EdSchouten
Copy link
Collaborator Author

Support for this has now been added to Buildbarn:

Thanks for merging this, @bergsieker! @lucasmeijer, be sure to upgrade to the latest Buildbarn container images and give this a try.

@lucasmeijer
Copy link

Amazing, will try right after Xmas break

EdSchouten added a commit to buildbarn/bb-browser that referenced this pull request Feb 28, 2024
Now that bazelbuild/remote-apis#258 has landed,
it's also possible for individual REv2 Directory messages to be outputs
of build actions. This means that we should no longer call them "input
directories".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants