-
Notifications
You must be signed in to change notification settings - Fork 127
Allow emitting output directories as plain Directory messages #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow emitting output directories as plain Directory messages #258
Conversation
0c93b51 to
3ef3fc0
Compare
3ef3fc0 to
a28d04e
Compare
|
This is exactly what I'm looking for. I'm trying to use reapi in a situation where one action produces an output directory that is consumed by a followup action. the actual build client never needs to see any of these bytes. today, I have to after execution retrieve the Tree object from the CAS, and then manually upload all embedded Directory messages. It would make my scenario much easier if the ActionResult can return directly the digest of the Directory of the output directory. |
|
Listening to yesterday discussion, I am a bit curious on this PR: I know that the current intention is to have the field My question is: is there a world where, after this PR merged, we DO want to share the cache result between the older clients and the newer clients? The diff in cached result, in this case, is more of a metadata: how the output is being presented on the API layer, and not actually functional differences (same exit code, same output files, dirs, symlinks etc…). So separating the cache result here seems wasteful? Would it be better if we could move this to a different place that would not affect the cache key computation and enable cache sharing between clients with and without this new field? |
Is your concern about caching across different clients or across different versions of the same client? I'm not concerned about the former as I expect it to be very rare for different clients to construct identical Action messages even without this new option. The latter may be a more practical concern. This would reduce the cache hit rate as long as there are active users of different versions of the same client. It may be acceptable as it typically only affects a transitional period, though. And it's very similar to the transition to For remote execution, the server could mitigate this by setting both Tree and Directory digests in the result and storing a second action cache entry (for the digest of the Action without the new option). For remote caching, the new version of the client could perform this mitigation. |
|
@sluongng Though I agree with your feedback in spirit, there are some aspects to it that make that hard to achieve. For example, consider the case where a client asks for an action to be executed in such a way that it only yields Directory messages. Then another client shows up and executes the same action, but only expecting Trees. Should the scheduler then still perform in-flight deduplication between the two? (In the case of Buildbarn it should not.) In this case @lucasmeijer is intending to use this feature for a completely different kind of client (not Bazel), and he's also not expecting to have cache hits between the two. I therefore think the most conservative approach here is to just have it part of Action/Command. |
Could you please elaborate a bit on why we do not want to have cache deduplication when the actions are identical? And even if we do desire cache separation between 2 types of clients, we already have other mechanisms in place for that right? (i.e. instance_name, Action's salt, etc...) I will brainstorm this with an alternative approach: Instead of adding this field to With this approach, if the newer client still wishes to have a separate cache result from other tools, it could always use a unique Perhaps I am missing something? |
At least in the case of Buildbarn, the worker is the one uploading the Tree objects into the CAS. This logic is also going to be extended to upload the Directory messages. Now if in-flight deduplication occurs, it may be the case that the worker will end up making the wrong choice. It will only upload the output directories in one format, and subsequently return a single ActionResult to the scheduler, which the scheduler sends back to both clients. I think we shouldn't treat this differently from |
|
One concern I have: what about Actions that produce a large output tree? The current flow requires calling GetTree, which is a streaming API can has reasonable affordances for dealing with large trees (streamed responses, to fit within the max response size, resumable downloads, etc.) We'd either need to duplicate that sort of functionality here or figure out a way to fall back to the tree digest when the response gets too large, neither of which I'm wild about. As an alternative, I wonder if we could add an optional "inlined directories" field that could be used to inline the results of simple cases like single-node "trees." I think that would likely solve Lucas' use-case of wanting to avoid unnecessary calls to GetTree, but I'm not sure that it solves Ed's concerns about wanting to support Git directories as first-class citizens. |
|
As discussed during the monthly meeting, @bergsieker's concerns shouldn't be an issue. Only the digest of the root directory is stored in ActionResult. This extension should be fairly light-weight to implement for servers, as all they need to do as part of GetActionResult() is walk over the Tree objects (as they do right now) and validate that all individual Directory messages are also present in the CAS. With regards to supporting Git directories as first-class citizens: this is something that is out of scope for this specific change, but as far as I know not interfering with it. Lucas is interested in using the existing Directory message. I've rebased this PR to address merge conflicts in the .pb.go. |
It looks like the current ones are out of sync.
As part of bazelbuild#257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.
a28d04e to
5d15896
Compare
This brings in a newer version of REv2 that includes bazelbuild/remote-apis#258.
This brings in bazelbuild/remote-apis#258, which we need to add support for emitting output directories as plain Directory messages.
bazelbuild/remote-apis#258 added support to REv2 for storing output directories in the form of individual Directory messages, as opposed to using Trees. This change adds partial support for it. Namely, we always upload Trees, but additionally upload separate Directory messages if requested by the client. Furthermore, we add a configuration option to force enable it. Some users may desire this for the reasons documented in bb_worker.proto.
This brings in bb-storage's and bb-remote-execution's support for bazelbuild/remote-apis#258.
|
Support for this has now been added to Buildbarn:
Thanks for merging this, @bergsieker! @lucasmeijer, be sure to upgrade to the latest Buildbarn container images and give this a try. |
|
Amazing, will try right after Xmas break |
Now that bazelbuild/remote-apis#258 has landed, it's also possible for individual REv2 Directory messages to be outputs of build actions. This means that we should no longer call them "input directories".
As part of #257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.