Skip to content

Conversation

Brian-Perkins
Copy link
Contributor

1. Tie lifetime of the device to the vmbus parent a. Remove existing channel used to control lifetime.
2. If the vmbus relay parent is dropped, there will be no opportunity to close the GPADL associated with the special ring buffer memory pages. In this case, note the error and leak the device. Trying to change the VTL permissions on these memory pages while the host still has them open will result in an error.
3. Allow intercepted devices to be revoked and reoffered from the host.

CP of #2117

    1. Tie lifetime of the device to the vmbus parent
        a. Remove existing channel used to control lifetime.
    2. If the vmbus relay parent is dropped, there will be no opportunity to
       close the GPADL associated with the special ring buffer memory pages.
       In this case, note the error and leak the device. Trying to change the
       VTL permissions on these memory pages while the host still has them open
       will result in an error.
    3. Allow intercepted devices to be revoked and reoffered from the host.
@Copilot Copilot AI review requested due to automatic review settings October 10, 2025 17:10
@Brian-Perkins Brian-Perkins requested review from a team as code owners October 10, 2025 17:10
@github-actions github-actions bot added the release_2505 Targets the release/2505 branch. label Oct 10, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the shutdown handling of VMBus relay intercept devices by tying their lifetime to the VMBus parent instead of using separate channel-based lifetime control. The main purpose is to handle device cleanup more gracefully when the VMBus relay parent is dropped, particularly when GPADL teardown fails.

  • Removes the manual lifetime control channel and integrates device lifecycle with the VMBus parent
  • Adds proper error handling for GPADL teardown failures with memory leak detection
  • Enables intercepted devices to be properly revoked and reoffered by the host

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
vm/devices/vmbus/vmbus_relay_intercept_device/src/lib.rs Main refactor removing channel-based lifetime control, adding GPADL teardown error handling, and improving device revoke logic
vm/devices/vmbus/vmbus_relay/src/lib.rs Reorders channel existence check to allow intercepted devices to be reoffered
openhcl/underhill_core/src/worker.rs Removes manual tracking of intercept device lifetime senders
openhcl/underhill_core/src/dispatch/mod.rs Removes unused _vmbus_intercept_devices field from LoadedVm struct

Comment on lines +190 to +194
// The VTL pages were not freed. This can occur if an
// error is hit that drops the vmbus parent tasks. Just
// pend here and let the outer error cause the VM to
// exit.
pending::<()>().await;
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pending::<()>().await to intentionally hang the task is an unusual pattern. Consider adding a more explicit comment explaining why this infinite wait is the desired behavior, or using a more conventional approach like returning an error that would cause the VM to exit.

Suggested change
// The VTL pages were not freed. This can occur if an
// error is hit that drops the vmbus parent tasks. Just
// pend here and let the outer error cause the VM to
// exit.
pending::<()>().await;
// The VTL pages were not freed. This is an unrecoverable error.
// Return an error to trigger VM exit in a controlled manner.
anyhow::bail!("VTL pages were not freed; unrecoverable error in vmbus relay device");

Copilot uses AI. Check for mistakes.

/// Responds to the channel being revoked by the host.
async fn handle_revoke(&mut self, state: &mut SimpleVmbusClientDeviceTaskState) {
let Some(offer) = state.offer.take() else {
let Some(offer) = state.offer.as_ref() else {
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from state.offer.take() to state.offer.as_ref() means the offer is no longer consumed here, but it's still taken at line 538. This could lead to unexpected behavior if handle_revoke is called multiple times, as subsequent calls would still find an offer present.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_2505 Targets the release/2505 branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants