Skip to content

Conversation

mattkur
Copy link
Contributor

@mattkur mattkur commented Oct 4, 2025

wip: memory/nvme: keep track of whether any given memory block is allocated from
a memory store with persistent allocations or not. This is useful for the
NVMe driver to know that it is safely in a state to be able to do keepalive.

Sending out this change now to get some early feedback on if I'm on the right
track. I also made some drive-by comments changes that helped me improve my
understanding of the code. I can pull those out into a separate PR if once
I get eyes on them.

TODO:

  • Send out an initial RFC PR, especially on the per-memory-block persistence.
  • Fixup the handling in the nvme driver and nvme manager stacks (I left some todos there already)
  • Something smells wrong here; it's awfully complicated to handle save/restore,
    decide when it's enabled, etc. etc. There are lots of shared assumptions
    across the stack. I'd like to figure out a way to make this more elegant
    and easier to maintain.

…located from

    a memory store with persistent allocations or not. This is useful for the
    NVMe driver to know that it is safely in a state to be able to do keepalive.

    Sending out this change now to get some early feedback on if I'm on the right
    track. I also made some drive-by comments changes that helped me improve my
    understanding of the code. I can pull those out into a separate PR if once
    I get eyes on them.

    TODO:
    - [ ] Send out an initial RFC PR, especially on the per-memory-block persistence.
    - [ ] Fixup the handling in the nvme driver and nvme manager stacks (I left some todos there already)
    - [ ] Something smells wrong here; it's awfully complicated to handle save/restore,
          decide when it's enabled, etc. etc. There are lots of shared assumptions
          across the stack. I'd like to figure out a way to make this more elegant
          and easier to maintain.
@mattkur mattkur requested review from a team, chris-oo and gurasinghMS October 4, 2025 23:04
@mattkur mattkur requested review from a team as code owners October 4, 2025 23:04
@Copilot Copilot AI review requested due to automatic review settings October 4, 2025 23:04
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces persistence tracking for memory blocks in OpenVMM/OpenHCL to support NVMe keepalive functionality during save/restore operations. The change adds a persistent field to MemoryBlock to indicate whether the underlying memory allocation can survive OpenHCL servicing events.

Key changes:

  • Added persistence tracking to the MemoryBlock API with a new boolean field
  • Updated memory allocation code paths to specify persistence characteristics
  • Enhanced NVMe driver to check memory persistence and handle save/restore scenarios appropriately

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vm/devices/user_driver/src/memory.rs Core MemoryBlock API updated with persistence field and accessor method
vm/page_pool_alloc/src/lib.rs Page pool allocations marked as persistent with explanatory comment
vm/devices/user_driver/src/lockmem.rs Locked memory allocations marked as non-persistent with improved documentation
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs NVMe queue pair enhanced with persistence checking and save/restore handling
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Added TODO comments for memory persistence validation during save operations
openhcl/underhill_core/src/nvme_manager/manager.rs Added TODO comments for save state checking
openhcl/underhill_core/src/nvme_manager/device.rs Enhanced NVMe spawner with warnings for non-optimal configurations
openhcl/underhill_core/src/worker.rs Updated NVMe spawner configuration with warning flags
openhcl/underhill_core/src/dispatch/mod.rs Added documentation about NVMe save/restore behavior
openhcl/openhcl_dma_manager/src/lib.rs Improved documentation for allocation visibility and persistence
openhcl/lower_vtl_permissions_guard/src/lib.rs Updated to preserve persistence information when wrapping memory blocks

pub async fn save(&mut self) -> anyhow::Result<NvmeDriverSavedState> {
// todo: interrogate all queues to see if any are with non-persistent memory
// that's been allocated. If so, then fail the save (and make sure the stack above handles)
// that gracefully. By failing thye save, we should leave the driver
Copy link

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'thye' to 'the'.

Suggested change
// that gracefully. By failing thye save, we should leave the driver
// that gracefully. By failing the save, we should leave the driver

Copilot uses AI. Check for mistakes.

Comment on lines +204 to +210
let dma_client = device.dma_client();
let mem = dma_client
.allocate_dma_buffer(total_size)
.context("failed to allocate memory for queues")?;
if !mem.persistent() {
tracing::warn!(qid, "allocated non-persistent memory for queue pair");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking through what it would take to safely handle fallback at save time, I think a better approach is to fail the QueuePair create if we can't allocate a persistent memory block. The NVMe driver allocates the QP in two spots:

  1. Initial device start, where it first allocates an admin queue and then one single IO issuer, and
  2. Later at runtime, when IO comes in on a new CPU.

If anything in #1 fails, then we've made a serious mistake. We can fail VM start and hopefully catch that in QA. If we did want to fall back, then the nice thing is that path is on the initial driver create path, so it's relatively easy to propagate the lack-of-save-restore state up the stack.

If #2 fails, then we just don't create the IO issuer. The driver already supports this fallback, since it's plausible that the underlying device doesn't have enough resources to support this new IO issuer (e.g. not enough interrupt vectors).

In such a degraded state, we would get worse IO performance. I think that is less impactful than long blips during servicing.

@chris-oo , @gurasinghMS interested in your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still relatively new to this area, so I'm currently reviewing the documentation and code to build a better understanding.

A little clarification on the approach, I'm guessing we'll check if the keep_alive flag is set along with persistent memory availability check and then fail the VM start. This is to avoid failing the VM start during servicing in the older host when the VM is migrated off to that host where keep alive feature is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good call. You're saying: if OpenVMM is booting up and ...

  1. ... there is saved memory, and
  2. ... keepalive is not set

then we should fail (since things conflict).

Did I get that right?

Copy link
Contributor

@SrinivasShekar SrinivasShekar Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, missed to reply here based on our discussion :)
The current proposed design is if there is saved memory and keep-alive is not set(which is true for older host), VM will not receive the saved state from the Host and will re-initialize NVMe as though it's a first boot.

But with the current implementation, VM will fail to start when we can't allocate persistent memory even on older host. Thanks for the change in the recent diff.

Based on our discussion, we've also created a task to reset the device when on old host and it has saved memory.

Comment on lines +754 to +755
// DEVNOTE: A subtlety here is that the act of saving the NVMe state also causes the driver
// to enter a state where subsequent teardown operations will noop. There is a STRONG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a mechanism to leave this state if/when the save or restore fails after nvme state has already persisted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I don't think so. I think the only way that this can work would be for save() to be safe in one way or another: either actually save and set up for keepalive, or don't do anything such that the servicing operation will power off the device (and return no save state).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the save/restore to be aborted after a successful LoadedVm::save call? My intuition was yes, but maybe OpenHCL relies on being externally reset (ex by the host) in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I'm not sure about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for us to support a mode for nvme where "save" is nondestructive and "resume" can be called after save? Or is that hard to reason about?

In the past, that's the stance we've taken. But with more complicated devices, it becomes difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the save operation put the NVMe driver into a "persisted" state is fundamental to any sort of device keepalive we want (NVMe, MANA, anything else). But "persisting" doesn't have to be "destructive", if there's a way for the VM worker to un-persist the driver for failure recovery.

Without this un-persistence operation, does NVMe persistence become a point-of-no-return for the save operation? That feels odd, but I don't know whether that's ok or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way that this can work would be for save() to be safe in one way or another: either actually save and set up for keepalive, or don't do anything such that the servicing operation will power off the device (and return no save state).

I agree that the NVMe driver's save operation has to be safe one way or another, but I think the part that's harder to reason about is how to handle failure after a successful NVMe save

// Do not allow any more processing after save completed.
break;
}
Req::QueryPersistence(rpc) => rpc.complete(self.memory_persistent),
Copy link
Contributor

@gurasinghMS gurasinghMS Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This self.memory_persistent is tracking the state of memory issued to other structs. Could we instead add a function to the SubmissionQueue and CompletionQueue structs called is_persistent() and that function should return self.mem.persistent(). Here we can instead return rpc.complete(self.sq.is_persistent() && self.cq.is_persistent()). That would ensure data freshness imo

// that's been allocated. If so, then fail the save (and make sure the stack above handles)
// that gracefully. By failing thye save, we should leave the driver
// in a state that is still running.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what the reprecussions of failing a save would be. Right now, invoking save also makes 2 other key changes:

  • Stops processing at the QueueHandler level.
  • Set's value of keepalive to true (regardless of what it was before)

For the first point, not stopping queue processing seems fine. But for the second point, maybe we should explicitly set self.nvme_keepalive = false. This value is accessed in shutdown and prevents proper cleanup if it is true. Right now we start the driver off with nvme_keepalive=false but it could have been updated using the update_servicing_flags sometime during its lifetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help! Yeah, I think this is what we'd have to do.

/// Send an RPC call to save NVMe worker data.
pub async fn save(&self) -> Option<NvmeManagerSavedState> {
match self.sender.call(Request::Save, ()).await {
// todo: check if we get save state here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow: Why do we want to check for the saved state here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was my attempt at checking for a failure path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants