RFC: memory/nvme: track if memory blocks are persistent, handle fallback in nvme keepalive save/restore #2087

mattkur · 2025-10-04T23:04:43Z

wip: memory/nvme: keep track of whether any given memory block is allocated from
a memory store with persistent allocations or not. This is useful for the
NVMe driver to know that it is safely in a state to be able to do keepalive.

Sending out this change now to get some early feedback on if I'm on the right
track. I also made some drive-by comments changes that helped me improve my
understanding of the code. I can pull those out into a separate PR if once
I get eyes on them.

TODO:

Send out an initial RFC PR, especially on the per-memory-block persistence.
Fixup the handling in the nvme driver and nvme manager stacks (I left some todos there already)
Something smells wrong here; it's awfully complicated to handle save/restore,
decide when it's enabled, etc. etc. There are lots of shared assumptions
across the stack. I'd like to figure out a way to make this more elegant
and easier to maintain.

…located from a memory store with persistent allocations or not. This is useful for the NVMe driver to know that it is safely in a state to be able to do keepalive. Sending out this change now to get some early feedback on if I'm on the right track. I also made some drive-by comments changes that helped me improve my understanding of the code. I can pull those out into a separate PR if once I get eyes on them. TODO: - [ ] Send out an initial RFC PR, especially on the per-memory-block persistence. - [ ] Fixup the handling in the nvme driver and nvme manager stacks (I left some todos there already) - [ ] Something smells wrong here; it's awfully complicated to handle save/restore, decide when it's enabled, etc. etc. There are lots of shared assumptions across the stack. I'd like to figure out a way to make this more elegant and easier to maintain.

Copilot

Pull Request Overview

This PR introduces persistence tracking for memory blocks in OpenVMM/OpenHCL to support NVMe keepalive functionality during save/restore operations. The change adds a persistent field to MemoryBlock to indicate whether the underlying memory allocation can survive OpenHCL servicing events.

Key changes:

Added persistence tracking to the MemoryBlock API with a new boolean field
Updated memory allocation code paths to specify persistence characteristics
Enhanced NVMe driver to check memory persistence and handle save/restore scenarios appropriately

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
vm/devices/user_driver/src/memory.rs	Core `MemoryBlock` API updated with persistence field and accessor method
vm/page_pool_alloc/src/lib.rs	Page pool allocations marked as persistent with explanatory comment
vm/devices/user_driver/src/lockmem.rs	Locked memory allocations marked as non-persistent with improved documentation
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs	NVMe queue pair enhanced with persistence checking and save/restore handling
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs	Added TODO comments for memory persistence validation during save operations
openhcl/underhill_core/src/nvme_manager/manager.rs	Added TODO comments for save state checking
openhcl/underhill_core/src/nvme_manager/device.rs	Enhanced NVMe spawner with warnings for non-optimal configurations
openhcl/underhill_core/src/worker.rs	Updated NVMe spawner configuration with warning flags
openhcl/underhill_core/src/dispatch/mod.rs	Added documentation about NVMe save/restore behavior
openhcl/openhcl_dma_manager/src/lib.rs	Improved documentation for allocation visibility and persistence
openhcl/lower_vtl_permissions_guard/src/lib.rs	Updated to preserve persistence information when wrapping memory blocks

Copilot · 2025-10-04T23:05:18Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

    pub async fn save(&mut self) -> anyhow::Result<NvmeDriverSavedState> {
+        // todo: interrogate all queues to see if any are with non-persistent memory
+        // that's been allocated. If so, then fail the save (and make sure the stack above handles)
+        // that gracefully. By failing thye save, we should leave the driver


Corrected spelling of 'thye' to 'the'.

Suggested change

// that gracefully. By failing thye save, we should leave the driver

// that gracefully. By failing the save, we should leave the driver

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

mattkur · 2025-10-04T23:15:50Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

        let dma_client = device.dma_client();
        let mem = dma_client
            .allocate_dma_buffer(total_size)
            .context("failed to allocate memory for queues")?;
+        if !mem.persistent() {
+            tracing::warn!(qid, "allocated non-persistent memory for queue pair");
+        }


After looking through what it would take to safely handle fallback at save time, I think a better approach is to fail the QueuePair create if we can't allocate a persistent memory block. The NVMe driver allocates the QP in two spots:

Initial device start, where it first allocates an admin queue and then one single IO issuer, and

Later at runtime, when IO comes in on a new CPU.

If anything in #1 fails, then we've made a serious mistake. We can fail VM start and hopefully catch that in QA. If we did want to fall back, then the nice thing is that path is on the initial driver create path, so it's relatively easy to propagate the lack-of-save-restore state up the stack.

If #2 fails, then we just don't create the IO issuer. The driver already supports this fallback, since it's plausible that the underlying device doesn't have enough resources to support this new IO issuer (e.g. not enough interrupt vectors).

In such a degraded state, we would get worse IO performance. I think that is less impactful than long blips during servicing.

@chris-oo , @gurasinghMS interested in your thoughts.

I'm still relatively new to this area, so I'm currently reviewing the documentation and code to build a better understanding.

A little clarification on the approach, I'm guessing we'll check if the keep_alive flag is set along with persistent memory availability check and then fail the VM start. This is to avoid failing the VM start during servicing in the older host when the VM is migrated off to that host where keep alive feature is disabled.

Oh, good call. You're saying: if OpenVMM is booting up and ...

... there is saved memory, and

... keepalive is not set

then we should fail (since things conflict).

Did I get that right?

My bad, missed to reply here based on our discussion :)
The current proposed design is if there is saved memory and keep-alive is not set(which is true for older host), VM will not receive the saved state from the Host and will re-initialize NVMe as though it's a first boot.

But with the current implementation, VM will fail to start when we can't allocate persistent memory even on older host. Thanks for the change in the recent diff.

Based on our discussion, we've also created a task to reset the device when on old host and it has saved memory.

jackschefer-msft · 2025-10-06T15:46:22Z

openhcl/underhill_core/src/dispatch/mod.rs

+            // DEVNOTE: A subtlety here is that the act of saving the NVMe state also causes the driver
+            // to enter a state where subsequent teardown operations will noop. There is a STRONG


Is there a mechanism to leave this state if/when the save or restore fails after nvme state has already persisted?

This is a good point. I don't think so. I think the only way that this can work would be for save() to be safe in one way or another: either actually save and set up for keepalive, or don't do anything such that the servicing operation will power off the device (and return no save state).

Is it possible for the save/restore to be aborted after a successful LoadedVm::save call? My intuition was yes, but maybe OpenHCL relies on being externally reset (ex by the host) in that case?

That I'm not sure about.

is it possible for us to support a mode for nvme where "save" is nondestructive and "resume" can be called after save? Or is that hard to reason about?

In the past, that's the stance we've taken. But with more complicated devices, it becomes difficult.

I think having the save operation put the NVMe driver into a "persisted" state is fundamental to any sort of device keepalive we want (NVMe, MANA, anything else). But "persisting" doesn't have to be "destructive", if there's a way for the VM worker to un-persist the driver for failure recovery.

Without this un-persistence operation, does NVMe persistence become a point-of-no-return for the save operation? That feels odd, but I don't know whether that's ok or not.

I think the only way that this can work would be for save() to be safe in one way or another: either actually save and set up for keepalive, or don't do anything such that the servicing operation will power off the device (and return no save state).

I agree that the NVMe driver's save operation has to be safe one way or another, but I think the part that's harder to reason about is how to handle failure after a successful NVMe save

gurasinghMS · 2025-10-08T20:35:54Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

                        // Do not allow any more processing after save completed.
                        break;
                    }
+                    Req::QueryPersistence(rpc) => rpc.complete(self.memory_persistent),


This self.memory_persistent is tracking the state of memory issued to other structs. Could we instead add a function to the SubmissionQueue and CompletionQueue structs called is_persistent() and that function should return self.mem.persistent(). Here we can instead return rpc.complete(self.sq.is_persistent() && self.cq.is_persistent()). That would ensure data freshness imo

gurasinghMS · 2025-10-08T21:01:44Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+        // that's been allocated. If so, then fail the save (and make sure the stack above handles)
+        // that gracefully. By failing thye save, we should leave the driver
+        // in a state that is still running.
+        //


I am wondering what the reprecussions of failing a save would be. Right now, invoking save also makes 2 other key changes:

Stops processing at the QueueHandler level.

Set's value of keepalive to true (regardless of what it was before)

For the first point, not stopping queue processing seems fine. But for the second point, maybe we should explicitly set self.nvme_keepalive = false. This value is accessed in shutdown and prevents proper cleanup if it is true. Right now we start the driver off with nvme_keepalive=false but it could have been updated using the update_servicing_flags sometime during its lifetime.

Thanks for the help! Yeah, I think this is what we'd have to do.

gurasinghMS · 2025-10-08T21:14:09Z

openhcl/underhill_core/src/nvme_manager/manager.rs

    /// Send an RPC call to save NVMe worker data.
    pub async fn save(&self) -> Option<NvmeManagerSavedState> {
        match self.sender.call(Request::Save, ()).await {
+            // todo: check if we get save state here


Not sure I follow: Why do we want to check for the saved state here?

that was my attempt at checking for a failure path

mattkur requested review from a team, chris-oo and gurasinghMS October 4, 2025 23:04

mattkur requested review from a team as code owners October 4, 2025 23:04

Copilot AI review requested due to automatic review settings October 4, 2025 23:04

Copilot AI reviewed Oct 4, 2025

View reviewed changes

mattkur commented Oct 4, 2025

View reviewed changes

mattkur added 2 commits October 6, 2025 13:31

Merge branch 'main' into memory-block-persistent

ba232a6

fixups to build

3eb05e3

jackschefer-msft reviewed Oct 6, 2025

View reviewed changes

mattkur mentioned this pull request Oct 7, 2025

vmm_tests: use inspect to validate state of the nvme driver #2103

Merged

3 tasks

gurasinghMS reviewed Oct 8, 2025

View reviewed changes

mattkur mentioned this pull request Oct 9, 2025

mana: save and restore mana devices when keepalive is enabled #2123

Open

mattkur added 3 commits October 15, 2025 12:47

Merge branch 'main' into memory-block-persistent

543daf3

wip: compiles but not sure if right

b120866

Merge branch 'main' into memory-block-persistent

9a72ad8

	// that gracefully. By failing thye save, we should leave the driver
	// that gracefully. By failing the save, we should leave the driver

		// DEVNOTE: A subtlety here is that the act of saving the NVMe state also causes the driver
		// to enter a state where subsequent teardown operations will noop. There is a STRONG

RFC: memory/nvme: track if memory blocks are persistent, handle fallback in nvme keepalive save/restore #2087

Are you sure you want to change the base?

RFC: memory/nvme: track if memory blocks are persistent, handle fallback in nvme keepalive save/restore #2087

Uh oh!

Conversation

mattkur commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SrinivasShekar Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gurasinghMS Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mattkur commented Oct 4, 2025 •

edited

Loading

SrinivasShekar Oct 16, 2025 •

edited

Loading

gurasinghMS Oct 8, 2025 •

edited

Loading