Skip to content

Conversation

gurasinghMS
Copy link
Contributor

@gurasinghMS gurasinghMS commented Aug 13, 2025

Adding optional FLR support to the NvmeFaultController. Fixes: #1719. Description in the words of copilot:

Problem

When testing with the NVMe emulator, VFIO didn't allow OpenHCL to modify the reset_method file because the emulator didn't advertise FLR support. This prevented proper device reset functionality in virtualized environments.

Solution

Added a minimal PCI Express capability implementation that:

  1. Advertises FLR support - Implements Device Capabilities register with FLR bit (bit 29) set
  2. Handles FLR requests - Processes writes to Device Control register FLR trigger bit (bit 15)
  3. Performs device reset - Resets NVMe controller state when FLR is initiated
  4. Maintains compatibility - FLR support is optional and disabled by default

Key Components

  • PciExpressCapability: New capability struct implementing PCI Express Device Capabilities, Device Control, and Device Status registers
  • FlrHandler trait: Interface for handling FLR events with atomic signaling mechanism
  • Configuration options: Added flr_support flag to NvmeControllerCaps and NvmeControllerHandle
  • Reset integration: FLR triggers existing controller reset logic safely

Testing

Added comprehensive unit tests that verify:

  • FLR capability is properly advertised when enabled
  • No PCI Express capability appears when FLR is disabled
  • FLR trigger mechanism works with correct self-clearing behavior
  • All existing functionality remains unaffected

Usage

let caps = NvmeControllerCaps {
    msix_count: 64,
    max_io_queues: 64, 
    subsystem_id: Guid::new_random(),
    flr_support: true,  // Enable FLR support
};

Disclaimer: This is a reworked PR authored by copilot with logic fixes to how the FLR is completed and fixes to thedefinition of the DeviceConfiguration struct in PCIe capability + Moved the support to the fault emulator instead of the main one (for now).

@gurasinghMS
Copy link
Contributor Author

gurasinghMS commented Aug 13, 2025

Spec digarams (From PCIe Base Spec 6.4) for reference:

image image

@gurasinghMS gurasinghMS marked this pull request as ready for review August 13, 2025 23:30
@Copilot Copilot AI review requested due to automatic review settings August 13, 2025 23:30
@gurasinghMS gurasinghMS requested review from a team as code owners August 13, 2025 23:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds optional Function Level Reset (FLR) support to the NVMe fault controller for testing purposes. This enables VFIO to modify the reset_method file by advertising FLR capability through PCI Express configuration space.

Key changes:

  • Implements PCI Express capability with FLR support in the fault controller
  • Adds comprehensive unit tests for FLR functionality
  • Updates configuration structures to include FLR support option

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
vm/devices/storage/nvme_test/src/tests/flr_tests.rs New comprehensive test suite for FLR capability advertising and triggering
vm/devices/storage/nvme_test/src/pci.rs Main implementation adding PCI Express capability with FLR handler and reset logic
vm/devices/pci/pci_core/src/capabilities/pci_express.rs New PCI Express capability implementation with FLR support and extensive tests
vm/devices/pci/pci_core/src/spec.rs PCI Express specification definitions for capability registers
vm/devices/storage/nvme_test/src/lib.rs Updated fault configuration to use Arc for shared ownership
vm/devices/storage/nvme_resources/src/lib.rs Added flr_support field to controller handle
vm/devices/storage/nvme_test/src/resolver.rs Updated resolver to pass through flr_support configuration
vm/devices/storage/nvme_test/src/tests.rs Added flr_tests module
vm/devices/storage/nvme_test/src/tests/controller_tests.rs Updated fault configuration usage and added flr_support field
vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs Updated fault configuration usage and added flr_support field
vm/devices/pci/pci_core/src/capabilities/mod.rs Exposed new PCI Express capability types

@gurasinghMS
Copy link
Contributor Author

Some Open question for this PR:

  • Does this NVME Emulator count as a "memory based controller". In the Nvme spec, it specifies that "memory based controllers" should not be resetting the Admin Queues during controller reset (i.e. when cc.en() bit goes from 1 -> 0. Currently the emulator is clearing the Admin queues upon controller reset.
  • One big way that the FLR differs from a regular controller reset (in my understanding) is it should reset the device regardless of the controller state. A normal reset by setting cc.en = 0 should only be done when csts.rdy = 1 - If not, controller panics. I found 2 options while implementing this behaviour -- Option 1: call self.workers.reset().await but since we should be servicing the FLR right after the cfg_write, we can't await (cfg_write isn't async). -- Option 2: Just recreate the NvmeWorker. This has futures in it but it seems like the drop implementation for Task just removes the task so this should seemingly be OK to do.

@gurasinghMS gurasinghMS changed the title [nvme_test] Advertise optional flr support nvme_test: Advertise optional flr support Aug 14, 2025
Copy link
Contributor

@mattkur mattkur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Thanks for taking the time to work on this!

Comment on lines 68 to 73
/// Creates a new PCI Express capability with FLR support.
///
/// # Arguments
/// * `flr_supported` - Whether Function Level Reset is supported
/// * `flr_handler` - Optional handler to be called when FLR is initiated
pub fn new(flr_supported: bool, flr_handler: Option<Arc<dyn FlrHandler>>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you envision that callers would use this if they specify flr_supported: true but flr_handler: None. I'm not sure that makes semantic sense. What do you think? (I'm happy to be wrong here, I'm still wrapping my head around the nuance of FLR behavior).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to even allow this behavior? It seems like even if you want a no-op, shouldn't you be forced to specify a handler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically I think requiring FLR handler at all times makes sense to me. However, my holdout is it would add unnecessary bloat to tests down the road (specially if we expect more PCIe capabilities to be supported in the future). What if we get rid of the bool altogether and just interpret Some(FlrHandler) to mean that flr is supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using above approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like Some(FlrHandler) here.

Comment on lines 121 to 101
PciExpressCapabilityHeader::PCIE_CAPS => {
// PCIe Capabilities Register (16 bits) + Next Pointer (8 bits) + Capability ID (8 bits)
// For basic endpoint: Version=2, Device/Port Type=0 (PCI Express Endpoint)
let pcie_caps: u16 = 0x0002; // Version 2, Device/Port Type 0
(pcie_caps as u32) << 16 | (0x00 << 8) | CapabilityId::PCI_EXPRESS.0 as u32
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not critical, but I'd prefer to see you define the appropriate struct type for this as well. We're already doing that for the device caps, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this could be a follow up change to this? Since the PR has been sitting for a while I am trying to scope it to the mvp. Always happy to send out follow up changes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. That can be a follow up.

Comment on lines 68 to 73
/// Creates a new PCI Express capability with FLR support.
///
/// # Arguments
/// * `flr_supported` - Whether Function Level Reset is supported
/// * `flr_handler` - Optional handler to be called when FLR is initiated
pub fn new(flr_supported: bool, flr_handler: Option<Arc<dyn FlrHandler>>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to even allow this behavior? It seems like even if you want a no-op, shouldn't you be forced to specify a handler?

Copy link

@gurasinghMS gurasinghMS force-pushed the advertise_flr_support branch from d30046e to ece7dac Compare August 23, 2025 23:27
@gurasinghMS gurasinghMS requested a review from a team as a code owner August 23, 2025 23:27
Copy link

gurasinghMS added a commit that referenced this pull request Aug 25, 2025
…1913)

This is a support PR being made to reduce size of #1858. Added
`NvmeWorkersContext` that stores the input for `NvmeWorkers`. This will
subsequently be used to create new `NvmeWorkers` when FLR is issued to
the test controller.
gurasinghMS and others added 14 commits August 25, 2025 14:47
- Add PCI Express capability ID to pci_core spec
- Implement PciExpressCapability with FLR support in pci_core
- Add FlrHandler trait for handling FLR events
- Add flr_support flag to NvmeControllerCaps and NvmeControllerHandle
- Wire FLR capability to NvmeController when enabled
- Implement FLR reset logic using atomic flag mechanism
- Add comprehensive unit tests for FLR functionality
- Update nvme_resources to include FLR configuration option

Co-authored-by: mattkur <[email protected]>
@gurasinghMS gurasinghMS force-pushed the advertise_flr_support branch from 2919774 to 8c2141e Compare August 25, 2025 22:02
Copy link
Contributor

@mattkur mattkur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. We discussed offline that you're going to split this PR into one for the core capability and another for the NVMe handling of it. Looking at a smaller nvme-only PR may yield some additional comments, but I think we're good to go here. Thanks for your work on this Guramrit!

Co-authored-by: Matt LaFayette (Kurjanowicz) <[email protected]>
@gurasinghMS gurasinghMS changed the title nvme_test: Advertise optional flr support wip: Advertise optional flr support Aug 28, 2025
@gurasinghMS gurasinghMS marked this pull request as draft August 28, 2025 17:27
gurasinghMS added a commit that referenced this pull request Aug 29, 2025
…th some basic functionality (#1930)

Adds the PCI_Express capability option for devices to advertise PCIe
support. This change will also us to add FLR support to the nvme_test
crate for further perf testing.

PciExpressCapability: New capability struct implementing PCI Express
Device Capabilities, Device Control, and Device Status registers

Spec digarams (From PCIe Base Spec 6.4) for reference:
<img width="1481" height="991" alt="image"
src="https://github.com/user-attachments/assets/a6f49e62-af03-4573-ab60-17637154aadc"
/>
<img width="1489" height="919" alt="image"
src="https://github.com/user-attachments/assets/21f6996c-eb8b-4fb6-b505-a113c37d9257"
/>

Being made as a helper PR to #1858 since that was getting a little too
large/complicated

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: Matt LaFayette (Kurjanowicz) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nvme: nvme emulator does not advertise FLR support
4 participants