Skip to content

Conversation

@mresvanis
Copy link

VEP Metadata

Tracking issue: #115
SIG label: /sig compute

What this PR does

When guestMappingPassthrough is set any PCIe GPU and host devices with NUMA affinity information will be placed to a PCI controller hierarchy matching their NUMA node affinity (i.e. not directly on the default PCI root). The devices that don't have any NUMA alignment will be assigned to the default PCI root. Any error during this process will result to all devices falling back to their assignment under the default PCI root.

Special notes for your reviewer

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Oct 29, 2025
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xpivarc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mresvanis mresvanis marked this pull request as draft October 29, 2025 16:01
@kubevirt-bot kubevirt-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 29, 2025
@mresvanis mresvanis force-pushed the pcie-numa-topology-awareness branch from 6cf5187 to d0beae6 Compare October 29, 2025 16:07
@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Oct 29, 2025
@mresvanis mresvanis force-pushed the pcie-numa-topology-awareness branch from 39b5fa4 to 3ad6491 Compare October 29, 2025 16:30
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Oct 29, 2025
@mresvanis mresvanis force-pushed the pcie-numa-topology-awareness branch from 3ad6491 to 7d3a11f Compare October 29, 2025 16:33
Co-authored-by: Fan Zhang <[email protected]>
Co-authored-by: Fabien Dupont <[email protected]>
Signed-off-by: Fan Zhang <[email protected]>
Signed-off-by: Fabien Dupont <[email protected]>
Signed-off-by: Michail Resvanis <[email protected]>
@mresvanis mresvanis force-pushed the pcie-numa-topology-awareness branch from 7d3a11f to 50c38fc Compare October 29, 2025 16:33
@xpivarc xpivarc marked this pull request as ready for review November 11, 2025 14:40
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2025
## Goals

- Add a new feature gate `PCINUMAAwareTopology`.
- When the afore-mentioned feature gate and `guestMappingPassthrough` are enabled, mirror PCIe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on the VEP call I wonder if it would be nicer to introduce a new policy instead of changing the behaviour of the existing guestMappingPassthrough policy when the FG is enabled? That way we don't change the behaviour of existing users when the FG is enabled or eventually graduates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lyarwood for your feedback!
Our current design is based on the understanding that PCIe NUMA passthrough is inherently coupled with CPU NUMA passthrough. In practice, enabling PCIe NUMA passthrough without aligning the guest’s CPU topology is not feasible. Therefore, the workflow we designed assumes that:

  • If users want to enable PCIe NUMA awareness, they must explicitly enable the feature gate and set guestMappingPassthrough: true
  • If users only want CPU NUMA passthrough, the current behavior continues to work as expected, as long as the feature gate remains disabled.

That said, you bring up an excellent point: when the feature gate graduates, enabling it could inadvertently alter behavior for users already relying on guestMappingPassthrough. Introducing an explicit API field early on is a more robust and future-proof solution, as it avoids implicit behavior shifts.

I agree with your suggestion. We can introduce a dedicated policy field to clearly express PCIe NUMA passthrough intent. One possible design might look like this:

spec:
  domain:
    devices:
      pciNumaAwareTopology: true

Copy link
Member

@lyarwood lyarwood Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of options here.

Firstly I wouldn't use a bool to represent this, using a struct lets us extend this in the future :

spec:
  domain:
    cpu:
      numa:
        guestMappingPassthrough: {}
    devices:
      numa:
        pciNumaAwareTopology: {}

or we could just provide an extra policy under the existing cpu numa configurable for the PCI topology:

spec:
  domain:
    cpu:
      numa:
        guestMappingPassthrough: {}
        pciNumaAwareTopology: {}

or provide an extended policy that handles both (eww, not a fan of this) :

spec:
  domain:
    cpu:
      numa:
        guestMappingPassthroughWithNumaAwarePCITopology: {}

or we ignore this and just document the fact that when enabled/graduated this FG will alter the behaviour of guestMappingPassthrough and potentially the PCI topology of the guest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyarwood, I'm wondering if you could explain why we would want another tunable?
Are there users who don't want to expose the NUMA affinity of the passed-through devices?
So far, we have been avoiding constructing PCI controllers, but with this VEP, we have a path forward.
I think that constructing these controller hierarchies for NUMA passthrough is a first step, but later we will need to find a way to advertise the passed devices' NUMA affinity with dedicated CPUs as well - hopefully with DRA integration, this will be easier

if we must have a new api field, then I would vote for

spec:
 domain:
   cpu:
     numa:
       guestMappingPassthrough: {}
       pciNumaAwareTopology: {}

However, I don't know if it's needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr guest ABI, ultimately to implement this we are going to rewire PCI device topology right? So without an additional configurable all users of guestMappingPassthrough will see this enabled in their guests once the FG graduates or is enabled. That's why I'm suggesting another configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants