VEP 115: Add PCIe NUMA topology awareness VEP #116

mresvanis · 2025-10-29T15:57:31Z

VEP Metadata

Tracking issue: #115
SIG label: /sig compute

What this PR does

When guestMappingPassthrough is set any PCIe GPU and host devices with NUMA affinity information will be placed to a PCI controller hierarchy matching their NUMA node affinity (i.e. not directly on the default PCI root). The devices that don't have any NUMA alignment will be assigned to the default PCI root. Any error during this process will result to all devices falling back to their assignment under the default PCI root.

Special notes for your reviewer

kubevirt-bot · 2025-10-29T15:57:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xpivarc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Fan Zhang <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Fan Zhang <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Michail Resvanis <[email protected]>

lyarwood · 2025-11-11T15:17:27Z

veps/sig-compute/pcie-numa-topology-awareness.md

+## Goals
+
+- Add a new feature gate `PCINUMAAwareTopology`.
+- When the afore-mentioned feature gate and `guestMappingPassthrough` are enabled, mirror PCIe


As discussed on the VEP call I wonder if it would be nicer to introduce a new policy instead of changing the behaviour of the existing guestMappingPassthrough policy when the FG is enabled? That way we don't change the behaviour of existing users when the FG is enabled or eventually graduates.

Thanks @lyarwood for your feedback!
Our current design is based on the understanding that PCIe NUMA passthrough is inherently coupled with CPU NUMA passthrough. In practice, enabling PCIe NUMA passthrough without aligning the guest’s CPU topology is not feasible. Therefore, the workflow we designed assumes that:

If users want to enable PCIe NUMA awareness, they must explicitly enable the feature gate and set guestMappingPassthrough: true

If users only want CPU NUMA passthrough, the current behavior continues to work as expected, as long as the feature gate remains disabled.

That said, you bring up an excellent point: when the feature gate graduates, enabling it could inadvertently alter behavior for users already relying on guestMappingPassthrough. Introducing an explicit API field early on is a more robust and future-proof solution, as it avoids implicit behavior shifts.

I agree with your suggestion. We can introduce a dedicated policy field to clearly express PCIe NUMA passthrough intent. One possible design might look like this:

spec: domain: devices: pciNumaAwareTopology: true

There are a couple of options here.

Firstly I wouldn't use a bool to represent this, using a struct lets us extend this in the future :

spec: domain: cpu: numa: guestMappingPassthrough: {} devices: numa: pciNumaAwareTopology: {}

or we could just provide an extra policy under the existing cpu numa configurable for the PCI topology:

spec: domain: cpu: numa: guestMappingPassthrough: {} pciNumaAwareTopology: {}

or provide an extended policy that handles both (eww, not a fan of this) :

spec: domain: cpu: numa: guestMappingPassthroughWithNumaAwarePCITopology: {}

or we ignore this and just document the fact that when enabled/graduated this FG will alter the behaviour of guestMappingPassthrough and potentially the PCI topology of the guest.

@lyarwood, I'm wondering if you could explain why we would want another tunable?
Are there users who don't want to expose the NUMA affinity of the passed-through devices?
So far, we have been avoiding constructing PCI controllers, but with this VEP, we have a path forward.
I think that constructing these controller hierarchies for NUMA passthrough is a first step, but later we will need to find a way to advertise the passed devices' NUMA affinity with dedicated CPUs as well - hopefully with DRA integration, this will be easier

if we must have a new api field, then I would vote for

spec: domain: cpu: numa: guestMappingPassthrough: {} pciNumaAwareTopology: {}

However, I don't know if it's needed.

@vladikr guest ABI, ultimately to implement this we are going to rewire PCI device topology right? So without an additional configurable all users of guestMappingPassthrough will see this enabled in their guests once the FG graduates or is enabled. That's why I'm suggesting another configurable.

Signed-off-by: Michail Resvanis <[email protected]>

kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Oct 29, 2025

kubevirt-bot requested review from lyarwood, vladikr and xpivarc October 29, 2025 15:57

kubevirt-bot added the size/XL label Oct 29, 2025

mresvanis mentioned this pull request Oct 29, 2025

VEP 115: PCIe NUMA Topology Awareness #115

Open

4 tasks

mresvanis marked this pull request as draft October 29, 2025 16:01

kubevirt-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 29, 2025

mresvanis force-pushed the pcie-numa-topology-awareness branch from 6cf5187 to d0beae6 Compare October 29, 2025 16:07

kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Oct 29, 2025

mresvanis force-pushed the pcie-numa-topology-awareness branch from 39b5fa4 to 3ad6491 Compare October 29, 2025 16:30

kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Oct 29, 2025

mresvanis force-pushed the pcie-numa-topology-awareness branch from 3ad6491 to 7d3a11f Compare October 29, 2025 16:33

VEP 115: PCIe NUMA topology awareness

50c38fc

Co-authored-by: Fan Zhang <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Fan Zhang <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Michail Resvanis <[email protected]>

mresvanis force-pushed the pcie-numa-topology-awareness branch from 7d3a11f to 50c38fc Compare October 29, 2025 16:33

xpivarc marked this pull request as ready for review November 11, 2025 14:40

kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2025

lyarwood reviewed Nov 11, 2025

View reviewed changes

mresvanis added 3 commits November 18, 2025 13:09

Clarify how NUMA awareness is achieved early on

d554378

Signed-off-by: Michail Resvanis <[email protected]>

Elaborate on resource allocation and hot-plug non goals

f5ec7e8

Signed-off-by: Michail Resvanis <[email protected]>

Add non NUMA aware pcie topology to compare with VEP changes

67a8fae

Signed-off-by: Michail Resvanis <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VEP 115: Add PCIe NUMA topology awareness VEP #116

VEP 115: Add PCIe NUMA topology awareness VEP #116

Uh oh!

mresvanis commented Oct 29, 2025

Uh oh!

kubevirt-bot commented Oct 29, 2025

Uh oh!

lyarwood Nov 11, 2025

Uh oh!

fanzhangio Nov 11, 2025

Uh oh!

lyarwood Nov 11, 2025 •

edited

Loading

Uh oh!

vladikr Nov 18, 2025

Uh oh!

lyarwood Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

VEP 115: Add PCIe NUMA topology awareness VEP #116

Are you sure you want to change the base?

VEP 115: Add PCIe NUMA topology awareness VEP #116

Uh oh!

Conversation

mresvanis commented Oct 29, 2025

VEP Metadata

What this PR does

Special notes for your reviewer

Uh oh!

kubevirt-bot commented Oct 29, 2025

Uh oh!

lyarwood Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

fanzhangio Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

lyarwood Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vladikr Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

lyarwood Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lyarwood Nov 11, 2025 •

edited

Loading