-
Notifications
You must be signed in to change notification settings - Fork 46
VEP 115: Add PCIe NUMA topology awareness VEP #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6cf5187 to
d0beae6
Compare
39b5fa4 to
3ad6491
Compare
3ad6491 to
7d3a11f
Compare
Co-authored-by: Fan Zhang <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Fan Zhang <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Michail Resvanis <[email protected]>
7d3a11f to
50c38fc
Compare
| ## Goals | ||
|
|
||
| - Add a new feature gate `PCINUMAAwareTopology`. | ||
| - When the afore-mentioned feature gate and `guestMappingPassthrough` are enabled, mirror PCIe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on the VEP call I wonder if it would be nicer to introduce a new policy instead of changing the behaviour of the existing guestMappingPassthrough policy when the FG is enabled? That way we don't change the behaviour of existing users when the FG is enabled or eventually graduates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lyarwood for your feedback!
Our current design is based on the understanding that PCIe NUMA passthrough is inherently coupled with CPU NUMA passthrough. In practice, enabling PCIe NUMA passthrough without aligning the guest’s CPU topology is not feasible. Therefore, the workflow we designed assumes that:
- If users want to enable PCIe NUMA awareness, they must explicitly enable the feature gate and set
guestMappingPassthrough: true - If users only want CPU NUMA passthrough, the current behavior continues to work as expected, as long as the feature gate remains disabled.
That said, you bring up an excellent point: when the feature gate graduates, enabling it could inadvertently alter behavior for users already relying on guestMappingPassthrough. Introducing an explicit API field early on is a more robust and future-proof solution, as it avoids implicit behavior shifts.
I agree with your suggestion. We can introduce a dedicated policy field to clearly express PCIe NUMA passthrough intent. One possible design might look like this:
spec:
domain:
devices:
pciNumaAwareTopology: trueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of options here.
Firstly I wouldn't use a bool to represent this, using a struct lets us extend this in the future :
spec:
domain:
cpu:
numa:
guestMappingPassthrough: {}
devices:
numa:
pciNumaAwareTopology: {}or we could just provide an extra policy under the existing cpu numa configurable for the PCI topology:
spec:
domain:
cpu:
numa:
guestMappingPassthrough: {}
pciNumaAwareTopology: {}or provide an extended policy that handles both (eww, not a fan of this) :
spec:
domain:
cpu:
numa:
guestMappingPassthroughWithNumaAwarePCITopology: {}or we ignore this and just document the fact that when enabled/graduated this FG will alter the behaviour of guestMappingPassthrough and potentially the PCI topology of the guest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lyarwood, I'm wondering if you could explain why we would want another tunable?
Are there users who don't want to expose the NUMA affinity of the passed-through devices?
So far, we have been avoiding constructing PCI controllers, but with this VEP, we have a path forward.
I think that constructing these controller hierarchies for NUMA passthrough is a first step, but later we will need to find a way to advertise the passed devices' NUMA affinity with dedicated CPUs as well - hopefully with DRA integration, this will be easier
if we must have a new api field, then I would vote for
spec:
domain:
cpu:
numa:
guestMappingPassthrough: {}
pciNumaAwareTopology: {}
However, I don't know if it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr guest ABI, ultimately to implement this we are going to rewire PCI device topology right? So without an additional configurable all users of guestMappingPassthrough will see this enabled in their guests once the FG graduates or is enabled. That's why I'm suggesting another configurable.
Signed-off-by: Michail Resvanis <[email protected]>
Signed-off-by: Michail Resvanis <[email protected]>
Signed-off-by: Michail Resvanis <[email protected]>
VEP Metadata
Tracking issue: #115
SIG label: /sig compute
What this PR does
When
guestMappingPassthroughis set any PCIe GPU and host devices with NUMA affinity information will be placed to a PCI controller hierarchy matching their NUMA node affinity (i.e. not directly on the default PCI root). The devices that don't have any NUMA alignment will be assigned to the default PCI root. Any error during this process will result to all devices falling back to their assignment under the default PCI root.Special notes for your reviewer