Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions doc/content/toolstack/features/NUMA/lazy-reclaim.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
---
title: "Lazy memory reclaim"
weight: 10
categories:
NUMA
---
## Xen host memory scrubbing

Xen does not immediately reclaim deallocated memory.
Instead, Xen has a host memory scrubber that runs lazily in
the background to reclaim recently deallocated memory.

Thus, there is no guarantee that Xen has finished scrubbing
when `xenopsd` is being asked to build a domain.

## Waiting for enough free host memory

> [!info]
> In case the reclaimed host-wide memory is not sufficient yet,
> when `xenopsd` starts to build a VM, its
> [build_pre](https://github.com/xapi-project/xen-api/blob/073373ff/ocaml/xenopsd/xc/domain.ml#L899-L964)
> (also part of VM restore / VM migration)
> [polls](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L904)
> Xen [until enough host-wide memory](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L236-L272)
> has been reclaimed. See the
> [walk-through of Domain.build](../../../xenopsd/walkthroughs/VM.build/Domain.build.md#build_pre-prepare-building-the-vm)
> of `xenopsd` for more context:

```ml
let build_pre ~xc ~xs ~vcpus ~memory ~has_hard_affinity domid =
let open Memory in
let uuid = get_uuid ~xc domid in
debug "VM = %s; domid = %d; waiting for %Ld MiB of free host memory"
(Uuidx.to_string uuid) domid memory.required_host_free_mib ;
(* CA-39743: Wait, if necessary, for the Xen scrubber to catch up. *)
if
not (wait_xen_free_mem ~xc (Memory.kib_of_mib memory.required_host_free_mib))
then (
error "VM = %s; domid = %d; Failed waiting for Xen to free %Ld MiB"
(Uuidx.to_string uuid) domid memory.required_host_free_mib ;
raise (Not_enough_memory (Memory.bytes_of_mib memory.required_host_free_mib))
) ;
```

This is the implementation of the polling function:

```ml
let wait_xen_free_mem ~xc ?(maximum_wait_time_seconds = 64) required_memory_kib
: bool =
let open Memory in
let rec wait accumulated_wait_time_seconds =
let host_info = Xenctrl.physinfo xc in
let free_memory_kib =
kib_of_pages (Int64.of_nativeint host_info.Xenctrl.free_pages)
in
let scrub_memory_kib =
kib_of_pages (Int64.of_nativeint host_info.Xenctrl.scrub_pages)
in
(* At exponentially increasing intervals, write *)
(* a debug message saying how long we've waited: *)
if is_power_of_2 accumulated_wait_time_seconds then
debug
"Waited %i second(s) for memory to become available: %Ld KiB free, %Ld \
KiB scrub, %Ld KiB required"
accumulated_wait_time_seconds free_memory_kib scrub_memory_kib
required_memory_kib ;
if
free_memory_kib >= required_memory_kib
(* We already have enough memory. *)
then
true
else if scrub_memory_kib = 0L (* We'll never have enough memory. *) then
false
else if
accumulated_wait_time_seconds >= maximum_wait_time_seconds
(* We've waited long enough. *)
then
false
else (
Thread.delay 1.0 ;
wait (accumulated_wait_time_seconds + 1)
)
in
wait 0
```

## Waiting for enough free memory on NUMA nodes

To address the same situation not host-wide but specific to NUMA
nodes, the build, restore and migrate processors of domains on NUMA machines needs a similar algorithm.

This should be done directly before the NUMA placement algorithm
runs, or even as part of an improvement for it:

The NUMA placement algorithm calls the `numainfo` hypercall to
obtain a table of NUMA nodes with the available memory on each
node and the distance matrix between the NUMA nodes as the basis
for the NUMA placement decision for the VM.

If the reported free memory of the host is lower than would be
expected at that moment, this might be an indication that some
memory might not be scrubbed yet. Another indication might be
if the amount of free memory is increasing between two checks.

Also, if other domains are in the process of being shut down,
or if a shutdown recently occurred, Xen is likely scrubbing in
the background.

For cases where the NUMA placement returns no NUMA node affinity
for the new domain, the smallest possible change would be to
simply re-run the NUMA placement algorithm.

As a trivial first step would be to retry once if the initial
NUMA placement of a VM failed and abort retrying if the available
memory did not change since the initial failed attempt.


System-wide polling seems to abort polling when the amount of
free memory did not change compared to the previous poll. For
the NUMA memory poll, the previous results could be kept likewise
and compared to the new results.

Besides, the same polling timeout like for system-wide memory
could be used.

## An example scenario

This is an example scenario where not waiting for memory scrubbing
in a NUMA-aware way could fragment the VM across many NUMA nodes:

In this example, a relatively large VM is rebooted:

Fictional machine with 4 NUMA nodes, 25 GB each (for layout reasons):

```mermaid
%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
packet-beta
0-18: "Memory used by other VMs"
19-24: "free: 6 GB"
25-44: "VM before restart: 20 GB"
45-49: "free: 5GB"
50-69: "Memory used by other VMs"
70-74: "free: 5GB"
75-94: "Memory used by other VMs"
95-99: "free: 5GB"
```

VM is destroyed:

```mermaid
%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
packet-beta
0-18: "Memory used by other VMs"
19-24: "free: 6 GB"
25-44: "VM memory to be reclaimed, but not yet scrubbed"
45-49: "free: 5GB"
50-69: "Memory used by other VMs"
70-74: "free: 5GB"
75-94: "Memory used by other VMs"
95-99: "free: 5GB"
```

NUMA placement runs, and sees that no NUMA node has enough memory
for the VM. Therefore:
1. NUMA placement does not return a NUMA placement solution.
2. As a result, vCPU soft pinning it not set up
3. As a result, the domain does not get a NUMA node affinity
4. When `xenguest` allocates the VM's memory, Xen falls back to
round-robin memory allocation across all NUMA nodes.

Even if Xen has already scrubbed the memory by the time the
NUMA placement function returns, the decision to not select
a NUMA placement has already been done and the domain is
built in this way:

```mermaid
%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
packet-beta
0-18: "Memory used by other VMs"
19-23: "VM: 5GB"
24-24: ""
25-44: "scrubbed/reclaimed free memory: 20 GB"
45-49: "VM: 5GB"
50-69: "Memory used by other VMs"
70-74: "VM: 5GB"
75-94: "Memory used by other VMs"
95-99: "VM: 5GB"
```

In case the reclaimed 20 GB of memory is not partially allocated
for other VMs in the meantime:
After scrubbing and memory reclaim is complete. the 20 GB
of NUMA-node memory is available for the VM again.

When the 20 GB VM is rebooted and the memory is still available,
the rebooted VM might become NUMA-affine to the 2nd NUMA node again. Of course, this unpredictability is what we need to fix.

## Starting VMs when not enough reclaim is possible

But, when no NUMA node has enough memory to run a new VM,
waiting will not help.

But, it might be good to inform the caller that perfect NUMA placement
could not be achieved.

However, if a CPU socket with multiple NUMA nodes with a very low
internode distance has enough free memory, that could be seen as
a fallback that would have relatively low performance impact.

In the end, in such situations, it depends on the caller what to do:
Whether to start the VM anyway despite being not NUMA-aligned,
or to inform the caller of an expected performance degradation of the VM.

### Example scenario when not waiting for free NUMA node memory:

Note: This uses round numbers for easy checking and is purely theoretical:

| Node | RAM | used | free |
| ----:| ---:| ----:| ----:|
| 1 | 50 | 35 | 15 |
| 2 | 50 | 45 | 5 |
| 3 | 50 | 35 | 15 |
| 4 | 50 | 35 | 15 |
| all | 200 | 150 | 50 |

Action: A 45 GB VM on Node 2 is shut down and started again.
1. When the new `VM.start` runs, the 45 GB may not have been scrubbed yet.
2. The free memory check still finds 50 GB free, enough to start the VM.
3. NUMA placement picks one of the other nodes as they have more memory.
4. For example, assume it picks node 0, and sets the node-affinity to it.
5. The Xen buddy allocator will run out of 1GB superpages on node 0 after
having exhausted the 15 GB free memory on it.
6. This leaves 30 GB to be allocated elsewhere.
7. Meanwhile, some memory might have been scrubbed and reclaimed on Node 2.
8. The Xen buddy allocator then falls back to allocating in a round-robin
fashion from the other NUMA nodes, assume 10 GB on each of the 3 nodes.

New memory situation after the restart:

| Node | RAM | used | free | Dom1 |
| ----:| ---:| ----:| ----:| ----:|
| 1 | 50 | 50 | 0 | 15 |
| 2 | 50 | 10 | 40 | 10 |
| 3 | 50 | 45 | 5 | 10 |
| 4 | 50 | 45 | 5 | 10 |
| all | 200 | 150 | 50 | 45 |

Thus, a single VM restart may cause the VM's memory to be spread over
all NUMA nodes. As a result, most memory accesses would be remote.

`xenguest` populates the Guest memory in the process of the build step.

But as the `VM.build` micro-ops running in parallel, this can happen:
The free memory reported by Xen may not yet reflect memory that will
be allocated by other concurrently running `VM.build` micro-ops when
the `xenguest` processes started by them populate the VM memory.
78 changes: 78 additions & 0 deletions doc/content/toolstack/features/NUMA/node-fallbacks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: VM.build Use neighbouring NUMA nodes
mermaid:
force: true
theme: forest
mermaidInitialize: '{ "theme": "base" }'
---

{{%include "topologies/2x2.md" %}}

This shows that the distance to remote memory on the same socket
is far lower than the distance to the other socket, which is twice
the local memory distance.

As a result, this means that using remote memory on the same socket
increases the distance by only by 1/10th than using the other socket.


Hence, for VMs builds where might not be enough NUMA memory on one
NUMA node, using same-socket memory would have only about 50% of the
than remote-socket memory would have.

At same time, if the memory is (roughly) equally spread over two
NUMA nodes on the same socket, it could make sense to move the
vCPU affinity between the two NUMA nodes depending on their CPU load.





In the simplest case, the vCPU affinity could be set to e.g. two
NUMA nodes on the same socket (specified as having low distance),
which would cause Xen to allocate the memory from both NUMA nodes.

If this is not done,


##
| Node | RAM | used | free |
| ----:| ---:| ----:| ----:|
| 1 | 50 | 35 | 15 |
| 2 | 50 | 45 | 5 |
| 3 | 50 | 35 | 15 |
| 4 | 50 | 35 | 15 |
| all | 200 | 150 | 50 |

<!---
This shows that the distance to remote memory on the same socket
is far lower than the distance to the other socket, which is twice
the local memory distance.

As a result, this means that using remote memory on the same socket
increases the distance by only by 1/10th than using the other socket.

Hence, for VMs builds where might not be enough NUMA memory on one
NUMA node, using same-socket memory would have only about 50% of the
than remote-socket memory would have.

At same time, if the memory is (roughly) equally spread over two
NUMA nodes on the same socket, it could make sense to move the
vCPU affinity between the two NUMA nodes depending on their CPU load.

In the simplest case, the vCPU affinity could be set to e.g. two
NUMA nodes on the same socket (specified as having low distance),
which would cause Xen to allocate the memory from both NUMA nodes.

If this is not done,

##
| Node | RAM | used | free |
| ----:| ---:| ----:| ----:|
| 1 | 50 | 35 | 15 |
| 2 | 50 | 45 | 5 |
| 3 | 50 | 35 | 15 |
| 4 | 50 | 35 | 15 |
| all | 200 | 150 | 50 |

--->
52 changes: 52 additions & 0 deletions doc/content/toolstack/features/NUMA/parallel-VM.build.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: "Parallel VM build"
categories:
- NUMA
weight: 50
mermaid:
force: true
---

## Introduction

When the `xenopsd` server receives a `VM.start` request, it:
1. splits the request it into micro-ops and
2. dispatches the micro-ops in one queue per VM.

When `VM.start` requests arrive faster than the thread pool
finishes them, the thread pool will run multiple
micro-ops for different VMs in parallel. This includes the
VM.build micro-op that does NUMA placement and VM memory allocation.

The [Xenopsd architecture](xenopsd/architecture/_index) and the
[walkthrough of VM.start](VM.start) provide more details.

This walkthrough dives deeper into the `VM_create` and `VM_build` micro-ops
and focusses on allocating the memory allocation for different VMs in
parallel with respect to the NUMA placement of the starting VMs.

## Architecture

This diagram shows the [architecture](../../../xenopsd/architecture/_index) of Xenopsd:

At the top of the diagram, two client RPCs have been sent:
One to start a VM and the other to fetch the latest events.
The `Xenops_server` module splits them into "micro-ops" (labelled "μ op" here).
These micro-ops are enqueued in queues, one queue per VM. The thread pool pulls
from the VM queues and runs the micro-ops:

![Inside xenopsd](../../../../xenopsd/architecture/xenopsd.svg)
<center><figcaption><i>Image 1: Xenopsd architecture</i></figcaption></center>

Overview of the micro-ops for creating a new VM:

- `VM.create`: create an empty Xen domain in the Hypervisor and the Xenstore
- `VM.build`: build a Xen domain: Allocate guest memory and load the firmware and `hvmloader`
- Several micro-ops to attach devices launch the device model.
- `VM.unpause`: unpause the domain

## Flowchart: Parallel VM start

When multiple `VM.start` run concurrently, an example could look like this:

{{% include "snippets/vm-build-parallel" %}}
Loading
Loading