diff --git a/doc/content/toolstack/features/NUMA/index.md b/doc/content/toolstack/features/NUMA/_index.md similarity index 100% rename from doc/content/toolstack/features/NUMA/index.md rename to doc/content/toolstack/features/NUMA/_index.md diff --git a/doc/content/toolstack/features/NUMA/lazy-reclaim.md b/doc/content/toolstack/features/NUMA/lazy-reclaim.md new file mode 100644 index 00000000000..fb12deb8207 --- /dev/null +++ b/doc/content/toolstack/features/NUMA/lazy-reclaim.md @@ -0,0 +1,256 @@ +--- +title: "Lazy memory reclaim" +weight: 10 +categories: + NUMA +--- +## Xen host memory scrubbing + +Xen does not immediately reclaim deallocated memory. +Instead, Xen has a host memory scrubber that runs lazily in +the background to reclaim recently deallocated memory. + +Thus, there is no guarantee that Xen has finished scrubbing +when `xenopsd` is being asked to build a domain. + +## Waiting for enough free host memory + +> [!info] +> In case the reclaimed host-wide memory is not sufficient yet, +> when `xenopsd` starts to build a VM, its +> [build_pre](https://github.com/xapi-project/xen-api/blob/073373ff/ocaml/xenopsd/xc/domain.ml#L899-L964) +> (also part of VM restore / VM migration) +> [polls](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L904) +> Xen [until enough host-wide memory](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L236-L272) +> has been reclaimed. See the +> [walk-through of Domain.build](../../../xenopsd/walkthroughs/VM.build/Domain.build.md#build_pre-prepare-building-the-vm) +> of `xenopsd` for more context: + +```ml +let build_pre ~xc ~xs ~vcpus ~memory ~has_hard_affinity domid = + let open Memory in + let uuid = get_uuid ~xc domid in + debug "VM = %s; domid = %d; waiting for %Ld MiB of free host memory" + (Uuidx.to_string uuid) domid memory.required_host_free_mib ; + (* CA-39743: Wait, if necessary, for the Xen scrubber to catch up. *) + if + not (wait_xen_free_mem ~xc (Memory.kib_of_mib memory.required_host_free_mib)) + then ( + error "VM = %s; domid = %d; Failed waiting for Xen to free %Ld MiB" + (Uuidx.to_string uuid) domid memory.required_host_free_mib ; + raise (Not_enough_memory (Memory.bytes_of_mib memory.required_host_free_mib)) + ) ; +``` + +This is the implementation of the polling function: + +```ml +let wait_xen_free_mem ~xc ?(maximum_wait_time_seconds = 64) required_memory_kib + : bool = + let open Memory in + let rec wait accumulated_wait_time_seconds = + let host_info = Xenctrl.physinfo xc in + let free_memory_kib = + kib_of_pages (Int64.of_nativeint host_info.Xenctrl.free_pages) + in + let scrub_memory_kib = + kib_of_pages (Int64.of_nativeint host_info.Xenctrl.scrub_pages) + in + (* At exponentially increasing intervals, write *) + (* a debug message saying how long we've waited: *) + if is_power_of_2 accumulated_wait_time_seconds then + debug + "Waited %i second(s) for memory to become available: %Ld KiB free, %Ld \ + KiB scrub, %Ld KiB required" + accumulated_wait_time_seconds free_memory_kib scrub_memory_kib + required_memory_kib ; + if + free_memory_kib >= required_memory_kib + (* We already have enough memory. *) + then + true + else if scrub_memory_kib = 0L (* We'll never have enough memory. *) then + false + else if + accumulated_wait_time_seconds >= maximum_wait_time_seconds + (* We've waited long enough. *) + then + false + else ( + Thread.delay 1.0 ; + wait (accumulated_wait_time_seconds + 1) + ) + in + wait 0 +``` + +## Waiting for enough free memory on NUMA nodes + +To address the same situation not host-wide but specific to NUMA +nodes, the build, restore and migrate processors of domains on NUMA machines needs a similar algorithm. + +This should be done directly before the NUMA placement algorithm +runs, or even as part of an improvement for it: + +The NUMA placement algorithm calls the `numainfo` hypercall to +obtain a table of NUMA nodes with the available memory on each +node and the distance matrix between the NUMA nodes as the basis +for the NUMA placement decision for the VM. + +If the reported free memory of the host is lower than would be +expected at that moment, this might be an indication that some +memory might not be scrubbed yet. Another indication might be +if the amount of free memory is increasing between two checks. + +Also, if other domains are in the process of being shut down, +or if a shutdown recently occurred, Xen is likely scrubbing in +the background. + +For cases where the NUMA placement returns no NUMA node affinity +for the new domain, the smallest possible change would be to +simply re-run the NUMA placement algorithm. + +As a trivial first step would be to retry once if the initial +NUMA placement of a VM failed and abort retrying if the available +memory did not change since the initial failed attempt. + + +System-wide polling seems to abort polling when the amount of +free memory did not change compared to the previous poll. For +the NUMA memory poll, the previous results could be kept likewise +and compared to the new results. + +Besides, the same polling timeout like for system-wide memory +could be used. + +## An example scenario + +This is an example scenario where not waiting for memory scrubbing +in a NUMA-aware way could fragment the VM across many NUMA nodes: + +In this example, a relatively large VM is rebooted: + +Fictional machine with 4 NUMA nodes, 25 GB each (for layout reasons): + +```mermaid +%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%% +packet-beta + 0-18: "Memory used by other VMs" + 19-24: "free: 6 GB" + 25-44: "VM before restart: 20 GB" + 45-49: "free: 5GB" + 50-69: "Memory used by other VMs" + 70-74: "free: 5GB" + 75-94: "Memory used by other VMs" + 95-99: "free: 5GB" +``` + +VM is destroyed: + +```mermaid +%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%% +packet-beta + 0-18: "Memory used by other VMs" + 19-24: "free: 6 GB" + 25-44: "VM memory to be reclaimed, but not yet scrubbed" + 45-49: "free: 5GB" + 50-69: "Memory used by other VMs" + 70-74: "free: 5GB" + 75-94: "Memory used by other VMs" + 95-99: "free: 5GB" +``` + +NUMA placement runs, and sees that no NUMA node has enough memory +for the VM. Therefore: +1. NUMA placement does not return a NUMA placement solution. +2. As a result, vCPU soft pinning it not set up +3. As a result, the domain does not get a NUMA node affinity +4. When `xenguest` allocates the VM's memory, Xen falls back to + round-robin memory allocation across all NUMA nodes. + +Even if Xen has already scrubbed the memory by the time the +NUMA placement function returns, the decision to not select +a NUMA placement has already been done and the domain is +built in this way: + +```mermaid +%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%% +packet-beta + 0-18: "Memory used by other VMs" + 19-23: "VM: 5GB" + 24-24: "" + 25-44: "scrubbed/reclaimed free memory: 20 GB" + 45-49: "VM: 5GB" + 50-69: "Memory used by other VMs" + 70-74: "VM: 5GB" + 75-94: "Memory used by other VMs" + 95-99: "VM: 5GB" +``` + +In case the reclaimed 20 GB of memory is not partially allocated +for other VMs in the meantime: +After scrubbing and memory reclaim is complete. the 20 GB +of NUMA-node memory is available for the VM again. + +When the 20 GB VM is rebooted and the memory is still available, +the rebooted VM might become NUMA-affine to the 2nd NUMA node again. Of course, this unpredictability is what we need to fix. + +## Starting VMs when not enough reclaim is possible + +But, when no NUMA node has enough memory to run a new VM, +waiting will not help. + +But, it might be good to inform the caller that perfect NUMA placement +could not be achieved. + +However, if a CPU socket with multiple NUMA nodes with a very low +internode distance has enough free memory, that could be seen as +a fallback that would have relatively low performance impact. + +In the end, in such situations, it depends on the caller what to do: +Whether to start the VM anyway despite being not NUMA-aligned, +or to inform the caller of an expected performance degradation of the VM. + +### Example scenario when not waiting for free NUMA node memory: + +Note: This uses round numbers for easy checking and is purely theoretical: + +| Node | RAM | used | free | +| ----:| ---:| ----:| ----:| +| 1 | 50 | 35 | 15 | +| 2 | 50 | 45 | 5 | +| 3 | 50 | 35 | 15 | +| 4 | 50 | 35 | 15 | +| all | 200 | 150 | 50 | + +Action: A 45 GB VM on Node 2 is shut down and started again. +1. When the new `VM.start` runs, the 45 GB may not have been scrubbed yet. +2. The free memory check still finds 50 GB free, enough to start the VM. +3. NUMA placement picks one of the other nodes as they have more memory. +4. For example, assume it picks node 0, and sets the node-affinity to it. +5. The Xen buddy allocator will run out of 1GB superpages on node 0 after + having exhausted the 15 GB free memory on it. +6. This leaves 30 GB to be allocated elsewhere. +7. Meanwhile, some memory might have been scrubbed and reclaimed on Node 2. +8. The Xen buddy allocator then falls back to allocating in a round-robin + fashion from the other NUMA nodes, assume 10 GB on each of the 3 nodes. + +New memory situation after the restart: + +| Node | RAM | used | free | Dom1 | +| ----:| ---:| ----:| ----:| ----:| +| 1 | 50 | 50 | 0 | 15 | +| 2 | 50 | 10 | 40 | 10 | +| 3 | 50 | 45 | 5 | 10 | +| 4 | 50 | 45 | 5 | 10 | +| all | 200 | 150 | 50 | 45 | + +Thus, a single VM restart may cause the VM's memory to be spread over +all NUMA nodes. As a result, most memory accesses would be remote. + +`xenguest` populates the Guest memory in the process of the build step. + +But as the `VM.build` micro-ops running in parallel, this can happen: +The free memory reported by Xen may not yet reflect memory that will +be allocated by other concurrently running `VM.build` micro-ops when +the `xenguest` processes started by them populate the VM memory. diff --git a/doc/content/toolstack/features/NUMA/node-fallbacks.md b/doc/content/toolstack/features/NUMA/node-fallbacks.md new file mode 100644 index 00000000000..036fda29e9a --- /dev/null +++ b/doc/content/toolstack/features/NUMA/node-fallbacks.md @@ -0,0 +1,78 @@ +--- +title: VM.build Use neighbouring NUMA nodes +mermaid: + force: true + theme: forest + mermaidInitialize: '{ "theme": "base" }' +--- + +{{%include "topologies/2x2.md" %}} + +This shows that the distance to remote memory on the same socket +is far lower than the distance to the other socket, which is twice +the local memory distance. + +As a result, this means that using remote memory on the same socket +increases the distance by only by 1/10th than using the other socket. + + +Hence, for VMs builds where might not be enough NUMA memory on one +NUMA node, using same-socket memory would have only about 50% of the +than remote-socket memory would have. + +At same time, if the memory is (roughly) equally spread over two +NUMA nodes on the same socket, it could make sense to move the +vCPU affinity between the two NUMA nodes depending on their CPU load. + + + + + +In the simplest case, the vCPU affinity could be set to e.g. two +NUMA nodes on the same socket (specified as having low distance), +which would cause Xen to allocate the memory from both NUMA nodes. + +If this is not done, + + +## +| Node | RAM | used | free | +| ----:| ---:| ----:| ----:| +| 1 | 50 | 35 | 15 | +| 2 | 50 | 45 | 5 | +| 3 | 50 | 35 | 15 | +| 4 | 50 | 35 | 15 | +| all | 200 | 150 | 50 | + + \ No newline at end of file diff --git a/doc/content/toolstack/features/NUMA/parallel-VM.build.md b/doc/content/toolstack/features/NUMA/parallel-VM.build.md new file mode 100644 index 00000000000..fe08dce2102 --- /dev/null +++ b/doc/content/toolstack/features/NUMA/parallel-VM.build.md @@ -0,0 +1,52 @@ +--- +title: "Parallel VM build" +categories: + - NUMA +weight: 50 +mermaid: + force: true +--- + +## Introduction + +When the `xenopsd` server receives a `VM.start` request, it: +1. splits the request it into micro-ops and +2. dispatches the micro-ops in one queue per VM. + +When `VM.start` requests arrive faster than the thread pool +finishes them, the thread pool will run multiple +micro-ops for different VMs in parallel. This includes the +VM.build micro-op that does NUMA placement and VM memory allocation. + +The [Xenopsd architecture](xenopsd/architecture/_index) and the +[walkthrough of VM.start](VM.start) provide more details. + +This walkthrough dives deeper into the `VM_create` and `VM_build` micro-ops +and focusses on allocating the memory allocation for different VMs in +parallel with respect to the NUMA placement of the starting VMs. + +## Architecture + +This diagram shows the [architecture](../../../xenopsd/architecture/_index) of Xenopsd: + +At the top of the diagram, two client RPCs have been sent: +One to start a VM and the other to fetch the latest events. +The `Xenops_server` module splits them into "micro-ops" (labelled "μ op" here). +These micro-ops are enqueued in queues, one queue per VM. The thread pool pulls +from the VM queues and runs the micro-ops: + +![Inside xenopsd](../../../../xenopsd/architecture/xenopsd.svg) +
Image 1: Xenopsd architecture
+ +Overview of the micro-ops for creating a new VM: + +- `VM.create`: create an empty Xen domain in the Hypervisor and the Xenstore +- `VM.build`: build a Xen domain: Allocate guest memory and load the firmware and `hvmloader` +- Several micro-ops to attach devices launch the device model. +- `VM.unpause`: unpause the domain + +## Flowchart: Parallel VM start + +When multiple `VM.start` run concurrently, an example could look like this: + +{{% include "snippets/vm-build-parallel" %}} diff --git a/doc/content/toolstack/features/NUMA/parallel-boot.md b/doc/content/toolstack/features/NUMA/parallel-boot.md new file mode 100644 index 00000000000..1de2a6339fc --- /dev/null +++ b/doc/content/toolstack/features/NUMA/parallel-boot.md @@ -0,0 +1,13 @@ +--- +title: "Parallel VM.build" +--- + +Summary of the xenopsd architecture when running micro-ops +in parallel: + +See: [VM.build: Architecture](architecture.md) + +Running multiple `VM.build` micro-ops in parallel can +run into two kinds of race conditions for NUMA placement: + +- [Lazy scrubbing by the Xen hypervisor](lazy-scrubbing.md) \ No newline at end of file diff --git a/doc/content/toolstack/features/NUMA/topologies/2x2.md b/doc/content/toolstack/features/NUMA/topologies/2x2.md new file mode 100644 index 00000000000..d90510aba73 --- /dev/null +++ b/doc/content/toolstack/features/NUMA/topologies/2x2.md @@ -0,0 +1,61 @@ +--- +title: 2 sockets, 4 nodes +description: NUMA toplogy with 2 sockets, 4 nodes +--- + +### Example NUMA topology with 2 sockets, 4 nodes: + +A topology with 2 sockets and 4 nodes results +in this NUMA memory distance matrix: + +|node| 0| 2| 1| 3| +|---:|-:|-:|-:|-:| +| 0 |10|21|11|21| +| 2 |21|10|21|11| +| 1 |11|21|10|21| +| 3 |21|11|21|10| + + +The distance values in this diagram describes in a normalized way how large +the distance from a NUMA node's CPU to the memory of another node is: + +- 10: This is (by convention) the distance to local memory of the NUMA node +- 11: Relative to 10, the distance to the remote memory on the same socket +- 21: Relative to 10, the distance to the remote memory on the other socket + +This NUMA distance matrix could be visualized using this block diagram: + +{{< mermaid >}} +block-beta +columns 3 + %% 1st row, left column + block columns 1 + Mem0[/"Memory of Node 0"/] + Dist0<["Distance: 10"]>(up) + Node0{{"CPU of Node 0"}} + end + %% 1st row, middle column + space + %% 1st row, right column + block columns 1 + Mem2[/"Memory of Node 2"/] + Dist2<["Distance: 10"]>(up) + Node2{{"CPU of Node 2"}} + end + %% 2nd row + Socket_1<["Distance: 1"]>(y) + x<["Distance: 10"]>(x) + Socket_2<["Distance: 1"]>(y) + %% 3rd row + block columns 1 + Node1{{"CPU of Node 1"}} + Dist1<["Distance: 10"]>(down) + Mem1[/"Memory of Node 1"/] + end + space + block columns 1 + Node3{{"CPU of Node 3"}} + Dist3<["Distance: 10"]>(down) + Mem3[/"Memory of Node 3"/] + end +{{< /mermaid >}} diff --git a/doc/content/toolstack/features/NUMA/topologies/_index.md b/doc/content/toolstack/features/NUMA/topologies/_index.md new file mode 100644 index 00000000000..6cbffff0390 --- /dev/null +++ b/doc/content/toolstack/features/NUMA/topologies/_index.md @@ -0,0 +1,6 @@ ++++ +title = "NUMA topologies" +weight = 20 ++++ + +{{% children description=true %}} \ No newline at end of file