bernhard-xapi · bernhardkaindl · Feb 17, 2025 · Feb 17, 2025
diff --git a/doc/content/toolstack/features/NUMA/index.md → ...content/toolstack/features/NUMA/_index.md b/doc/content/toolstack/features/NUMA/index.md → ...content/toolstack/features/NUMA/_index.md
diff --git a/doc/content/toolstack/features/NUMA/lazy-reclaim.md b/doc/content/toolstack/features/NUMA/lazy-reclaim.md
@@ -0,0 +1,256 @@
+---
+title: "Lazy memory reclaim"
+weight: 10
+categories:
+  NUMA
+---
+## Xen host memory scrubbing
+
+Xen does not immediately reclaim deallocated memory.
+Instead, Xen has a host memory scrubber that runs lazily in
+the background to reclaim recently deallocated memory.
+
+Thus, there is no guarantee that Xen has finished scrubbing
+when `xenopsd` is being asked to build a domain.
+
+## Waiting for enough free host memory
+
+> [!info] 
+> In case the reclaimed host-wide memory is not sufficient yet,
+> when `xenopsd` starts to build a VM, its
+> [build_pre](https://github.com/xapi-project/xen-api/blob/073373ff/ocaml/xenopsd/xc/domain.ml#L899-L964)
+> (also part of VM restore / VM migration)
+> [polls](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L904)
+> Xen [until enough host-wide memory](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L236-L272)
+> has been reclaimed. See the
+> [walk-through of Domain.build](../../../xenopsd/walkthroughs/VM.build/Domain.build.md#build_pre-prepare-building-the-vm)
+> of `xenopsd` for more context:
+
+```ml
+let build_pre ~xc ~xs ~vcpus ~memory ~has_hard_affinity domid =
+  let open Memory in
+  let uuid = get_uuid ~xc domid in
+  debug "VM = %s; domid = %d; waiting for %Ld MiB of free host memory"
+    (Uuidx.to_string uuid) domid memory.required_host_free_mib ;
+  (* CA-39743: Wait, if necessary, for the Xen scrubber to catch up. *)
+  if
+    not (wait_xen_free_mem ~xc (Memory.kib_of_mib memory.required_host_free_mib))
+  then (
+    error "VM = %s; domid = %d; Failed waiting for Xen to free %Ld MiB"
+      (Uuidx.to_string uuid) domid memory.required_host_free_mib ;
+    raise (Not_enough_memory (Memory.bytes_of_mib memory.required_host_free_mib))
+  ) ;
+```
+
+This is the implementation of the polling function:
+
+```ml
+let wait_xen_free_mem ~xc ?(maximum_wait_time_seconds = 64) required_memory_kib
+    : bool =
+  let open Memory in
+  let rec wait accumulated_wait_time_seconds =
+    let host_info = Xenctrl.physinfo xc in
+    let free_memory_kib =
+      kib_of_pages (Int64.of_nativeint host_info.Xenctrl.free_pages)
+    in
+    let scrub_memory_kib =
+      kib_of_pages (Int64.of_nativeint host_info.Xenctrl.scrub_pages)
+    in
+    (* At exponentially increasing intervals, write  *)
+    (* a debug message saying how long we've waited: *)
+    if is_power_of_2 accumulated_wait_time_seconds then
+      debug
+        "Waited %i second(s) for memory to become available: %Ld KiB free, %Ld \
+         KiB scrub, %Ld KiB required"
+        accumulated_wait_time_seconds free_memory_kib scrub_memory_kib
+        required_memory_kib ;
+    if
+      free_memory_kib >= required_memory_kib
+      (* We already have enough memory. *)
+    then
+      true
+    else if scrub_memory_kib = 0L (* We'll never have enough memory. *) then
+      false
+    else if
+      accumulated_wait_time_seconds >= maximum_wait_time_seconds
+      (* We've waited long enough. *)
+    then
+      false
+    else (
+      Thread.delay 1.0 ;
+      wait (accumulated_wait_time_seconds + 1)
+    )
+  in
+  wait 0
+```
+
+## Waiting for enough free memory on NUMA nodes
+
+To address the same situation not host-wide but specific to NUMA
+nodes, the build, restore and migrate processors of domains on NUMA machines needs a similar algorithm.
+
+This should be done directly before the NUMA placement algorithm
+runs, or even as part of an improvement for it:
+
+The NUMA placement algorithm calls the `numainfo` hypercall to
+obtain a table of NUMA nodes with the available memory on each
+node and the distance matrix between the NUMA nodes as the basis
+for the NUMA placement decision for the VM.
+
+If the reported free memory of the host is lower than would be
+expected at that moment, this might be an indication that some
+memory might not be scrubbed yet. Another indication might be
+if the amount of free memory is increasing between two checks.
+
+Also, if other domains are in the process of being shut down,
+or if a shutdown recently occurred, Xen is likely scrubbing in
+the background.
+
+For cases where the NUMA placement returns no NUMA node affinity
+for the new domain, the smallest possible change would be to
+simply re-run the NUMA placement algorithm.
+
+As a trivial first step would be to retry once if the initial
+NUMA placement of a VM failed and abort retrying if the available
+memory did not change since the initial failed attempt.
+
+
+System-wide polling seems to abort polling when the amount of
+free memory did not change compared to the previous poll. For
+the NUMA memory poll, the previous results could be kept likewise
+and compared to the new results.
+
+Besides, the same polling timeout like for system-wide memory
+could be used.
+
+## An example scenario
+
+This is an example scenario where not waiting for memory scrubbing
+in a NUMA-aware way could fragment the VM across many NUMA nodes:
+
+In this example, a relatively large VM is rebooted:
+
+Fictional machine with 4 NUMA nodes, 25 GB each (for layout reasons):
+
+```mermaid
+%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
+packet-beta
+   0-18: "Memory used by other VMs"
+  19-24: "free: 6 GB"
+  25-44: "VM before restart: 20 GB"
+  45-49: "free: 5GB"
+  50-69: "Memory used by other VMs"
+  70-74: "free: 5GB"
+  75-94: "Memory used by other VMs"
+  95-99: "free: 5GB"
+```
+
+VM is destroyed:
+
+```mermaid
+%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
+packet-beta
+   0-18: "Memory used by other VMs"
+  19-24: "free: 6 GB"
+  25-44: "VM memory to be reclaimed, but not yet scrubbed"
+  45-49: "free: 5GB"
+  50-69: "Memory used by other VMs"
+  70-74: "free: 5GB"
+  75-94: "Memory used by other VMs"
+  95-99: "free: 5GB"
+```
+
+NUMA placement runs, and sees that no NUMA node has enough memory
+for the VM. Therefore:
+1. NUMA placement does not return a NUMA placement solution.
+2. As a result, vCPU soft pinning it not set up
+3. As a result, the domain does not get a NUMA node affinity
+4. When `xenguest` allocates the VM's memory, Xen falls back to
+   round-robin memory allocation across all NUMA nodes.
+
+Even if Xen has already scrubbed the memory by the time the
+NUMA placement function returns, the decision to not select
+a NUMA placement has already been done and the domain is
+built in this way:
+
+```mermaid
+%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
+packet-beta
+   0-18: "Memory used by other VMs"
+  19-23: "VM: 5GB"
+  24-24: ""
+  25-44: "scrubbed/reclaimed free memory: 20 GB"
+  45-49: "VM: 5GB"
+  50-69: "Memory used by other VMs"
+  70-74: "VM: 5GB"
+  75-94: "Memory used by other VMs"
+  95-99: "VM: 5GB"
+```
+
+In case the reclaimed 20 GB of memory is not partially allocated
+for other VMs in the meantime:
+After scrubbing and memory reclaim is complete. the 20 GB
+of NUMA-node memory is available for the VM again.
+
+When the 20 GB VM is rebooted and the memory is still available,
+the rebooted VM might become NUMA-affine to the 2nd NUMA node again. Of course, this unpredictability is what we need to fix.
+
+## Starting VMs when not enough reclaim is possible
+
+But, when no NUMA node has enough memory to run a new VM,
+waiting will not help.
+
+But, it might be good to inform the caller that perfect NUMA placement
+could not be achieved.
+
+However, if a CPU socket with multiple NUMA nodes with a very low
+internode distance has enough free memory, that could be seen as
+a fallback that would have relatively low performance impact.
+
+In the end, in such situations, it depends on the caller what to do:
+Whether to start the VM anyway despite being not NUMA-aligned,
+or to inform the caller of an expected performance degradation of the VM.
+
+### Example scenario when not waiting for free NUMA node memory:
+
+Note: This uses round numbers for easy checking and is purely theoretical:
+
+| Node | RAM | used | free |
+| ----:| ---:| ----:| ----:|
+|    1 |  50 |   35 |   15 |
+|    2 |  50 |   45 |    5 |
+|    3 |  50 |   35 |   15 |
+|    4 |  50 |   35 |   15 |
+|  all | 200 |  150 |   50 |
+
+Action: A 45 GB VM on Node 2 is shut down and started again.
+1. When the new `VM.start` runs, the 45 GB may not have been scrubbed yet.
+2. The free memory check still finds 50 GB free, enough to start the VM.
+3. NUMA placement picks one of the other nodes as they have more memory.
+4. For example, assume it picks node 0, and sets the node-affinity to it.
+5. The Xen buddy allocator will run out of 1GB superpages on node 0 after
+   having exhausted the 15 GB free memory on it.
+6. This leaves 30 GB to be allocated elsewhere.
+7. Meanwhile, some memory might have been scrubbed and reclaimed on Node 2.
+8. The Xen buddy allocator then falls back to allocating in a round-robin
+   fashion from the other NUMA nodes, assume 10 GB on each of the 3 nodes.
+
+New memory situation after the restart:
+
+| Node | RAM | used | free | Dom1 |
+| ----:| ---:| ----:| ----:| ----:|
+|    1 |  50 |   50 |    0 |   15 |
+|    2 |  50 |   10 |   40 |   10 |
+|    3 |  50 |   45 |    5 |   10 |
+|    4 |  50 |   45 |    5 |   10 |
+|  all | 200 |  150 |   50 |   45 |
+
+Thus, a single VM restart may cause the VM's memory to be spread over
+all NUMA nodes. As a result, most memory accesses would be remote.
+
+`xenguest` populates the Guest memory in the process of the build step.
+
+But as the `VM.build` micro-ops running in parallel, this can happen:
+The free memory reported by Xen may not yet reflect memory that will
+be allocated by other concurrently running `VM.build` micro-ops when
+the `xenguest` processes started by them populate the VM memory.
diff --git a/doc/content/toolstack/features/NUMA/node-fallbacks.md b/doc/content/toolstack/features/NUMA/node-fallbacks.md
@@ -0,0 +1,78 @@
+---
+title: VM.build Use neighbouring NUMA nodes
+mermaid:
+    force: true
+    theme: forest
+    mermaidInitialize: '{ "theme": "base" }'
+---
+
+{{%include "topologies/2x2.md" %}}
+
+This shows that the distance to remote memory on the same socket
+is far lower than the distance to the other socket, which is twice
+the local memory distance.
+
+As a result, this means that using remote memory on the same socket
+increases the distance by only by 1/10th than using the other socket.
+
+
+Hence, for VMs builds where might not be enough NUMA memory on one
+NUMA node, using same-socket memory would have only about 50% of the
+than remote-socket memory would have.
+
+At same time, if the memory is (roughly) equally spread over two
+NUMA nodes on the same socket, it could make sense to move the
+vCPU affinity between the two NUMA nodes depending on their CPU load.
+
+
+
+
+
+In the simplest case, the vCPU affinity could be set to e.g. two
+NUMA nodes on the same socket (specified as having low distance),
+which would cause Xen to allocate the memory from both NUMA nodes.
+
+If this is not done,
+
+
+## 
+| Node | RAM | used | free |
+| ----:| ---:| ----:| ----:|
+|    1 |  50 |   35 |   15 |
+|    2 |  50 |   45 |    5 |
+|    3 |  50 |   35 |   15 |
+|    4 |  50 |   35 |   15 |
+|  all | 200 |  150 |   50 |
+
+<!---
+This shows that the distance to remote memory on the same socket
+is far lower than the distance to the other socket, which is twice
+the local memory distance.
+
+As a result, this means that using remote memory on the same socket
+increases the distance by only by 1/10th than using the other socket.
+
+Hence, for VMs builds where might not be enough NUMA memory on one
+NUMA node, using same-socket memory would have only about 50% of the
+than remote-socket memory would have.
+
+At same time, if the memory is (roughly) equally spread over two
+NUMA nodes on the same socket, it could make sense to move the
+vCPU affinity between the two NUMA nodes depending on their CPU load.
+
+In the simplest case, the vCPU affinity could be set to e.g. two
+NUMA nodes on the same socket (specified as having low distance),
+which would cause Xen to allocate the memory from both NUMA nodes.
+
+If this is not done,
+
+## 
+| Node | RAM | used | free |
+| ----:| ---:| ----:| ----:|
+|    1 |  50 |   35 |   15 |
+|    2 |  50 |   45 |    5 |
+|    3 |  50 |   35 |   15 |
+|    4 |  50 |   35 |   15 |
+|  all | 200 |  150 |   50 |
+
+--->
diff --git a/doc/content/toolstack/features/NUMA/parallel-VM.build.md b/doc/content/toolstack/features/NUMA/parallel-VM.build.md
@@ -0,0 +1,52 @@
+---
+title: "Parallel VM build"
+categories:
+  - NUMA
+weight: 50
+mermaid:
+  force: true
+---
+
+## Introduction
+
+When the `xenopsd` server receives a `VM.start` request, it:
+1. splits the request it into micro-ops and
+2. dispatches the micro-ops in one queue per VM.
+
+When `VM.start` requests arrive faster than the thread pool
+finishes them, the thread pool will run multiple
+micro-ops for different VMs in parallel. This includes the
+VM.build micro-op that does NUMA placement and VM memory allocation.
+
+The [Xenopsd architecture](xenopsd/architecture/_index) and the
+[walkthrough of VM.start](VM.start) provide more details.
+
+This walkthrough dives deeper into the `VM_create` and `VM_build` micro-ops
+and focusses on allocating the memory allocation for different VMs in
+parallel with respect to the NUMA placement of the starting VMs.
+
+## Architecture
+
+This diagram shows the [architecture](../../../xenopsd/architecture/_index) of Xenopsd:
+
+At the top of the diagram, two client RPCs have been sent:
+One to start a VM and the other to fetch the latest events.
+The `Xenops_server` module splits them into "micro-ops" (labelled "μ op" here).
+These micro-ops are enqueued in queues, one queue per VM. The thread pool pulls
+from the VM queues and runs the micro-ops:
+
+![Inside xenopsd](../../../../xenopsd/architecture/xenopsd.svg)
+<center><figcaption><i>Image 1: Xenopsd architecture</i></figcaption></center>
+
+Overview of the micro-ops for creating a new VM:
+
+- `VM.create`: create an empty Xen domain in the Hypervisor and the Xenstore
+- `VM.build`: build a Xen domain: Allocate guest memory and load the firmware and `hvmloader`
+- Several micro-ops to attach devices launch the device model.
+- `VM.unpause`: unpause the domain
+
+## Flowchart: Parallel VM start
+
+When multiple `VM.start` run concurrently, an example could look like this:
+
+{{% include "snippets/vm-build-parallel" %}}