diff --git a/doc/content/toolstack/features/NUMA/index.md b/doc/content/toolstack/features/NUMA/_index.md similarity index 100% rename from doc/content/toolstack/features/NUMA/index.md rename to doc/content/toolstack/features/NUMA/_index.md diff --git a/doc/content/toolstack/features/NUMA/lazy-reclaim.md b/doc/content/toolstack/features/NUMA/lazy-reclaim.md new file mode 100644 index 00000000000..fb12deb8207 --- /dev/null +++ b/doc/content/toolstack/features/NUMA/lazy-reclaim.md @@ -0,0 +1,256 @@ +--- +title: "Lazy memory reclaim" +weight: 10 +categories: + NUMA +--- +## Xen host memory scrubbing + +Xen does not immediately reclaim deallocated memory. +Instead, Xen has a host memory scrubber that runs lazily in +the background to reclaim recently deallocated memory. + +Thus, there is no guarantee that Xen has finished scrubbing +when `xenopsd` is being asked to build a domain. + +## Waiting for enough free host memory + +> [!info] +> In case the reclaimed host-wide memory is not sufficient yet, +> when `xenopsd` starts to build a VM, its +> [build_pre](https://github.com/xapi-project/xen-api/blob/073373ff/ocaml/xenopsd/xc/domain.ml#L899-L964) +> (also part of VM restore / VM migration) +> [polls](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L904) +> Xen [until enough host-wide memory](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L236-L272) +> has been reclaimed. See the +> [walk-through of Domain.build](../../../xenopsd/walkthroughs/VM.build/Domain.build.md#build_pre-prepare-building-the-vm) +> of `xenopsd` for more context: + +```ml +let build_pre ~xc ~xs ~vcpus ~memory ~has_hard_affinity domid = + let open Memory in + let uuid = get_uuid ~xc domid in + debug "VM = %s; domid = %d; waiting for %Ld MiB of free host memory" + (Uuidx.to_string uuid) domid memory.required_host_free_mib ; + (* CA-39743: Wait, if necessary, for the Xen scrubber to catch up. *) + if + not (wait_xen_free_mem ~xc (Memory.kib_of_mib memory.required_host_free_mib)) + then ( + error "VM = %s; domid = %d; Failed waiting for Xen to free %Ld MiB" + (Uuidx.to_string uuid) domid memory.required_host_free_mib ; + raise (Not_enough_memory (Memory.bytes_of_mib memory.required_host_free_mib)) + ) ; +``` + +This is the implementation of the polling function: + +```ml +let wait_xen_free_mem ~xc ?(maximum_wait_time_seconds = 64) required_memory_kib + : bool = + let open Memory in + let rec wait accumulated_wait_time_seconds = + let host_info = Xenctrl.physinfo xc in + let free_memory_kib = + kib_of_pages (Int64.of_nativeint host_info.Xenctrl.free_pages) + in + let scrub_memory_kib = + kib_of_pages (Int64.of_nativeint host_info.Xenctrl.scrub_pages) + in + (* At exponentially increasing intervals, write *) + (* a debug message saying how long we've waited: *) + if is_power_of_2 accumulated_wait_time_seconds then + debug + "Waited %i second(s) for memory to become available: %Ld KiB free, %Ld \ + KiB scrub, %Ld KiB required" + accumulated_wait_time_seconds free_memory_kib scrub_memory_kib + required_memory_kib ; + if + free_memory_kib >= required_memory_kib + (* We already have enough memory. *) + then + true + else if scrub_memory_kib = 0L (* We'll never have enough memory. *) then + false + else if + accumulated_wait_time_seconds >= maximum_wait_time_seconds + (* We've waited long enough. *) + then + false + else ( + Thread.delay 1.0 ; + wait (accumulated_wait_time_seconds + 1) + ) + in + wait 0 +``` + +## Waiting for enough free memory on NUMA nodes + +To address the same situation not host-wide but specific to NUMA +nodes, the build, restore and migrate processors of domains on NUMA machines needs a similar algorithm. + +This should be done directly before the NUMA placement algorithm +runs, or even as part of an improvement for it: + +The NUMA placement algorithm calls the `numainfo` hypercall to +obtain a table of NUMA nodes with the available memory on each +node and the distance matrix between the NUMA nodes as the basis +for the NUMA placement decision for the VM. + +If the reported free memory of the host is lower than would be +expected at that moment, this might be an indication that some +memory might not be scrubbed yet. Another indication might be +if the amount of free memory is increasing between two checks. + +Also, if other domains are in the process of being shut down, +or if a shutdown recently occurred, Xen is likely scrubbing in +the background. + +For cases where the NUMA placement returns no NUMA node affinity +for the new domain, the smallest possible change would be to +simply re-run the NUMA placement algorithm. + +As a trivial first step would be to retry once if the initial +NUMA placement of a VM failed and abort retrying if the available +memory did not change since the initial failed attempt. + + +System-wide polling seems to abort polling when the amount of +free memory did not change compared to the previous poll. For +the NUMA memory poll, the previous results could be kept likewise +and compared to the new results. + +Besides, the same polling timeout like for system-wide memory +could be used. + +## An example scenario + +This is an example scenario where not waiting for memory scrubbing +in a NUMA-aware way could fragment the VM across many NUMA nodes: + +In this example, a relatively large VM is rebooted: + +Fictional machine with 4 NUMA nodes, 25 GB each (for layout reasons): + +```mermaid +%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%% +packet-beta + 0-18: "Memory used by other VMs" + 19-24: "free: 6 GB" + 25-44: "VM before restart: 20 GB" + 45-49: "free: 5GB" + 50-69: "Memory used by other VMs" + 70-74: "free: 5GB" + 75-94: "Memory used by other VMs" + 95-99: "free: 5GB" +``` + +VM is destroyed: + +```mermaid +%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%% +packet-beta + 0-18: "Memory used by other VMs" + 19-24: "free: 6 GB" + 25-44: "VM memory to be reclaimed, but not yet scrubbed" + 45-49: "free: 5GB" + 50-69: "Memory used by other VMs" + 70-74: "free: 5GB" + 75-94: "Memory used by other VMs" + 95-99: "free: 5GB" +``` + +NUMA placement runs, and sees that no NUMA node has enough memory +for the VM. Therefore: +1. NUMA placement does not return a NUMA placement solution. +2. As a result, vCPU soft pinning it not set up +3. As a result, the domain does not get a NUMA node affinity +4. When `xenguest` allocates the VM's memory, Xen falls back to + round-robin memory allocation across all NUMA nodes. + +Even if Xen has already scrubbed the memory by the time the +NUMA placement function returns, the decision to not select +a NUMA placement has already been done and the domain is +built in this way: + +```mermaid +%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%% +packet-beta + 0-18: "Memory used by other VMs" + 19-23: "VM: 5GB" + 24-24: "" + 25-44: "scrubbed/reclaimed free memory: 20 GB" + 45-49: "VM: 5GB" + 50-69: "Memory used by other VMs" + 70-74: "VM: 5GB" + 75-94: "Memory used by other VMs" + 95-99: "VM: 5GB" +``` + +In case the reclaimed 20 GB of memory is not partially allocated +for other VMs in the meantime: +After scrubbing and memory reclaim is complete. the 20 GB +of NUMA-node memory is available for the VM again. + +When the 20 GB VM is rebooted and the memory is still available, +the rebooted VM might become NUMA-affine to the 2nd NUMA node again. Of course, this unpredictability is what we need to fix. + +## Starting VMs when not enough reclaim is possible + +But, when no NUMA node has enough memory to run a new VM, +waiting will not help. + +But, it might be good to inform the caller that perfect NUMA placement +could not be achieved. + +However, if a CPU socket with multiple NUMA nodes with a very low +internode distance has enough free memory, that could be seen as +a fallback that would have relatively low performance impact. + +In the end, in such situations, it depends on the caller what to do: +Whether to start the VM anyway despite being not NUMA-aligned, +or to inform the caller of an expected performance degradation of the VM. + +### Example scenario when not waiting for free NUMA node memory: + +Note: This uses round numbers for easy checking and is purely theoretical: + +| Node | RAM | used | free | +| ----:| ---:| ----:| ----:| +| 1 | 50 | 35 | 15 | +| 2 | 50 | 45 | 5 | +| 3 | 50 | 35 | 15 | +| 4 | 50 | 35 | 15 | +| all | 200 | 150 | 50 | + +Action: A 45 GB VM on Node 2 is shut down and started again. +1. When the new `VM.start` runs, the 45 GB may not have been scrubbed yet. +2. The free memory check still finds 50 GB free, enough to start the VM. +3. NUMA placement picks one of the other nodes as they have more memory. +4. For example, assume it picks node 0, and sets the node-affinity to it. +5. The Xen buddy allocator will run out of 1GB superpages on node 0 after + having exhausted the 15 GB free memory on it. +6. This leaves 30 GB to be allocated elsewhere. +7. Meanwhile, some memory might have been scrubbed and reclaimed on Node 2. +8. The Xen buddy allocator then falls back to allocating in a round-robin + fashion from the other NUMA nodes, assume 10 GB on each of the 3 nodes. + +New memory situation after the restart: + +| Node | RAM | used | free | Dom1 | +| ----:| ---:| ----:| ----:| ----:| +| 1 | 50 | 50 | 0 | 15 | +| 2 | 50 | 10 | 40 | 10 | +| 3 | 50 | 45 | 5 | 10 | +| 4 | 50 | 45 | 5 | 10 | +| all | 200 | 150 | 50 | 45 | + +Thus, a single VM restart may cause the VM's memory to be spread over +all NUMA nodes. As a result, most memory accesses would be remote. + +`xenguest` populates the Guest memory in the process of the build step. + +But as the `VM.build` micro-ops running in parallel, this can happen: +The free memory reported by Xen may not yet reflect memory that will +be allocated by other concurrently running `VM.build` micro-ops when +the `xenguest` processes started by them populate the VM memory. diff --git a/doc/content/toolstack/features/NUMA/node-fallbacks.md b/doc/content/toolstack/features/NUMA/node-fallbacks.md new file mode 100644 index 00000000000..036fda29e9a --- /dev/null +++ b/doc/content/toolstack/features/NUMA/node-fallbacks.md @@ -0,0 +1,78 @@ +--- +title: VM.build Use neighbouring NUMA nodes +mermaid: + force: true + theme: forest + mermaidInitialize: '{ "theme": "base" }' +--- + +{{%include "topologies/2x2.md" %}} + +This shows that the distance to remote memory on the same socket +is far lower than the distance to the other socket, which is twice +the local memory distance. + +As a result, this means that using remote memory on the same socket +increases the distance by only by 1/10th than using the other socket. + + +Hence, for VMs builds where might not be enough NUMA memory on one +NUMA node, using same-socket memory would have only about 50% of the +than remote-socket memory would have. + +At same time, if the memory is (roughly) equally spread over two +NUMA nodes on the same socket, it could make sense to move the +vCPU affinity between the two NUMA nodes depending on their CPU load. + + + + + +In the simplest case, the vCPU affinity could be set to e.g. two +NUMA nodes on the same socket (specified as having low distance), +which would cause Xen to allocate the memory from both NUMA nodes. + +If this is not done, + + +## +| Node | RAM | used | free | +| ----:| ---:| ----:| ----:| +| 1 | 50 | 35 | 15 | +| 2 | 50 | 45 | 5 | +| 3 | 50 | 35 | 15 | +| 4 | 50 | 35 | 15 | +| all | 200 | 150 | 50 | + + \ No newline at end of file diff --git a/doc/content/toolstack/features/NUMA/parallel-VM.build.md b/doc/content/toolstack/features/NUMA/parallel-VM.build.md new file mode 100644 index 00000000000..fe08dce2102 --- /dev/null +++ b/doc/content/toolstack/features/NUMA/parallel-VM.build.md @@ -0,0 +1,52 @@ +--- +title: "Parallel VM build" +categories: + - NUMA +weight: 50 +mermaid: + force: true +--- + +## Introduction + +When the `xenopsd` server receives a `VM.start` request, it: +1. splits the request it into micro-ops and +2. dispatches the micro-ops in one queue per VM. + +When `VM.start` requests arrive faster than the thread pool +finishes them, the thread pool will run multiple +micro-ops for different VMs in parallel. This includes the +VM.build micro-op that does NUMA placement and VM memory allocation. + +The [Xenopsd architecture](xenopsd/architecture/_index) and the +[walkthrough of VM.start](VM.start) provide more details. + +This walkthrough dives deeper into the `VM_create` and `VM_build` micro-ops +and focusses on allocating the memory allocation for different VMs in +parallel with respect to the NUMA placement of the starting VMs. + +## Architecture + +This diagram shows the [architecture](../../../xenopsd/architecture/_index) of Xenopsd: + +At the top of the diagram, two client RPCs have been sent: +One to start a VM and the other to fetch the latest events. +The `Xenops_server` module splits them into "micro-ops" (labelled "μ op" here). +These micro-ops are enqueued in queues, one queue per VM. The thread pool pulls +from the VM queues and runs the micro-ops: + + +