Skip to content

Conversation

erfrimod
Copy link
Contributor

Linux netvsc sends an OID to stop receiving packets on vmbus channel close. Example scenarios: hibernation and MTU change. Prior to opening a new channel and processing the packets, netvsc checks that there are no pending packets. If there are, netvsc logs and error and is unable to recover. We observe the error: hv_netvsc eth0: Ring buffer not empty after closing rndis in the guest syslog.

Modifying netvsp to handle the OID and stop processing RX traffic. This will allow for netvsc to successfully close and re-open the vmbus channel, even under heavy incoming traffic.

Cherry pick of
#1873

benhillis and others added 30 commits May 16, 2025 21:53
Update mirroring logic to mirror the 2505 release branch.

Co-authored-by: Ben Hillis <[email protected]>
This branch will be on 1.86.0 for the forseeable future. As per our
internal support policy this is an even version.
…#1403)

We support these regs on all backends, so no reason not to.

Cherry-pick of microsoft#1308
…t#1387) (microsoft#1393)

The added tests use [loom](https://docs.rs/loom/0.7.2/loom/) to ensure
that all possible order of operations work correctly. This required
tweaking the orderings we use.

Cherry-pick of microsoft#1387
We are trying to write a non-JSON-formatted value to track that an
env-var-sourced variable is secret. Fix this by writing `null`. Also add
in some diagnostics and improve the in-memory variable representation to
avoid so many allocations.

Co-authored-by: John Starks <[email protected]>
…1402) (microsoft#1406)

After long discussions we have decided to flip the default of our
tracing filter, and to allow untagged tracing statements by default. We
believe that the risks and costs of being unable to debug incidents in
production are too high, and that we can manually scrub our tracing
statements to ensure that no sensitive information is leaked.

Cherry-pick of microsoft#1402.

Part of microsoft#852.
…icrosoft#1383) (microsoft#1431)

When guest memory page protections are changed (e.g., pages are
transitioned between shared and private), we need to flush concurrent
accesses to those pages by the paravisor before updating the page state
in hardware. Otherwise, faults or cross-VTL data leaks may occur.

Add this synchronization as cheaply as we can: add a simple RCU
(Read-Copy-Update) mechanism that allows threads accessing guest memory
to cheaply synchronize with threads mutating the page access bitmaps.
Use the membarrier() syscall on Linux to allow readers to operate
without memory barriers, shifting the expensive to the (infrequent)
bitmap update paths.

Only enable this mechanism in OpenHCL, since other environments do not
rely on bitmap-based guest memory access controls.

Cherry-pick of microsoft#1383

Co-authored-by: John Starks <[email protected]>
…dresses (microsoft#1340) (microsoft#1434)

On TDX, we see some cases where the guest attempts to access an address
with an incorrect shared bit. It's unclear if this is an issue in
OpenHCL, the guest, or the host, but fix OpenHCL crashing with an
emulation failure due to a `GuestMemory` access failure, and instead
inject a machine check into the guest.

For addresses outside of mmio and ram, continue to emulate but log that
the guest did something strange. In the future, we may also inject a
machine check on that path.

Tested via a uefi app that attempts to access a shared page at a private
gpa (see

https://github.com/chris-oo/openvmm/blob/uefi-tmk-write-to-shared/tmk/tmk_launch/src/main.rs)
with the following crash:

```
[kmsg]: [3.058450] virt_mshv_vtl::processor::tdx: WARN  guest accessed inaccessible gpa, injecting MC gpa=0x666d9000 is_shared=false
[kmsg]: [3.061137] virt_mshv_vtl::processor: WARN  Guest has reported system crash crash=VtlCrash { vp_index: VpIndex(0), last_vtl: Vtl0, control: GuestCrashCtl { pre_os_id: 0, no_crash_dump: false, crash_message: true, crash_notify: true }, parameters: [12, 0, 0, 6790a820, 4d7] }
[kmsg]: [3.061758] virt_mshv_vtl::processor:
WARN  Guest has reported a system crash message "!!!! X64 Exception Type - 12(#MC - Machine-Check)  CPU Apic ID - 00000000 !!!!<5c>r<5c>nRIP  - 00000000666DB030, CS  - 0000000000000038, RFLAGS - 0000000000010202<5c>r<5c>nRAX  - 00000000666D9000, RCX - 0000000000000042, RDX - 3333333333333333<5c>r<5c>nRBX  - 0000000066E00018, RSP - 0000000033D99150, RBP - 0000000033D99190<5c>r<5c>nRSI  - 0000000066E54718, RDI - 0000000033DB8160<5c>r<5c>nR8   - 0000000000000000, R9  - 00000000666ED508, R10 - 00000000666ED5EE<5c>r<5c>nR11  - 000000000000000A, R12 - 0000000000000000, R13 - 0000000000000000<5c>r<5c>nR14  - 0000000033D99AA0, R15 - 0000000033D99A98<5c>r<5c>nDS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030<5c>r<5c>nGS   - 0000000000000030, SS  - 0000000000000030<5c>r<5c>nCR0  - 0000000080010073, CR2 - 0000000000000000, CR3 - 0000000033801000<5c>r<5c>nCR4  - 0000000000000668, CR8 - 0000000000000000<5c>r<5c>nDR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000<5c>r<5c>nDR3  -     
```

microsoft#1426 tracks implementing this for SNP.

Backport of microsoft#1340
…rosoft#1441)

The register page is not valid until it has been mapped and the
hypervisor says it's valid. However, the in-memory contents of the page
before it is mapped could be non-zero, especially across a servicing
operation. This can cause the VMM to think the register page contents
are valid, causing register corruption.

Fix this by explicitly clearing the valid flag in the register page
during sidecar startup.

Also zero a few more potentially stale structures to avoid potential
bugs.

Cherry-pick of microsoft#1440

Co-authored-by: John Starks <[email protected]>
) (microsoft#1448)

Update the kernel to the latest available version which contains this
sidecar fixes:
microsoft/OHCL-Linux-Kernel#82
microsoft/OHCL-Linux-Kernel#83

Cherry-pick of microsoft#1442

Co-authored-by: Ben Hillis <[email protected]>
This change updates ms-tpm-20-ref-rs to a version that includes a TPM
backing store size fix.
…oft#1465)

Allow the memory backing to provide an error kind, which the virt
backends will later use to determine whether to attempt emulation,
inject a machine fault, resume the VP, or terminate the VM.

Cherry-pick of microsoft#1430

Co-authored-by: John Starks <[email protected]>
…icrosoft#1458)

Filter panic messages (printed to /dev/ttyprintk) out to a separate
target and raise their effective trace level from verbose to critical.

Cherry-pick of microsoft#1455

Co-authored-by: John Starks <[email protected]>
…ng branch (microsoft#1466)

Now that we are in ask mode for the upcoming release, this change
updates the mirroring workflow to submit OSS changes to a staging branch
instead of directly to the release branch. This means that we will need
to periodically merge staging into release (after getting approval).
This is the same flow we did for the 2411 release.

Co-authored-by: Ben Hillis <[email protected]>
…crosoft#1408) (microsoft#1460)

Also includes some drive-by cleanups where I happened to see them.

Areas I did not audit because they are not relevant to CVMs:
- trace and debug level statements
- Non-CVM workers (debug & VNC)
- Test only code (petri, vmm_tests, tmk*)
- ARM-specific code
- Non-CVM virt backends (including virt_mshv_vtl/mshv)
- Host-only code (openvmm, GED, igvmfilegen, etc)
- Gen 1 devices (vga, chipset, etc)
- VirtIO

Areas that still need auditing by owners and area experts:
- Mesh (@jstarks)
- Networking (vm/devices/net/* & underhill_core/netvsp) (networking
team)
- Storage (vm/devices/storage/* & underhill_core/nvme_manager) (storage
team)
- VMBus (vm/devices/vmbus/*) (@SvenGroot)
- VMGS (vm/vmgs/*) (@tjones60)

Part of microsoft#852

Cherry-pick of microsoft#1408
…ft#1470)

The Linux kernel serializes CPU hotplug. If multiple sidecar VPs need to
be onlined into OpenVMM simultaneously, they will all stop running the
guest while associated Linux threads call into the Linux kernel to
online the CPU (which will block on the CPU hotplug lock or whatever).

This means the average blackout time for a VP that's onlined early in
boot is linear in the number of early-onlined VPs. And thanks to typical
device configurations, this is usually linear in the total number of
VPs. This is a performance problem.

To avoid this, explicitly serialize VP online _before_ the target VP is
stopped. This allows the VP to continue running the guest until it
reaches the front of the online queue. This reduces the average blackout
time to just the time to online one CPU, meaning this solution should
scale to any number of VPs.

Cherry-pick of microsoft#1443

Co-authored-by: John Starks <[email protected]>
…1477)

The current size check is failing in the release/2505 branch because
it's currently comparing against main and the branches have diverged.

Co-authored-by: Ben Hillis <[email protected]>
…icrosoft#1474)

…wide (microsoft#1370)

In preparation for VTL 1 memory support for CVMs, make the
shared/encrypted bitmap tracking available on a partition-level, rather
than in the GuestMemoryMapping (which ends up being per-VTL). Also
includes some refactoring to isolate out the bitmap logic so that it can
be reused for vtl protection bitmaps.

Tested: SNP +/- guest vsm boots
…1462) (microsoft#1468)

tokio-rs/tracing#2519 can cause the tracing
crate to mistakenly drop logs emitted after calling the `enabled!`
macro. Today we only call that macro in two places; this PR removes one
of them, the second is coming in another PR.

We currently are using our dynamic tracing filtering to filter out what
system-level messages get sent to the host, in addition to its normal
purposes. After much investigation and thought I've come to the
conclusion that there is no good way to work around this bug while
maintaining dynamic configuration. So instead just statically code these
levels. Realistically in terms of what these messages can help us
diagnose this is almost certainly fine.

Cherry-pick of microsoft#1462
…icrosoft#1484)

WHP has a bug around partition scrub on AMD nested hosts which makes
servicing tests flakey. Skip them for now.

This is a targeted PR to just make these tests not flakey. I'd like to
instead rework how we decide what petri tests to run based on host
capabilities, but take this stopgap first.

Cherry pick of microsoft#1480
…rosoft#1486)

This removes functionality added in

microsoft@7278a20
to avoid hitting tokio-rs/tracing#2519. While
the functionality is nice to have, it is not so important as to be worth
potentially dropping events, and there is no performant way to implement
it that can avoid this bug.

Then ban the tracing::enabled macro codebase-wide.

Cherry-pick of microsoft#1469
Changes flowey config for mu_msvm to use v25.1.3 release.

mu_msvm release here:
https://github.com/microsoft/mu_msvm/releases/tag/v25.1.3
(This is a backport PR)

This PR adds EFI Diagnostics, which is a service used to parse UEFI
diagnostics data from an in-memory buffer and send it to our tracing
facilities.

The UEFI firmware will write the GPA of the advanced logger buffer to an
Io port intercept called `SET_EFI_DIAGNOSTICS_GPA`.

The diagnostics service is responsible for reading guest memory at the
specified GPA and parsing the data. This gets triggered when the UEFI
firmware writes to an Io port intercept called
`PROCESS_EFI_DIAGNOSTICS`.

The `PROCESS_EFI_DIAGNOSTICS` UefiCommand gets triggered by the
following conditions:
- UEFI encounters a failure (guest driven via `PROCESS_EFI_DIAGNOSTICS`)
- UEFI fails to boot any device (guest driven via
`PROCESS_EFI_DIAGNOSTICS`)
- UEFI reaches exit boot services

The simplest way to test this is to run:
```
cargo run -- --uefi
```
microsoft#1487)

When VTL 2 accesses VTL 0 memory on behalf of VTL 0, it needs to be able
to check whether VTL 1 has restricted access to the memory. This change
introduces tracking of VTL 1 permissions using bitmaps and adds some of
the enforcement of these permissions.

Tested:
SNP +/- guest VSM boots
TVM and TDX VMs also boot
)

Mitigate TPM corruption due to previous VMs having a 16K TPM NVRAM
reported as 32K, and commited bad state to the vTPM NVRAM.

This involves the following:

For every 16K TPM NVRAM, walk the dynamic section and truncate the last
header if it points to data past the end.

Additionally, run the following mitigation steps for 16K NVRAM:
1. Check for a 4K bytes AK cert nv index. 1. This VM needs to be
mitigated. 2. Undefine the 4K AK cert to save space. 3. Attempt to write
a 1 byte mitigation platform marker, which can fail. 4. Attempt to write
a just-sized platform ak cert.
2. Else, check for a mitigated marker or no platform cert 1. Log that
this vm is mitigated, and if an ak cert is present or not
3. Else, check for an owner cert 1. Log that this VM is in the expected
state

Co-authored-by: Chris Oo <[email protected]>
…rosoft#1483) (microsoft#1516)

Uses RCU implementation to synchronize reads and writes of the VTL 1
permission bitmaps.

Tested:
SNP +/- guest vsm boots
microsoft#1536)

…mulation (microsoft#1513)

Adds two guest memory objects, backed by kernel/usermode execute VTL 1
permission bitmaps (for cvms), to be used on the emulation path to
enforce VTL 1 protections when accessing instructions during instruction
emulation.

Tested:
SNP +/- guest vsm boots
TVM boots
…crosoft#1535) (microsoft#1553)

Nobody calls it, and they should be going through uh_mem instead
anyways. We probably need to explore unifying traits between virt and
uh_mem in the future, but that can wait.

Cherry-pick of microsoft#1535
chris-oo and others added 22 commits July 31, 2025 13:38
Update nextest to the latest release. This fixes some bugs it seems in
nextest around detecting leaks on windows, which was causing test
failures.

Cherry pick of microsoft#1790.

Fixes microsoft#1782
…microsoft#1797)

On AArch64, the Performance Monitor Unit (PMU) is supposed to be
supported by every platform. Add this information to the vm's topology,
and correctly report a configured value in the MADT via the GICC
structure. Onboard a test to verify that Linux sees the correct
interrupts.

Hyper-V and WHP support a hardcoded value of 0x17, so for now hardcode
that value on those platforms. A follow up change will correctly report
this value via a `pmu` device tree node, but take this more minimal
change to backport to the release/2505 branch.

Although macOS also supports this interrupt with the same value of 0x17,
enabling that did not cause Linux to work as expected, so more
investigation there is needed.

This fixes xperf on Windows and perf on Linux which rely on this being
present.

microsoft#1775 

Backport of microsoft#1776
Probably got accidentally added during a merge conflict.
…onnecting (microsoft#1817)

This change fixes an issue where, if all channels are already reset when
a disconnect happens, the server would not invoke
`Notifier::modify_connection`. This means that the state such as the
interrupt page and monitor pages is not reset, and in the case of
OpenHCL the relay is not notified of the disconnect (which can leave
host state intact, including monitor pages if MNF is handled by the
host).

This caused an issue where Linux would occasionally crash during resume
from hibernate. When resuming, Linux makes two connections, first to
read the memory image, and then to resume normal operations, both using
MNF. When the first connection unloads, the overlay pages for the
monitor pages were not removed until the reconnect, leading to memory
corruption when Linux proceeds to use these pages as normal memory.

This change also adds some tests ensuring the notifier is invoked for an
unload with open channels, without open channels, and a forced
disconnect when a new InitiateContact message is received.

Cherry-picked from microsoft#1809

Co-authored-by: Copilot <[email protected]>
) (microsoft#1773)

On CVM platforms, this self test results in logs to verify that the
various bitmap are preventing accesses as expected. Log that we're doing
this self test.

Cherry pick of microsoft#1772
…icrosoft#1810)

Backport of: microsoft#1755

This PR focuses on allowing EfiDiagnostics to force flush through
InspectMut.

We will only print EfiDiagnostics to our tracing facilities **ONLY IF WE
WRITE** to the `process_diagnostics` field in `UefiDevice`.

To trigger this, use inspect like so:
```
openvmm> inspect -u 1 vm/uefi/process_diagnostics
```
…t#1835)

This change cherry-picks microsoft#1829,
and its dependent change microsoft#1815.

---------

Co-authored-by: Copilot <[email protected]>
…soft#1838)

This should help us catch bad memory setups earlier.

Note: we're still debugging what causes failures here, but the sooner we
can catch them the better.

Also includes some additional tracing.

Cherry-pick of microsoft#1828
…1839)

If two ranges in a guest's memory layout share a bitmap backing page,
then during bitmap initialization one of the range's bitmap state will
be incorrectly zeroed. This causes bitmap checks to unexpectedly fail.

Fix this by not re-zeroing bitmap pages during initialization.

Cherry-pick of microsoft#1830

Co-authored-by: John Starks <[email protected]>
…1847)

There is a vulnerability in OpenHCL's VMGS key-rolling code that allows
the host to cause the VMGS to be encrypted with a host-controlled key.
unwrap_and_rotate_keys now returns a pair of egress keys: one that may
have been used to previously encrypt the VMGS and can only be used for
that purpose; and a second key, always derived anew, that can safely be
used to re-encrypt the VMGS.

CVE-2025-53781

Co-authored-by: Copilot <[email protected]>
Cherry-pick into release/2505 for
microsoft#1831

Co-authored-by: Jenna Goddard <[email protected]>
Cherry pick to release/2505

### underhill_core: factory for nvme devices + tests for nvme_manager
(microsoft#1787)

Add unit tests for the existing code in `nvme_manager.rs`. This requires
a minor refactoring: push the code to create NVMe drivers into a factory
that the tests can then mock.

These basic tests already highlight the performance problems seen in
production: the GetNamespace path is not concurrent.

This is one part of the broader work effort.

### underhill_core: nvme_manager: make it multithreaded microsoft#1763

This PR addresses serialization in the existing underhill_core:
nvme_manager. This serialization proves to be the bottleneck when
performing a runtime servicing operation with multiple NVMe devices.

This change leverages mesh to create a two-level hierarchy:
- The existing `NvmeManager` API surface is the top level. The idea is
that this keeps track of the NVMe devices that are in some state of
being created, and
- A new `NvmeDriverManager` that manages the lifecycle of a single NVMe
device.

Most NVMe devices have one namespace, but our cloud scale scenario
requires supporting multiple namespaces per NVMe device. It's okay to
serialize multiple calls to the same device, since the most expensive
portion is loading the driver.
This release is almost ready, time to disable debug support in the
builds that will end up in prod.
microsoft#1866)

Allow the host to specify OpenHCL features and encryption policy for the
VMGS.
A malicious admin can evict the AK from their VM's vTPM and replace it
with their own key. At boot, Azure will load that key from the VMGS and
then sign an AKCert with that key, allowing the admin to spoof KeyGuard
and CVM attestation.

CVM: This change mirrors changes in the legacy HCL: Regenerate the AK at
boot from the TPM seeds, instead of loading it from VMGS. This ensures
that the original AKCert is always present in the vTPM.

TVM: OpenHCL currently cannot regenerate the AK for a TVM, because the
original AK (provisioned by the vtpmservice) contains an auth policy;
OpenHCL does not implement that policy creation. As an alternative, when
OpenHCL boots, it will check the attributes on the AK that it loads from
VMGS. If the attributes are wrong (indicating a possibly malicious key),
it will not make any calls to renew the AKCert.

CVE-2025-49707

---------

Co-authored-by: Ben Hillis <[email protected]>
Co-authored-by: Copilot <[email protected]>
…icrosoft#1875)

This reverts commit b768183. This
apparently broke vmbus relay on tdx. We're not sure why yet, but revert
it for now to unblock RIs.
…crosoft#1878) (microsoft#1879)

This fixes VMBus Relay on TDX without the hw debug bit.

Clean cherry-pick of microsoft#1878
…) (microsoft#1894)

When we are sharing a page we remove all VTL 0 permissions to that page.
Later on, when we re-private the page, we were failing to reset these
permissions, which led to failures when the guest tried to use pages it
should have had access to. This code is a bit confusing due to
conflating private/shared with VTL 1 access permissions. Add a bunch of
comments, and fix the reset.

Cherry-pick of microsoft#1891
…ft#1900)

Continue processing events while waiting for guest response from
shutdown request.
Properly return errors from shutdown requests.

CP from microsoft#1895

Co-authored-by: Brian Perkins <[email protected]>
Linux netvsc sends an OID to stop receiving packets on vmbus channel
close. Example scenarios: hibernation and MTU change. Prior to opening a
new channel and processing the packets, netvsc checks that there are no
pending packets. If there are, netvsc logs and error and is unable to
recover. We observe the error: `hv_netvsc eth0: Ring buffer not empty
after closing rndis` in the guest syslog.

Modifying netvsp to handle the OID and stop processing RX traffic. This
will allow for netvsc to successfully close and re-open the vmbus
channel, even under heavy incoming traffic.

---------

Co-authored-by: Sunil Muthuswamy <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings August 27, 2025 22:37
@erfrimod erfrimod requested review from a team as code owners August 27, 2025 22:37
@erfrimod erfrimod closed this Aug 27, 2025
@github-actions github-actions bot added the Guide label Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.