Feature Request
Standard virtio-pmem backends in Firecracker are typically backed by local files. However, in high-density container environments (e.g., using Nydus or EROFS-based lazy loading), image data is often stored in remote registries and fetched on-demand.
Currently, achieving "lazy-loading" in Firecracker requires either virtio-block (which lacks DAX and causes memory-heavy double-buffering) or complex vhost-user-fs setups. This feature request proposes an asynchronous, on-demand FD-passing backend for virtio-pmem using userfaultfd. This allows MicroVMs to start instantly while sharing the Host's page cache with zero-copy overhead, perfectly aligning with Firecracker's performance and security philosophy.
Describe the desired solution
We propose a new UffdBackend for the virtio-pmem device that leverages userfaultfd (UFFD) and Unix Domain Socket (UDS) file descriptor passing.
Architecture
- UFFD Registration: The VMM creates an anonymous RO mapping for the
virtio-pmem address space and registers it with userfaultfd.
- NBD-style Control Plane: The VMM communicates with an external image service (e.g.,
nydusd) via UDS using a lightweight, asynchronous protocol inspired by NBD.
PROBE: Queries the service for existing local chunks at boot to perform initial mappings.
FETCH: Triggered by a UFFD page fault; the VMM sends a request with (pos, len) and returns to the event loop.
- FD-Passing & Remapping: The service replies using
SCM_RIGHTS to pass File Descriptors. The VMM receiver thread then performs an mmap(MAP_FIXED) to map the blob-backed FD into the Guest's address space.
- Zero-Copy Execution: Once mapped, the Guest accesses the data via DAX (e.g., EROFS + DAX). This bypasses the VMM entirely for subsequent reads, sharing the Host Page Cache directly.
Protocol Definition
- Request:
magic (u32), type (u32), handle (u64), pos (u64), len (u32)
- Reply:
magic (u32), code (u32), handle (u64), dev_sz (u64), ranges_count (u32)
- Control Message:
[fd, off, len, dev_off] (passed via SCM_RIGHTS)
Describe possible alternatives
1. Virtio-fs with DAX
While virtio-fs is a common solution, it is significantly more complex to implement and audit for Firecracker. It requires a stateful FUSE control plane and a dynamic DAX window management protocol (SetupMapping/RemoveMapping). Our virtio-pmem approach treats the image as a static, flat address space, keeping the VMM implementation stateless and minimizing the attack surface.
2. Standard virtio-block
virtio-block lacks DAX support. Every block access requires a VM-exit and results in "double-buffering" (data exists in both the Host and Guest page caches), which significantly increases memory pressure in high-density deployments.
3. UFFDIO_COPY
The VMM could fetch data into a userspace buffer and use UFFDIO_COPY. However, this introduces an unnecessary memcpy and higher CPU overhead compared to the proposed mmap + FD-passing approach, which natively shares the page cache.
How do you work around not having this feature?
Currently, we have to pre-fetch the entire container image before VM startup to use virtio-pmem, which leads to high "Time To First Byte" (TTFB) and wastes local storage/memory if only a fraction of the image is actually accessed by the Guest.
Additional context
1. Prototype Status
We have already implemented a functional prototype of this mechanism. The VMM is capable of:
- Handling
userfaultfd events in a dedicated thread.
- Communicating with a modified
nydusd via the NBD-style UDS protocol.
- Performing
mmap(MAP_FIXED) with FD passing and successfully waking up the Guest vCPU.
2. Failure Handling & Resilience
A critical concern for this feature is the stability of the UDS connection. Our implementation includes:
- Auto-reconnect: The VMM can transparently re-establish the connection to the image service (e.g., if
nydusd restarts).
3. Use Case: Large-scale Serverless & AI Inference
In serverless environments, MicroVMs are short-lived but require instant access to large base images (often several GBs for AI models or thick containers). This feature enables:
- Shared Host Page Cache: Multiple MicroVMs using the same base image will map the same physical pages on the Host, significantly reducing the memory footprint (RSS) compared to
virtio-block.
- Reduced Cold Start Latency: By only fetching the metadata and the entry-point code chunks, the effective "Ready" time is decoupled from the total image size.
4. Comparison with Standard UFFD usage
Unlike the traditional UFFDIO_COPY approach used in post-copy migration, this "FD-passing" approach is uniquely suited for read-only DAX devices. It treats the VMM as a coordinator rather than a data buffer, which is consistent with Firecracker's goal of being a "thin" VMM.
Checks
Feature Request
Standard
virtio-pmembackends in Firecracker are typically backed by local files. However, in high-density container environments (e.g., using Nydus or EROFS-based lazy loading), image data is often stored in remote registries and fetched on-demand.Currently, achieving "lazy-loading" in Firecracker requires either
virtio-block(which lacks DAX and causes memory-heavy double-buffering) or complexvhost-user-fssetups. This feature request proposes an asynchronous, on-demand FD-passing backend for virtio-pmem usinguserfaultfd. This allows MicroVMs to start instantly while sharing the Host's page cache with zero-copy overhead, perfectly aligning with Firecracker's performance and security philosophy.Describe the desired solution
We propose a new
UffdBackendfor thevirtio-pmemdevice that leveragesuserfaultfd(UFFD) and Unix Domain Socket (UDS) file descriptor passing.Architecture
virtio-pmemaddress space and registers it withuserfaultfd.nydusd) via UDS using a lightweight, asynchronous protocol inspired by NBD.PROBE: Queries the service for existing local chunks at boot to perform initial mappings.FETCH: Triggered by a UFFD page fault; the VMM sends a request with(pos, len)and returns to the event loop.SCM_RIGHTSto pass File Descriptors. The VMM receiver thread then performs anmmap(MAP_FIXED)to map the blob-backed FD into the Guest's address space.Protocol Definition
magic (u32), type (u32), handle (u64), pos (u64), len (u32)magic (u32), code (u32), handle (u64), dev_sz (u64), ranges_count (u32)[fd, off, len, dev_off](passed viaSCM_RIGHTS)Describe possible alternatives
1. Virtio-fs with DAX
While
virtio-fsis a common solution, it is significantly more complex to implement and audit for Firecracker. It requires a stateful FUSE control plane and a dynamic DAX window management protocol (SetupMapping/RemoveMapping). Ourvirtio-pmemapproach treats the image as a static, flat address space, keeping the VMM implementation stateless and minimizing the attack surface.2. Standard virtio-block
virtio-blocklacks DAX support. Every block access requires a VM-exit and results in "double-buffering" (data exists in both the Host and Guest page caches), which significantly increases memory pressure in high-density deployments.3. UFFDIO_COPY
The VMM could fetch data into a userspace buffer and use
UFFDIO_COPY. However, this introduces an unnecessarymemcpyand higher CPU overhead compared to the proposedmmap+ FD-passing approach, which natively shares the page cache.How do you work around not having this feature?
Currently, we have to pre-fetch the entire container image before VM startup to use
virtio-pmem, which leads to high "Time To First Byte" (TTFB) and wastes local storage/memory if only a fraction of the image is actually accessed by the Guest.Additional context
1. Prototype Status
We have already implemented a functional prototype of this mechanism. The VMM is capable of:
userfaultfdevents in a dedicated thread.nydusdvia the NBD-style UDS protocol.mmap(MAP_FIXED)with FD passing and successfully waking up the Guest vCPU.2. Failure Handling & Resilience
A critical concern for this feature is the stability of the UDS connection. Our implementation includes:
nydusdrestarts).3. Use Case: Large-scale Serverless & AI Inference
In serverless environments, MicroVMs are short-lived but require instant access to large base images (often several GBs for AI models or thick containers). This feature enables:
virtio-block.4. Comparison with Standard UFFD usage
Unlike the traditional
UFFDIO_COPYapproach used in post-copy migration, this "FD-passing" approach is uniquely suited for read-only DAX devices. It treats the VMM as a coordinator rather than a data buffer, which is consistent with Firecracker's goal of being a "thin" VMM.Checks