[Feature Request] Asynchronous On-demand FD-passing for virtio-pmem via userfaultfd

# Feature Request

Standard `virtio-pmem` backends in Firecracker are typically backed by local files. However, in high-density container environments (e.g., using Nydus or EROFS-based lazy loading), image data is often stored in remote registries and fetched on-demand. 

Currently, achieving "lazy-loading" in Firecracker requires either `virtio-block` (which lacks DAX and causes memory-heavy double-buffering) or complex `vhost-user-fs` setups. This feature request proposes an **asynchronous, on-demand FD-passing backend for virtio-pmem** using `userfaultfd`. This allows MicroVMs to start instantly while sharing the Host's page cache with zero-copy overhead, perfectly aligning with Firecracker's performance and security philosophy.

## Describe the desired solution

We propose a new `UffdBackend` for the `virtio-pmem` device that leverages `userfaultfd` (UFFD) and Unix Domain Socket (UDS) file descriptor passing.

### Architecture
1. **UFFD Registration**: The VMM creates an anonymous RO mapping for the `virtio-pmem` address space and registers it with `userfaultfd`.
2. **NBD-style Control Plane**: The VMM communicates with an external image service (e.g., `nydusd`) via UDS using a lightweight, asynchronous protocol inspired by NBD.
   - **`PROBE`**: Queries the service for existing local chunks at boot to perform initial mappings.
   - **`FETCH`**: Triggered by a UFFD page fault; the VMM sends a request with `(pos, len)` and returns to the event loop.
3. **FD-Passing & Remapping**: The service replies using `SCM_RIGHTS` to pass File Descriptors. The VMM receiver thread then performs an `mmap(MAP_FIXED)` to map the blob-backed FD into the Guest's address space.
4. **Zero-Copy Execution**: Once mapped, the Guest accesses the data via DAX (e.g., EROFS + DAX). This bypasses the VMM entirely for subsequent reads, sharing the **Host Page Cache** directly.

### Protocol Definition
* **Request**: `magic (u32), type (u32), handle (u64), pos (u64), len (u32)`
* **Reply**: `magic (u32), code (u32), handle (u64), dev_sz (u64), ranges_count (u32)` 
* **Control Message**: `[fd, off, len, dev_off]` (passed via `SCM_RIGHTS`)

## Describe possible alternatives

### 1. Virtio-fs with DAX
While `virtio-fs` is a common solution, it is significantly more complex to implement and audit for Firecracker. It requires a stateful FUSE control plane and a dynamic DAX window management protocol (SetupMapping/RemoveMapping). Our `virtio-pmem` approach treats the image as a **static, flat address space**, keeping the VMM implementation stateless and minimizing the attack surface.

### 2. Standard virtio-block
`virtio-block` lacks DAX support. Every block access requires a VM-exit and results in "double-buffering" (data exists in both the Host and Guest page caches), which significantly increases memory pressure in high-density deployments.

### 3. UFFDIO_COPY
The VMM could fetch data into a userspace buffer and use `UFFDIO_COPY`. However, this introduces an unnecessary `memcpy` and higher CPU overhead compared to the proposed `mmap` + FD-passing approach, which natively shares the page cache.

## How do you work around not having this feature?
Currently, we have to pre-fetch the entire container image before VM startup to use `virtio-pmem`, which leads to high "Time To First Byte" (TTFB) and wastes local storage/memory if only a fraction of the image is actually accessed by the Guest.

## Additional context

### 1. Prototype Status
We have already implemented a functional prototype of this mechanism. The VMM is capable of:
- Handling `userfaultfd` events in a dedicated thread.
- Communicating with a modified `nydusd` via the NBD-style UDS protocol.
- Performing `mmap(MAP_FIXED)` with FD passing and successfully waking up the Guest vCPU.

### 2. Failure Handling & Resilience
A critical concern for this feature is the stability of the UDS connection. Our implementation includes:
- **Auto-reconnect**: The VMM can transparently re-establish the connection to the image service (e.g., if `nydusd` restarts).

### 3. Use Case: Large-scale Serverless & AI Inference
In serverless environments, MicroVMs are short-lived but require instant access to large base images (often several GBs for AI models or thick containers). This feature enables:
- **Shared Host Page Cache**: Multiple MicroVMs using the same base image will map the same physical pages on the Host, significantly reducing the memory footprint (RSS) compared to `virtio-block`.
- **Reduced Cold Start Latency**: By only fetching the metadata and the entry-point code chunks, the effective "Ready" time is decoupled from the total image size.

### 4. Comparison with Standard UFFD usage
Unlike the traditional `UFFDIO_COPY` approach used in post-copy migration, this "FD-passing" approach is uniquely suited for read-only DAX devices. It treats the VMM as a coordinator rather than a data buffer, which is consistent with Firecracker's goal of being a "thin" VMM.

## Checks

- [ ] Have you searched the Firecracker Issues database for similar requests?
- [ ] Have you read all the existing relevant Firecracker documentation?
- [ ] Have you read and understood Firecracker's core tenets?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Asynchronous On-demand FD-passing for virtio-pmem via userfaultfd #5740

Feature Request

Describe the desired solution

Architecture

Protocol Definition

Describe possible alternatives

1. Virtio-fs with DAX

2. Standard virtio-block

3. UFFDIO_COPY

How do you work around not having this feature?

Additional context

1. Prototype Status

2. Failure Handling & Resilience

3. Use Case: Large-scale Serverless & AI Inference

4. Comparison with Standard UFFD usage

Checks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Asynchronous On-demand FD-passing for virtio-pmem via userfaultfd #5740

Description

Feature Request

Describe the desired solution

Architecture

Protocol Definition

Describe possible alternatives

1. Virtio-fs with DAX

2. Standard virtio-block

3. UFFDIO_COPY

How do you work around not having this feature?

Additional context

1. Prototype Status

2. Failure Handling & Resilience

3. Use Case: Large-scale Serverless & AI Inference

4. Comparison with Standard UFFD usage

Checks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions