Skip to content

[Feature Request] Asynchronous On-demand FD-passing for virtio-pmem via userfaultfd #5740

@joy-allen

Description

@joy-allen

Feature Request

Standard virtio-pmem backends in Firecracker are typically backed by local files. However, in high-density container environments (e.g., using Nydus or EROFS-based lazy loading), image data is often stored in remote registries and fetched on-demand.

Currently, achieving "lazy-loading" in Firecracker requires either virtio-block (which lacks DAX and causes memory-heavy double-buffering) or complex vhost-user-fs setups. This feature request proposes an asynchronous, on-demand FD-passing backend for virtio-pmem using userfaultfd. This allows MicroVMs to start instantly while sharing the Host's page cache with zero-copy overhead, perfectly aligning with Firecracker's performance and security philosophy.

Describe the desired solution

We propose a new UffdBackend for the virtio-pmem device that leverages userfaultfd (UFFD) and Unix Domain Socket (UDS) file descriptor passing.

Architecture

  1. UFFD Registration: The VMM creates an anonymous RO mapping for the virtio-pmem address space and registers it with userfaultfd.
  2. NBD-style Control Plane: The VMM communicates with an external image service (e.g., nydusd) via UDS using a lightweight, asynchronous protocol inspired by NBD.
    • PROBE: Queries the service for existing local chunks at boot to perform initial mappings.
    • FETCH: Triggered by a UFFD page fault; the VMM sends a request with (pos, len) and returns to the event loop.
  3. FD-Passing & Remapping: The service replies using SCM_RIGHTS to pass File Descriptors. The VMM receiver thread then performs an mmap(MAP_FIXED) to map the blob-backed FD into the Guest's address space.
  4. Zero-Copy Execution: Once mapped, the Guest accesses the data via DAX (e.g., EROFS + DAX). This bypasses the VMM entirely for subsequent reads, sharing the Host Page Cache directly.

Protocol Definition

  • Request: magic (u32), type (u32), handle (u64), pos (u64), len (u32)
  • Reply: magic (u32), code (u32), handle (u64), dev_sz (u64), ranges_count (u32)
  • Control Message: [fd, off, len, dev_off] (passed via SCM_RIGHTS)

Describe possible alternatives

1. Virtio-fs with DAX

While virtio-fs is a common solution, it is significantly more complex to implement and audit for Firecracker. It requires a stateful FUSE control plane and a dynamic DAX window management protocol (SetupMapping/RemoveMapping). Our virtio-pmem approach treats the image as a static, flat address space, keeping the VMM implementation stateless and minimizing the attack surface.

2. Standard virtio-block

virtio-block lacks DAX support. Every block access requires a VM-exit and results in "double-buffering" (data exists in both the Host and Guest page caches), which significantly increases memory pressure in high-density deployments.

3. UFFDIO_COPY

The VMM could fetch data into a userspace buffer and use UFFDIO_COPY. However, this introduces an unnecessary memcpy and higher CPU overhead compared to the proposed mmap + FD-passing approach, which natively shares the page cache.

How do you work around not having this feature?

Currently, we have to pre-fetch the entire container image before VM startup to use virtio-pmem, which leads to high "Time To First Byte" (TTFB) and wastes local storage/memory if only a fraction of the image is actually accessed by the Guest.

Additional context

1. Prototype Status

We have already implemented a functional prototype of this mechanism. The VMM is capable of:

  • Handling userfaultfd events in a dedicated thread.
  • Communicating with a modified nydusd via the NBD-style UDS protocol.
  • Performing mmap(MAP_FIXED) with FD passing and successfully waking up the Guest vCPU.

2. Failure Handling & Resilience

A critical concern for this feature is the stability of the UDS connection. Our implementation includes:

  • Auto-reconnect: The VMM can transparently re-establish the connection to the image service (e.g., if nydusd restarts).

3. Use Case: Large-scale Serverless & AI Inference

In serverless environments, MicroVMs are short-lived but require instant access to large base images (often several GBs for AI models or thick containers). This feature enables:

  • Shared Host Page Cache: Multiple MicroVMs using the same base image will map the same physical pages on the Host, significantly reducing the memory footprint (RSS) compared to virtio-block.
  • Reduced Cold Start Latency: By only fetching the metadata and the entry-point code chunks, the effective "Ready" time is decoupled from the total image size.

4. Comparison with Standard UFFD usage

Unlike the traditional UFFDIO_COPY approach used in post-copy migration, this "FD-passing" approach is uniquely suited for read-only DAX devices. It treats the VMM as a coordinator rather than a data buffer, which is consistent with Firecracker's goal of being a "thin" VMM.

Checks

  • Have you searched the Firecracker Issues database for similar requests?
  • Have you read all the existing relevant Firecracker documentation?
  • Have you read and understood Firecracker's core tenets?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: Awaiting authorIndicates that an issue or pull request requires author action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions