Skip to content

Add support for IOMMUFD IOCTLs#833

Open
benlwalker wants to merge 2 commits intonutanix:masterfrom
benlwalker:iommufd
Open

Add support for IOMMUFD IOCTLs#833
benlwalker wants to merge 2 commits intonutanix:masterfrom
benlwalker:iommufd

Conversation

@benlwalker
Copy link
Copy Markdown
Contributor

Linux added a new uapi called IOMMUFD for managing the IOMMU. In recent Linux releases, many of the vfio ioctls are now just legacy wrappers around this new interface. However, the new interface has a couple of important features that I'd like to use with vfio-user. Namely:

  1. Support for allowing the server to allocate the IOVA. This is important because the server can return the true IOVA to the client directly, which can be used as an actual DMA address by hardware. If the client puts this address into a command, they don't have to rely on the server process to intercept the command and translate it.
  2. Support for registering memory described by a dma-buf (e.g. memory on another PCI device)

I've tried to make this backward compatible by adding feature negotiation. Only if both client and server report support will it be used.

Add lspci.expected.out.5 to accommodate output format differences in
newer versions of the lspci utility (e.g., on Fedora 42). The test
already supported multiple expected outputs for different lspci
versions; this adds another variant.

Signed-off-by: Ben Walker <ben@nvidia.com>
} __attribute__((packed));

/* VFIO_USER_IOMMUFD_BIND */
struct vfio_user_iommufd_bind {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the real iommufd flow you'd be selecting the device to bind to here. But in vfio-user the device is implied by the domain socket we connected to. So all this is really doing is giving us this devid value to use later (and we don't really need it later, because there's only one device). I debated whether to keep this to make the flow stay the same as when using the real iommufd, or if I should just drop it and simplify.

uint32_t __reserved;
uint64_t user_va; /* Userspace virtual address (for tracking) */
uint64_t length;
uint64_t iova; /* Output: server-allocated IOVA */
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the real iommufd there's an extra flag defined that tells the system whether the client specifies the iova here or whether the server fills it in. I've only implemented the mode where the server fills it in. There's a few considerations here:

  • If the client chooses the iova, then it's almost certainly not going to be the real DMA address that needs to be sent to the hardware. The system will assume that the server process intercepts and translates all commands.
  • If the server chooses the iova then we can return the real DMA address and the server doesn't need to intercept anything (so in theory, clients could get more direct access to the NVMe queues some day, if we gain additional security features in other areas). But the client needs to translate from its own vaddr to the server returned iova on every I/O.

The biggest challenge comes in knowing when the server should do a translation. By only allowing server-choose mode with iommufd, then I know if we're in iommufd mode the server doesn't have to translate. If I allow both modes like in the real iommufd ioctl, I don't have a great way to figure out if the server should translate an address or not.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think protocol wise we should have the option, but explicitly only support server-allocated for now - and we should have those specific features in negotiation too (we'd not say we support client allocation for these)

Add support for iommufd-based memory management to the vfio-user
protocol as an alternative to the legacy VFIO DMA_MAP/UNMAP model.

Key changes:
- Add capability negotiation for iommufd support in VFIO_USER_VERSION
- Implement new protocol commands for iommufd operations:
  * VFIO_USER_IOMMUFD_ALLOC_IOAS - Allocate I/O Address Space
  * VFIO_USER_IOMMUFD_DESTROY_IOAS - Destroy I/O Address Space
  * VFIO_USER_IOMMUFD_BIND - Bind device to iommufd context
  * VFIO_USER_IOMMUFD_ATTACH_PT - Attach device to IOAS
  * VFIO_USER_IOMMUFD_DETACH_PT - Detach device from IOAS
  * VFIO_USER_IOMMUFD_IOAS_MAP - Map memory region (server-allocated IOVA)
  * VFIO_USER_IOMMUFD_IOAS_UNMAP - Unmap memory region
- Add server-side iommufd controller implementation (lib/iommufd.c)
- Update client sample to support iommufd capability negotiation
- Fix build dependencies in samples and tests

The iommufd model uses server-controlled IOVA allocation, where the
server performs mmap() and returns the mapped virtual address as the
IOVA. This simplifies address management and avoids the need for
separate IOVA allocators.

When both client and server advertise iommufd support during version
negotiation, iommufd mode is enabled for the connection. Otherwise,
the connection falls back to legacy VFIO DMA_MAP/UNMAP.

Signed-off-by: Ben Walker <ben@nvidia.com>
Copy link
Copy Markdown
Collaborator

@tmakatos tmakatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, just to be clear there's no interaction with /dev/iommufd but this add iommufd-like functionality, right? I need to familiarise myself first with iommufd before I can meaningfully review this.

@benlwalker
Copy link
Copy Markdown
Contributor Author

Thanks for the patch, just to be clear there's no interaction with /dev/iommufd but this add iommufd-like functionality, right? I need to familiarise myself first with iommufd before I can meaningfully review this.

Right - just like this library doesn't interact with vfio but uses the same ioctls. Portions of the functionality covered previously by vfio in Linux have moved to this new iommufd API (and those parts of vfio are now just a thin wrapper around iommufd) and I wanted the new fancy features in iommufd to be available in vfio-user.

@tmakatos
Copy link
Copy Markdown
Collaborator

tmakatos commented Feb 6, 2026

Thanks for the patch, just to be clear there's no interaction with /dev/iommufd but this add iommufd-like functionality, right? I need to familiarise myself first with iommufd before I can meaningfully review this.

Right - just like this library doesn't interact with vfio but uses the same ioctls. Portions of the functionality covered previously by vfio in Linux have moved to this new iommufd API (and those parts of vfio are now just a thin wrapper around iommufd) and I wanted the new fancy features in iommufd to be available in vfio-user.

Makes sense, I think using IOMMUFD in the protocol specification is consistent with what we've been doing so far. Not sure about whether it should be called iommufd internally though, I'll defer this to @jlevon (who has more bikesheddy inclinations).
I suppose you're going to post the protocol changes to qemu-devel?
Also, do you have any relevant QEMU patches to share?

Copy link
Copy Markdown
Collaborator

@tmakatos tmakatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benlwalker this is a very superficial, partial review, I haven't had a chance to read about iommufd just yet.

/*
* vfio-user protocol commands
*
* Commands 1-18 are part of the base protocol. Commands 19+ implement
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration read/write isn't supported in practise if the device isn't migrateable, so not sure we need to state it here.

*/
typedef struct iommufd_ioas {
uint32_t ioas_id;
TAILQ_HEAD(, iommufd_mapping) mappings;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an IOAS has multiple IOVA mappings? Just checking whether a list is suitable, how many such mappings do we generally expect?

typedef struct iommufd_controller {
struct vfu_ctx *vfu_ctx;
bool enabled; /* iommufd mode negotiated */
bool device_bound; /* Device bound to iommufd */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's no /dev/iommufd, what does device_bound mean in practise?

} iommufd_controller_t;

/*
* Create a new iommufd controller.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose returns NULL on error and sets errno?
(Here and wherever a pointer is returned?)

/*
* Destroy an IOAS and all its mappings.
*/
int
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on error and sets errno? (elsewher as well?)

return ERROR_INT(ERANGE) - 1; /* Need at least 1 SG entry */
}

/* Use INT_MAX as special region index to mark iommufd mapping */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it cleaner to use a dedicated field for that?

for (i = 0; i < cnt; i++, sg++) {
if (sg->region == INT_MAX) {
/* This is an iommufd mapping - IOVA equals the server's virtual address */
if (vfu_ctx->iommufd == NULL || !iommufd_is_enabled(vfu_ctx->iommufd)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be an assert right?

iov[i].iov_len = sg->length;
} else {
/* Legacy DMA mapping - use existing dma_sgl_get logic */
if (sg->region >= vfu_ctx->dma->nregions) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return ERROR_INT(EINVAL);
}

dma_memory_region_t *region = &vfu_ctx->dma->regions[sg->region];
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check why dma_sgl_get went away.

return ret;
}

/* Return a dummy device ID (not used in vfio-user) */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK vfio-user doesn't use device IDs at all, check.

}

iommufd->device_bound = true;
iommufd->bound_ioas_id = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a define like INVALID_IOAS_ID instead of the zero here?

/* Use INT_MAX as special region index to mark iommufd mapping */
/* In iommufd mode, IOVA equals the server's virtual address, so we can use it directly */
sgl[0].dma_addr = dma_addr;
sgl[0].region = INT_MAX;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more specific define than INT_MAX for marking these please.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will change once we merge the btree...

}

int
iommufd_bind_device(iommufd_controller_t *iommufd)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any advantage to us having a bind primitive in our protocol, so I do think we can drop it.

return 0;
}

int
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does attach/detach actually do for vfio-user servers? The only difference I could see is that we check the bound ioas first in iommufd_find_mapping() - but we already have the mapping_hint there.

What am I missing?

mapping->vaddr = vaddr;
mapping->iova = (uint64_t)(uintptr_t)(vaddr + offset_in_page);
mapping->length = length;
mapping->user_va = user_va;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just store this and don't do anything with it - this is the clients VA right? The docs could clarify this, and also give more detail as to why we even accept it in the message never mind store it?


/* The IOVA is the server's vaddr (offset-adjusted) */
mapping->vaddr = vaddr;
mapping->iova = (uint64_t)(uintptr_t)(vaddr + offset_in_page);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment/explanation above struct vfu_dma_info needs updating, as it's still stating iova space is physical

VFIO_USER_MIG_DATA_READ = 17,
VFIO_USER_MIG_DATA_WRITE = 18,
/*
* iommufd protocol extension
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should call this iommufd (but it's fine if we reference it for inspiration). There's really no equivalent thing to "iommufd" at all here, so it's not a useful name for this set of APIs.

I'd suggest instead (depending on disposition of attach/detach, see below) VFIO_USER_IOAS_*

* the legacy VFIO_USER_DMA_MAP/UNMAP commands, mirroring the Linux
* kernel's iommufd API.
*/
VFIO_USER_IOMMUFD_ALLOC_IOAS = 19,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't ask it of you, but we will need to update the protocol definition (which lives in qemu now). Let's make sure to file an issue at least.

VFIO_USER_IOMMUFD_ATTACH_PT = 25,
VFIO_USER_IOMMUFD_DETACH_PT = 26,
VFIO_USER_MAX,
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance I could talk you into updating https://github.com/nutanix/libvfio-user/blob/master/docs/memory-mapping.md with a section on these APIs, and at least a brief summary of what an "ioas" is, how it relates to the existing libvfio-user APIs, the IOVA==SVA requirement here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in particular, what would we actually want multiple IOAS for in vfio-user

if (vfu_ctx->iommufd != NULL && iommufd_is_enabled(vfu_ctx->iommufd)) {
/* Validate that the IOVA is registered (for security) */
mapping = iommufd_find_mapping(vfu_ctx->iommufd, (uint64_t)(uintptr_t)dma_addr, len);
if (mapping != NULL) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we fall back to the non-iommufd implementation, if we can't find a suitable mapping here?

Also, could you expand more about the specific "security" check you're doing here? (I think it's to prevent an incoming client request from arbitrarily reading for any VA in the server process, but it's not explicit here.)

$LSPCI | lspci -vv -F /dev/stdin >lspci.out

for i in 1 2 3 4; do
for i in 1 2 3 4 5; do
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind opening a separate PR for your lspci commit, so we can get it merged?

@@ -197,7 +197,8 @@ recv_version(int sock, int *server_max_fds, size_t *server_max_data_xfer_size,
}

ret = tran_parse_version_json(json_str, server_max_fds,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will definitely want python tests for the new APIs once they're firmed up

}
/* Enable iommufd mode if both client and server support it */
if (iommufd_supported && vfu_ctx->iommufd != NULL) {
iommufd_enable(vfu_ctx->iommufd);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could instead just have ->iommufd only allocated when iommufd is enabled, and != NULL comparisons to decide if it's supported

static int
vfu_reset_ctx(vfu_ctx_t *vfu_ctx, int reason);

/* iommufd command handlers */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could you just put the handlers before handle_request() and not forward declare here?

/* Check iommufd mappings first if iommufd is enabled */
if (vfu_ctx->iommufd != NULL && iommufd_is_enabled(vfu_ctx->iommufd)) {
/* Validate that the IOVA is registered (for security) */
mapping = iommufd_find_mapping(vfu_ctx->iommufd, (uint64_t)(uintptr_t)dma_addr, len);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general nit: lots of long lines, we generally keep to 80 chars

return dma_sg_is_mappable(vfu_ctx->dma, sg);
}

/* iommufd command handler implementations */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will review all these once disposition of bind+attach is resolved

*/

/*
* iommufd support for libvfio-user
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do with expanding - and explaining the relationship to dma.c

struct vfio_user_iommufd_ioas_unmap {
uint32_t argsz;
uint32_t flags;
#define VFIO_USER_IOMMUFD_UNMAP_ALL (1 << 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't handled - and handle_iommufd_ioas_unmap() doesn't validate the flags at all

* kernel's iommufd API.
*/
VFIO_USER_IOMMUFD_ALLOC_IOAS = 19,
VFIO_USER_IOMMUFD_DESTROY_IOAS = 20,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need relevant command_needs_quiesce() otherwise client code race badly with multi-threaded code using sgls

return 0;
}

/* Determine page size and round mmap parameters */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lacking the fd_get_blocksize() behaviour of the legacy controller?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants