Add supplementary groups support and POSIX permission fixes#220
Open
ejc3 wants to merge 18 commits intocloud-hypervisor:masterfrom
Open
Add supplementary groups support and POSIX permission fixes#220ejc3 wants to merge 18 commits intocloud-hypervisor:masterfrom
ejc3 wants to merge 18 commits intocloud-hypervisor:masterfrom
Conversation
set_creds() uses setfsuid()/setfsgid() to switch credentials, but this breaks /proc/self/fd/ access which fuse-backend-rs uses internally via readlinkat to resolve inode paths. fuse-pipe handles credential switching in its own wrapper layer via CredentialsGuard, so these calls are not needed in the passthrough layer. Removed set_creds() from: - mkdir (line 433) - create (lines 565, 587) - mknod (line 934) - symlink (line 999)
…tching The previous implementation used setresuid/setresgid which change the real/effective UID/GID. This breaks access to /proc/self/fd/ because after switching to a non-root user, the process can no longer access files owned by root (euid). This change replaces the scoped_cred! macro with ScopedCreds which uses setfsuid/setfsgid instead. These syscalls ONLY affect filesystem access checks and do NOT change the real/effective UID, so /proc/self/fd/ access is preserved. Key advantages of setfsuid/setfsgid: - Inherently per-thread (no glibc signal synchronization) - Only affects filesystem permission checks - Process retains root privileges for /proc access - Designed specifically for file servers serving multiple users Also restores set_creds() calls to mkdir, create, mknod, and symlink so that created files are owned by the calling user, not root.
These operations need to run as the calling user for proper permission checking: - truncate requires write permission - utimensat requires write permission or ownership chown and chmod continue to run as root (correct behavior).
Operations that modify filesystem state need to run with the caller's credentials so the kernel performs proper permission checks. Without this, operations like rmdir/unlink/rename succeed even when the caller doesn't have write permission to the parent directory. Added set_creds() to: - rmdir: permission check for directory removal - unlink: permission check for file removal - link: permission check for hard link creation - rename: permission check for renaming - open: permission check for file access - opendir: permission check for directory access This fixes 780+ pjdfstest failures where operations were bypassing POSIX permission checks.
chmod requires the caller to be the file owner or root. Without switching credentials, chmod always succeeds when running as root, bypassing POSIX permission checks.
chown requires the caller to be root to change owner, or file owner to change group. Without switching credentials, chown always succeeds when running as root, bypassing POSIX permission checks.
Without this, lookup always runs as root and bypasses directory search (execute) permission checks. This caused chmod/truncate to succeed even when the user doesn't have search permission on a parent directory component.
Link fix: - Remove set_creds() from link() function because linkat with AT_EMPTY_PATH requires CAP_DAC_READ_SEARCH capability which is lost when switching to non-root credentials - Permission checks are handled by the kernel via default_permissions mount option before the LINK request reaches the FUSE server Ftruncate fix: - When a file handle is provided (ftruncate on an open fd), don't switch credentials or re-check permissions - the permission was validated when the fd was opened - This allows ftruncate to succeed on an O_RDWR fd even if the file mode is 0, which is correct POSIX behavior SUID/SGID clearing fix: - Detect when the kernel is clearing SUID/SGID bits on write by non-owner - In this case, don't switch credentials because the operation is kernel- initiated and the writing user doesn't have chmod permission - This allows POSIX-compliant SUID/SGID clearing on write Also adds debug logging for setattr and write operations to aid debugging.
FUSE protocol only passes uid and primary gid in requests, not supplementary groups. With default_permissions mount option, the kernel checks permissions but doesn't consider the caller's supplementary groups. This commit implements the "gocryptfs workaround" (see rfjakob/gocryptfs): 1. Read caller's supplementary groups from /proc/<pid>/status 2. Call setgroups() to adopt those groups before the filesystem operation 3. Restore original groups when ScopedCreds is dropped IMPORTANT: We use raw SYS_setgroups/SYS_getgroups syscalls instead of the libc wrappers because glibc's NPTL signals ALL threads to change credentials for POSIX compliance. The raw kernel syscalls are per-thread, which is required for a multi-threaded FUSE server where each thread handles different users concurrently. See: https://man7.org/linux/man-pages/man7/nptl.7.html See: https://github.com/rfjakob/gocryptfs (uses same approach) This fixes 108 pjdfstest chown failures where operations like: -u 65534 -g 65532,65531 chown file -1 65531 were returning EPERM because the kernel only saw primary group 65532, not supplementary group 65531. Changes: - mod.rs: Add setgroups_thread(), getgroups_thread() using raw syscalls - mod.rs: Add parse_proc_groups(), get_current_groups() helper functions - mod.rs: Extend ScopedCreds to store and restore original_groups - mod.rs: Update set_creds() to accept pid parameter - sync_io.rs: Update all set_creds() calls to pass ctx.pid Result: All 8789 pjdfstest tests now pass on FUSE filesystem.
When the Context struct has supplementary_groups set, ScopedCreds now uses those groups directly instead of reading from /proc/<pid>/status. This is essential for remote filesystems (like fuse-pipe over vsock) where the PID from the FUSE request exists in the guest VM but not on the host server, making /proc/<pid>/status inaccessible. Changes: - Add supplementary_groups field to Context struct (api/filesystem/mod.rs) - Change Context from Copy to Clone (Vec isn't Copy) - Modify ScopedCreds::new() to accept optional pre-read groups - Add set_creds_from_context() helper function - Update sync_io.rs to use set_creds_from_context() for all operations This works with fuse-pipe's wire protocol which reads groups from /proc/<pid>/status on the client side and forwards them to the server.
Simplify credential handling by removing the parse_proc_groups() function that read from /proc/<pid>/status. This was a fallback for local FUSE mounts, but we control both client and server in fuse-pipe, so groups are always forwarded through the wire protocol. Changes: - Remove parse_proc_groups() function (37 lines) - Remove pid parameter from ScopedCreds::new() - Simplify set_creds() to just take (uid, gid) for upstream API compatibility - set_creds_from_context() calls ScopedCreds::new directly All 8789 pjdfstest POSIX compliance tests pass with this change.
- Add copy_file_range method to FileSystem trait with default ENOSYS - Implement copy_file_range in PassthroughFs using libc syscall - Add test for copy_file_range with partial copy and offset support This enables efficient server-side copy operations on filesystems that support it (e.g., btrfs reflinks). When the underlying filesystem supports copy_file_range, copies can be instant O(1) operations instead of O(n) read+write.
Add remap_file_range to the FileSystem trait to enable reflink operations through FUSE. This maps to the kernel's .remap_file_range VFS callback, which handles FICLONE and FICLONERANGE ioctls. Changes: - Add remap_file_range method to FileSystem trait (default: ENOSYS) - Implement remap_file_range in PassthroughFs using FICLONE/FICLONERANGE - Add Arc<FS> delegations for copy_file_range and remap_file_range The Arc<FS> blanket implementation was missing delegations for these methods, causing operations through Arc<PassthroughFs> to incorrectly return ENOSYS instead of calling the actual implementation. On CoW filesystems (btrfs, xfs with reflink), this enables instant file cloning where the copy shares physical storage until modified. Tested: cargo test (unit tests pass) E2E: ejc3/fcvm#21 - cp --reflink=always through FUSE → btrfs - filefrag confirms shared extents (true reflinks)
Fixes: - Add named constants for FICLONE/FICLONERANGE ioctl numbers - Fix return value for whole-file clone (len=0): get actual size via fstat instead of returning 0, which the kernel interprets as "0 bytes cloned" The kernel expects the number of bytes actually remapped. For FICLONE (whole file), we now query the source file size after successful clone.
When writeback cache is enabled, O_WRONLY is promoted to O_RDWR so the kernel can issue read requests for page cache operations. However, this fails with EACCES for files that have write-only permission (e.g., mode 0277). Fix: Try promoted flags first, fall back to original flags on EACCES. If the kernel later sends a READ request, it will correctly fail since the file is genuinely write-only.
Contributor
|
It looks pretty useful. Taking a deeper look. Thanks for the contribution! |
bergwolf
reviewed
Feb 26, 2026
Contributor
bergwolf
left a comment
There was a problem hiding this comment.
overall looks good. Please also aquash your commits so that each of them makes sense. Thanks!
| ctx: &Context, | ||
| inode: Self::Inode, | ||
| handle: Self::Handle, | ||
| offset: i64, |
Contributor
There was a problem hiding this comment.
Why do you need this signed offset? It is not part of the posix lseek definition.
| let LseekIn { | ||
| fh, offset, whence, .. | ||
| } = ctx.r.read_obj().map_err(Error::DecodeMessage)?; | ||
| let offset_signed = offset as i64; |
Contributor
There was a problem hiding this comment.
LseekIn is passing offset: u64. It is wrong to cast it to i64 that can cause overflow.
| /// | ||
| /// See: https://man7.org/linux/man-pages/man7/nptl.7.html | ||
| /// See: https://github.com/rfjakob/gocryptfs (uses same approach) | ||
| unsafe fn setgroups_thread(groups: &[libc::gid_t]) -> libc::c_long { |
Contributor
There was a problem hiding this comment.
Please wrap it to a safe function rather than letting the caller set another unsafe block.
| /// Thread-local getgroups using raw syscall. | ||
| /// | ||
| /// Like setgroups_thread, this uses the raw syscall to get per-thread groups. | ||
| unsafe fn getgroups_thread(size: libc::c_int, list: *mut libc::gid_t) -> libc::c_long { |
| } | ||
|
|
||
| // Now change uid | ||
| let original_fsuid = unsafe { libc::setfsuid(uid) } as libc::uid_t; |
Contributor
There was a problem hiding this comment.
Please create wrappers for setfsuid and setfsgid to avoid spreading unsafe keyword in multiple places.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds supplementary groups support and fixes several POSIX permission issues in the passthrough filesystem. These changes enable passing all 8789 pjdfstest POSIX compliance tests.
Use Case
These changes are used in fuse-pipe, a high-performance FUSE client/server that forwards filesystem operations over Unix sockets or vsock. fuse-pipe uses fuse-backend-rs's passthrough filesystem to serve files to Firecracker microVMs.
The supplementary groups support is essential because fuse-pipe forwards FUSE requests from a guest VM over vsock - the server cannot read
/proc/<pid>/statussince the PID exists in the guest, not the host. Groups must be forwarded through the wire protocol.Changes
Supplementary Groups Support
supplementary_groupsfield toContextstruct for remote filesystemsScopedCredsthat uses rawSYS_setgroups/SYS_getgroupssyscalls (per-thread, not NPTL broadcast)Permission Fixes
set_creds()calls to operations requiring permission checks: lookup, chmod, chown, truncate, utimes, rmdir, unlink, rename, open, opendirCAP_DAC_READ_SEARCH)Other
setresuid/setresgidwithsetfsuid/setfsgidfor credential switching (preserves/proc/self/fd/access)Testing
All 8789 pjdfstest POSIX compliance tests pass with these changes. CI runs the full test suite: https://github.com/ejc3/firepod/actions
Breaking Changes
Contextstruct changed fromCopytoClone(due toVec<u32>for groups)