Skip to content

Add supplementary groups support and POSIX permission fixes#220

Open
ejc3 wants to merge 18 commits intocloud-hypervisor:masterfrom
ejc3:master
Open

Add supplementary groups support and POSIX permission fixes#220
ejc3 wants to merge 18 commits intocloud-hypervisor:masterfrom
ejc3:master

Conversation

@ejc3
Copy link

@ejc3 ejc3 commented Dec 19, 2025

Summary

This PR adds supplementary groups support and fixes several POSIX permission issues in the passthrough filesystem. These changes enable passing all 8789 pjdfstest POSIX compliance tests.

Use Case

These changes are used in fuse-pipe, a high-performance FUSE client/server that forwards filesystem operations over Unix sockets or vsock. fuse-pipe uses fuse-backend-rs's passthrough filesystem to serve files to Firecracker microVMs.

The supplementary groups support is essential because fuse-pipe forwards FUSE requests from a guest VM over vsock - the server cannot read /proc/<pid>/status since the PID exists in the guest, not the host. Groups must be forwarded through the wire protocol.

Changes

Supplementary Groups Support

  • Add supplementary_groups field to Context struct for remote filesystems
  • Implement ScopedCreds that uses raw SYS_setgroups/SYS_getgroups syscalls (per-thread, not NPTL broadcast)
  • Support forwarding groups via wire protocol for remote FUSE (e.g., over vsock)

Permission Fixes

  • Add set_creds() calls to operations requiring permission checks: lookup, chmod, chown, truncate, utimes, rmdir, unlink, rename, open, opendir
  • Fix link operation to not switch credentials (requires CAP_DAC_READ_SEARCH)
  • Fix ftruncate to skip permission re-check when file handle is provided
  • Fix SUID/SGID clearing on write by non-owner

Other

  • Replace setresuid/setresgid with setfsuid/setfsgid for credential switching (preserves /proc/self/fd/ access)
  • Add signed lseek hook for backward-compatible negative offsets
  • Format debug logging statements

Testing

All 8789 pjdfstest POSIX compliance tests pass with these changes. CI runs the full test suite: https://github.com/ejc3/firepod/actions

Breaking Changes

  • Context struct changed from Copy to Clone (due to Vec<u32> for groups)

ejc3 added 13 commits December 3, 2025 02:13
set_creds() uses setfsuid()/setfsgid() to switch credentials, but this
breaks /proc/self/fd/ access which fuse-backend-rs uses internally via
readlinkat to resolve inode paths.

fuse-pipe handles credential switching in its own wrapper layer via
CredentialsGuard, so these calls are not needed in the passthrough layer.

Removed set_creds() from:
- mkdir (line 433)
- create (lines 565, 587)
- mknod (line 934)
- symlink (line 999)
…tching

The previous implementation used setresuid/setresgid which change the
real/effective UID/GID. This breaks access to /proc/self/fd/ because
after switching to a non-root user, the process can no longer access
files owned by root (euid).

This change replaces the scoped_cred! macro with ScopedCreds which uses
setfsuid/setfsgid instead. These syscalls ONLY affect filesystem access
checks and do NOT change the real/effective UID, so /proc/self/fd/
access is preserved.

Key advantages of setfsuid/setfsgid:
- Inherently per-thread (no glibc signal synchronization)
- Only affects filesystem permission checks
- Process retains root privileges for /proc access
- Designed specifically for file servers serving multiple users

Also restores set_creds() calls to mkdir, create, mknod, and symlink
so that created files are owned by the calling user, not root.
These operations need to run as the calling user for proper permission
checking:
- truncate requires write permission
- utimensat requires write permission or ownership

chown and chmod continue to run as root (correct behavior).
Operations that modify filesystem state need to run with the caller's
credentials so the kernel performs proper permission checks. Without
this, operations like rmdir/unlink/rename succeed even when the caller
doesn't have write permission to the parent directory.

Added set_creds() to:
- rmdir: permission check for directory removal
- unlink: permission check for file removal
- link: permission check for hard link creation
- rename: permission check for renaming
- open: permission check for file access
- opendir: permission check for directory access

This fixes 780+ pjdfstest failures where operations were bypassing
POSIX permission checks.
chmod requires the caller to be the file owner or root. Without
switching credentials, chmod always succeeds when running as root,
bypassing POSIX permission checks.
chown requires the caller to be root to change owner, or file owner
to change group. Without switching credentials, chown always succeeds
when running as root, bypassing POSIX permission checks.
Without this, lookup always runs as root and bypasses directory
search (execute) permission checks. This caused chmod/truncate
to succeed even when the user doesn't have search permission on
a parent directory component.
Link fix:
- Remove set_creds() from link() function because linkat with AT_EMPTY_PATH
  requires CAP_DAC_READ_SEARCH capability which is lost when switching to
  non-root credentials
- Permission checks are handled by the kernel via default_permissions mount
  option before the LINK request reaches the FUSE server

Ftruncate fix:
- When a file handle is provided (ftruncate on an open fd), don't switch
  credentials or re-check permissions - the permission was validated when
  the fd was opened
- This allows ftruncate to succeed on an O_RDWR fd even if the file mode
  is 0, which is correct POSIX behavior

SUID/SGID clearing fix:
- Detect when the kernel is clearing SUID/SGID bits on write by non-owner
- In this case, don't switch credentials because the operation is kernel-
  initiated and the writing user doesn't have chmod permission
- This allows POSIX-compliant SUID/SGID clearing on write

Also adds debug logging for setattr and write operations to aid debugging.
FUSE protocol only passes uid and primary gid in requests, not supplementary
groups. With default_permissions mount option, the kernel checks permissions
but doesn't consider the caller's supplementary groups.

This commit implements the "gocryptfs workaround" (see rfjakob/gocryptfs):
1. Read caller's supplementary groups from /proc/<pid>/status
2. Call setgroups() to adopt those groups before the filesystem operation
3. Restore original groups when ScopedCreds is dropped

IMPORTANT: We use raw SYS_setgroups/SYS_getgroups syscalls instead of the
libc wrappers because glibc's NPTL signals ALL threads to change credentials
for POSIX compliance. The raw kernel syscalls are per-thread, which is
required for a multi-threaded FUSE server where each thread handles
different users concurrently.

See: https://man7.org/linux/man-pages/man7/nptl.7.html
See: https://github.com/rfjakob/gocryptfs (uses same approach)

This fixes 108 pjdfstest chown failures where operations like:
  -u 65534 -g 65532,65531 chown file -1 65531
were returning EPERM because the kernel only saw primary group 65532,
not supplementary group 65531.

Changes:
- mod.rs: Add setgroups_thread(), getgroups_thread() using raw syscalls
- mod.rs: Add parse_proc_groups(), get_current_groups() helper functions
- mod.rs: Extend ScopedCreds to store and restore original_groups
- mod.rs: Update set_creds() to accept pid parameter
- sync_io.rs: Update all set_creds() calls to pass ctx.pid

Result: All 8789 pjdfstest tests now pass on FUSE filesystem.
When the Context struct has supplementary_groups set, ScopedCreds now
uses those groups directly instead of reading from /proc/<pid>/status.

This is essential for remote filesystems (like fuse-pipe over vsock)
where the PID from the FUSE request exists in the guest VM but not on
the host server, making /proc/<pid>/status inaccessible.

Changes:
- Add supplementary_groups field to Context struct (api/filesystem/mod.rs)
- Change Context from Copy to Clone (Vec isn't Copy)
- Modify ScopedCreds::new() to accept optional pre-read groups
- Add set_creds_from_context() helper function
- Update sync_io.rs to use set_creds_from_context() for all operations

This works with fuse-pipe's wire protocol which reads groups from
/proc/<pid>/status on the client side and forwards them to the server.
Simplify credential handling by removing the parse_proc_groups() function
that read from /proc/<pid>/status. This was a fallback for local FUSE mounts,
but we control both client and server in fuse-pipe, so groups are always
forwarded through the wire protocol.

Changes:
- Remove parse_proc_groups() function (37 lines)
- Remove pid parameter from ScopedCreds::new()
- Simplify set_creds() to just take (uid, gid) for upstream API compatibility
- set_creds_from_context() calls ScopedCreds::new directly

All 8789 pjdfstest POSIX compliance tests pass with this change.
@ejc3 ejc3 marked this pull request as ready for review December 19, 2025 05:12
ejc3 added 5 commits December 26, 2025 02:34
- Add copy_file_range method to FileSystem trait with default ENOSYS
- Implement copy_file_range in PassthroughFs using libc syscall
- Add test for copy_file_range with partial copy and offset support

This enables efficient server-side copy operations on filesystems that
support it (e.g., btrfs reflinks). When the underlying filesystem
supports copy_file_range, copies can be instant O(1) operations
instead of O(n) read+write.
Add remap_file_range to the FileSystem trait to enable reflink
operations through FUSE. This maps to the kernel's .remap_file_range
VFS callback, which handles FICLONE and FICLONERANGE ioctls.

Changes:
- Add remap_file_range method to FileSystem trait (default: ENOSYS)
- Implement remap_file_range in PassthroughFs using FICLONE/FICLONERANGE
- Add Arc<FS> delegations for copy_file_range and remap_file_range

The Arc<FS> blanket implementation was missing delegations for these
methods, causing operations through Arc<PassthroughFs> to incorrectly
return ENOSYS instead of calling the actual implementation.

On CoW filesystems (btrfs, xfs with reflink), this enables instant
file cloning where the copy shares physical storage until modified.

Tested:
  cargo test (unit tests pass)
  E2E: ejc3/fcvm#21
    - cp --reflink=always through FUSE → btrfs
    - filefrag confirms shared extents (true reflinks)
Fixes:
- Add named constants for FICLONE/FICLONERANGE ioctl numbers
- Fix return value for whole-file clone (len=0): get actual size via fstat
  instead of returning 0, which the kernel interprets as "0 bytes cloned"

The kernel expects the number of bytes actually remapped. For FICLONE
(whole file), we now query the source file size after successful clone.
When writeback cache is enabled, O_WRONLY is promoted to O_RDWR so the
kernel can issue read requests for page cache operations. However, this
fails with EACCES for files that have write-only permission (e.g., mode
0277).

Fix: Try promoted flags first, fall back to original flags on EACCES.
If the kernel later sends a READ request, it will correctly fail since
the file is genuinely write-only.
@bergwolf bergwolf closed this Feb 26, 2026
@bergwolf bergwolf reopened this Feb 26, 2026
@bergwolf
Copy link
Contributor

It looks pretty useful. Taking a deeper look. Thanks for the contribution!

Copy link
Contributor

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. Please also aquash your commits so that each of them makes sense. Thanks!

ctx: &Context,
inode: Self::Inode,
handle: Self::Handle,
offset: i64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this signed offset? It is not part of the posix lseek definition.

let LseekIn {
fh, offset, whence, ..
} = ctx.r.read_obj().map_err(Error::DecodeMessage)?;
let offset_signed = offset as i64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LseekIn is passing offset: u64. It is wrong to cast it to i64 that can cause overflow.

///
/// See: https://man7.org/linux/man-pages/man7/nptl.7.html
/// See: https://github.com/rfjakob/gocryptfs (uses same approach)
unsafe fn setgroups_thread(groups: &[libc::gid_t]) -> libc::c_long {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap it to a safe function rather than letting the caller set another unsafe block.

/// Thread-local getgroups using raw syscall.
///
/// Like setgroups_thread, this uses the raw syscall to get per-thread groups.
unsafe fn getgroups_thread(size: libc::c_int, list: *mut libc::gid_t) -> libc::c_long {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here.

}

// Now change uid
let original_fsuid = unsafe { libc::setfsuid(uid) } as libc::uid_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create wrappers for setfsuid and setfsgid to avoid spreading unsafe keyword in multiple places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants