Skip to content

Conversation

rmacklem
Copy link
Contributor

Motivation and Context

Support for the pathconf name _PC_CLONE_BLKSIZE is needed so that
the NFSv4.2 server can reply correctly when the clone_blksize attribute
is requested. It also allows applications running locally on the system
to determine if block cloning is available via the copy_file_range(2)
system call.

Description

FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE which
is the block size supported for block cloning for the file system.
Since ZFS's block size varies per file, return the largest size possible,
which is zfs_max_recordsize, or zero if block cloning is not supported.

How Has This Been Tested?

Tested in a system based on a recent FreeBSD main, which supports
the _PC_CLONE_BLKSIZE pathconf name and also uses it for requests
for the NFSv4.2 clone_blksize attribute.
(The FreeBSD NFSv4.2 client actually "cheats" and clips the value of
"clone_blksize" it receives to 128Kbytes, so that cloning works for
most files. It accepts that there will be a rare failure when a file has
a larger record size. However, I believe returning 128Kbytes would
not conform to RFC7862.)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [x ] New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • [x ] My code follows the OpenZFS code style requirements.
  • I have updated the documentation accordingly.
  • [x ] I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • [x ] All commit messages are properly formatted and contain Signed-off-by.

if (zfs_bclone_enabled &&
spa_feature_is_enabled(dmu_objset_spa(zfsvfs->z_os),
SPA_FEATURE_BLOCK_CLONING))
*ap->a_retval = zfs_max_recordsize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say that I like zfs_max_recordsize, since it does not represent the pool, but instead more of a system capabilities. In most cases it will be 16MB, which is rarely reached. Same time on i386 pools imported from another system might have blocks bigger than its zfs_max_recordsize. I think you could say whether dataset ever had any block above 128K by checking dsl_dataset_feature_is_active(ds, SPA_FEATURE_LARGE_BLOCKS). But otherwise I'd say that recordsize property is the best guess we have.

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing Status: Revision Needed Changes are required for the PR to be accepted labels Aug 22, 2025
FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE
which is the block size supported for block cloning for
the file system.  Since ZFS's block size varies per file,
return the largest size likely to be used, or zero if block
cloning is not supported.

Signed-off-by: Rick Macklem <[email protected]>
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Aug 23, 2025
@rmacklem
Copy link
Contributor Author

rmacklem commented Aug 23, 2025 via email

@behlendorf
Copy link
Contributor

Since it the vnode is available here it seems like we should return the actual recordsize size for the file, or if it's not yet set fallback to the maximum allowed size. Something like this should work (untested).

#ifdef _PC_CLONE_BLKSIZE
        case _PC_CLONE_BLKSIZE:
                zfsvfs_t *zfsvfs = (zfsvfs_t *)ap->a_vp->v_mount->mnt_data;
                spa_t *spa = dmu_objset_spa(zfsvfs->z_os);

                if (zfs_bclone_enabled &&
                    spa_feature_is_enabled(spa, SPA_FEATURE_BLOCK_CLONING)) {
                        *ap->a_retval = VTOZ(ap->a_vp)->z_blksz;
                        if (*ap->a_retval == 0)
                                *ap->a_retval = zfsvfs->z_max_blksz;
                } else {
                        *ap->a_retval = 0;
                }
                return (0);
#endif

@robn
Copy link
Member

robn commented Aug 27, 2025

Drive-by comment from vacation, so not thought hard, but...

Seems like this is strongly adjacent to zfs_get_direct_alignment() (#16972), and indeed, these pathconf vars seems to be aimed at similar areas as the statx vars.

Is there opportunity for a more uniform or shared approach?

@rmacklem
Copy link
Contributor Author

rmacklem commented Aug 27, 2025 via email

@rmacklem
Copy link
Contributor Author

rmacklem commented Aug 27, 2025 via email

@rmacklem
Copy link
Contributor Author

rmacklem commented Aug 27, 2025 via email

@behlendorf
Copy link
Contributor

Thanks for linking to that thread, it was helpful to have some more context. Even if i still don't have a solid understanding of how NFS4.2 is going to use this value! Let me add a little more detail about what values we could return and maybe that will help.

For filesystems with the SPA_FEATURE_LARGE_BLOCKS feature active the maximum block size a file may be using is SPA_MAXBLOCKSIZE. This is limited by on the on-disk format and is unlikely to ever change. If it's not set, then the maximum block size possible will be SPA_OLD_MAXBLOCKSIZE.

The thing is the recordsize property is probably the best guess as to the most common block size in the filesystem. There's really no guarantee there either though because that default block size what changed after the filesystem was created (that's allowed). Or if the pool contains a large number of small files they may all have a block size smaller than that default.

For real workloads I'm not sure which of those values would work best.

@rmacklem
Copy link
Contributor Author

rmacklem commented Aug 28, 2025 via email

@rmacklem
Copy link
Contributor Author

rmacklem commented Aug 28, 2025 via email

@behlendorf
Copy link
Contributor

Oh, I think using zfs_max_recordsize is a little more correct than SPA_MAXBLOCKSIZE because the setting for zfs_max_recordsize is smaller for 32bits.

The wrinkle with using zfs_max_recordsize is that existing pools with 16M blocks on-disk can be imported and used on 32-bit systems. Performance may (will) suffer for existing files written with 16M blocks, but they can be accessed. In this case the bclone alignment restriction will still be 16M for these files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants