zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name #17645

rmacklem · 2025-08-15T21:40:08Z

Motivation and Context

Support for the pathconf name _PC_CLONE_BLKSIZE is needed so that
the NFSv4.2 server can reply correctly when the clone_blksize attribute
is requested. It also allows applications running locally on the system
to determine if block cloning is available via the copy_file_range(2)
system call.

Description

FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE which
is the block size supported for block cloning for the file system.
Since ZFS's block size varies per file, return the largest size possible,
which is zfs_max_recordsize, or zero if block cloning is not supported.

How Has This Been Tested?

Tested in a system based on a recent FreeBSD main, which supports
the _PC_CLONE_BLKSIZE pathconf name and also uses it for requests
for the NFSv4.2 clone_blksize attribute.
(The FreeBSD NFSv4.2 client actually "cheats" and clips the value of
"clone_blksize" it receives to 128Kbytes, so that cloning works for
most files. It accepts that there will be a rare failure when a file has
a larger record size. However, I believe returning 128Kbytes would
not conform to RFC7862.)

Types of changes

Bug fix (non-breaking change which fixes an issue)
[x ] New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

[x ] My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
[x ] I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
[x ] All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-08-17T15:45:37Z

module/os/freebsd/zfs/zfs_vnops_os.c

+		if (zfs_bclone_enabled &&
+		    spa_feature_is_enabled(dmu_objset_spa(zfsvfs->z_os),
+		    SPA_FEATURE_BLOCK_CLONING))
+			*ap->a_retval = zfs_max_recordsize;


I can't say that I like zfs_max_recordsize, since it does not represent the pool, but instead more of a system capabilities. In most cases it will be 16MB, which is rarely reached. Same time on i386 pools imported from another system might have blocks bigger than its zfs_max_recordsize. I think you could say whether dataset ever had any block above 128K by checking dsl_dataset_feature_is_active(ds, SPA_FEATURE_LARGE_BLOCKS). But otherwise I'd say that recordsize property is the best guess we have.

FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE which is the block size supported for block cloning for the file system. Since ZFS's block size varies per file, return the largest size likely to be used, or zero if block cloning is not supported. Signed-off-by: Rick Macklem <[email protected]>

rmacklem · 2025-08-23T01:17:33Z

On Sun, Aug 17, 2025 at 8:46 AM Alexander Motin ***@***.***> wrote: CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@***.*** @amotin commented on this pull request. ________________________________ In module/os/freebsd/zfs/zfs_vnops_os.c: > @@ -5772,6 +5775,17 @@ zfs_freebsd_pathconf(struct vop_pathconf_args *ap) case _PC_HAS_HIDDENSYSTEM: *ap->a_retval = 1; return (0); +#endif +#ifdef _PC_CLONE_BLKSIZE + case _PC_CLONE_BLKSIZE: + zfsvfs = (zfsvfs_t *)ap->a_vp->v_mount->mnt_data; + if (zfs_bclone_enabled && + spa_feature_is_enabled(dmu_objset_spa(zfsvfs->z_os), + SPA_FEATURE_BLOCK_CLONING)) + *ap->a_retval = zfs_max_recordsize; I can't say that I like zfs_max_recordsize, since it does not represent the pool, but instead more of a system capabilities. In most cases it will be 16MB, which is rarely reached. Same time on i386 pools imported from another system might have blocks bigger than its zfs_max_recordsize. I think you could say whether dataset ever had any block above 128K by checking dsl_dataset_feature_is_active(ds, SPA_FEATURE_LARGE_BLOCKS). But otherwise I'd say that recordsize property is the best guess we have.

I have modified the patch to use f_iosize if SPA_FEATURE_LARGE_BLOCKS is not active. I think this should be ok, since the Linux NFSv4.2 client will attempt the Clone when the alignment works, but will accept a reply of EINVAL if it does not work. rick

…

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

behlendorf · 2025-08-27T00:08:36Z

Since it the vnode is available here it seems like we should return the actual recordsize size for the file, or if it's not yet set fallback to the maximum allowed size. Something like this should work (untested).

#ifdef _PC_CLONE_BLKSIZE
        case _PC_CLONE_BLKSIZE:
                zfsvfs_t *zfsvfs = (zfsvfs_t *)ap->a_vp->v_mount->mnt_data;
                spa_t *spa = dmu_objset_spa(zfsvfs->z_os);

                if (zfs_bclone_enabled &&
                    spa_feature_is_enabled(spa, SPA_FEATURE_BLOCK_CLONING)) {
                        *ap->a_retval = VTOZ(ap->a_vp)->z_blksz;
                        if (*ap->a_retval == 0)
                                *ap->a_retval = zfsvfs->z_max_blksz;
                } else {
                        *ap->a_retval = 0;
                }
                return (0);
#endif

robn · 2025-08-27T00:20:38Z

Drive-by comment from vacation, so not thought hard, but...

Seems like this is strongly adjacent to zfs_get_direct_alignment() (#16972), and indeed, these pathconf vars seems to be aimed at similar areas as the statx vars.

Is there opportunity for a more uniform or shared approach?

rmacklem · 2025-08-27T03:58:30Z

On Tue, Aug 26, 2025 at 5:09 PM Brian Behlendorf ***@***.***> wrote: CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@***.*** *behlendorf* left a comment (openzfs/zfs#17645) Since it the vnode is available here it seems like we should return the actual recordsize size for the file, or if it's not yet set fallback to the maximum allowed size. Something like this should work (untested). #ifdef _PC_CLONE_BLKSIZE case _PC_CLONE_BLKSIZE: zfsvfs_t *zfsvfs = (zfsvfs_t *)ap->a_vp->v_mount->mnt_data; spa_t *spa = dmu_objset_spa(zfsvfs->z_os); if (zfs_bclone_enabled && spa_feature_is_enabled(spa, SPA_FEATURE_BLOCK_CLONING)) { *ap->a_retval = VTOZ(ap->a_vp)->z_blksz; if (*ap->a_retval == 0) *ap->a_retval = zfsvfs->z_max_blksz; } else { *ap->a_retval = 0; } return (0);#endif That was actually my first version of the patch.

The problem is that my main use for _PC_CLONE_BLKSIZE is the NFSv4.2 attribute called clone_blksize and it seems to be defined as "per file system" and not "per file". (It actually turns out that RFC7862 does not define which it is, but the collective opinion on ***@***.*** was that it is "per file system".) You can read the thread here, if it interests you. https://mailarchive.ietf.org/arch/msg/nfsv4/brGjm5_oYKVrwlMwF5utgAgEfPU/ The extant Linux NFSv4.2 client handles clone_blksize per file system. It checks to see if the offsets and len are clone_blksize aligned and then attempts a Clone operation if they are. If that operation fails with EINVAL, the outcome is the same as if the offsets/len are not clone_blksize aligned, so the only negative effect of making it too small is an RPC RTT. (Making it different for each file will only result in the Linux client using the value it gets for the first file object, which is likely to be a directory.) rick —

…

Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

rmacklem · 2025-08-27T04:02:34Z

On Tue, Aug 26, 2025 at 5:21 PM Rob Norris ***@***.***> wrote: CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@***.*** *robn* left a comment (openzfs/zfs#17645) Drive-by comment from vacation, so not thought hard, but... Seems like this is strongly adjacent to zfs_get_direct_alignment() (#16972), and indeed, these pathconf vars seems to be aimed at similar areas as the statx vars. Is there opportunity for a more uniform or shared approach?

For FreeBSD, making a revision to statfs() is a messy bother and the collective opinion is that adding pathconf names is the way to go. For the case of named attributes (Solaris style extended attributes), Solaris used pathconf names as well. For this case, I am not aware of any other system exposing this value to userland. (Linux uses FICLONE and FICLONERANGE ioctl()s and maybe copy_file_range(2), but I am not aware of a clone_blksize being returned to userland.) rick —

…

Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

rmacklem · 2025-08-27T04:10:08Z

On Tue, Aug 26, 2025 at 9:02 PM Rick Macklem ***@***.***> wrote: On Tue, Aug 26, 2025 at 5:21 PM Rob Norris ***@***.***> wrote: > CAUTION: This email originated from outside of the University of Guelph. > Do not click links or open attachments unless you recognize the sender and > know the content is safe. If in doubt, forward suspicious emails to > ***@***.*** > > *robn* left a comment (openzfs/zfs#17645) > > Drive-by comment from vacation, so not thought hard, but... > > Seems like this is strongly adjacent to zfs_get_direct_alignment() ( > #16972), and indeed, these pathconf vars seems to be aimed at similar > areas as the statx vars. > > Is there opportunity for a more uniform or shared approach? > For FreeBSD, making a revision to statfs() is a messy bother and the collective opinion is that adding pathconf names is the way to go. For the case of named attributes (Solaris style extended attributes), Solaris used pathconf names as well. For this case, I am not aware of any other system exposing this value to userland. (Linux uses FICLONE and FICLONERANGE ioctl()s and maybe copy_file_range(2), but I am not aware of a clone_blksize being returned to userland.)

Oh, I should add the caveat that I am not a Linux guy, so the above could easily be incorrect. rick

…

rick — > Reply to this email directly, view it on GitHub, or unsubscribe. > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

behlendorf · 2025-08-27T22:21:25Z

Thanks for linking to that thread, it was helpful to have some more context. Even if i still don't have a solid understanding of how NFS4.2 is going to use this value! Let me add a little more detail about what values we could return and maybe that will help.

For filesystems with the SPA_FEATURE_LARGE_BLOCKS feature active the maximum block size a file may be using is SPA_MAXBLOCKSIZE. This is limited by on the on-disk format and is unlikely to ever change. If it's not set, then the maximum block size possible will be SPA_OLD_MAXBLOCKSIZE.

The thing is the recordsize property is probably the best guess as to the most common block size in the filesystem. There's really no guarantee there either though because that default block size what changed after the filesystem was created (that's allowed). Or if the pool contains a large number of small files they may all have a block size smaller than that default.

For real workloads I'm not sure which of those values would work best.

rmacklem · 2025-08-28T00:45:43Z

On Wed, Aug 27, 2025 at 3:21 PM Brian Behlendorf ***@***.***> wrote: CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@***.*** *behlendorf* left a comment (openzfs/zfs#17645) <#17645 (comment)> Thanks for linking to that thread, it was helpful to have some more context. Even if i still don't have a solid understanding of how NFS4.2 is going to use this value! Let me add a little more detail about what values we could return and maybe that will help. For filesystems with the SPA_FEATURE_LARGE_BLOCKS feature active the maximum block size a file may be using is SPA_MAXBLOCKSIZE. This is limited by on the on-disk format and is unlikely to ever change. If it's not set, then the maximum block size possible will be SPA_OLD_MAXBLOCKSIZE .

Right now, the patch returns the same values as SPA_MAXBLOCKSIZE or SPA_OLD_MAXBLOCKSIZE. It just happens to get them from different places. I'll change the patch, since using these constants makes it a little easier to understand. As for what works best, I don't really know either, but I think returning SPA_MAXBLOCKSIZE or SPA_OLD_MAXBLOCKSIZE is as good as anything else. Since recordsize can be smaller than this, I think it is safer to return the largest block alignment needed. The thing is the recordsize property is probably the best guess as to the

most common block size in the filesystem. There's really no guarantee there either though because that default block size what changed after the filesystem was created (that's allowed). Or if the pool contains a large number of small files they may all have a block size smaller than that default.

On FreeBSD, the default recordsize is 128K and "man zfsprops" recommends against making it smaller for most cases, so I think 128K (assuming the SPA_FEATURE_LARGE_BLOCKS is not active) seems the best bet, since it probably should be the largest value for the file system. rick For real workloads I'm not sure which of those values would work best.

…

— Reply to this email directly, view it on GitHub <#17645 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APNAL2TXS7J6WARAGUYNMAD3PYVPVAVCNFSM6AAAAACEAO6IV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMRZHEYTIOBTGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rmacklem · 2025-08-28T00:57:07Z

On Wed, Aug 27, 2025 at 3:21 PM Brian Behlendorf ***@***.***> wrote: CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@***.*** *behlendorf* left a comment (openzfs/zfs#17645) <#17645 (comment)> Thanks for linking to that thread, it was helpful to have some more context. Even if i still don't have a solid understanding of how NFS4.2 is going to use this value! Let me add a little more detail about what values we could return and maybe that will help. For filesystems with the SPA_FEATURE_LARGE_BLOCKS feature active the maximum block size a file may be using is SPA_MAXBLOCKSIZE. This is limited by on the on-disk format and is unlikely to ever change. If it's not set, then the maximum block size possible will be SPA_OLD_MAXBLOCKSIZE .

Oh, I think using zfs_max_recordsize is a little more correct than SPA_MAXBLOCKSIZE because the setting for zfs_max_recordsize is smaller for 32bits. (Although FreeBSD is dropping 32bit support, so it hardly matters.) For 64bits, I'm fairly sure zfs_max_recordsize is set to SPA_MAXBLOCKSIZE. rick The thing is the recordsize property is probably the best guess as to the

…

most common block size in the filesystem. There's really no guarantee there either though because that default block size what changed after the filesystem was created (that's allowed). Or if the pool contains a large number of small files they may all have a block size smaller than that default. For real workloads I'm not sure which of those values would work best. — Reply to this email directly, view it on GitHub <#17645 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APNAL2TXS7J6WARAGUYNMAD3PYVPVAVCNFSM6AAAAACEAO6IV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMRZHEYTIOBTGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

behlendorf · 2025-08-28T22:39:03Z

Oh, I think using zfs_max_recordsize is a little more correct than SPA_MAXBLOCKSIZE because the setting for zfs_max_recordsize is smaller for 32bits.

The wrinkle with using zfs_max_recordsize is that existing pools with 16M blocks on-disk can be imported and used on 32-bit systems. Performance may (will) suffer for existing files written with 16M blocks, but they can be accessed. In this case the bclone alignment restriction will still be 16M for these files.

amotin reviewed Aug 17, 2025

View reviewed changes

behlendorf added Status: Code Review Needed Ready for review and testing Status: Revision Needed Changes are required for the PR to be accepted labels Aug 22, 2025

rmacklem force-pushed the master branch from b900b11 to 4d7db87 Compare August 23, 2025 01:09

github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Aug 23, 2025

tonyhutter approved these changes Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name #17645

zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name #17645

Uh oh!

rmacklem commented Aug 15, 2025

Uh oh!

amotin Aug 17, 2025

Uh oh!

rmacklem commented Aug 23, 2025 via email

Uh oh!

behlendorf commented Aug 27, 2025

Uh oh!

robn commented Aug 27, 2025

Uh oh!

rmacklem commented Aug 27, 2025 via email

Uh oh!

rmacklem commented Aug 27, 2025 via email

Uh oh!

rmacklem commented Aug 27, 2025 via email

Uh oh!

behlendorf commented Aug 27, 2025

Uh oh!

rmacklem commented Aug 28, 2025 via email

Uh oh!

rmacklem commented Aug 28, 2025 via email

Uh oh!

behlendorf commented Aug 28, 2025

Uh oh!

Uh oh!

zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name #17645

Are you sure you want to change the base?

zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name #17645

Uh oh!

Conversation

rmacklem commented Aug 15, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

rmacklem commented Aug 23, 2025 via email

Uh oh!

behlendorf commented Aug 27, 2025

Uh oh!

robn commented Aug 27, 2025

Uh oh!

rmacklem commented Aug 27, 2025 via email

Uh oh!

rmacklem commented Aug 27, 2025 via email

Uh oh!

rmacklem commented Aug 27, 2025 via email

Uh oh!

behlendorf commented Aug 27, 2025

Uh oh!

rmacklem commented Aug 28, 2025 via email

Uh oh!

rmacklem commented Aug 28, 2025 via email

Uh oh!

behlendorf commented Aug 28, 2025

Uh oh!

Uh oh!