-
Notifications
You must be signed in to change notification settings - Fork 1.9k
zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name #17645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
module/os/freebsd/zfs/zfs_vnops_os.c
Outdated
if (zfs_bclone_enabled && | ||
spa_feature_is_enabled(dmu_objset_spa(zfsvfs->z_os), | ||
SPA_FEATURE_BLOCK_CLONING)) | ||
*ap->a_retval = zfs_max_recordsize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say that I like zfs_max_recordsize
, since it does not represent the pool, but instead more of a system capabilities. In most cases it will be 16MB, which is rarely reached. Same time on i386 pools imported from another system might have blocks bigger than its zfs_max_recordsize
. I think you could say whether dataset ever had any block above 128K by checking dsl_dataset_feature_is_active(ds, SPA_FEATURE_LARGE_BLOCKS)
. But otherwise I'd say that recordsize property is the best guess we have.
FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE which is the block size supported for block cloning for the file system. Since ZFS's block size varies per file, return the largest size likely to be used, or zero if block cloning is not supported. Signed-off-by: Rick Macklem <[email protected]>
On Sun, Aug 17, 2025 at 8:46 AM Alexander Motin ***@***.***> wrote:
CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@***.***
@amotin commented on this pull request.
________________________________
In module/os/freebsd/zfs/zfs_vnops_os.c:
> @@ -5772,6 +5775,17 @@ zfs_freebsd_pathconf(struct vop_pathconf_args *ap)
case _PC_HAS_HIDDENSYSTEM:
*ap->a_retval = 1;
return (0);
+#endif
+#ifdef _PC_CLONE_BLKSIZE
+ case _PC_CLONE_BLKSIZE:
+ zfsvfs = (zfsvfs_t *)ap->a_vp->v_mount->mnt_data;
+ if (zfs_bclone_enabled &&
+ spa_feature_is_enabled(dmu_objset_spa(zfsvfs->z_os),
+ SPA_FEATURE_BLOCK_CLONING))
+ *ap->a_retval = zfs_max_recordsize;
I can't say that I like zfs_max_recordsize, since it does not represent the pool, but instead more of a system capabilities. In most cases it will be 16MB, which is rarely reached. Same time on i386 pools imported from another system might have blocks bigger than its zfs_max_recordsize. I think you could say whether dataset ever had any block above 128K by checking dsl_dataset_feature_is_active(ds, SPA_FEATURE_LARGE_BLOCKS). But otherwise I'd say that recordsize property is the best guess we have.
I have modified the patch to use f_iosize if SPA_FEATURE_LARGE_BLOCKS
is not active.
I think this should be ok, since the Linux NFSv4.2 client will attempt the
Clone when the alignment works, but will accept a reply of EINVAL if
it does not work.
rick
…
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Since it the vnode is available here it seems like we should return the actual recordsize size for the file, or if it's not yet set fallback to the maximum allowed size. Something like this should work (untested). #ifdef _PC_CLONE_BLKSIZE
case _PC_CLONE_BLKSIZE:
zfsvfs_t *zfsvfs = (zfsvfs_t *)ap->a_vp->v_mount->mnt_data;
spa_t *spa = dmu_objset_spa(zfsvfs->z_os);
if (zfs_bclone_enabled &&
spa_feature_is_enabled(spa, SPA_FEATURE_BLOCK_CLONING)) {
*ap->a_retval = VTOZ(ap->a_vp)->z_blksz;
if (*ap->a_retval == 0)
*ap->a_retval = zfsvfs->z_max_blksz;
} else {
*ap->a_retval = 0;
}
return (0);
#endif |
Drive-by comment from vacation, so not thought hard, but... Seems like this is strongly adjacent to Is there opportunity for a more uniform or shared approach? |
On Tue, Aug 26, 2025 at 5:09 PM Brian Behlendorf ***@***.***> wrote:
CAUTION: This email originated from outside of the University of Guelph.
Do not click links or open attachments unless you recognize the sender and
know the content is safe. If in doubt, forward suspicious emails to
***@***.***
*behlendorf* left a comment (openzfs/zfs#17645)
Since it the vnode is available here it seems like we should return the
actual recordsize size for the file, or if it's not yet set fallback to the
maximum allowed size. Something like this should work (untested).
#ifdef _PC_CLONE_BLKSIZE
case _PC_CLONE_BLKSIZE:
zfsvfs_t *zfsvfs = (zfsvfs_t *)ap->a_vp->v_mount->mnt_data;
spa_t *spa = dmu_objset_spa(zfsvfs->z_os);
if (zfs_bclone_enabled &&
spa_feature_is_enabled(spa, SPA_FEATURE_BLOCK_CLONING)) {
*ap->a_retval = VTOZ(ap->a_vp)->z_blksz;
if (*ap->a_retval == 0)
*ap->a_retval = zfsvfs->z_max_blksz;
} else {
*ap->a_retval = 0;
}
return (0);#endif
That was actually my first version of the patch.
The problem is that my main use for _PC_CLONE_BLKSIZE
is the NFSv4.2 attribute called clone_blksize and it
seems to be defined as "per file system" and not
"per file". (It actually turns out that RFC7862
does not define which it is, but the collective
opinion on ***@***.*** was that it is
"per file system".)
You can read the thread here, if it interests you.
https://mailarchive.ietf.org/arch/msg/nfsv4/brGjm5_oYKVrwlMwF5utgAgEfPU/
The extant Linux NFSv4.2 client handles clone_blksize
per file system. It checks to see if the
offsets and len are clone_blksize aligned and then
attempts a Clone operation if they are. If that operation
fails with EINVAL, the outcome is the same as if the
offsets/len are not clone_blksize aligned, so the only
negative effect of making it too small is an RPC RTT.
(Making it different for each file will only result in
the Linux client using the value it gets for the first
file object, which is likely to be a directory.)
rick
—
… Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
On Tue, Aug 26, 2025 at 5:21 PM Rob Norris ***@***.***> wrote:
CAUTION: This email originated from outside of the University of Guelph.
Do not click links or open attachments unless you recognize the sender and
know the content is safe. If in doubt, forward suspicious emails to
***@***.***
*robn* left a comment (openzfs/zfs#17645)
Drive-by comment from vacation, so not thought hard, but...
Seems like this is strongly adjacent to zfs_get_direct_alignment() (#16972),
and indeed, these pathconf vars seems to be aimed at similar areas as the
statx vars.
Is there opportunity for a more uniform or shared approach?
For FreeBSD, making a revision to statfs() is a messy bother
and the collective opinion is that adding pathconf names is
the way to go.
For the case of named attributes (Solaris style extended attributes),
Solaris used pathconf names as well.
For this case, I am not aware of any other system exposing this
value to userland. (Linux uses FICLONE and FICLONERANGE ioctl()s
and maybe copy_file_range(2), but I am not aware of a clone_blksize
being returned to userland.)
rick
—
… Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
On Tue, Aug 26, 2025 at 9:02 PM Rick Macklem ***@***.***> wrote:
On Tue, Aug 26, 2025 at 5:21 PM Rob Norris ***@***.***>
wrote:
> CAUTION: This email originated from outside of the University of Guelph.
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe. If in doubt, forward suspicious emails to
> ***@***.***
>
> *robn* left a comment (openzfs/zfs#17645)
>
> Drive-by comment from vacation, so not thought hard, but...
>
> Seems like this is strongly adjacent to zfs_get_direct_alignment() (
> #16972), and indeed, these pathconf vars seems to be aimed at similar
> areas as the statx vars.
>
> Is there opportunity for a more uniform or shared approach?
>
For FreeBSD, making a revision to statfs() is a messy bother
and the collective opinion is that adding pathconf names is
the way to go.
For the case of named attributes (Solaris style extended attributes),
Solaris used pathconf names as well.
For this case, I am not aware of any other system exposing this
value to userland. (Linux uses FICLONE and FICLONERANGE ioctl()s
and maybe copy_file_range(2), but I am not aware of a clone_blksize
being returned to userland.)
Oh, I should add the caveat that I am not a Linux guy, so the above
could easily be incorrect.
rick
… rick
—
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
Thanks for linking to that thread, it was helpful to have some more context. Even if i still don't have a solid understanding of how NFS4.2 is going to use this value! Let me add a little more detail about what values we could return and maybe that will help. For filesystems with the The thing is the For real workloads I'm not sure which of those values would work best. |
On Wed, Aug 27, 2025 at 3:21 PM Brian Behlendorf ***@***.***> wrote:
CAUTION: This email originated from outside of the University of Guelph.
Do not click links or open attachments unless you recognize the sender and
know the content is safe. If in doubt, forward suspicious emails to
***@***.***
*behlendorf* left a comment (openzfs/zfs#17645)
<#17645 (comment)>
Thanks for linking to that thread, it was helpful to have some more
context. Even if i still don't have a solid understanding of how NFS4.2 is
going to use this value! Let me add a little more detail about what values
we could return and maybe that will help.
For filesystems with the SPA_FEATURE_LARGE_BLOCKS feature active the
maximum block size a file may be using is SPA_MAXBLOCKSIZE. This is
limited by on the on-disk format and is unlikely to ever change. If it's
not set, then the maximum block size possible will be SPA_OLD_MAXBLOCKSIZE
.
Right now, the patch returns the same values as SPA_MAXBLOCKSIZE or
SPA_OLD_MAXBLOCKSIZE.
It just happens to get them from different places.
I'll change the patch, since using these constants makes it a little
easier to understand.
As for what works best, I don't really know either, but I think returning
SPA_MAXBLOCKSIZE or SPA_OLD_MAXBLOCKSIZE is as good as anything else.
Since recordsize can be smaller than this, I think it is safer to return
the largest block alignment needed.
The thing is the recordsize property is probably the best guess as to the
most common block size in the filesystem. There's really no guarantee there
either though because that default block size what changed after the
filesystem was created (that's allowed). Or if the pool contains a large
number of small files they may all have a block size smaller than that
default.
On FreeBSD, the default recordsize is 128K and "man zfsprops" recommends
against making it smaller for most cases, so I think 128K (assuming the
SPA_FEATURE_LARGE_BLOCKS is not active) seems the best bet, since it
probably should be the largest value for the file system.
rick
For real workloads I'm not sure which of those values would work best.
…
—
Reply to this email directly, view it on GitHub
<#17645 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APNAL2TXS7J6WARAGUYNMAD3PYVPVAVCNFSM6AAAAACEAO6IV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMRZHEYTIOBTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
On Wed, Aug 27, 2025 at 3:21 PM Brian Behlendorf ***@***.***> wrote:
CAUTION: This email originated from outside of the University of Guelph.
Do not click links or open attachments unless you recognize the sender and
know the content is safe. If in doubt, forward suspicious emails to
***@***.***
*behlendorf* left a comment (openzfs/zfs#17645)
<#17645 (comment)>
Thanks for linking to that thread, it was helpful to have some more
context. Even if i still don't have a solid understanding of how NFS4.2 is
going to use this value! Let me add a little more detail about what values
we could return and maybe that will help.
For filesystems with the SPA_FEATURE_LARGE_BLOCKS feature active the
maximum block size a file may be using is SPA_MAXBLOCKSIZE. This is
limited by on the on-disk format and is unlikely to ever change. If it's
not set, then the maximum block size possible will be SPA_OLD_MAXBLOCKSIZE
.
Oh, I think using zfs_max_recordsize is a little more correct than
SPA_MAXBLOCKSIZE
because the setting for zfs_max_recordsize is smaller for 32bits.
(Although FreeBSD is dropping 32bit support, so it hardly matters.)
For 64bits, I'm fairly sure zfs_max_recordsize is set to SPA_MAXBLOCKSIZE.
rick
The thing is the recordsize property is probably the best guess as to the
… most common block size in the filesystem. There's really no guarantee there
either though because that default block size what changed after the
filesystem was created (that's allowed). Or if the pool contains a large
number of small files they may all have a block size smaller than that
default.
For real workloads I'm not sure which of those values would work best.
—
Reply to this email directly, view it on GitHub
<#17645 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APNAL2TXS7J6WARAGUYNMAD3PYVPVAVCNFSM6AAAAACEAO6IV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMRZHEYTIOBTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
The wrinkle with using |
Motivation and Context
Support for the pathconf name _PC_CLONE_BLKSIZE is needed so that
the NFSv4.2 server can reply correctly when the clone_blksize attribute
is requested. It also allows applications running locally on the system
to determine if block cloning is available via the copy_file_range(2)
system call.
Description
FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE which
is the block size supported for block cloning for the file system.
Since ZFS's block size varies per file, return the largest size possible,
which is zfs_max_recordsize, or zero if block cloning is not supported.
How Has This Been Tested?
Tested in a system based on a recent FreeBSD main, which supports
the _PC_CLONE_BLKSIZE pathconf name and also uses it for requests
for the NFSv4.2 clone_blksize attribute.
(The FreeBSD NFSv4.2 client actually "cheats" and clips the value of
"clone_blksize" it receives to 128Kbytes, so that cloning works for
most files. It accepts that there will be a rare failure when a file has
a larger record size. However, I believe returning 128Kbytes would
not conform to RFC7862.)
Types of changes
Checklist:
Signed-off-by
.