-
Notifications
You must be signed in to change notification settings - Fork 2.2k
runc exec: use CLONE_INTO_CGROUP #4812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
aa873c8
to
115aa1f
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
467d16c
to
dfcf22a
Compare
I notice that all the failures occurred in rootless container tests. This might be related to: runc/libcontainer/process_linux.go Line 205 in dfcf22a
However, you mentioned we're seeing an ENOENT error here, so that may not be the cause. |
dfcf22a
to
6095b61
Compare
@kolyshkin Wait, I thought we always communicated with systemd when using Is this just for our testing, or are users actually using this? Because we will need to fix that if we have users on systemd-based systems using cgroups directly without transient units... |
When you use I'm pretty sure it has been that way from the very beginning. One other thing is, when using systemd, we configure everything via systemd and then use fs/fs2 driver to write to cgroup directly. This is also how things have always been. One reason for that is we did not care much to translate OCI spec into systemd settings, which is now mostly fixed. Another reason is, systemd doesn't support all per-cgroup settings that the kernel has (so some of those can't be expressed as systemd unit properties). |
The thing is, while the comment says "EBUSY", the actual code doesn't check for particular error, going for this fallback on any error (including ENOENT). My guess is, with systemd driver we actually need |
6e3bf36
to
9489925
Compare
Apparently, we are also not placing rootless container exec's into the proper cgroup (which is still possible when using cgroup v2 systemd driver, but we'd need to use Tacking it in #4822 |
I'm aware of the mixed fs/systemd setup, my confusion was more that systemd very strictly expects to be told about stuff in cgroupv2 because cgroupv2 was designed around a global management process -- it has been a long time since I reviewed the first cgroupv2 patches for runc, but I remember that being a thing I was worried about at the time. The exec issue seems bad too... |
9489925
to
f96e179
Compare
b56a52b
to
1652210
Compare
Guess what, this is no longer happening. Based on cgroup name ( |
1652210
to
58ed84c
Compare
58ed84c
to
99977d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if retrying only when specific errors are returned is enough to not screw it up in older kernels.
Also, this will mean in old kernels we will always fail the first time calling clone. I guess it's fine?
1b6d405
to
d1d3712
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! One nit about a debug line, but worst case we can add it later.
// Rootless has no direct access to cgroup. | ||
return true | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear we might nor retry in a case we should and then the container will never start for the user and be a big regression. Can we log as debug what was the error before returning false?
So in case that happens, we can ask them to run with debug and check the error they get and we just handle that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just rewrote the code to always emit the warning, and realized that if we won't retry without cgroupfd, when the original error from syscall.StartProcess will be returned to the caller (and then logged by runc).
I'm still keeping the new version, let me know if you like it slightly better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reiterate, with the previous code, the following scenarios are possible:
- Start with CLONE_INTO_CGROUP succeeds. Everything's fine.
- Start with CLONE_INTO_CGROUP fails, and we retry without it. In this case we're not very interested in the particular error, but it is still printed (in debug logs);
- Start with CLONE_INTO_CGROUP fails, and we don't retry. In this case, the error from clone2 with CLONE_INTO_CGROUP (or any other error from
os.StartProcess
) is returned and runc prints it with the error log level:
runc/libcontainer/container_linux.go
Lines 347 to 348 in 77ead42
if err := parent.start(); err != nil { | |
return fmt.Errorf("unable to start container process: %w", err) |
In other words, we either retry without cgroupfd, or print the cgroupfd error loud and clear, so your concern is not feasible.
With the new version (just pushed), runc produces the following output:
- When CLONE_INTO_CGROUP is used:
logrus.Debugf("using CLONE_INTO_CGROUP %q", cgroup)
- When CLONE_INTO_CGROUP fails:
logrus.Debugf("exec with CLONE_INTO_CGROUP failed: %v", err)
- When exec fails:
return fmt.Errorf("unable to start container process: %w", err)
The difference from the old version is (2) is printed in case of any error, and the same error may be printed by (3) if we don't retry without cgroupfd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolyshkin thanks! Now with the debug log I feel more comfortable.
My concern is possible, though. Maybe I didn't express myself correctly.
We only retry without the cgroupfd if this function returns true. That is based on the manpage and the possible errors that are returned. It wouldn't be the first time a manpage is not up to date (i.e. clone_into_cgroup can fail with some other error too), nor the first time more errors than what we think possible are returned.
If this fails with some other error than the ones handled here and it would succeed when retrying without cgroupfd, then we will have a bug in that case (because we won't return true, as it's not handled).
This is the case I worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolyshkin what if we just always retry on failure?
As other have pointed out, there were more errors we needed to handle. If we always retry on failure without the cgroupfd, then it will:
- Either work on retry, which is great
- Or not work, in which case we just did one more try in the failing case (I wouldn't care that much about the failure to execute the container)
Then for sure we can not screw it up if the kernel in the future starts returning some other error or some red hat backport changes the error returned or whatever.
What do you think?
d1d3712
to
8175772
Compare
@cyphar @lifubang @AkihiroSuda PTAL |
libcontainer/process_linux.go
Outdated
logrus.Debugf("exec with CLONE_INTO_CGROUP failed: %v", err) | ||
|
||
switch { | ||
// Either clone3(CLONE_INTO_CGROUP) is not supported (ENOSYS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had a chance to test this myself, but could you confirm whether it returns ENOSYS
or EINVAL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, ENOSYS
means the kernel doesn't implement the clone3
syscall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, maybe we should fall back on EINVAL
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually want to check for E2BIG
, which is the error you'll get from an extensible struct syscall when you try to use an unsupported field. But including EINVAL
wouldn't hurt either (you would get that in the very unlikely scenario that the cgroup fd is 0
on a pre-CLONE_INTO_CGROUP
kernel).
(I also find the ErrUnsupported
handling by syscall.Errno
too magical and would prefer an explicit list of errnos, but I'm probably a minority opinion there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, ENOSYS means the kernel doesn't implement the clone3 syscall.
You are right, my comment was not entirely correct; fixed.
So, maybe we should fall back on EINVAL as well?
Yes, CLONE_INTO_CGROUP
support was introduced in commit ef2c41cf38a7, Feb 5 2020, which is half a year later than the clone3
itself (commit 7f192e3cd316, May 25 2019). Since Go stdlib always uses the latest version of clone_args struct, indeed the kernel will return EINVAL.
Fixed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolyshkin You need to check for E2BIG
-- the error path in copy_struct_from_user
necessarily comes before the flag checking code. But having both is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar yep, if the "extra" fields are not zero, it returns E2BIG.
In kernel v5.3: https://github.com/torvalds/linux/blame/v5.3/kernel/fork.c#L2546-L2559
In kernels v5.4 to v5.7 (i.e. after commit torvalds/linux@f14c234), the same check is in copy_struct_from_user`.
So, do we need to check for EINVAL? Probably not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ffdcf6e
to
9b80e9f
Compare
// No clone3 syscall (kernels < v5.3). | ||
case errors.Is(err, unix.ENOSYS): | ||
return true | ||
// No CLONE_INTO_CGROUP flag support (kernels v5.3 to v5.7). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify why clone3
returns E2BIG
rather than EINVAL
for kernels which don't support CLONE_INTO_CGROUP
, I know this is because the kernel checked the oversized structures first, it might be helpful to include a link to the relevant kernel source (kernel/fork.c#L2525-L2536). This provides direct context for readers who are puzzled by this specific error code choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torvalds/linux@f5a1a53 or https://github.com/torvalds/linux/blob/v5.4/include/linux/uaccess.h#L237 are better links -- the error is coming from copy_struct_from_user
. The bit you linked to is for something else.
(This is a generic pattern for all extensible-struct syscalls by the way. bpf(2)
, openat2(2)
, clone3(2)
, and a few other interfaces have the same behaviour. I gave a talk about this in 2020.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I looked it all up when checking whether E2BIG is the correct error to check for, and found out this actually depends on the kernel version.
From my memory (might be wrong here):
- in kernel v5.3,
clone3
is already available but there is nocopy_struct_from_user
(yet clone3 returns E2BIG if the clone_args is larger than expected and the tail has non-zero values). - from v5.4 to v5.7, the above pattern is generalized in
copy_struct_from_user
by torvalds/linux@f5a1a53
I don't think the code reader need all those details and references though, but I've included a short explanation in the commit message so it can be found via git blame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the same text to PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not directly related to all this, but I remember then looking into in-kernel checkpoint-restore code (which is part of OpenVZ/Virtuozzo kernel, later reimplemented mostly in userspace, and thus CRIU was born), I saw ANK (who wrote most of in-kernel C/R single-handedly) was using structures with zero padding at the end. Not sure if he checked if the padding was actually zero. Also, it might not be the C/R code, but some of the COW layered filesystem (or block device) stuff we had).
It makes sense to make runc exec benefit from clone2(CLONE_INTO_CGROUP), if it is available. Since it requires a recent kernel and might not work, implement a fallback to older way of joining the cgroup. Based on work done in - https://go-review.googlesource.com/c/go/+/417695 - coreos/go-systemd#458 - opencontainers/cgroups#26 - opencontainers#4822 Regarding E2BIG check in shouldRetryWithoutCgroupFD. The clone3 syscall first appeared in kernel v5.3 via commit [1], which added a check that if the size of clone_args structure passed from the userspace is larger than known to kernel, and the "unknown" part contains non-zero values, E2BIG is returned. A similar check was already used in other similar scenarios at the time, and later in kernel v5.4, this was generalized by patch series [2]. [1]: torvalds/linux@7f192e3 [2]: https://lore.kernel.org/all/[email protected]/#r Signed-off-by: Kir Kolyshkin <[email protected]>
9b80e9f
to
190a165
Compare
return nil, fmt.Errorf("bad sub cgroup path: %s", sub) | ||
} | ||
|
||
fd, err := os.OpenFile(cgroup, unix.O_PATH|unix.O_DIRECTORY|unix.O_CLOEXEC, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we use openat2
on the cached cgroupfs fd rather than having this double-check logic -- do we not have access to it anymore with the cgroup package split?
Requires (and currently includes) PR #4822; draft until that one is merged.It makes sense to make runc exec benefit from
clone2(CLONE_INTO_CGROUP)
, whenavailable. Since it requires a recent kernel and might not work, implement a fallback.
Based on:
Regarding E2BIG check in
shouldRetryWithoutCgroupFD
. Theclone3
syscallfirst appeared in kernel v5.3 via commit torvalds/linux@7f192e3, which added
a check that if the size of
clone_args
structure passed from the userspaceis larger than known to kernel, and the "unknown" part contains non-zero
values,
E2BIG
is returned. A similar check was already used in other similarscenarios at the time, and later in kernel v5.4, this was generalized by
patch series https://lore.kernel.org/all/[email protected]/#r
Closes: #4782.