-
Notifications
You must be signed in to change notification settings - Fork 1.3k
oci: Isolate cgroups under the current hierarchy #6343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
4699516 to
a212937
Compare
c6b7bc5 to
db7ca6c
Compare
When buildkitd is run in a privileged container on Kubernetes, the `/sys/fs/cgroup` mount will be that of the host which allows buildkitd to remove itself from the cgroup hierarchy managed by Kubernetes (cgroupfs). When buildkitd's worker creates cgroups outside of the externally managed hierarchy, resource accounting is incorrect and resource limits are not enforced. This can lead to OOM and other CPU contention issues on nodes. Introduce a new `isolateCgroups` configuration for the OCI worker. If set, all cgroups are created beneath the cgroup hierarchy of the buildkitd process. Signed-off-by: Dan Duvall <[email protected]>
db7ca6c to
007d58c
Compare
|
@tonistiigi and @crazy-max I'm still trying to debug the test failures, but what do you think of this change in general? |
tonistiigi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior of this being enabled in our regular privileged container setup? If we require different setup, can't we just detect what we need without requiring config from the user?
Note that there is also some special setup in the entrypoint of the container. Are you running that in your env?
The behavior only differs when buildkitd is run without a cgroup namespace and where cgroup v2 is used. For example, if I run $ docker run -d --name buildkitd --privileged moby/buildkit:latest
34a267a4bb8fda59fdbb1870af4cba105b4fad5f0bcfa8f9d0018cf74eb8a64b
$ sudo lsns -t cgroup -p $(pgrep buildkitd)
NS TYPE NPROCS PID USER COMMAND
4026532685 cgroup 1 3369108 root buildkitd
$ docker exec -it buildkitd cat /proc/1/cgroup
0::/init
$ cat /proc/$(pgrep buildkitd)/cgroup
0::/system.slice/docker-34a267a4bb8fda59fdbb1870af4cba105b4fad5f0bcfa8f9d0018cf74eb8a64b.scope/init
However, on Kubernetes where cgroup v2 is used and no cgroup namespace is created for privileged containers (the default behavior according to the cgroup v2 KEP which was a big surprise to me), there will be a difference in the overall cgroup hierarchy depending on whether the Without
|
If cgroup v2 is in use, use `unshare` to create a new cgroup and mount
namespace for buildkitd. Remount `/sys/fs/cgroup` to restrict its view
of the unified cgroup v2 hierarchy which will ensure its `init` cgroup
and all OCI worker managed cgroups are kept beneath the root cgroup of
the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without cgroup namespace:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with cgroup namespace:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
Signed-off-by: Dan Duvall <[email protected]>
If cgroup v2 is in use, use `unshare` to create a new cgroup and mount
namespace for buildkitd. Remount `/sys/fs/cgroup` to restrict its view
of the unified cgroup v2 hierarchy which will ensure its `init` cgroup
and all OCI worker managed cgroups are kept beneath the root cgroup of
the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
Signed-off-by: Dan Duvall <[email protected]>
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
Signed-off-by: Dan Duvall <[email protected]>
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace
Signed-off-by: Dan Duvall <[email protected]>
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace
Signed-off-by: Dan Duvall <[email protected]>
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace
Signed-off-by: Dan Duvall <[email protected]>
When buildkitd is run in a privileged container on Kubernetes, the
/sys/fs/cgroupmount may be that of the host (depends on the cgroup driver and ultimately whether the buildkitd container is started within acgroupnamespace) which allows buildkitd to remove itself from its assigned cgroup hierarchy. When buildkitd's worker creates cgroups outside of the cgroup hierarchy managed by Kubernetes, resource accounting is incorrect and resource limits are not enforced. This can lead to OOM and other CPU contention issues on nodes.Introduce a new
isolateCgroupsconfiguration for the OCI worker. If set, all cgroups are created beneath the cgroup hierarchy of the buildkitd process.