oci: Isolate cgroups under the current hierarchy #6343

marxarelli · 2025-11-07T17:22:00Z

When buildkitd is run in a privileged container on Kubernetes, the /sys/fs/cgroup mount may be that of the host (depends on the cgroup driver and ultimately whether the buildkitd container is started within a cgroup namespace) which allows buildkitd to remove itself from its assigned cgroup hierarchy. When buildkitd's worker creates cgroups outside of the cgroup hierarchy managed by Kubernetes, resource accounting is incorrect and resource limits are not enforced. This can lead to OOM and other CPU contention issues on nodes.

Introduce a new isolateCgroups configuration for the OCI worker. If set, all cgroups are created beneath the cgroup hierarchy of the buildkitd process.

When buildkitd is run in a privileged container on Kubernetes, the `/sys/fs/cgroup` mount will be that of the host which allows buildkitd to remove itself from the cgroup hierarchy managed by Kubernetes (cgroupfs). When buildkitd's worker creates cgroups outside of the externally managed hierarchy, resource accounting is incorrect and resource limits are not enforced. This can lead to OOM and other CPU contention issues on nodes. Introduce a new `isolateCgroups` configuration for the OCI worker. If set, all cgroups are created beneath the cgroup hierarchy of the buildkitd process. Signed-off-by: Dan Duvall <[email protected]>

marxarelli · 2025-11-13T19:27:00Z

@tonistiigi and @crazy-max I'm still trying to debug the test failures, but what do you think of this change in general?

tonistiigi

What's the behavior of this being enabled in our regular privileged container setup? If we require different setup, can't we just detect what we need without requiring config from the user?

Note that there is also some special setup in the entrypoint of the container. Are you running that in your env?

marxarelli · 2025-11-14T23:56:21Z

What's the behavior of this being enabled in our regular privileged container setup?

The behavior only differs when buildkitd is run without a cgroup namespace and where cgroup v2 is used.

For example, if I run moby/buildkit:latest via Docker Engine, this feature makes no difference because the container is created with a new cgroup namespace and so the root of the cgroup2 mountpoint is the cgroup that buildkitd was spawned under originally.

$ docker run -d --name buildkitd --privileged moby/buildkit:latest
34a267a4bb8fda59fdbb1870af4cba105b4fad5f0bcfa8f9d0018cf74eb8a64b
$ sudo lsns -t cgroup -p $(pgrep buildkitd)
        NS TYPE   NPROCS     PID USER COMMAND
4026532685 cgroup      1 3369108 root buildkitd
$ docker exec -it buildkitd cat /proc/1/cgroup
0::/init
$ cat /proc/$(pgrep buildkitd)/cgroup
0::/system.slice/docker-34a267a4bb8fda59fdbb1870af4cba105b4fad5f0bcfa8f9d0018cf74eb8a64b.scope/init

However, on Kubernetes where cgroup v2 is used and no cgroup namespace is created for privileged containers (the default behavior according to the cgroup v2 KEP which was a big surprise to me), there will be a difference in the overall cgroup hierarchy depending on whether the --oci-isolate-cgroups flag is used.

Without `--isolate-cgroups`

When running an image built from this change without --oci-isolate-cgroups, buildkitd will move itself into an /init cgroup which will be at the root of the host's (node's) cgroup hierarchy since the buildkitd container doesn't have its own cgroup namespace.

$ kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
spec:
  containers:
    - name: buildkitd
      image: docker.io/marxarelli/buildkit:isolate-cgroups
      securityContext:
        privileged: true
pod/buildkitd created
$ kubectl exec -it buildkitd -- cat /proc/1/cgroup
0::/init
$ kubectl exec -it buildkitd -- ls /sys/fs/cgroup
buildkit                       io.cost.model
cgroup.controllers             io.cost.qos
cgroup.max.depth               io.pressure
cgroup.max.descendants         io.stat
cgroup.pressure                kubepods
cgroup.procs                   memory.numa_stat
cgroup.stat                    memory.pressure
cgroup.subtree_control         memory.reclaim
cgroup.threads                 memory.stat
cpu.pressure                   misc.capacity
cpu.stat                       proc-sys-fs-binfmt_misc.mount
cpuset.cpus.effective          sys-fs-fuse-connections.mount
cpuset.mems.effective          sys-kernel-config.mount
dev-hugepages.mount            sys-kernel-debug.mount
dev-mqueue.mount               sys-kernel-tracing.mount
init                           system.slice
init.scope                     user.slice

# on the node where the pod was scheduled
root@pool-1zycrmzjx-p0zdz:/# cat /proc/$(pgrep buildkitd)/cgroup
0::/init
root@pool-1zycrmzjx-p0zdz:/# lsns -t cgroup -p $(pgrep buildkitd)
        NS TYPE   NPROCS PID USER COMMAND
4026531835 cgroup    115   1 root /sbin/init

With `--isolate-cgroups`

When running the same image and pod setup but with --isolate-cgroups, buildkitd will create its init cgroup (and scope all build container cgroups) under the hierarchy of its originally assigned cgroup.

$ kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
spec:
  containers:
    - name: buildkitd
      image: docker.io/marxarelli/buildkit:isolate-cgroups
      args:
        - --oci-isolate-cgroups
      securityContext:
        privileged: true
pod/buildkitd created
$ kubectl exec -it buildkitd -- cat /proc/1/cgroup
0::/kubepods/besteffort/pod9387ff78-ffef-4466-878e-27cfb0b3d4b3/65a857219c342c13a7f5c40a57b2d8c8b0c0dc1f8a2e221c9bda420f65dbb981/init

# on the node
root@pool-1zycrmzjx-p0zdz:/# cat /proc/$(pgrep buildkitd)/cgroup
0::/kubepods/besteffort/pod9387ff78-ffef-4466-878e-27cfb0b3d4b3/65a857219c342c13a7f5c40a57b2d8c8b0c0dc1f8a2e221c9bda420f65dbb981/init
root@pool-1zycrmzjx-p0zdz:/# lsns -t cgroup -p $(pgrep buildkitd)
        NS TYPE   NPROCS PID USER COMMAND
4026531835 cgroup    109   1 root /sbin/init

If we require different setup, can't we just detect what we need without requiring config from the user?

That sounds reasonable to me. Perhaps a good enough heuristic is to enable cgroup isolation is simply when cgroup v2 is detected?

Alternative implementation using `unshare(CAP_NEWCGROUP)`

I also experimented with an alternative approach today which is to have buildkitd create its own cgroup namespace and to remount /sys/fs/cgroup prior to creating its /init cgroup and running the OCI worker. This effectively accomplishes the same degree of isolation but requires far less complexity in the OCI worker/executor code. I'll prepare a different PR for this approach so you can compare.

If cgroup v2 is in use, use `unshare` to create a new cgroup and mount namespace for buildkitd. Remount `/sys/fs/cgroup` to restrict its view of the unified cgroup v2 hierarchy which will ensure its `init` cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process. When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node. Example behavior without cgroup namespace: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/buildkit/{runc-container-id} ``` Example behavior with cgroup namespace: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id} ``` Note this was developed as an alternative approach to moby#6343 Signed-off-by: Dan Duvall <[email protected]>

If cgroup v2 is in use, use `unshare` to create a new cgroup and mount namespace for buildkitd. Remount `/sys/fs/cgroup` to restrict its view of the unified cgroup v2 hierarchy which will ensure its `init` cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process. When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node. Example behavior without this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/buildkit/{runc-container-id} ``` Example behavior with this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id} ``` Note this was developed as an alternative approach to moby#6343 Signed-off-by: Dan Duvall <[email protected]>

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to restrict its view of the unified cgroup hierarchy. This will ensure its `init` cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process. When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node. Example behavior without this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/buildkit/{runc-container-id} ``` Example behavior with this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id} ``` Note this was developed as an alternative approach to moby#6343 Signed-off-by: Dan Duvall <[email protected]>

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to restrict its view of the unified cgroup hierarchy. This will ensure its `init` cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process. When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node. Example behavior without this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/buildkit/{runc-container-id} ``` Example behavior with this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id} ``` Note this was developed as an alternative approach to moby#6343 [kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace Signed-off-by: Dan Duvall <[email protected]>

github-actions bot added area/windows area/buildkitd area/docs area/executor area/worker labels Nov 7, 2025

github-actions bot assigned marxarelli Nov 7, 2025

marxarelli force-pushed the review/isolate-cgroups branch 2 times, most recently from 4699516 to a212937 Compare November 7, 2025 18:47

github-actions bot added the area/testing label Nov 7, 2025

marxarelli force-pushed the review/isolate-cgroups branch 3 times, most recently from c6b7bc5 to db7ca6c Compare November 7, 2025 19:09

marxarelli force-pushed the review/isolate-cgroups branch from db7ca6c to 007d58c Compare November 13, 2025 18:24

tonistiigi reviewed Nov 14, 2025

View reviewed changes

marxarelli mentioned this pull request Nov 17, 2025

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

oci: Isolate cgroups under the current hierarchy #6343

oci: Isolate cgroups under the current hierarchy #6343

marxarelli commented Nov 7, 2025 •

edited

Loading

Uh oh!

marxarelli commented Nov 13, 2025

Uh oh!

tonistiigi left a comment

Uh oh!

marxarelli commented Nov 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oci: Isolate cgroups under the current hierarchy #6343

Are you sure you want to change the base?

oci: Isolate cgroups under the current hierarchy #6343

Conversation

marxarelli commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marxarelli commented Nov 13, 2025

Uh oh!

tonistiigi left a comment

Choose a reason for hiding this comment

Uh oh!

marxarelli commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without --isolate-cgroups

With --isolate-cgroups

Alternative implementation using unshare(CAP_NEWCGROUP)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marxarelli commented Nov 7, 2025 •

edited

Loading

marxarelli commented Nov 14, 2025 •

edited

Loading

Without `--isolate-cgroups`

With `--isolate-cgroups`

Alternative implementation using `unshare(CAP_NEWCGROUP)`