Skip to content

Conversation

@marxarelli
Copy link
Contributor

@marxarelli marxarelli commented Nov 7, 2025

When buildkitd is run in a privileged container on Kubernetes, the /sys/fs/cgroup mount may be that of the host (depends on the cgroup driver and ultimately whether the buildkitd container is started within a cgroup namespace) which allows buildkitd to remove itself from its assigned cgroup hierarchy. When buildkitd's worker creates cgroups outside of the cgroup hierarchy managed by Kubernetes, resource accounting is incorrect and resource limits are not enforced. This can lead to OOM and other CPU contention issues on nodes.

Introduce a new isolateCgroups configuration for the OCI worker. If set, all cgroups are created beneath the cgroup hierarchy of the buildkitd process.

When buildkitd is run in a privileged container on Kubernetes, the
`/sys/fs/cgroup` mount will be that of the host which allows buildkitd
to remove itself from the cgroup hierarchy managed by Kubernetes
(cgroupfs). When buildkitd's worker creates cgroups outside of the
externally managed hierarchy, resource accounting is incorrect and
resource limits are not enforced. This can lead to OOM and other CPU
contention issues on nodes.

Introduce a new `isolateCgroups` configuration for the OCI worker. If
set, all cgroups are created beneath the cgroup hierarchy of the
buildkitd process.

Signed-off-by: Dan Duvall <[email protected]>
@marxarelli marxarelli force-pushed the review/isolate-cgroups branch from db7ca6c to 007d58c Compare November 13, 2025 18:24
@marxarelli
Copy link
Contributor Author

@tonistiigi and @crazy-max I'm still trying to debug the test failures, but what do you think of this change in general?

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior of this being enabled in our regular privileged container setup? If we require different setup, can't we just detect what we need without requiring config from the user?

Note that there is also some special setup in the entrypoint of the container. Are you running that in your env?

@marxarelli
Copy link
Contributor Author

marxarelli commented Nov 14, 2025

What's the behavior of this being enabled in our regular privileged container setup?

The behavior only differs when buildkitd is run without a cgroup namespace and where cgroup v2 is used.

For example, if I run moby/buildkit:latest via Docker Engine, this feature makes no difference because the container is created with a new cgroup namespace and so the root of the cgroup2 mountpoint is the cgroup that buildkitd was spawned under originally.

$ docker run -d --name buildkitd --privileged moby/buildkit:latest
34a267a4bb8fda59fdbb1870af4cba105b4fad5f0bcfa8f9d0018cf74eb8a64b
$ sudo lsns -t cgroup -p $(pgrep buildkitd)
        NS TYPE   NPROCS     PID USER COMMAND
4026532685 cgroup      1 3369108 root buildkitd
$ docker exec -it buildkitd cat /proc/1/cgroup
0::/init
$ cat /proc/$(pgrep buildkitd)/cgroup
0::/system.slice/docker-34a267a4bb8fda59fdbb1870af4cba105b4fad5f0bcfa8f9d0018cf74eb8a64b.scope/init

However, on Kubernetes where cgroup v2 is used and no cgroup namespace is created for privileged containers (the default behavior according to the cgroup v2 KEP which was a big surprise to me), there will be a difference in the overall cgroup hierarchy depending on whether the --oci-isolate-cgroups flag is used.

Without --isolate-cgroups

When running an image built from this change without --oci-isolate-cgroups, buildkitd will move itself into an /init cgroup which will be at the root of the host's (node's) cgroup hierarchy since the buildkitd container doesn't have its own cgroup namespace.

$ kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
spec:
  containers:
    - name: buildkitd
      image: docker.io/marxarelli/buildkit:isolate-cgroups
      securityContext:
        privileged: true
pod/buildkitd created
$ kubectl exec -it buildkitd -- cat /proc/1/cgroup
0::/init
$ kubectl exec -it buildkitd -- ls /sys/fs/cgroup
buildkit                       io.cost.model
cgroup.controllers             io.cost.qos
cgroup.max.depth               io.pressure
cgroup.max.descendants         io.stat
cgroup.pressure                kubepods
cgroup.procs                   memory.numa_stat
cgroup.stat                    memory.pressure
cgroup.subtree_control         memory.reclaim
cgroup.threads                 memory.stat
cpu.pressure                   misc.capacity
cpu.stat                       proc-sys-fs-binfmt_misc.mount
cpuset.cpus.effective          sys-fs-fuse-connections.mount
cpuset.mems.effective          sys-kernel-config.mount
dev-hugepages.mount            sys-kernel-debug.mount
dev-mqueue.mount               sys-kernel-tracing.mount
init                           system.slice
init.scope                     user.slice

# on the node where the pod was scheduled
root@pool-1zycrmzjx-p0zdz:/# cat /proc/$(pgrep buildkitd)/cgroup
0::/init
root@pool-1zycrmzjx-p0zdz:/# lsns -t cgroup -p $(pgrep buildkitd)
        NS TYPE   NPROCS PID USER COMMAND
4026531835 cgroup    115   1 root /sbin/init

With --isolate-cgroups

When running the same image and pod setup but with --isolate-cgroups, buildkitd will create its init cgroup (and scope all build container cgroups) under the hierarchy of its originally assigned cgroup.

$ kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
spec:
  containers:
    - name: buildkitd
      image: docker.io/marxarelli/buildkit:isolate-cgroups
      args:
        - --oci-isolate-cgroups
      securityContext:
        privileged: true
pod/buildkitd created
$ kubectl exec -it buildkitd -- cat /proc/1/cgroup
0::/kubepods/besteffort/pod9387ff78-ffef-4466-878e-27cfb0b3d4b3/65a857219c342c13a7f5c40a57b2d8c8b0c0dc1f8a2e221c9bda420f65dbb981/init

# on the node
root@pool-1zycrmzjx-p0zdz:/# cat /proc/$(pgrep buildkitd)/cgroup
0::/kubepods/besteffort/pod9387ff78-ffef-4466-878e-27cfb0b3d4b3/65a857219c342c13a7f5c40a57b2d8c8b0c0dc1f8a2e221c9bda420f65dbb981/init
root@pool-1zycrmzjx-p0zdz:/# lsns -t cgroup -p $(pgrep buildkitd)
        NS TYPE   NPROCS PID USER COMMAND
4026531835 cgroup    109   1 root /sbin/init

If we require different setup, can't we just detect what we need without requiring config from the user?

That sounds reasonable to me. Perhaps a good enough heuristic is to enable cgroup isolation is simply when cgroup v2 is detected?

Alternative implementation using unshare(CAP_NEWCGROUP)

I also experimented with an alternative approach today which is to have buildkitd create its own cgroup namespace and to remount /sys/fs/cgroup prior to creating its /init cgroup and running the OCI worker. This effectively accomplishes the same degree of isolation but requires far less complexity in the OCI worker/executor code. I'll prepare a different PR for this approach so you can compare.

marxarelli added a commit to marxarelli/buildkit that referenced this pull request Nov 17, 2025
If cgroup v2 is in use, use `unshare` to create a new cgroup and mount
namespace for buildkitd. Remount `/sys/fs/cgroup` to restrict its view
of the unified cgroup v2 hierarchy which will ensure its `init` cgroup
and all OCI worker managed cgroups are kept beneath the root cgroup of
the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without cgroup namespace:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with cgroup namespace:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

Signed-off-by: Dan Duvall <[email protected]>
marxarelli added a commit to marxarelli/buildkit that referenced this pull request Nov 17, 2025
If cgroup v2 is in use, use `unshare` to create a new cgroup and mount
namespace for buildkitd. Remount `/sys/fs/cgroup` to restrict its view
of the unified cgroup v2 hierarchy which will ensure its `init` cgroup
and all OCI worker managed cgroups are kept beneath the root cgroup of
the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

Signed-off-by: Dan Duvall <[email protected]>
marxarelli added a commit to marxarelli/buildkit that referenced this pull request Nov 17, 2025
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

Signed-off-by: Dan Duvall <[email protected]>
marxarelli added a commit to marxarelli/buildkit that referenced this pull request Nov 17, 2025
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace

Signed-off-by: Dan Duvall <[email protected]>
marxarelli added a commit to marxarelli/buildkit that referenced this pull request Nov 17, 2025
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace

Signed-off-by: Dan Duvall <[email protected]>
marxarelli added a commit to marxarelli/buildkit that referenced this pull request Nov 17, 2025
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace

Signed-off-by: Dan Duvall <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants