Skip to content

fix: cpu affinity #4815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ jobs:

- name: integration test (systemd driver)
run: |
sudo taskset -pc 0-1 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved to the bats script

# Delegate all cgroup v2 controllers to rootless user via --systemd-cgroup.
# The default (since systemd v252) is "pids memory cpu".
sudo mkdir -p /etc/systemd/system/[email protected]
Expand Down
44 changes: 44 additions & 0 deletions libcontainer/process_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,17 @@ func (p *setnsProcess) setFinalCPUAffinity() error {
return nil
}

func (p *setnsProcess) hasExecCPUAffinity() bool {
aff := p.config.CPUAffinity
if aff == nil {
return false
}
if aff.Initial != nil || aff.Final != nil {
return true
}
return false
}

func (p *setnsProcess) start() (retErr error) {
defer p.comm.closeParent()

Expand Down Expand Up @@ -258,6 +269,13 @@ func (p *setnsProcess) start() (retErr error) {
if err := p.setFinalCPUAffinity(); err != nil {
return err
}

if !p.hasExecCPUAffinity() {
if err := resetAffinityMask(p.pid()); err != nil {
return err
}
}

if p.intelRdtPath != "" {
// if Intel RDT "resource control" filesystem path exists
_, err := os.Stat(p.intelRdtPath)
Expand Down Expand Up @@ -615,6 +633,11 @@ func (p *initProcess) start() (retErr error) {
return fmt.Errorf("unable to apply cgroup configuration: %w", err)
}
}

if err := resetAffinityMask(p.pid()); err != nil {
return err
}

if p.intelRdtManager != nil {
if err := p.intelRdtManager.Apply(p.pid()); err != nil {
return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)
Expand Down Expand Up @@ -981,3 +1004,24 @@ func (p *Process) InitializeIO(rootuid, rootgid int) (i *IO, err error) {
}
return i, nil
}

// Set all inherited cpu affinity. Old kernels do that automatically, but
// new kernels remember the affinity that was set before the cgroup move.
// This is undesirable, because it inherits the systemd affinity when the container
// should really move to the container space cpus.
// here we can't use runtime.NumCPU() to get cpu counts because it call sched_getaffinity to get cpu counts.
// If systemd set CPUAffinity then use runtime.NumCPU() can't get real cpu counts.
func resetAffinityMask(pid int) error {
cpus, err := utils.SystemCPUCores()
if err != nil {
return err
}
cpuset := unix.CPUSet{}
for i := 0; i < int(cpus); i++ {
cpuset.Set(i)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this tested with nested containers?
cpus here can be different from the number of the available cpus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I think the code in this PR will only work if you use lxcfs with nested containers (which nobody does with runc).

I suspect that we would instead need to parse /proc/self/cgroup and then look at the CPU set in /sys/fs/cgroup/cpuset.cpus.effective (but we would also need to check any parent cgroups if cpuset is not in cgroup.controllers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, would just passing nil as described in MarSik@e6ce3af just work?

I believe this is trying to take advantage of the EINVAL error fallback of __sched_setaffinity (which does reset the affnitiy back to the cpuset if there is no overlap between the cpuset and the requested affinity) but I'm not sure it actually works. My reading of __set_cpus_allowed_ptr gives me the impression that this shouldn't work, but the linked commit claims this resolves this issue?

Copy link
Member

@cyphar cyphar Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually, sched_setaffinity silently clamps the cpuset you give based on the cpuset for the task. So this is fine.

However, I would like to know if nil works just as well -- less code is better.

In fact, it might be even simpler to just generate a set of 8192 CPUs and get the kernel to clamp it for us? The kernel automatically clamps the size of cpumask to nr_cpu_ids internally so even if you give a really large number they will happily ignore it.

EDIT: Testing this, it seems golang.org/x/sys/unix will silently truncate the cpuset to 1024 CPUs. They have a hardcoded limit of _CPU_SETSIZE.

if err := unix.SchedSetaffinity(pid, &cpuset); err != nil {
return fmt.Errorf("error resetting pid %d affinity: %w", pid, err)
}
return nil
}
31 changes: 31 additions & 0 deletions libcontainer/utils/utils.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
package utils

import (
"bufio"
"encoding/json"
"fmt"
"io"
"os"
"path/filepath"
Expand Down Expand Up @@ -113,3 +115,32 @@ func Annotations(labels []string) (bundle string, userAnnotations map[string]str
}
return
}

// SystemCPUCores parses CPU usage information from a reader providing
// /proc/stat format data. It returns the number of CPUs.
func SystemCPUCores() (cpuNum uint32, _ error) {
f, err := os.Open("/proc/stat")
if err != nil {
return 0, err
}
defer f.Close()
return readSystemCPU(f)
}

func readSystemCPU(r io.Reader) (cpuNum uint32, _ error) {
reader := bufio.NewReader(r)
for {
line, err := reader.ReadString('\n')
if err != nil {
return 0, fmt.Errorf("error scanning /proc/stat file: %w", err)
}
// just count the line start with cpuN(N is cpu No)
if line[:3] != "cpu" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment that all cpu* lines are at the beginning belongs here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you mark this as solved? Am I missing something?

break
}
if '0' <= line[3] && line[3] <= '9' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need a commet explaining that there is a "cpu" line that we should ignore and only count cpu lines followed by a number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

cpuNum++
}
}
return cpuNum, nil
}
46 changes: 46 additions & 0 deletions libcontainer/utils/utils_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package utils

import (
"bytes"
"os"
"testing"

"golang.org/x/sys/unix"
Expand Down Expand Up @@ -137,3 +138,48 @@ func TestStripRoot(t *testing.T) {
}
}
}

func TestSystemCPUCores(t *testing.T) {
t.Run("MultiCore", func(t *testing.T) {
content := `cpu 5263854 3354 5436110 61362568 22532 728994 208644 796742 0 0
cpu0 720149 490 674391 7571042 4601 103938 42990 109735 0 0
cpu1 595284 389 676327 7761080 2405 77856 25882 95566 0 0
cpu2 727310 508 693322 7562543 3426 102842 28396 105651 0 0
cpu3 601561 304 685817 7751082 2064 80219 17547 92322 0 0
cpu4 713033 504 669261 7586506 2850 105624 39150 106688 0 0
cpu5 595065 328 683341 7761812 2065 77750 17827 91675 0 0
cpu6 720528 458 676161 7595093 3007 101744 21132 103530 0 0
cpu7 590922 371 677486 7773406 2111 79018 15716 91570 0 0
intr 1997458243 37 333 0 0 0 0 3 0 1 0 0 0 183 0 0 90125 0 0 0 0 0 0 0 0 0 458484 0 361539 0 0 0 256 0 1956792 15 0 918260 6 1450411 256422 0 49025 195 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2640704037
btime 1752714561
processes 5253419
procs_running 2
procs_blocked 0
softirq 580996229 23 230614056 282 2160733 45109 0 40037 116656548 0 231479441
`
tmpfile, err := os.CreateTemp("", "stat")
if err != nil {
t.Fatal(err)
}
defer os.Remove(tmpfile.Name())

if _, err := tmpfile.WriteString(content); err != nil {
t.Fatal(err)
}
if err := tmpfile.Close(); err != nil {
t.Fatal(err)
}
f, err := os.Open(tmpfile.Name())
if err != nil {
t.Fatal(err)
}
count, err := readSystemCPU(f)
if err != nil {
t.Errorf("unexpected error: %v", err)
}
if count != 8 {
t.Errorf("expected 8 cores, got %d", count)
}
})
}
17 changes: 17 additions & 0 deletions tests/integration/cpu_affinity.bats
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,20 @@ function cpus_to_mask() {
[[ "$output" == *"nsexec"*": affinity: $mask"* ]]
[[ "$output" == *"Cpus_allowed_list: $final"* ]] # Mind the literal tab.
}

@test "runc exec [CPU affinity set from config.json]" {
update_config '.process.args = [ "/bin/grep", "-F", "Cpus_allowed_list", "/proc/self/status"]'
cpus=$(grep -c "^processor" /proc/cpuinfo)
cpus_minus_one=$((cpus - 1))
runc run ct1
[ "$status" -eq 0 ]
last_col=$(echo "$output" | awk '{print $NF}')
[[ "$last_col" == *"0-$cpus_minus_one"* ]] # Mind the literal tab.
update_config '.process.args = ["/bin/sleep", "100"]'
runc run -d --console-socket "$CONSOLE_SOCKET" ct2
[ "$status" -eq 0 ]
runc exec ct2 grep -F "Cpus_allowed_list:" /proc/self/status
[ "$status" -eq 0 ]
last_col=$(echo "$output" | awk '{print $NF}')
[[ "$last_col" == *"0-$cpus_minus_one"* ]] # Mind the literal tab.
}