Add `cpu.max` Support for the SCX Scheduler #3026

multics69 · 2025-11-10T23:00:03Z

This patch set adds cpu.max support [1] to the SCX scheduler.
It introduces a new library for CPU bandwidth control and integrates it with LAVD.
The series consists of four major parts:

cpu.max library implementation
ATQ/rbtree modifications required by the library
LAVD integration for cpu.max
Utility scripts for debugging and testing

1. Overview of the `cpu.max` Feature

The cpu.max interface controls CPU bandwidth for non-root cgroups using three parameters:

$QUOTA
$PERIOD
$BURST

A cgroup may consume up to $QUOTA runtime in each $PERIOD.
Using max for $QUOTA removes the limit.
At the end of each period, unused time is discarded unless $BURST is specified—allowing carryover up to $BURST.

2. Design Overview

Our implementation follows the cgroup v2 cpu.max semantics but differs from the in-kernel CFS version for efficiency and simplicity.

(1) Interpreting Quota and Period as CPU Utilization

We treat quota / period as CPU utilization rather than time-based replenishment.

For example, if a cgroup has:

quota = 20000 µs
period = 200000 µs

it may use up to 10% of a single CPU.
All cgroups are replenished by a single BPF timer, avoiding one timer per cgroup.

(2) Eventual Quota Enforcement via Task Admission Control

To reduce overhead in the critical path, we enforce quota eventually rather than immediately.

In CFS, quota is checked when selecting a task.
In SCX, we check it on enqueue (task admission control).

Tasks from throttled cgroups are deferred to a backlog queue instead of
DSQ. When quota is replenished by the BPF timer, deferred tasks are moved
back to their DSQ, ensuring zero additional overhead in ops.dispatch().

3. API for BPF Schedulers

BPF schedulers interact with the library using the following API.

Initialization

scx_cgroup_bw_lib_init() → called in ops.init()
scx_cgroup_bw_init() / scx_cgroup_bw_exit() → called in ops.cgroup_init() / ops.cgroup_exit()
scx_cgroup_bw_set() → called when cpu.max parameters change

Runtime Checks

Before enqueueing (ops.enqueue()) or selecting (ops.select_cpu()), call
scx_cgroup_bw_throttled() to check if the cgroup is throttled.

If throttled:

Defer task using scx_cgroup_bw_put_aside()
When replenished, call scx_cgroup_bw_reenqueue() (from ops.dispatch())

Re-enqueued tasks use scheduler-defined policies (e.g., time slice, vtime, DSQ).
A callback must be registered with:

REGISTER_SCX_CGROUP_BW_ENQUEUE_CB(eqcb)

where:

int eqcb(u64 pid);

If a throttled task is dequeued before being unthrottled (e.g., due to sched_setaffinity(2)),
the scheduler should call scx_cgroup_bw_cancel() to maintain proper state.

Reporting the Used Time

A task must report its consumed execution time periodically (ops .tick()) and when it stops (ops.stopping()) using scx_cgroup_bw_consume().

API Summary

See scheds/include/lib/cgroup.h for detailed documentation.

int scx_cgroup_bw_lib_init(struct scx_cgroup_bw_config *config);
int scx_cgroup_bw_init(struct cgroup *cgrp, struct scx_cgroup_init_args *args);
int scx_cgroup_bw_exit(struct cgroup *cgrp);
int scx_cgroup_bw_set(struct cgroup *cgrp, u64 period_us, u64 quota_us, u64 burst_us);
int scx_cgroup_bw_throttled(struct cgroup *cgrp);
int scx_cgroup_bw_consume(struct cgroup *cgrp, u64 consumed_ns);
int scx_cgroup_bw_put_aside(struct task_struct *p, u64 taskc, u64 vtime, struct cgroup *cgrp);
int scx_cgroup_bw_reenqueue(void);
int scx_cgroup_bw_cancel(u64 taskc);
#define REGISTER_SCX_CGROUP_BW_ENQUEUE_CB(eqcb)

4. Prerequisites for BPF Schedulers

Schedulers must allocate their per-task context (e.g., struct task_ctx in LAVD) in the BPF arena,
as the library internally uses ATQ (Arena-based Task Queue) for managing throttled tasks.

Each task context must begin with struct scx_task_common (defined in scheds/include/lib/atq.h)
or reserve equivalent space (struct atq_ctx in LAVD).
This structure embeds a red-black tree node and related state required by ATQ.

5. Patch Structure

(1) ATQ / Rbtree Changes

lib/atq: factor out task insertion into scx_atq_insert_node
lib/rbtree: add noalloc/nofree variants of the API
lib/rbtree: turn rbtree_insert_mode from a per-insert into a per-tree attribute
lib/rbtree: adjust inlining to pass verification
lib/atq: only use embedded rbnodes on scx_atq_insert_*()
lib/rbtree: remove RB_ALLOC check out of rb_insert codepath and update selftests
lib/selftests: add selftests for embedded rbtree node
lib/rbtree: initialize node color to red for embedded nodes
include/lib: expose rb_integrity_check as public API
lib/rbtree: wipe the values of ->left and ->right in embedded rbnode_t instances before use
selftests/atq: expand and fix tests
lib/selftests: exclude cgroup_bw.bpf.c from selftests
lib/selftests: hardcode Clang version to be the system one
include: fix userspace-side spinlock definitions
lib/atq: change signature to require scx_task_common
lib/atq: add atq_remove_node call
include/atq: add ATQ lock/unlock calls
lib/atq: add unlocked insert ATQ calls

(2) `cpu.max` Library Implementation

lib: cgroup_bw: Add skeleton for CPU bandwidth control.
lib: cgroup_bw: Implement scx_cgroup_bw_lib_init().
lib: cgroup_bw: Implement scx_cgroup_bw_init().
lib: cgroup_bw: Implement scx_cgroup_bw_set().
lib: cgroup_bw: Implement scx_cgroup_bw_exit().
lib: cgroup_bw: Implement scx_cgroup_bw_throttled().
lib: cgroup_bw: Implement scx_cgroup_bw_consume().
lib: cgroup_bw: Implement scx_cgroup_bw_put_aside().
lib: cgroup_bw: Implement replenish timer and cbw_reenqueue_cgroup().
lib: cgroup_bw: Implement scx_cgroup_bw_cancel().

(3) LAVD Integration

scx_lavd: Initial integration with CPU bandwidth control.
scx_lavd: Move task_ctx to the BPF arena.
scx_lavd: Support cpu.max at enqueue-like paths.
scx_lavd: Implement ops.deque() for throttling cancelling.

(4) Scripts

scripts: Add a script to get a cgroup path from it id
scripts: Add a script to set up cgroups for basic testing of cpu.max.

6. Reference

[1] Cgroup v2 Documentation — CPU controller

Signed-off-by: Changwoo Min [email protected]
Signed-off-by: Emil Tsalapatis [email protected]

etsal

Short note: This feature will temporarily break schedulers that use the old ATQ API, notably p2dq. There will be a subsequent patch that fixes this issue.

CC @hodgesds

htejun · 2025-11-10T23:27:46Z

Let's land this one after the november release is cut.

lib/atq.bpf.c

lib/selftests/Makefile

scheds/rust/scx_lavd/src/bpf/main.bpf.c

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

… attribute Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

…e selftests Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

…t instances before use Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

Define key data structures (scx_cgroup_ctx, scx_cgroup_llc_ctx) to support CPU bandwidth control (cpu.max) in cgroup v2. In addition, add an API skeleton for BPF schedulers. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>

Interate the CPU bandwidth control with the scx_lavd scheduler. The library is initialized (scx_cgroup_bw_lib_init) when the scheduler is initialized. Also, ops.cgroup_init(), ops.cgroup_exit(), and ops.cgroup_move() are implemented; scx_cgroup_bw_reenqueue() is called at ops.dispatch(). A new option, `--enable-cpu-bw` is added to enable the feature. Finally, replace __nr_cpu_ids to nr_cpu_ids defined in the scx library. Signed-off-by: Changwoo Min <[email protected]>

In order to use the cpu.max, task_ctx should be in the BPF arena because the cpu.max library internally uses ATQ to manage throttled tasks. To this end, let's move the task_ctx to the BPF arena. More specifically, it includes - rename struct task_ctx to task_ctx - move pid to main task_ctx - move task_ctx to arenas - mark functions taking a task context with __arg_arena Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

scx_cgroup_bw_lib_init() first initializes the config and replelish timer. Signed-off-by: Changwoo Min <[email protected]>

When a cgroup is initialized, the cgroup's context and its LLC contexts are initialized. Also, its parent now becomes non-leaf. If its parent is not threaded, it cannot have tasks, so we delete its LLC contexts. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>

When a cgroup's bandwidth is updated, we should update the nquota_lb of all its descendants too. Signed-off-by: Changwoo Min <[email protected]>

Destroy cgroup context, and its LLC contexts, and drain & free BTQs associated with the LLC contexts. Signed-off-by: Changwoo Min <[email protected]>

We test if a cgroup is throttled or not in a bottom-up manner: first at the LLC level, then at the cgroup level, and finally at the subroot cgroup level. Before asking the budget to the subroot cgroup level, update this cgroup's runtime_total_sloppy to avoid spending the budget over its upper bound (nquota_ub). We traverse the cgroup hierarchy in a post-order manner (left, right, then root) and check the cgroup's level to efficiently sum the runtime_total of all its descendants. Signed-off-by: Changwoo Min <[email protected]>

After executing a task, we update runtime_toal and compensate for budget_remaining_before by comparing the planned vs. actual time usage. Signed-off-by: Changwoo Min <[email protected]>

When a cgroup is throttled (i.e., scx_cgroup_bw_reserve() returns -EAGAIN), a task that is in the ops.enqueue() path should be put aside to the BTQ of its associated LLC context. When the cgroup becomes unthrottled again, the registered enqueue_cb() will be called to re-enqueue the task for execution. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>

We replenish the cgroup's budget every 100ms (nperiod) interval. Only subroot cgroups (their levels == 1) distribute the budget to their descendants; we only replenish the budget for the subroot cgroup with a limited quota. The underused budget can be accumulated by the burst specified. On the other hand, the overused budget will be charged over the intervals. The replenish timer is split into two parts: the top half and the bottom half. The top half -- the actual BPF timer function (replenish_timerfn) -- runs the essential, critical part, such as refilling the time budget. On the other hand, the bottom half -- scx_cgroup_bw_reenqueue() -- runs on a BPF scheduler's ops.dispatch() and requeues the backlogged tasks to proper DSQs. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>

Tasks may be dequeued from the BPF side by the scx core during system calls like sched_setaffinity(2). In that case, we must cancel any throttling-related ATQ insert operations for the task: - We must avoid double inserts caused by the dequeued task being reenqueed and throttled again while still in an ATQ. - We want to remove tasks not in scx anymore from throttling. While inserting non-scx tasks into a DSQ is a no-op, we would like our accounting to be as accurate as possible. Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

We updated ops.enqueue(), enqueue callback, ops.dispatch(), and ops.tick() paths. Under CPU bandwidth control using cpu.max, we should first check if the cgroup is throttled or not. If not, we will go ahead. Otherwise, we should set the task aside for later execution. Also, we report how much time was actually consumed at ops.tick() and ops.stopping(). Note that we do not throttle the scheduler process itself to guarantee forward progress. Signed-off-by: Changwoo Min <[email protected]>

A throttled task may be dequeued by the sched_ext core for various reasons, including sched_setaffinity(2), task migration, and so on. Eventually, such a task will be enqueued again, breaking the internal red-black tree state of the ATQ where it was. To avoid such a case, we should remove the dequeuing task from the ATQ if it is in. Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

Add a utility script, cgpath.sh, that takes a cgroup ID as a command-line argument and returns the full path of the cgroup. The cgroup ID is the inode number of the cgroup. This is for easy debugging of the cpu.max support. Signed-off-by: Changwoo Min <[email protected]>

Signed-off-by: Changwoo Min <[email protected]>

JakeHillion · 2025-11-14T15:01:29Z

Short note: This feature will temporarily break schedulers that use the old ATQ API, notably p2dq. There will be a subsequent patch that fixes this issue.

This is kind of a pain, CI is failing on main because of this. Can we please make sure to land the fixes today?

sirlucjan · 2025-11-14T15:49:56Z

This is fix: #3038 @JakeHillion

multics69 requested review from arighi, daidavid, dschatzberg, etsal, hodgesds, htejun and rrnewton November 10, 2025 23:00

etsal approved these changes Nov 10, 2025

View reviewed changes

hodgesds reviewed Nov 11, 2025

View reviewed changes

lib/atq.bpf.c Show resolved Hide resolved

hodgesds reviewed Nov 11, 2025

View reviewed changes

lib/selftests/Makefile Outdated Show resolved Hide resolved

hodgesds reviewed Nov 11, 2025

View reviewed changes

scheds/rust/scx_lavd/src/bpf/main.bpf.c Show resolved Hide resolved

multics69 force-pushed the lavd-cpu-bw15 branch 2 times, most recently from 5b0377f to 0bc9101 Compare November 12, 2025 01:48

hodgesds approved these changes Nov 12, 2025

View reviewed changes

multics69 force-pushed the lavd-cpu-bw15 branch from 0bc9101 to ad0ed04 Compare November 13, 2025 12:02

multics69 mentioned this pull request Nov 13, 2025

scx_lavd: Lower the verifier pressure. #3033

Merged

etsal and others added 13 commits November 13, 2025 21:08

lib/atq: factor out task insertion into scx_atq_insert_node

7720c7f

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/rbtree: add noalloc/nofree variants of the API

5c45751

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/rbtree: turn rbtree_insert_mode from a per-insert into a per-tree…

f83b6bb

… attribute Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/rbtree: adjust inlining to pass verification

b61387f

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/atq: only use embedded rbnodes on scx_atq_insert_*()

d9cb825

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/rbtree: remove RB_ALLOC check out of rb_insert codepath and updat…

f84800d

…e selftests Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/selftests: add selftests for embedded rbtree node

724d1e7

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/rbtree: initialize node color to red for embedded nodes

a707071

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

include/lib: expose rb_integrity_check as public API

80a8de6

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/rbtree: wipe the values of ->left and ->right in embedded rbnode_…

8c0a06f

…t instances before use Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

selftests/atq: expand and fix tests

e71d03e

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/selftests: exclude cgroup_bw.bpf.c from selftests

6e52f89

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/selftests: hardcode Clang version to be the system one

80459d6

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

etsal and others added 21 commits November 13, 2025 21:08

include: fix userspace-side spinlock definitions

2c0162d

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/atq: change signature to require scx_task_common

28f6e7d

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/atq: add atq_remove_node call

6cf81ab

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

include/atq: add ATQ lock/unlock calls

3a0841f

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib/atq: add unlocked insert ATQ calls

1873808

Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>

lib: cgroup_bw: Implement scx_cgroup_bw_lib_init().

0a6ad70

scx_cgroup_bw_lib_init() first initializes the config and replelish timer. Signed-off-by: Changwoo Min <[email protected]>

lib: cgroup_bw: Implement scx_cgroup_bw_set().

1bb38da

When a cgroup's bandwidth is updated, we should update the nquota_lb of all its descendants too. Signed-off-by: Changwoo Min <[email protected]>

lib: cgroup_bw: Implement scx_cgroup_bw_exit().

7ed9f6f

Destroy cgroup context, and its LLC contexts, and drain & free BTQs associated with the LLC contexts. Signed-off-by: Changwoo Min <[email protected]>

lib: cgroup_bw: Implement scx_cgroup_bw_consume().

83a420d

After executing a task, we update runtime_toal and compensate for budget_remaining_before by comparing the planned vs. actual time usage. Signed-off-by: Changwoo Min <[email protected]>

scripts: Add a script to set up cgroups for basic testing of cpu.max.

95af48b

Signed-off-by: Changwoo Min <[email protected]>

multics69 force-pushed the lavd-cpu-bw15 branch from ad0ed04 to 95af48b Compare November 13, 2025 12:13

multics69 merged commit b7f394c into sched-ext:main Nov 13, 2025
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `cpu.max` Support for the SCX Scheduler #3026

Add `cpu.max` Support for the SCX Scheduler #3026

Uh oh!

multics69 commented Nov 10, 2025 •

edited

Loading

Uh oh!

etsal left a comment •

edited

Loading

Uh oh!

htejun commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JakeHillion commented Nov 14, 2025

Uh oh!

sirlucjan commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add cpu.max Support for the SCX Scheduler #3026

Add cpu.max Support for the SCX Scheduler #3026

Uh oh!

Conversation

multics69 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Overview of the cpu.max Feature

2. Design Overview

(1) Interpreting Quota and Period as CPU Utilization

(2) Eventual Quota Enforcement via Task Admission Control

3. API for BPF Schedulers

Initialization

Runtime Checks

Reporting the Used Time

API Summary

4. Prerequisites for BPF Schedulers

5. Patch Structure

(1) ATQ / Rbtree Changes

(2) cpu.max Library Implementation

(3) LAVD Integration

(4) Scripts

6. Reference

Uh oh!

etsal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htejun commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JakeHillion commented Nov 14, 2025

Uh oh!

sirlucjan commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add `cpu.max` Support for the SCX Scheduler #3026

Add `cpu.max` Support for the SCX Scheduler #3026

multics69 commented Nov 10, 2025 •

edited

Loading

1. Overview of the `cpu.max` Feature

(2) `cpu.max` Library Implementation

etsal left a comment •

edited

Loading