Skip to content

Conversation

@multics69
Copy link
Contributor

@multics69 multics69 commented Nov 10, 2025

This patch set adds cpu.max support [1] to the SCX scheduler.
It introduces a new library for CPU bandwidth control and integrates it with LAVD.
The series consists of four major parts:

  1. cpu.max library implementation
  2. ATQ/rbtree modifications required by the library
  3. LAVD integration for cpu.max
  4. Utility scripts for debugging and testing

1. Overview of the cpu.max Feature

The cpu.max interface controls CPU bandwidth for non-root cgroups using three parameters:

  • $QUOTA
  • $PERIOD
  • $BURST

A cgroup may consume up to $QUOTA runtime in each $PERIOD.
Using max for $QUOTA removes the limit.
At the end of each period, unused time is discarded unless $BURST is specified—allowing carryover up to $BURST.


2. Design Overview

Our implementation follows the cgroup v2 cpu.max semantics but differs from the in-kernel CFS version for efficiency and simplicity.

(1) Interpreting Quota and Period as CPU Utilization

We treat quota / period as CPU utilization rather than time-based replenishment.

For example, if a cgroup has:

quota = 20000 µs
period = 200000 µs

it may use up to 10% of a single CPU.
All cgroups are replenished by a single BPF timer, avoiding one timer per cgroup.

(2) Eventual Quota Enforcement via Task Admission Control

To reduce overhead in the critical path, we enforce quota eventually rather than immediately.

  • In CFS, quota is checked when selecting a task.
  • In SCX, we check it on enqueue (task admission control).

Tasks from throttled cgroups are deferred to a backlog queue instead of
DSQ. When quota is replenished by the BPF timer, deferred tasks are moved
back
to their DSQ, ensuring zero additional overhead in ops.dispatch().


3. API for BPF Schedulers

BPF schedulers interact with the library using the following API.

Initialization

  • scx_cgroup_bw_lib_init() → called in ops.init()
  • scx_cgroup_bw_init() / scx_cgroup_bw_exit() → called in ops.cgroup_init() / ops.cgroup_exit()
  • scx_cgroup_bw_set() → called when cpu.max parameters change

Runtime Checks

Before enqueueing (ops.enqueue()) or selecting (ops.select_cpu()), call
scx_cgroup_bw_throttled() to check if the cgroup is throttled.

If throttled:

  • Defer task using scx_cgroup_bw_put_aside()
  • When replenished, call scx_cgroup_bw_reenqueue() (from ops.dispatch())

Re-enqueued tasks use scheduler-defined policies (e.g., time slice, vtime, DSQ).
A callback must be registered with:

REGISTER_SCX_CGROUP_BW_ENQUEUE_CB(eqcb)

where:

int eqcb(u64 pid);

If a throttled task is dequeued before being unthrottled (e.g., due to sched_setaffinity(2)),
the scheduler should call scx_cgroup_bw_cancel() to maintain proper state.

Reporting the Used Time

A task must report its consumed execution time periodically (ops .tick()) and when it stops (ops.stopping()) using scx_cgroup_bw_consume().

API Summary

See scheds/include/lib/cgroup.h for detailed documentation.

int scx_cgroup_bw_lib_init(struct scx_cgroup_bw_config *config);
int scx_cgroup_bw_init(struct cgroup *cgrp, struct scx_cgroup_init_args *args);
int scx_cgroup_bw_exit(struct cgroup *cgrp);
int scx_cgroup_bw_set(struct cgroup *cgrp, u64 period_us, u64 quota_us, u64 burst_us);
int scx_cgroup_bw_throttled(struct cgroup *cgrp);
int scx_cgroup_bw_consume(struct cgroup *cgrp, u64 consumed_ns);
int scx_cgroup_bw_put_aside(struct task_struct *p, u64 taskc, u64 vtime, struct cgroup *cgrp);
int scx_cgroup_bw_reenqueue(void);
int scx_cgroup_bw_cancel(u64 taskc);
#define REGISTER_SCX_CGROUP_BW_ENQUEUE_CB(eqcb)

4. Prerequisites for BPF Schedulers

Schedulers must allocate their per-task context (e.g., struct task_ctx in LAVD) in the BPF arena,
as the library internally uses ATQ (Arena-based Task Queue) for managing throttled tasks.

Each task context must begin with struct scx_task_common (defined in scheds/include/lib/atq.h)
or reserve equivalent space (struct atq_ctx in LAVD).
This structure embeds a red-black tree node and related state required by ATQ.


5. Patch Structure

(1) ATQ / Rbtree Changes

  • lib/atq: factor out task insertion into scx_atq_insert_node
  • lib/rbtree: add noalloc/nofree variants of the API
  • lib/rbtree: turn rbtree_insert_mode from a per-insert into a per-tree attribute
  • lib/rbtree: adjust inlining to pass verification
  • lib/atq: only use embedded rbnodes on scx_atq_insert_*()
  • lib/rbtree: remove RB_ALLOC check out of rb_insert codepath and update selftests
  • lib/selftests: add selftests for embedded rbtree node
  • lib/rbtree: initialize node color to red for embedded nodes
  • include/lib: expose rb_integrity_check as public API
  • lib/rbtree: wipe the values of ->left and ->right in embedded rbnode_t instances before use
  • selftests/atq: expand and fix tests
  • lib/selftests: exclude cgroup_bw.bpf.c from selftests
  • lib/selftests: hardcode Clang version to be the system one
  • include: fix userspace-side spinlock definitions
  • lib/atq: change signature to require scx_task_common
  • lib/atq: add atq_remove_node call
  • include/atq: add ATQ lock/unlock calls
  • lib/atq: add unlocked insert ATQ calls

(2) cpu.max Library Implementation

  • lib: cgroup_bw: Add skeleton for CPU bandwidth control.
  • lib: cgroup_bw: Implement scx_cgroup_bw_lib_init().
  • lib: cgroup_bw: Implement scx_cgroup_bw_init().
  • lib: cgroup_bw: Implement scx_cgroup_bw_set().
  • lib: cgroup_bw: Implement scx_cgroup_bw_exit().
  • lib: cgroup_bw: Implement scx_cgroup_bw_throttled().
  • lib: cgroup_bw: Implement scx_cgroup_bw_consume().
  • lib: cgroup_bw: Implement scx_cgroup_bw_put_aside().
  • lib: cgroup_bw: Implement replenish timer and cbw_reenqueue_cgroup().
  • lib: cgroup_bw: Implement scx_cgroup_bw_cancel().

(3) LAVD Integration

  • scx_lavd: Initial integration with CPU bandwidth control.
  • scx_lavd: Move task_ctx to the BPF arena.
  • scx_lavd: Support cpu.max at enqueue-like paths.
  • scx_lavd: Implement ops.deque() for throttling cancelling.

(4) Scripts

  • scripts: Add a script to get a cgroup path from it id
  • scripts: Add a script to set up cgroups for basic testing of cpu.max.

6. Reference

[1] Cgroup v2 Documentation — CPU controller


Signed-off-by: Changwoo Min [email protected]
Signed-off-by: Emil Tsalapatis [email protected]

Copy link
Contributor

@etsal etsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short note: This feature will temporarily break schedulers that use the old ATQ API, notably p2dq. There will be a subsequent patch that fixes this issue.

CC @hodgesds

@htejun
Copy link
Contributor

htejun commented Nov 10, 2025

Let's land this one after the november release is cut.

@multics69 multics69 force-pushed the lavd-cpu-bw15 branch 2 times, most recently from 5b0377f to 0bc9101 Compare November 12, 2025 01:48
etsal and others added 13 commits November 13, 2025 21:08
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
… attribute

Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
…e selftests

Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
…t instances before use

Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
etsal and others added 21 commits November 13, 2025 21:08
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Define key data structures (scx_cgroup_ctx, scx_cgroup_llc_ctx)
to support CPU bandwidth control (cpu.max) in cgroup v2. In addition,
add an API skeleton for BPF schedulers.

Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
Interate the CPU bandwidth control with the scx_lavd scheduler.

The library is initialized (scx_cgroup_bw_lib_init) when the scheduler
is initialized. Also, ops.cgroup_init(), ops.cgroup_exit(), and
ops.cgroup_move() are implemented; scx_cgroup_bw_reenqueue() is called
at ops.dispatch(). A new option, `--enable-cpu-bw` is added to enable
the feature. Finally, replace __nr_cpu_ids to nr_cpu_ids defined in the
scx library.

Signed-off-by: Changwoo Min <[email protected]>
In order to use the cpu.max, task_ctx should be in the BPF arena
because the cpu.max library internally uses ATQ to manage throttled tasks.
To this end, let's move the task_ctx to the BPF arena.

More specifically, it includes
  - rename struct task_ctx to task_ctx
  - move pid to main task_ctx
  - move task_ctx to arenas
  - mark functions taking a task context with __arg_arena

Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
scx_cgroup_bw_lib_init() first initializes the config and replelish timer.

Signed-off-by: Changwoo Min <[email protected]>
When a cgroup is initialized, the cgroup's context and its LLC contexts
are initialized. Also, its parent now becomes non-leaf. If its parent is
not threaded, it cannot have tasks, so we delete its LLC contexts.

Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
When a cgroup's bandwidth is updated, we should update the nquota_lb
of all its descendants too.

Signed-off-by: Changwoo Min <[email protected]>
Destroy cgroup context, and its LLC contexts, and drain & free BTQs
associated with the LLC contexts.

Signed-off-by: Changwoo Min <[email protected]>
We test if a cgroup is throttled or not in a bottom-up manner: first at
the LLC level, then at the cgroup level, and finally at the subroot cgroup
level. Before asking the budget to the subroot cgroup level, update this
cgroup's runtime_total_sloppy to avoid spending the budget over its upper
bound (nquota_ub). We traverse the cgroup hierarchy in a post-order manner
(left, right, then root) and check the cgroup's level to efficiently sum
the runtime_total of all its descendants.

Signed-off-by: Changwoo Min <[email protected]>
After executing a task, we update runtime_toal and compensate for
budget_remaining_before by comparing the planned vs. actual time usage.

Signed-off-by: Changwoo Min <[email protected]>
When a cgroup is throttled (i.e., scx_cgroup_bw_reserve() returns -EAGAIN),
a task that is in the ops.enqueue() path should be put aside to the BTQ of
its associated LLC context. When the cgroup becomes unthrottled again, the
registered enqueue_cb() will be called to re-enqueue the task for execution.

Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
We replenish the cgroup's budget every 100ms (nperiod) interval. Only
subroot cgroups (their levels == 1) distribute the budget to their
descendants; we only replenish the budget for the subroot cgroup with a
limited quota. The underused budget can be accumulated by the burst
specified. On the other hand, the overused budget will be charged over the
intervals.

The replenish timer is split into two parts: the top half and the bottom
half. The top half -- the actual BPF timer function (replenish_timerfn)
-- runs the essential, critical part, such as refilling the time budget.
On the other hand, the bottom half -- scx_cgroup_bw_reenqueue() -- runs
on a BPF scheduler's ops.dispatch() and requeues the backlogged tasks to
proper DSQs.

Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]>
 Tasks may be dequeued from the BPF side by the scx core during system calls
 like sched_setaffinity(2). In that case, we must cancel any
 throttling-related ATQ insert operations for the task:
 - We must avoid double inserts caused by the dequeued task being
   reenqueed and throttled again while still in an ATQ.
 - We want to remove tasks not in scx anymore from throttling. While
   inserting non-scx tasks into a DSQ is a no-op, we would like our
   accounting to be as accurate as possible.

Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
We updated ops.enqueue(), enqueue callback, ops.dispatch(), and
ops.tick() paths.

Under CPU bandwidth control using cpu.max, we should first check if
the cgroup is throttled or not. If not, we will go ahead. Otherwise,
we should set the task aside for later execution.

Also, we report how much time was actually consumed at ops.tick() and
ops.stopping().

Note that we do not throttle the scheduler process itself to guarantee
forward progress.

Signed-off-by: Changwoo Min <[email protected]>
A throttled task may be dequeued by the sched_ext core for various reasons,
including sched_setaffinity(2), task migration, and so on. Eventually, such
a task will be enqueued again, breaking the internal red-black tree state of
the ATQ where it was. To avoid such a case, we should remove the dequeuing
task from the ATQ if it is in.

Signed-off-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
Add a utility script, cgpath.sh, that takes a cgroup ID as a
command-line argument and returns the full path of the cgroup.
The cgroup ID is the inode number of the cgroup. This is for easy
debugging of the cpu.max support.

Signed-off-by: Changwoo Min <[email protected]>
@multics69 multics69 merged commit b7f394c into sched-ext:main Nov 13, 2025
2 of 4 checks passed
@JakeHillion
Copy link
Contributor

Short note: This feature will temporarily break schedulers that use the old ATQ API, notably p2dq. There will be a subsequent patch that fixes this issue.

This is kind of a pain, CI is failing on main because of this. Can we please make sure to land the fixes today?

@sirlucjan
Copy link
Collaborator

This is fix: #3038 @JakeHillion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants