-
Notifications
You must be signed in to change notification settings - Fork 192
Add cpu.max Support for the SCX Scheduler
#3026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short note: This feature will temporarily break schedulers that use the old ATQ API, notably p2dq. There will be a subsequent patch that fixes this issue.
CC @hodgesds
|
Let's land this one after the november release is cut. |
5b0377f to
0bc9101
Compare
0bc9101 to
ad0ed04
Compare
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
… attribute Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
…e selftests Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
…t instances before use Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Define key data structures (scx_cgroup_ctx, scx_cgroup_llc_ctx) to support CPU bandwidth control (cpu.max) in cgroup v2. In addition, add an API skeleton for BPF schedulers. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>
Interate the CPU bandwidth control with the scx_lavd scheduler. The library is initialized (scx_cgroup_bw_lib_init) when the scheduler is initialized. Also, ops.cgroup_init(), ops.cgroup_exit(), and ops.cgroup_move() are implemented; scx_cgroup_bw_reenqueue() is called at ops.dispatch(). A new option, `--enable-cpu-bw` is added to enable the feature. Finally, replace __nr_cpu_ids to nr_cpu_ids defined in the scx library. Signed-off-by: Changwoo Min <[email protected]>
In order to use the cpu.max, task_ctx should be in the BPF arena because the cpu.max library internally uses ATQ to manage throttled tasks. To this end, let's move the task_ctx to the BPF arena. More specifically, it includes - rename struct task_ctx to task_ctx - move pid to main task_ctx - move task_ctx to arenas - mark functions taking a task context with __arg_arena Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
scx_cgroup_bw_lib_init() first initializes the config and replelish timer. Signed-off-by: Changwoo Min <[email protected]>
When a cgroup is initialized, the cgroup's context and its LLC contexts are initialized. Also, its parent now becomes non-leaf. If its parent is not threaded, it cannot have tasks, so we delete its LLC contexts. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>
When a cgroup's bandwidth is updated, we should update the nquota_lb of all its descendants too. Signed-off-by: Changwoo Min <[email protected]>
Destroy cgroup context, and its LLC contexts, and drain & free BTQs associated with the LLC contexts. Signed-off-by: Changwoo Min <[email protected]>
We test if a cgroup is throttled or not in a bottom-up manner: first at the LLC level, then at the cgroup level, and finally at the subroot cgroup level. Before asking the budget to the subroot cgroup level, update this cgroup's runtime_total_sloppy to avoid spending the budget over its upper bound (nquota_ub). We traverse the cgroup hierarchy in a post-order manner (left, right, then root) and check the cgroup's level to efficiently sum the runtime_total of all its descendants. Signed-off-by: Changwoo Min <[email protected]>
After executing a task, we update runtime_toal and compensate for budget_remaining_before by comparing the planned vs. actual time usage. Signed-off-by: Changwoo Min <[email protected]>
When a cgroup is throttled (i.e., scx_cgroup_bw_reserve() returns -EAGAIN), a task that is in the ops.enqueue() path should be put aside to the BTQ of its associated LLC context. When the cgroup becomes unthrottled again, the registered enqueue_cb() will be called to re-enqueue the task for execution. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>
We replenish the cgroup's budget every 100ms (nperiod) interval. Only subroot cgroups (their levels == 1) distribute the budget to their descendants; we only replenish the budget for the subroot cgroup with a limited quota. The underused budget can be accumulated by the burst specified. On the other hand, the overused budget will be charged over the intervals. The replenish timer is split into two parts: the top half and the bottom half. The top half -- the actual BPF timer function (replenish_timerfn) -- runs the essential, critical part, such as refilling the time budget. On the other hand, the bottom half -- scx_cgroup_bw_reenqueue() -- runs on a BPF scheduler's ops.dispatch() and requeues the backlogged tasks to proper DSQs. Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Emil Tsalapatis <[email protected]>
Tasks may be dequeued from the BPF side by the scx core during system calls like sched_setaffinity(2). In that case, we must cancel any throttling-related ATQ insert operations for the task: - We must avoid double inserts caused by the dequeued task being reenqueed and throttled again while still in an ATQ. - We want to remove tasks not in scx anymore from throttling. While inserting non-scx tasks into a DSQ is a no-op, we would like our accounting to be as accurate as possible. Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
We updated ops.enqueue(), enqueue callback, ops.dispatch(), and ops.tick() paths. Under CPU bandwidth control using cpu.max, we should first check if the cgroup is throttled or not. If not, we will go ahead. Otherwise, we should set the task aside for later execution. Also, we report how much time was actually consumed at ops.tick() and ops.stopping(). Note that we do not throttle the scheduler process itself to guarantee forward progress. Signed-off-by: Changwoo Min <[email protected]>
A throttled task may be dequeued by the sched_ext core for various reasons, including sched_setaffinity(2), task migration, and so on. Eventually, such a task will be enqueued again, breaking the internal red-black tree state of the ATQ where it was. To avoid such a case, we should remove the dequeuing task from the ATQ if it is in. Signed-off-by: Emil Tsalapatis <[email protected]> Signed-off-by: Changwoo Min <[email protected]>
Add a utility script, cgpath.sh, that takes a cgroup ID as a command-line argument and returns the full path of the cgroup. The cgroup ID is the inode number of the cgroup. This is for easy debugging of the cpu.max support. Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Changwoo Min <[email protected]>
ad0ed04 to
95af48b
Compare
This is kind of a pain, CI is failing on main because of this. Can we please make sure to land the fixes today? |
|
This is fix: #3038 @JakeHillion |
This patch set adds
cpu.maxsupport [1] to the SCX scheduler.It introduces a new library for CPU bandwidth control and integrates it with LAVD.
The series consists of four major parts:
cpu.maxlibrary implementationcpu.max1. Overview of the
cpu.maxFeatureThe
cpu.maxinterface controls CPU bandwidth for non-root cgroups using three parameters:$QUOTA$PERIOD$BURSTA cgroup may consume up to
$QUOTAruntime in each$PERIOD.Using
maxfor$QUOTAremoves the limit.At the end of each period, unused time is discarded unless
$BURSTis specified—allowing carryover up to$BURST.2. Design Overview
Our implementation follows the cgroup v2
cpu.maxsemantics but differs from the in-kernel CFS version for efficiency and simplicity.(1) Interpreting Quota and Period as CPU Utilization
We treat
quota / periodas CPU utilization rather than time-based replenishment.For example, if a cgroup has:
it may use up to 10% of a single CPU.
All cgroups are replenished by a single BPF timer, avoiding one timer per cgroup.
(2) Eventual Quota Enforcement via Task Admission Control
To reduce overhead in the critical path, we enforce quota eventually rather than immediately.
Tasks from throttled cgroups are deferred to a backlog queue instead of
DSQ. When quota is replenished by the BPF timer, deferred tasks are moved
back to their DSQ, ensuring zero additional overhead in
ops.dispatch().3. API for BPF Schedulers
BPF schedulers interact with the library using the following API.
Initialization
scx_cgroup_bw_lib_init()→ called inops.init()scx_cgroup_bw_init()/scx_cgroup_bw_exit()→ called inops.cgroup_init()/ops.cgroup_exit()scx_cgroup_bw_set()→ called whencpu.maxparameters changeRuntime Checks
Before enqueueing (
ops.enqueue()) or selecting (ops.select_cpu()), callscx_cgroup_bw_throttled()to check if the cgroup is throttled.If throttled:
scx_cgroup_bw_put_aside()scx_cgroup_bw_reenqueue()(fromops.dispatch())Re-enqueued tasks use scheduler-defined policies (e.g., time slice, vtime, DSQ).
A callback must be registered with:
where:
If a throttled task is dequeued before being unthrottled (e.g., due to
sched_setaffinity(2)),the scheduler should call
scx_cgroup_bw_cancel()to maintain proper state.Reporting the Used Time
A task must report its consumed execution time periodically (
ops .tick()) and when it stops (ops.stopping()) usingscx_cgroup_bw_consume().API Summary
See
scheds/include/lib/cgroup.hfor detailed documentation.4. Prerequisites for BPF Schedulers
Schedulers must allocate their per-task context (e.g.,
struct task_ctxin LAVD) in the BPF arena,as the library internally uses ATQ (Arena-based Task Queue) for managing throttled tasks.
Each task context must begin with
struct scx_task_common(defined inscheds/include/lib/atq.h)or reserve equivalent space (
struct atq_ctxin LAVD).This structure embeds a red-black tree node and related state required by ATQ.
5. Patch Structure
(1) ATQ / Rbtree Changes
(2)
cpu.maxLibrary Implementation(3) LAVD Integration
(4) Scripts
6. Reference
[1] Cgroup v2 Documentation — CPU controller
Signed-off-by: Changwoo Min [email protected]
Signed-off-by: Emil Tsalapatis [email protected]