Skip to content

scx_lavd: Revise load-balancing conditions#3300

Open
bboymimi wants to merge 4 commits intosched-ext:mainfrom
bboymimi:scx_lavd/revise-lb-conditions
Open

scx_lavd: Revise load-balancing conditions#3300
bboymimi wants to merge 4 commits intosched-ext:mainfrom
bboymimi:scx_lavd/revise-lb-conditions

Conversation

@bboymimi
Copy link

@bboymimi bboymimi commented Feb 8, 2026

LAVD's periodic load balancer runs unconditionally even when the system is lightly loaded, and balancing provides no benefit. This wastes CPU cycles and can cause unnecessary task migrations that hurt cache locality.

This series makes load balancing more selective by skipping it when the system is idle enough, and by simplifying the dispatch path at very low utilization where deadline ordering adds overhead without benefit. Task stealing at ops.dispatch() and task migration at ops.select_cpu() remain active for work conservation and handling bursty workloads.

Two new tunable command line options are introduced:

  • --lb-low-util-pct (default 25): skip periodic LB below this utilization percentage
  • --lb-local-dsq-util-pct (default 10): bypass deadline scheduling and use FIFO local DSQ below this utilization percentage

In current implementation, plan_x_cpdom_migration() uses cur_util_sum,
which fluctuates significantly between sampling, when mig_delta_pct is
not set. Since periodic load balancing aims for stable and long-term
balance and shouldn't be adjusted to spontaneous spikes, let's use
avg_util_sum for the scaled load calculation.

Signed-off-by: Gavin Guo <gavinguo@igalia.com>
Extract the stealer/stealee reset logic in plan_x_cpdom_migration() into
a shared reset_and_skip_lb goto label at the end of the function. This
is a pure refactoring with no behavior change, preparing a shared code
for subsequent patches that will add additional early-exit conditions
for periodic load balancing.

Signed-off-by: Gavin Guo <gavinguo@igalia.com>
Add a tunable --lb-low-util-pct parameter (default: 25) that skips
periodic load balancing when average system utilization is below the
specified percentage. At low utilization, there is plenty of idle
capacity across LLC domains, making cross-domain balancing unnecessary.

The threshold is calculated as: avg_util < (lb_low_util_pct << 10) / 100.

Set to 0 to disable, or 100 to always skip periodic LB.

Signed-off-by: Gavin Guo <gavinguo@igalia.com>
When system utilization is very low, most CPUs are idle,
deadline-based ordering adds overhead without benefit, and FIFO
ordering on the local DSQ is sufficient. This patch adds a tunable
threshold (--lb-local-dsq-util-pct, default 10%) below which
tasks are dispatched directly to the local DSQ rather than using
vtime-based scheduling.

Set to 0 to disable, 100 to always bypass deadline scheduling.

Signed-off-by: Gavin Guo <gavinguo@igalia.com>
Copy link
Contributor

@multics69 multics69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bboymimi ! Overall, it looks good to me. I left a few comments to improve readability.

* LLC domains is unnecessary since there is plenty of idle capacity.
*/
if (lb_low_util_pct > 0 &&
sys_stat.avg_util < ((u64)lb_low_util_pct << LAVD_SHIFT) / 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the p2s() macro for ((u64)lb_low_util_pct << LAVD_SHIFT) / 100).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the p2s() macro for ((u64)lb_low_util_pct << LAVD_SHIFT) / 100).

Instead of calculating the same computation again and again, can we pre-calculate the value when initializing the scheduler (on the rust side)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will prepare another version to pre-calculate the value on the RUST side.

if (is_idle && !queued_on_cpu(cpuc)) {
if ((is_idle && !queued_on_cpu(cpuc)) ||
(lb_local_dsq_util_pct > 0 &&
sys_stat.avg_util < ((u64)lb_local_dsq_util_pct << LAVD_SHIFT) / 100)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Please use p2s().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Please use p2s().

Instead of calculating the same computation again and again, can we pre-calculate the value when initializing the scheduler (on the rust side)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will submit another revision.

/// scheduling. When set to a non-zero value, tasks are dispatched directly
/// to the local DSQ (FIFO) instead of using deadline-based ordering when
/// average system utilization is below this percentage. The threshold is
/// calculated as: avg_util < (lb-local-dsq-util-pct << 10) / 100.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding avg_util < (lb-local-dsq-util-pct << 10) / 100 to the help message would be too implementation detail. Please keep the description high-level from a user's point of view.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will provide a user-friendly version.

/// Low utilization threshold percentage (0-100) for periodic load balancing.
/// When set to a non-zero value, periodic load balancing is skipped when
/// average system utilization is below this percentage. The threshold is
/// calculated as: avg_util < (lb-low-util-pct << 10) / 100.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avg_util < (lb-low-util-pct << 10) / 100 is too low-level for the help output. Please keep it high-level for users.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked. Will provide another user-friendly version.


/*
* Use avg_util_sum when mig_delta_pct is set, otherwise use cur_util_sum.
* Use avg_util_sum for stable load balancing decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the description of the mig-delta-pct option:

    /// Migration delta threshold percentage (0-100). When set to a non-zero value,
    /// uses average utilization for threshold calculation instead of current
    /// utilization, and the threshold is calculated as: avg_load * (mig-delta-pct / 100).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Will do.

util = (cpdomc->avg_util_sum << LAVD_SHIFT) / cpdomc->nr_active_cpus;
else
util = (cpdomc->cur_util_sum << LAVD_SHIFT) / cpdomc->nr_active_cpus;
util = (cpdomc->avg_util_sum << LAVD_SHIFT) / cpdomc->nr_active_cpus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was experimenting with this earlier, I recall that avg_util_sum might've converged a bit too slowly. I think we might need some better functions to accumulate/decay avg_util_sum in a more nuanced way. Do you have any benchmark results that could highlight the impact of this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, David. The avg_util_sum calculated from avg_util using
calc_asym_avg(), which applies different decay rates depending on the
direction of change—increasing fast and decreasing slow (updated every 10ms):

Load increasing 0% to 100% (decay=2):

__calc_avg(new_val=target, old_val=current, 2)

Step-by-step (target=1024, representing 100% in LAVD scale):
10ms: 512 (50.0%)
20ms: 896 (87.5%)
30ms: 992 (96.9%)
40ms: 1016 (99.2%)
50ms: 1022 (99.8%)

Load decreasing 100% to 0% (decay=3):

__calc_avg(old_val=current, new_val=target, 3)

Step-by-step:
10ms: 896 (87.5%)
20ms: 784 (76.6%)
30ms: 686 (67.0%)
50ms: 525 (51.3%)
100ms: 269 (26.3%)
150ms: 156 (15.2%)
200ms: 71 ( 6.9%)

So upward convergence reaches ~97% of target within 30ms, while downward
convergence takes ~50ms just to around halve.

I was thinking about the two informations that would be helpful:

  1. Latency tolerance:
    Is it possible to share what's the acceptable reaction time or latency for load
    balancer decisions in the scenarios? If utilization shifts, how many ms of lag
    is tolerable before the load balance state should reflect the new reality? It will impact
    the reaction of role, stealers or stealees, classification and result in the
    slow load balance decision.

  2. Benchmarking methodology:
    when you observed the slow convergence earlier, what workload or test scenario
    were you using to reproduce it? Or we just use DCPerf to simulate the
    environment?

Currently, I don't have the benchmark results yet. With the latency tolerance
indicator, we could, for example, use a symmetric decay=2 EWMA or more
aggressive decay=1 , or design something more nuanced approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the detailed analysis Gavin!

  1. The tolerance for this is very dependent on the workload. I don't have an exact answer for you here.
  2. It was for an internal work similar to Django bench, it would be interesting to tune this and see the effects of this on the benchmarks. I'd recommend turning forced stealing off for experimentation since that makes load balancing completely unpredictable.

* to enable vtime comparison across DSQs during dispatch.
*/
if (is_idle && !queued_on_cpu(cpuc)) {
if ((is_idle && !queued_on_cpu(cpuc)) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that at low utilization, most of the tasks will get dispatched via direct_dispatch on select_cpu path anyways(It's been a while since I instrumented this, so my view of this might be outdated). I'm curious how often this actually gets exercised at low util.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. You're right that tasks get directly dispatched to local DSQ in select_cpu when an idle cpu with empty DSQ is found, which bypasses enqueue entirely. The main benefit of the lb_local_dsq_util_pct bypass is to avoid the overhead of vtime-based deadline ordering when it provides no scheduling advantage. At low utilization, maintaining deadline order via scx_bpf_dsq_insert_vtime() adds cost, such as vtime computation, ordered insertion, without improving scheduling efficiency. Dispatching FIFO via scx_bpf_dsq_insert() to the local dsq is cheaper and sufficient in this condition.

I think you're right that to add some instrumentation to measure how often the direct dispatch path triggers and how much benefit the strategy introduces. If the numbers show it's negligible in practice, we could drop the patch or adjust the default threshold. While I'll conduct the tracing for the impact, appreciate if you have suggestions which numbers of measurement you would expect to evaluate.

While I'm currently on the Lunar New Year vacation, time is limited for work. Sorry for any inconvenience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No rush, I suspect this will be fairly negligible, but any optimization is nice. Overall, I appreciate you looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments