scx_lavd: Revise load-balancing conditions#3300
scx_lavd: Revise load-balancing conditions#3300bboymimi wants to merge 4 commits intosched-ext:mainfrom
Conversation
In current implementation, plan_x_cpdom_migration() uses cur_util_sum, which fluctuates significantly between sampling, when mig_delta_pct is not set. Since periodic load balancing aims for stable and long-term balance and shouldn't be adjusted to spontaneous spikes, let's use avg_util_sum for the scaled load calculation. Signed-off-by: Gavin Guo <gavinguo@igalia.com>
Extract the stealer/stealee reset logic in plan_x_cpdom_migration() into a shared reset_and_skip_lb goto label at the end of the function. This is a pure refactoring with no behavior change, preparing a shared code for subsequent patches that will add additional early-exit conditions for periodic load balancing. Signed-off-by: Gavin Guo <gavinguo@igalia.com>
Add a tunable --lb-low-util-pct parameter (default: 25) that skips periodic load balancing when average system utilization is below the specified percentage. At low utilization, there is plenty of idle capacity across LLC domains, making cross-domain balancing unnecessary. The threshold is calculated as: avg_util < (lb_low_util_pct << 10) / 100. Set to 0 to disable, or 100 to always skip periodic LB. Signed-off-by: Gavin Guo <gavinguo@igalia.com>
When system utilization is very low, most CPUs are idle, deadline-based ordering adds overhead without benefit, and FIFO ordering on the local DSQ is sufficient. This patch adds a tunable threshold (--lb-local-dsq-util-pct, default 10%) below which tasks are dispatched directly to the local DSQ rather than using vtime-based scheduling. Set to 0 to disable, 100 to always bypass deadline scheduling. Signed-off-by: Gavin Guo <gavinguo@igalia.com>
| * LLC domains is unnecessary since there is plenty of idle capacity. | ||
| */ | ||
| if (lb_low_util_pct > 0 && | ||
| sys_stat.avg_util < ((u64)lb_low_util_pct << LAVD_SHIFT) / 100) |
There was a problem hiding this comment.
Please use the p2s() macro for ((u64)lb_low_util_pct << LAVD_SHIFT) / 100).
There was a problem hiding this comment.
Please use the
p2s()macro for((u64)lb_low_util_pct << LAVD_SHIFT) / 100).
Instead of calculating the same computation again and again, can we pre-calculate the value when initializing the scheduler (on the rust side)?
There was a problem hiding this comment.
Yes, I will prepare another version to pre-calculate the value on the RUST side.
| if (is_idle && !queued_on_cpu(cpuc)) { | ||
| if ((is_idle && !queued_on_cpu(cpuc)) || | ||
| (lb_local_dsq_util_pct > 0 && | ||
| sys_stat.avg_util < ((u64)lb_local_dsq_util_pct << LAVD_SHIFT) / 100)) { |
There was a problem hiding this comment.
Ditto. Please use
p2s().
Instead of calculating the same computation again and again, can we pre-calculate the value when initializing the scheduler (on the rust side)?
| /// scheduling. When set to a non-zero value, tasks are dispatched directly | ||
| /// to the local DSQ (FIFO) instead of using deadline-based ordering when | ||
| /// average system utilization is below this percentage. The threshold is | ||
| /// calculated as: avg_util < (lb-local-dsq-util-pct << 10) / 100. |
There was a problem hiding this comment.
Adding avg_util < (lb-local-dsq-util-pct << 10) / 100 to the help message would be too implementation detail. Please keep the description high-level from a user's point of view.
There was a problem hiding this comment.
Will provide a user-friendly version.
| /// Low utilization threshold percentage (0-100) for periodic load balancing. | ||
| /// When set to a non-zero value, periodic load balancing is skipped when | ||
| /// average system utilization is below this percentage. The threshold is | ||
| /// calculated as: avg_util < (lb-low-util-pct << 10) / 100. |
There was a problem hiding this comment.
avg_util < (lb-low-util-pct << 10) / 100 is too low-level for the help output. Please keep it high-level for users.
There was a problem hiding this comment.
Acked. Will provide another user-friendly version.
|
|
||
| /* | ||
| * Use avg_util_sum when mig_delta_pct is set, otherwise use cur_util_sum. | ||
| * Use avg_util_sum for stable load balancing decisions. |
There was a problem hiding this comment.
Please also update the description of the mig-delta-pct option:
/// Migration delta threshold percentage (0-100). When set to a non-zero value,
/// uses average utilization for threshold calculation instead of current
/// utilization, and the threshold is calculated as: avg_load * (mig-delta-pct / 100).
There was a problem hiding this comment.
Thanks for the review. Will do.
| util = (cpdomc->avg_util_sum << LAVD_SHIFT) / cpdomc->nr_active_cpus; | ||
| else | ||
| util = (cpdomc->cur_util_sum << LAVD_SHIFT) / cpdomc->nr_active_cpus; | ||
| util = (cpdomc->avg_util_sum << LAVD_SHIFT) / cpdomc->nr_active_cpus; |
There was a problem hiding this comment.
When I was experimenting with this earlier, I recall that avg_util_sum might've converged a bit too slowly. I think we might need some better functions to accumulate/decay avg_util_sum in a more nuanced way. Do you have any benchmark results that could highlight the impact of this change?
There was a problem hiding this comment.
Great point, David. The avg_util_sum calculated from avg_util using
calc_asym_avg(), which applies different decay rates depending on the
direction of change—increasing fast and decreasing slow (updated every 10ms):
Load increasing 0% to 100% (decay=2):
__calc_avg(new_val=target, old_val=current, 2)
Step-by-step (target=1024, representing 100% in LAVD scale):
10ms: 512 (50.0%)
20ms: 896 (87.5%)
30ms: 992 (96.9%)
40ms: 1016 (99.2%)
50ms: 1022 (99.8%)
Load decreasing 100% to 0% (decay=3):
__calc_avg(old_val=current, new_val=target, 3)
Step-by-step:
10ms: 896 (87.5%)
20ms: 784 (76.6%)
30ms: 686 (67.0%)
50ms: 525 (51.3%)
100ms: 269 (26.3%)
150ms: 156 (15.2%)
200ms: 71 ( 6.9%)
So upward convergence reaches ~97% of target within 30ms, while downward
convergence takes ~50ms just to around halve.
I was thinking about the two informations that would be helpful:
-
Latency tolerance:
Is it possible to share what's the acceptable reaction time or latency for load
balancer decisions in the scenarios? If utilization shifts, how many ms of lag
is tolerable before the load balance state should reflect the new reality? It will impact
the reaction of role, stealers or stealees, classification and result in the
slow load balance decision. -
Benchmarking methodology:
when you observed the slow convergence earlier, what workload or test scenario
were you using to reproduce it? Or we just use DCPerf to simulate the
environment?
Currently, I don't have the benchmark results yet. With the latency tolerance
indicator, we could, for example, use a symmetric decay=2 EWMA or more
aggressive decay=1 , or design something more nuanced approach.
There was a problem hiding this comment.
Thanks for doing the detailed analysis Gavin!
- The tolerance for this is very dependent on the workload. I don't have an exact answer for you here.
- It was for an internal work similar to Django bench, it would be interesting to tune this and see the effects of this on the benchmarks. I'd recommend turning forced stealing off for experimentation since that makes load balancing completely unpredictable.
| * to enable vtime comparison across DSQs during dispatch. | ||
| */ | ||
| if (is_idle && !queued_on_cpu(cpuc)) { | ||
| if ((is_idle && !queued_on_cpu(cpuc)) || |
There was a problem hiding this comment.
I recall that at low utilization, most of the tasks will get dispatched via direct_dispatch on select_cpu path anyways(It's been a while since I instrumented this, so my view of this might be outdated). I'm curious how often this actually gets exercised at low util.
There was a problem hiding this comment.
Good observation. You're right that tasks get directly dispatched to local DSQ in select_cpu when an idle cpu with empty DSQ is found, which bypasses enqueue entirely. The main benefit of the lb_local_dsq_util_pct bypass is to avoid the overhead of vtime-based deadline ordering when it provides no scheduling advantage. At low utilization, maintaining deadline order via scx_bpf_dsq_insert_vtime() adds cost, such as vtime computation, ordered insertion, without improving scheduling efficiency. Dispatching FIFO via scx_bpf_dsq_insert() to the local dsq is cheaper and sufficient in this condition.
I think you're right that to add some instrumentation to measure how often the direct dispatch path triggers and how much benefit the strategy introduces. If the numbers show it's negligible in practice, we could drop the patch or adjust the default threshold. While I'll conduct the tracing for the impact, appreciate if you have suggestions which numbers of measurement you would expect to evaluate.
While I'm currently on the Lunar New Year vacation, time is limited for work. Sorry for any inconvenience.
There was a problem hiding this comment.
No rush, I suspect this will be fairly negligible, but any optimization is nice. Overall, I appreciate you looking into this.
LAVD's periodic load balancer runs unconditionally even when the system is lightly loaded, and balancing provides no benefit. This wastes CPU cycles and can cause unnecessary task migrations that hurt cache locality.
This series makes load balancing more selective by skipping it when the system is idle enough, and by simplifying the dispatch path at very low utilization where deadline ordering adds overhead without benefit. Task stealing at ops.dispatch() and task migration at ops.select_cpu() remain active for work conservation and handling bursty workloads.
Two new tunable command line options are introduced:
--lb-low-util-pct(default 25): skip periodic LB below this utilization percentage--lb-local-dsq-util-pct(default 10): bypass deadline scheduling and use FIFO local DSQ below this utilization percentage