Skip to content

L2ARC: Add depth cap and write budget fairness for persistent markers#18289

Open
ixhamza wants to merge 5 commits intoopenzfs:masterfrom
truenas:NAS-139817
Open

L2ARC: Add depth cap and write budget fairness for persistent markers#18289
ixhamza wants to merge 5 commits intoopenzfs:masterfrom
truenas:NAS-139817

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Mar 5, 2026

Motivation and Context

Follow-up to #18093. With persistent markers, scan positions drift indefinitely toward the head of ARC eviction lists where buffers will stay in ARC for a while, writing them to L2ARC adds little value since ARC already serves them. The tail is where buffers are closest to eviction and benefit most from L2ARC. Additionally, when eviction outpaces L2ARC write throughput, metadata passes run first and can fill the entire write budget every cycle, starving data passes of buffers that could have produced L2ARC hits.

Description

  • Even sublist headroom distribution: Divide headroom equally across sublists with round-robin visitation to prevent any single sublist from dominating the write budget. - Lazy sublist reset flags: Signal marker resets via per-sublist boolean flags instead of direct manipulation, consumed at scan start and end. Decouples reset signaling from active scans.
  • Scan-based depth cap: Track cumulative bytes scanned per pass. Reset markers to tail when scanning exceeds l2arc_ext_headroom_pct (default 25%) of ARC state size. Keeps markers in the tail zone where L2ARC adds the most value. Set to 0 to disable.
  • Write budget fairness: After l2arc_meta_cycles (default 2) consecutive cycles where metadata fills the budget, skip metadata for one cycle to let data run. Only triggers when data states have buffers to write. Set to 0 to disable.
  • Man page updates: Remove stale "inclusive caching" terminology and document new tunables.

How Has This Been Tested?

  • CI Testing
  • Manual unit tests.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

ixhamza added 5 commits March 5, 2026 20:15
The dynamic headroom redistribution formula gave later sublists
progressively larger scanning budgets, and random sublist selection
caused uneven coverage across sublists. For depth cap to work
effectively, each sublist should be equally and fairly treated.
Use equal per-sublist headroom (headroom / num_sublists) for even
distribution and deterministic round-robin selection for fair
coverage across cycles.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads.  Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.

Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags.  This keeps markers
focused on the tail zone where recently evicted buffers have the most
value for L2ARC.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run.  Reset
the counter when nothing is written.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant