Let `BackgroundProcessor` drive HTLC forwarding #3891

tnull · 2025-06-25T15:12:49Z

Closes #3768.
Closes #1101.

Previously, we'd require the user to manually call process_pending_htlc_forwards as part of PendingHTLCsForwardable event handling. Here, we rather move this responsibility to BackgroundProcessor, which simplifies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations.

Note this also introduces batching rounds rather than calling process_pending_htlc_forwards individually for each PendingHTLCsForwardable event, which had been unintuitive anyways, as subsequent PendingHTLCsForwardable could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, as process_pending_htlc_forwards would of course handle all pending HTLCs at once.

To this end, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms and drop the PendingHTLCsForwardable event.

~~Draft for now as I'm still cleaning up the code base as part of the final commit dropping PendingHTLCsForwardable.~~

ldk-reviews-bot · 2025-06-25T15:12:52Z

👋 Thanks for assigning @joostjager as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

joostjager · 2025-06-25T16:46:14Z

Does this in any way limit users to not have delays or not have batching? Assuming that's what they want.

tnull · 2025-06-25T16:58:39Z

Does this in any way limit users to not have delays or not have batching? Assuming that's what they want.

On the contrary actually: it effectively reduces the (mean and min forwarding) delay quite a bit, which we can allow as we're gonna add larger receiver-side delays in the next step. And, while it get's rid of the event, users are still free to call process_pending_htlc_forwards on a faster schedule if they really want to. IMO, this should result in a win-win situation: substantially reduced forwarding delays on average and by default, while still considerably improving receiver anonymity.

joostjager · 2025-06-26T09:11:32Z

Isn't it the case that without the event, as a user you are forced to "poll" for forwards, making extra delays unavoidable?

tnull · 2025-06-26T09:17:20Z

Isn't it the case that without the event, as a user you are forced to "poll" for forwards, making extra delays unavoidable?

LDK always processes HTLCs in batches (note that process_pending_htlcs never allowed to just forward a single HTLC, for good reason). Having some batching delay makes a lot of sense in any scenario. And given that 'polling' is really cheap, users could consider doing that frequently. But, they really shouldn't try to skip the batching entirely as IO overhead/delay would come to bite them (especially on more busy forwarding nodes), and of course since they should be 'good citizens' providing some privacy by default for the network.

joostjager · 2025-06-26T09:26:58Z

Polling may be cheap, but forcing users to poll when there is an event mechanism available, is that really the right choice? Perhaps the event is beneficial for testing, debugging and monitoring too?

tnull · 2025-06-26T09:32:32Z

Polling may be cheap, but forcing users to poll when there is an event mechanism available, is that really the right choice? Perhaps the event is beneficial for testing, debugging and monitoring too?

The event never featured any information so is not helpful for debugging or 'informational' purposes. Plus, it means at least 1-2 more rounds of ChannelManager persistence, just to queue and remove the event. So since we don't need it anymore, we should def. drop it in production. As you know I was on the fence whether to drop it for testing, but now went this way, especially given that nobody indicated a strong opinion either way. If we indeed want to introspect the holding cell during testing (or, e.g., in fuzzing), we should add another approach to do it, but that's up for discussion.

joostjager · 2025-06-26T09:41:28Z

But at least the event could wake up the background processor, where as now nothing is waking it up for forwards and the user is forced to call into channel manager at a high frequency? Not sure if there is a lighter way to wake up the bp without persistence involved.

Also if you have to call into channel manager always anyway, aren't there more events/notifiers that can be dropped?

As you know I was on the fence whether to drop it for testing, but now went this way, especially given that nobody indicated a strong opinion either way.

I may have missed this deciding moment.

If the assertions were useless to begin with, no problem dropping them of course. I can imagine though that at some points, a peek into the pending htlc state is still required to not reduce the coverage of the tests?

tnull · 2025-06-26T09:46:58Z

But at least the event could wake up the background processor, where as now nothing is waking it up for forwards and the user is forced to call into channel manager at a high frequency? Not sure if there is a lighter way to wake up the bp without persistence involved.

Also if you have to call into channel manager always anyway, aren't there more events/notifiers that can be dropped?

As you know I was on the fence whether to drop it for testing, but now went this way, especially given that nobody indicated a strong opinion either way.

I may have missed this deciding moment.

Again, the default behavior we had intended to switch to for quite some time is to introduce batching intervals (especially given that the current event-based approach was essentially broken/race-y). This is what is implemented here. If users want to bend the recommended/default approach they are free to do so, but I don't think it makes sense to keep all the legacy codepaths, including persistence overhead, around if it's not used anymore.

If the assertions were useless to begin with, no problem dropping them of course. I can imagine though that at some points, a peek into the pending htlc state is still required to not reduce the coverage of the tests?

I don't think this is generally the case, no. The 'assertion' that is mainly dropped is 'we generated an event', every thing else remains the same.

joostjager · 2025-06-26T10:25:37Z

Again, the default behavior we had intended to switch to for quite some time is to introduce batching intervals (especially given that the current event-based approach was essentially broken/race-y). This is what is implemented here. If users want to bend the recommended/default approach they are free to do so, but I don't think it makes sense to keep all the legacy codepaths, including persistence overhead, around if it's not used anymore.

This doesn't rule out a notification when there's something to forward, to at least not keep spinning when there's nothing to do?

tnull · 2025-06-27T09:31:17Z

Finished for now with the test refactoring post-dropping PendingHTLCsForwardable event. This should be good for a first round of (concept) review. Whether or not we should add a notifier on top is up for debate.

ldk-reviews-bot · 2025-06-27T09:40:24Z

✅ Added second reviewer: @valentinewallace

lightning-background-processor/src/lib.rs

ldk-reviews-bot · 2025-06-30T00:00:43Z

🔔 1st Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-07-02T00:01:33Z

🔔 2nd Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

joostjager · 2025-07-22T09:47:18Z

fuzz/src/chanmon_consistency.rs

@@ -1365,6 +1340,9 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
 						},
 					}
 				}
+				while nodes[$node].needs_pending_htlc_processing() {


macro name process_events no longer accurate?

It still is mostly processing events?

Yes, so name not accurate? Non-blocking.

IMO it's close enough? Or would you prefer to rename to mostly_process_events? 😛

joostjager · 2025-07-22T09:57:27Z

lightning/src/ln/functional_test_utils.rs

+	if process_twice {
+		// We expect that further processing steps became necessary, e.g., because we have to
+		// process the failure, or retry a payment.
+		assert!(node.node.needs_pending_htlc_processing());


Should this second call be handled inside process_pending_htlc_forwards, so that it is always ready when it returns?

Clarified offline. This is to give more control over forwarding / response delays (in prod).

joostjager · 2025-07-22T09:59:12Z

lightning/src/ln/functional_test_utils.rs

@@ -3411,11 +3336,11 @@ pub fn do_pass_along_path<'a, 'b, 'c>(args: PassAlongPathArgs) -> Option<Event>

 		if is_last_hop && is_probe {
 			commitment_signed_dance!(node, prev_node, payment_event.commitment_msg, true, true);
-			expect_pending_htlcs_forwardable!(node);
+			node.node.process_pending_htlc_forwards();


No expect here anymore? And below

Clarified offline. This is used in various tests, some without anything pending.

lightning-background-processor/src/fwd_batch.rs

lightning/src/ln/channelmanager.rs

lightning-background-processor/src/lib.rs

lightning/src/ln/channelmanager.rs

joostjager · 2025-07-22T12:18:13Z

lightning/src/ln/functional_test_utils.rs

-			$node.node.process_pending_htlc_forwards();
-		}
-	}};
+pub fn expect_pending_htlc_processing(node: &Node<'_, '_, '_>, process_twice: bool) {


My opinion remains that more could have been done to facilitate review, but will review the commit as is. It's mostly test changes, so arguably less critical.

lightning/src/ln/outbound_payment.rs

joostjager · 2025-07-22T12:25:29Z

lightning-background-processor/src/fwd_batch.rs

+	possiblyrandom::getpossiblyrandom(&mut random_bytes);
+
+	let index = usize::from_be_bytes(random_bytes) % FWD_DELAYS_MILLIS.len();
+	*FWD_DELAYS_MILLIS.get(index).unwrap_or(&FALLBACK_DELAY)


I'd really just unwrap here, as it is so clear that this can never happen. Also, if it happens, you probably want to know.

lightning/src/ln/channelmanager.rs

tnull · 2025-07-22T13:16:00Z

Addressed all pending feedback.

tnull · 2025-07-22T13:19:21Z

Let me know if I can squash commits.

joostjager · 2025-07-22T13:24:37Z

lightning/src/ln/payment_tests.rs

@@ -3955,15 +3955,15 @@ fn test_threaded_payment_retries() {
 			}
 		}

+		// We give the node some time before we process messages and check the added monitors.
+		std::thread::sleep(Duration::from_secs(1));


Flake potential as you mentioned yourself already, but I suppose it is pre-exisitng...

joostjager · 2025-07-22T13:25:30Z

lightning-background-processor/src/fwd_batch.rs

 	const USIZE_LEN: usize = core::mem::size_of::<usize>();
 	let mut random_bytes = [0u8; USIZE_LEN];
 	possiblyrandom::getpossiblyrandom(&mut random_bytes);

 	let index = usize::from_be_bytes(random_bytes) % FWD_DELAYS_MILLIS.len();
-	*FWD_DELAYS_MILLIS.get(index).unwrap_or(&FALLBACK_DELAY)
+	FWD_DELAYS_MILLIS[index]


The confidence this radiates. It's simply amazing 😂

Previously, we'd require the user to manually call `process_pending_htlc_forwards` as part of `PendingHTLCsForwardable` event handling. Here, we rather move this responsibility to `BackgroundProcessor`, which simplyfies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations. Note this also introduces batching rounds rather than calling `process_pending_htlc_forwards` individually for each `PendingHTLCsForwardable` event, which had been unintuitive anyways, as subsequent `PendingHTLCsForwardable` could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, as `process_pending_htlc_forwards` would of course handle all pending HTLCs at once.

Now that we have `BackgroundProcessor` drive the batch forwarding of HTLCs, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms.

.. as `forward_htlcs` now does the same thing

.. as `fail_htlcs_backwards_internal` now does the same thing

We move the code into the `optionally_notify` closure, but maintain the behavior for now. In the next step, we'll use this to make sure we only repersist when necessary.

We skip repersisting `ChannelManager` when nothing is actually processed.

We add a reentrancy guard to disallow entering `process_pending_htlc_forwards` multiple times. This makes sure that we'd skip any additional processing calls if a prior round/batch of processing is still underway.

tnull · 2025-07-22T13:27:01Z

Let me know if I can squash commits.

Got goahead out-of-bands, so squashed fixups.

joostjager

LGTM. Nice improvement/simplification.

What I definitely don't like is that the delay is now hard-coded and non-configurable. It can't be disabled or changed without modifying the code. To me, it doesn't feel right to force a privacy feature that impacts UX onto users.

That said, it seems most support the change, and we can always add a disable switch if it's requested.

TheBlueMatt

Gonna go ahead and land this so it doesn't need rebase, but there's a few things that probably merit a followup and further discussion.

lightning-background-processor/src/fwd_batch.rs

lightning-background-processor/src/lib.rs

TheBlueMatt · 2025-07-22T18:35:34Z

lightning-background-processor/src/lib.rs

@@ -336,6 +349,9 @@ macro_rules! define_run_body {
 		let mut have_pruned = false;
 		let mut have_decayed_scorer = false;

+		let mut cur_batch_delay = $batch_delay.get();


The old logic in channelmanager.rs delayed the first forward by two seconds to give us a chance to get connected to our peers first, which seems like something we may want to duplicate here.

Mhh, that startup would always be sub-2 secons seems like a pretty flaky assumption to begin with? I'm not convinced hardcoding such a value makes sense. Maybe it would be more robust to have some kind of callback that users need to call once setup is done and they want to start forwarding?

I suppose we could detect when we have enough peers connected and then start forwarding at that point? Its kinda hard to figure out though because we want to know that we've connected to all thei peers that we're going to connect to, but we're not waiting around to connect to peers that are offline...

I suppose we could detect when we have enough peers connected and then start forwarding at that point? Its kinda hard to figure out though because we want to know that we've connected to all thei peers that we're going to connect to, but we're not waiting around to connect to peers that are offline...

Right, that's why I imagine it would need to be a callback made by the user once they think they made a decent effort reconnecting.

Or, maybe there is another proposal here:
For the receiver-side delay I so far considered delaying the receiving HTLCs, e.g., by adding a pending_receive_htlcs map to park them until ready, or by adding a 'remaining-rounds-until-ready-for-processing' time-to-live field on the PendingHTLCRouting::{Receive, ReceiveKeysend} variants.

However, we could also consider making 'delay processing by X rounds' a more general feature of forward_htlcs, and on startup simply use this with, say, X = 10?

I think for the receiver-side delay work the former approach would be simpler, but maybe it's worth solving both issues at once?

Right, that's why I imagine it would need to be a callback made by the user once they think they made a decent effort reconnecting.

Yea, but also bleh. Yet more stuff to wire up :/.

on startup simply use this with, say, X = 10?

Yea, I mean thats another way to get the same thing. I'm not sure we should prefer that logic live in ChannelManager rather than in the BP (and if it complexifies the ChannelManager implementation more definitely not), but I don't feel super strongly there.

I do kinda wonder if, since we're here, we shouldn't at least try to look at how many channels we have with connected peers. Even a naive heuristic like "sleep 2 seconds or until all ChannelDetails show connected or 1 second after half of all ChannelDetails show connected" is likely to be pretty good in most cases, and certainly better than any fixed number.

I do kinda wonder if, since we're here, we shouldn't at least try to look at how many channels we have with connected peers. Even a naive heuristic like "sleep 2 seconds or until all ChannelDetails show connected or 1 second after half of all ChannelDetails show connected" is likely to be pretty good in most cases, and certainly better than any fixed number.

Hmm, but that also seems pretty unreliable, as you don't know whether you can actually reconnect to the counterparty? I.e., could be that you come online and none of the peers are reachable anymore, so we can't wait/abort if we cannot reconnect.

FWIW, I for now put the 2-second initial forwarding delay back in place over at #3955.

Yea, I mean it'd be a heuristic, not a perfect one. We'd probably refuse to wait longer than two seconds in any case, but optimizing it to wait less when we do happen to be connected seems reasonable.

lightning/src/ln/channelmanager.rs

tnull marked this pull request as draft June 25, 2025 15:12

tnull force-pushed the 2025-06-batch-forwarding-delays branch from ceb3335 to 9ba691c Compare June 26, 2025 08:13

tnull force-pushed the 2025-06-batch-forwarding-delays branch from 9ba691c to b38c19e Compare June 26, 2025 09:49

tnull force-pushed the 2025-06-batch-forwarding-delays branch from c1a0b35 to d35c944 Compare June 26, 2025 13:17

tnull added this to Weekly Goals Jun 26, 2025

tnull self-assigned this Jun 26, 2025

tnull force-pushed the 2025-06-batch-forwarding-delays branch from d35c944 to c21aeab Compare June 27, 2025 09:29

tnull requested a review from TheBlueMatt June 27, 2025 09:29

tnull marked this pull request as ready for review June 27, 2025 09:29

tnull removed the request for review from TheBlueMatt June 27, 2025 09:36

tnull moved this to Goal: Merge in Weekly Goals Jun 27, 2025

ldk-reviews-bot requested a review from valentinewallace June 27, 2025 09:40

tnull requested review from TheBlueMatt and removed request for TheBlueMatt June 27, 2025 09:51

joostjager reviewed Jun 27, 2025

View reviewed changes

lightning-background-processor/src/lib.rs Show resolved Hide resolved

tnull force-pushed the 2025-06-batch-forwarding-delays branch from c21aeab to e2ad6ca Compare July 2, 2025 09:55

tnull requested a review from joostjager July 22, 2025 08:01

tnull force-pushed the 2025-06-batch-forwarding-delays branch from 4c14904 to bd030c0 Compare July 22, 2025 09:16

joostjager reviewed Jul 22, 2025

View reviewed changes

lightning/src/ln/outbound_payment.rs Show resolved Hide resolved

joostjager reviewed Jul 22, 2025

View reviewed changes

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

joostjager reviewed Jul 22, 2025

View reviewed changes

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

tnull force-pushed the 2025-06-batch-forwarding-delays branch from bd030c0 to e84bd8d Compare July 22, 2025 13:15

tnull force-pushed the 2025-06-batch-forwarding-delays branch from e84bd8d to 69b807d Compare July 22, 2025 13:18

tnull requested a review from joostjager July 22, 2025 13:19

joostjager reviewed Jul 22, 2025

View reviewed changes

tnull added 9 commits July 22, 2025 15:26

Randomly draw forwarding delays

6ce6c6d

Now that we have `BackgroundProcessor` drive the batch forwarding of HTLCs, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms.

Drop PendingHTLCsForwardable event

cfd15ee

Rename expect_pending_htlcs_forwardable_conditions

9f57703

Drop unnecessary forward_htlcs_without_forward_event

5f883bf

.. as `forward_htlcs` now does the same thing

Drop unnecessary fail_htlcs_.._without_forwarding_event

53eb950

.. as `fail_htlcs_backwards_internal` now does the same thing

Use optionally_notify in process_pending_htlc_forwards

6eaf211

We move the code into the `optionally_notify` closure, but maintain the behavior for now. In the next step, we'll use this to make sure we only repersist when necessary.

Skip unnecessary persists in process_pending_htlc_forwards

7b85f4a

We skip repersisting `ChannelManager` when nothing is actually processed.

Add reentrancy guard for process_pending_htlc_forwards

6985f53

We add a reentrancy guard to disallow entering `process_pending_htlc_forwards` multiple times. This makes sure that we'd skip any additional processing calls if a prior round/batch of processing is still underway.

tnull force-pushed the 2025-06-batch-forwarding-delays branch from 69b807d to 6985f53 Compare July 22, 2025 13:26

joostjager approved these changes Jul 22, 2025

View reviewed changes

TheBlueMatt approved these changes Jul 22, 2025

View reviewed changes

TheBlueMatt merged commit 1378ae2 into lightningdevkit:main Jul 22, 2025
27 of 28 checks passed

github-project-automation bot moved this from Goal: Merge to Done in Weekly Goals Jul 22, 2025

tnull mentioned this pull request Jul 24, 2025

Batch forwarding followup #3955

Merged

Let BackgroundProcessor drive HTLC forwarding #3891

Let BackgroundProcessor drive HTLC forwarding #3891

Uh oh!

Conversation

tnull commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 26, 2025

Uh oh!

tnull commented Jun 26, 2025

Uh oh!

joostjager commented Jun 26, 2025

Uh oh!

tnull commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 26, 2025

Uh oh!

tnull commented Jun 27, 2025

Uh oh!

ldk-reviews-bot commented Jun 27, 2025

Uh oh!

Uh oh!

ldk-reviews-bot commented Jun 30, 2025

Uh oh!

ldk-reviews-bot commented Jul 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tnull commented Jul 22, 2025

Uh oh!

tnull commented Jul 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Let `BackgroundProcessor` drive HTLC forwarding #3891

Let `BackgroundProcessor` drive HTLC forwarding #3891

tnull commented Jun 25, 2025 •

edited

Loading

ldk-reviews-bot commented Jun 25, 2025 •

edited

Loading

joostjager commented Jun 25, 2025 •

edited

Loading

tnull commented Jun 25, 2025 •

edited

Loading

tnull commented Jun 26, 2025 •

edited

Loading

joostjager commented Jun 26, 2025 •

edited

Loading

tnull commented Jun 26, 2025 •

edited

Loading