-
Notifications
You must be signed in to change notification settings - Fork 418
Let BackgroundProcessor
drive HTLC forwarding
#3891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let BackgroundProcessor
drive HTLC forwarding
#3891
Conversation
👋 Thanks for assigning @joostjager as a reviewer! |
Does this in any way limit users to not have delays or not have batching? Assuming that's what they want. |
On the contrary actually: it effectively reduces the (mean and min forwarding) delay quite a bit, which we can allow as we're gonna add larger receiver-side delays in the next step. And, while it get's rid of the event, users are still free to call |
ceb3335
to
9ba691c
Compare
Isn't it the case that without the event, as a user you are forced to "poll" for forwards, making extra delays unavoidable? |
LDK always processes HTLCs in batches (note that |
Polling may be cheap, but forcing users to poll when there is an event mechanism available, is that really the right choice? Perhaps the event is beneficial for testing, debugging and monitoring too? |
The event never featured any information so is not helpful for debugging or 'informational' purposes. Plus, it means at least 1-2 more rounds of |
But at least the event could wake up the background processor, where as now nothing is waking it up for forwards and the user is forced to call into channel manager at a high frequency? Not sure if there is a lighter way to wake up the bp without persistence involved. Also if you have to call into channel manager always anyway, aren't there more events/notifiers that can be dropped?
I may have missed this deciding moment. If the assertions were useless to begin with, no problem dropping them of course. I can imagine though that at some points, a peek into the pending htlc state is still required to not reduce the coverage of the tests? |
Again, the default behavior we had intended to switch to for quite some time is to introduce batching intervals (especially given that the current event-based approach was essentially broken/race-y). This is what is implemented here. If users want to bend the recommended/default approach they are free to do so, but I don't think it makes sense to keep all the legacy codepaths, including persistence overhead, around if it's not used anymore.
I don't think this is generally the case, no. The 'assertion' that is mainly dropped is 'we generated an event', every thing else remains the same. |
9ba691c
to
b38c19e
Compare
This doesn't rule out a notification when there's something to forward, to at least not keep spinning when there's nothing to do? |
c1a0b35
to
d35c944
Compare
d35c944
to
c21aeab
Compare
Finished for now with the test refactoring post-dropping |
✅ Added second reviewer: @valentinewallace |
🔔 1st Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @valentinewallace! This PR has been waiting for your review. |
c21aeab
to
e2ad6ca
Compare
4c14904
to
bd030c0
Compare
@@ -1365,6 +1340,9 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) { | |||
}, | |||
} | |||
} | |||
while nodes[$node].needs_pending_htlc_processing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macro name process_events
no longer accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still is mostly processing events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so name not accurate? Non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's close enough? Or would you prefer to rename to mostly_process_events
? 😛
if process_twice { | ||
// We expect that further processing steps became necessary, e.g., because we have to | ||
// process the failure, or retry a payment. | ||
assert!(node.node.needs_pending_htlc_processing()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this second call be handled inside process_pending_htlc_forwards
, so that it is always ready when it returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified offline. This is to give more control over forwarding / response delays (in prod).
@@ -3411,11 +3336,11 @@ pub fn do_pass_along_path<'a, 'b, 'c>(args: PassAlongPathArgs) -> Option<Event> | |||
|
|||
if is_last_hop && is_probe { | |||
commitment_signed_dance!(node, prev_node, payment_event.commitment_msg, true, true); | |||
expect_pending_htlcs_forwardable!(node); | |||
node.node.process_pending_htlc_forwards(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No expect here anymore? And below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified offline. This is used in various tests, some without anything pending.
$node.node.process_pending_htlc_forwards(); | ||
} | ||
}}; | ||
pub fn expect_pending_htlc_processing(node: &Node<'_, '_, '_>, process_twice: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion remains that more could have been done to facilitate review, but will review the commit as is. It's mostly test changes, so arguably less critical.
possiblyrandom::getpossiblyrandom(&mut random_bytes); | ||
|
||
let index = usize::from_be_bytes(random_bytes) % FWD_DELAYS_MILLIS.len(); | ||
*FWD_DELAYS_MILLIS.get(index).unwrap_or(&FALLBACK_DELAY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really just unwrap here, as it is so clear that this can never happen. Also, if it happens, you probably want to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
bd030c0
to
e84bd8d
Compare
Addressed all pending feedback. |
e84bd8d
to
69b807d
Compare
Let me know if I can squash commits. |
@@ -3955,15 +3955,15 @@ fn test_threaded_payment_retries() { | |||
} | |||
} | |||
|
|||
// We give the node some time before we process messages and check the added monitors. | |||
std::thread::sleep(Duration::from_secs(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flake potential as you mentioned yourself already, but I suppose it is pre-exisitng...
const USIZE_LEN: usize = core::mem::size_of::<usize>(); | ||
let mut random_bytes = [0u8; USIZE_LEN]; | ||
possiblyrandom::getpossiblyrandom(&mut random_bytes); | ||
|
||
let index = usize::from_be_bytes(random_bytes) % FWD_DELAYS_MILLIS.len(); | ||
*FWD_DELAYS_MILLIS.get(index).unwrap_or(&FALLBACK_DELAY) | ||
FWD_DELAYS_MILLIS[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confidence this radiates. It's simply amazing 😂
Previously, we'd require the user to manually call `process_pending_htlc_forwards` as part of `PendingHTLCsForwardable` event handling. Here, we rather move this responsibility to `BackgroundProcessor`, which simplyfies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations. Note this also introduces batching rounds rather than calling `process_pending_htlc_forwards` individually for each `PendingHTLCsForwardable` event, which had been unintuitive anyways, as subsequent `PendingHTLCsForwardable` could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, as `process_pending_htlc_forwards` would of course handle all pending HTLCs at once.
Now that we have `BackgroundProcessor` drive the batch forwarding of HTLCs, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms.
.. as `forward_htlcs` now does the same thing
.. as `fail_htlcs_backwards_internal` now does the same thing
We move the code into the `optionally_notify` closure, but maintain the behavior for now. In the next step, we'll use this to make sure we only repersist when necessary.
We skip repersisting `ChannelManager` when nothing is actually processed.
We add a reentrancy guard to disallow entering `process_pending_htlc_forwards` multiple times. This makes sure that we'd skip any additional processing calls if a prior round/batch of processing is still underway.
69b807d
to
6985f53
Compare
Got goahead out-of-bands, so squashed fixups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice improvement/simplification.
What I definitely don't like is that the delay is now hard-coded and non-configurable. It can't be disabled or changed without modifying the code. To me, it doesn't feel right to force a privacy feature that impacts UX onto users.
That said, it seems most support the change, and we can always add a disable switch if it's requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna go ahead and land this so it doesn't need rebase, but there's a few things that probably merit a followup and further discussion.
} | ||
|
||
pub(crate) fn get(&self) -> Duration { | ||
Duration::from_millis(self.next_batch_delay_millis as u64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also draw to randomize the us count as well? Otherwise in some cases I imagine you'll be able to see corresponding message forwards are are a round number of milliseconds since the incoming message (plus some relatively-fixed number of processing us).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also draw to randomize the us count as well?
Sorry, I have a hard time parsing what you're asking. Mind rephrasing this? What does 'randomize the millisecond count' mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently only randomize the millisecond count, I think we should probably also randomize the microsecond count (sorry, I realize "us" was ambiguous).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for the clarification.
Otherwise in some cases I imagine you'll be able to see corresponding message forwards are are a round number of milliseconds since the incoming message (plus some relatively-fixed number of processing us).
For one, I think message processing and the actual network delay should introduce 'sufficient' jitter to accout up to 1ms. And, since any observer can't know what value we drew from the distribution, we don't gain anything from more than 1ms jitter? In any case, I don't imagine microsecond resolution for the delay is adding much here, at least not to the degree that it's worth the effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For one, I think message processing and the actual network delay should introduce 'sufficient' jitter to accout up to 1ms
I doubt it? 1ms is a longggggg time. Network delay maybe in some cases, but not if its all within a datacenter (eg all AWS) or you have a network-level observer.
In any case, I don't imagine microsecond resolution for the delay is adding much here, at least not to the degree that it's worth the effort?
I mean part of the reason I mention it is because it seems absolutely trivial to just draw twice and add it to the Duration
, since we already have that and aren't returning a u64
anymore.
@@ -336,6 +349,9 @@ macro_rules! define_run_body { | |||
let mut have_pruned = false; | |||
let mut have_decayed_scorer = false; | |||
|
|||
let mut cur_batch_delay = $batch_delay.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old logic in channelmanager.rs
delayed the first forward by two seconds to give us a chance to get connected to our peers first, which seems like something we may want to duplicate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhh, that startup would always be sub-2 secons seems like a pretty flaky assumption to begin with? I'm not convinced hardcoding such a value makes sense. Maybe it would be more robust to have some kind of callback that users need to call once setup is done and they want to start forwarding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could detect when we have enough peers connected and then start forwarding at that point? Its kinda hard to figure out though because we want to know that we've connected to all thei peers that we're going to connect to, but we're not waiting around to connect to peers that are offline...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could detect when we have enough peers connected and then start forwarding at that point? Its kinda hard to figure out though because we want to know that we've connected to all thei peers that we're going to connect to, but we're not waiting around to connect to peers that are offline...
Right, that's why I imagine it would need to be a callback made by the user once they think they made a decent effort reconnecting.
Or, maybe there is another proposal here:
For the receiver-side delay I so far considered delaying the receiving HTLCs, e.g., by adding a pending_receive_htlcs
map to park them until ready, or by adding a 'remaining-rounds-until-ready-for-processing' time-to-live
field on the PendingHTLCRouting::{Receive, ReceiveKeysend}
variants.
However, we could also consider making 'delay processing by X rounds' a more general feature of forward_htlcs
, and on startup simply use this with, say, X = 10
?
I think for the receiver-side delay work the former approach would be simpler, but maybe it's worth solving both issues at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's why I imagine it would need to be a callback made by the user once they think they made a decent effort reconnecting.
Yea, but also bleh. Yet more stuff to wire up :/.
on startup simply use this with, say, X = 10?
Yea, I mean thats another way to get the same thing. I'm not sure we should prefer that logic live in ChannelManager
rather than in the BP (and if it complexifies the ChannelManager
implementation more definitely not), but I don't feel super strongly there.
I do kinda wonder if, since we're here, we shouldn't at least try to look at how many channels we have with connected peers. Even a naive heuristic like "sleep 2 seconds or until all ChannelDetails
show connected or 1 second after half of all ChannelDetails
show connected" is likely to be pretty good in most cases, and certainly better than any fixed number.
Closes #3768.
Closes #1101.
Previously, we'd require the user to manually call
process_pending_htlc_forwards
as part ofPendingHTLCsForwardable
event handling. Here, we rather move this responsibility toBackgroundProcessor
, which simplifies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations.Note this also introduces batching rounds rather than calling
process_pending_htlc_forwards
individually for eachPendingHTLCsForwardable
event, which had been unintuitive anyways, as subsequentPendingHTLCsForwardable
could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, asprocess_pending_htlc_forwards
would of course handle all pending HTLCs at once.To this end, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms and drop the
PendingHTLCsForwardable
event.Draft for now as I'm still cleaning up the code base as part of the final commit droppingPendingHTLCsForwardable
.