-
Notifications
You must be signed in to change notification settings - Fork 2.2k
peer+lnd: add new CLI option to control if we D/C on slow pongs #9801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
eeb7897
to
8c23404
Compare
tACK, this PR improves connection reliability with my peers compared to just Some slight weirdness that I figured would be good to document: Pong size did not match expected size:
Back-to-back repeated Pongs (remote is CLN):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the flag could be useful!
I propose adding a Connection-Failure-Treshold (N) which only disconnects after several failed ping/pongs in a row rather than just adding a connection flag wdyt ? |
I had also considered making it a sort of peer level healthcheck, to inherit that threshold logic, but instead went in this direction as I started to second guess the design rationale in disconnecting in the first place. I think if we add a threshold flag, then we'd also want to add a flag to tune what the timeout value should be. When I started to run this on my node (even w/ the super prio queue), I noticed some nodes that were just persistently slow in replying. Ultimately slow nodes do affect payment latency e2e. There's a credible design direction here where we start to factor it in at the first hop level, but then also have the link sample the ping RTT of a peer, and decide if the link is even eligible to send based on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last commit does not change existing behaviour afaict
3c75628
to
f0fcffa
Compare
The latest version inverts the original PR: default stays, but users have an option to turn off the disconnect behavior. |
1f498d3
to
d6d25a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed up a fix for the sample config and also made the behavior of return
vs. continue
in the ping manager more clear.
With that, LGTM 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one comment - otherwise lgtm!
In this commit, we add a new CLI option to control if we D/C on slow pongs or not. Due to the existence of head-of-the-line blocking at various levels of abstraction (app buffer, slow processing, TCP kernel buffers, etc), if there's a flurry of gossip messages (eg: 1K channel updates), then even with a reasonable processing latency, a peer may still not read our ping in time. To give users another option, we add a flag that allows users to disable this behavior. The default remains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// If NoDisconnectOnPongFailure is true, we don't | ||
// disconnect. Otherwise (if it's false, the default), | ||
// we disconnect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Comment can be removed, describes the code.
} | ||
|
||
// getLastRTT safely retrieves the last known RTT, returning 0 if none exists. | ||
func (m *PingManager) getLastRTT() time.Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Use fn.Option here as well similar as below pendingPingWait
?
close(pingSent) | ||
}) | ||
}, | ||
OnPongFailure: func(err error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should be protected by a Once struct maybe as well, because it is closing a channel ?
In this commit, we add a new CLI option to control if we D/C on slow pongs or not. Due to the existence of head-of-the-line blocking at various levels of abstraction (app buffer, slow processing, TCP kernel buffers, etc), if there's a flurry of gossip messages (eg: 1K channel updates), then even with a reasonable processing latency, a peer may still not read our ping in time.
To combat this, we change the default behavior to just logging for slow pongs, and add a new CLI option to re-enable the old behavior.
Along the way, we also add some more enhanced logging, so we can tell when the last successful ping was, and also the deadline reached.