-
Notifications
You must be signed in to change notification settings - Fork 2.2k
peer: only send pings if the connection is idle for one minute #9805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This commit changes the ping/pong mechanism used for liveness check. Previously we will always fire a ping every minute, even we know for sure the connection is alive as we are busy processing other messages like gossip updates. We now only fire a ping if we haven't received any messages from the peer after one minute.
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
225f290
to
744feab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there might be a slight problem here...
Currently writeHandler
will force a disconnect if no message has been successfully sent to the peer in 5 minutes. Prior to this change, pings would be sent every 1 minute, thus preventing this disconnect. Now it is possible that the peer sends a gossip message often enough to prevent the pings, and then gets disconnected as a result.
One idea would be to do an explicit ping check at that point before D/C in the |
Here's a patch that implements that idea above: Roasbeef@120375e Before we would: "disconnect if we haven't written a message in 5 minutes". Now we'll: "force a ping to be sent if we haven't written a message in 5 minutes". Returning to @morehouse's scenario above, instead of never disconnect, we'll send a ping, then if that doesn't receive a timely response, we'll disconnect. |
Really nice idea with the combination of the patch +1 on this approach. |
Similar to how we delay sending pings on successful reads, we now delay sending pings if we know the connection is up and alive for writes.
Remove this `idleTimer` as the `readNextMessage` will time out after 5 seconds anyway, which means it's unlikely the `idleTimer` will fire due to a congestion in the read pipe of the connection. When a timeout happens, this error is not caught and will break the loop and jump directly to the disconnection and this commit doesn't change that behaivor. This commit will change the behaivor that, if any of the message processing takes longer than 5 minute, it will not disconnect but wait for that process to be finished. Since all the processing are async, this should never happen.
This commit removes the `idleTimer` used in the `writeHandler`. When the connection is idle for one minute, a ping will be written to the connection, which means the `idleTimer` won't be fired. If retrying the write fails due to timeout, we will disconnect the peer.
select { | ||
case outMsg := <-p.sendQueue: | ||
// Reset the timeout. | ||
flushTimer.Reset(writeFlushTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we are using go1.23, there's no need to call Stop
anymore, reference.
Nice catch. I think the same goes for the
Thanks for the patch! In addition I think we need to have a similar check in the reader too. Tried a bit locally and realized we could avoid interfering the ping/pong logic. Basically I think we could just remove the
Thus we rely solely on the ping/pong mechanism to check for liveness, and can get rid of the Another approach is to share the idle timer between the reader and writer, which can make sure one busy read won't make the write time out, see the commit here yyforyongyu@1ab9447 |
I really like this approach, but I'm afraid there's a DoS vulnerability hidden in here due to the way LND queues outgoing messages. Current situationOutgoing messages travel through a series of chans and queues before being sent via TCP to the peer:
Once messages have been removed from Processing of messages through the Accidental DoS protectionA peer can purposely refuse to ACK on the TCP layer, thereby causing messages to back up in the outgoing The However the ping-pong mechanism provides a much stronger (accidental) defense against this attack. Pings are sent regularly every ~1 minute with a random Situation after this PRAfter this PR, pings will only be sent after 1 minute of no messages being sent or received successfully. So if an attacker regularly sends an arbitrary message to the victim, the attacker will never actually have to respond to a ping. That means the attacker can refuse to ACK at the TCP layer (except once every 5 minutes), causing the victim's memory use to grow unbounded. Path forwardReally we need to have a discussion about proper outgoing message handling, including things like queue limits, dropping messages, forcing disconnection, etc. If we can mitigate potential DoS that way, then this PR would be great. IIUC, we don't urgently need this PR since the ping timeouts have been largely mitigated by #9804. For now I think we should leave the current ping-pong mechanism as is, to protect against the DoS attack described above, so we don't hold up the 0.19 release on a more robust solution. |
@morehouse Thanks for the detailed analysis! Yes unfortunately the remote peer can hold the ACK since TCP is full-duplex so they can keep sending us msgs without ACKing ours. Will put this PR on hold and revisit for 0.20, also have some ideas about how to better design this area, will create a doc to kick off the discussions. |
@yyforyongyu, remember to re-request review from reviewers when ready |
!lightninglabs-deploy mute |
Based on the specs, the
ping
is used for keeping the liveness of the connection. We perform this check every minute, regardless of whether the connection is live or not.This PR now changes the behavior such that we only perform this check when the connection is idle - if we are busy receiving gossip messages over the wire, we know for sure the connection is alive and can delay sending the pings.