peer: only send pings if the connection is idle for one minute #9805

yyforyongyu · 2025-05-13T14:10:02Z

Based on the specs, the ping is used for keeping the liveness of the connection. We perform this check every minute, regardless of whether the connection is live or not.

This PR now changes the behavior such that we only perform this check when the connection is idle - if we are busy receiving gossip messages over the wire, we know for sure the connection is alive and can delay sending the pings.

This commit changes the ping/pong mechanism used for liveness check. Previously we will always fire a ping every minute, even we know for sure the connection is alive as we are busy processing other messages like gossip updates. We now only fire a ping if we haven't received any messages from the peer after one minute.

coderabbitai · 2025-05-13T14:10:09Z

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)

llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

morehouse

LGTM

morehouse

Actually there might be a slight problem here...

Currently writeHandler will force a disconnect if no message has been successfully sent to the peer in 5 minutes. Prior to this change, pings would be sent every 1 minute, thus preventing this disconnect. Now it is possible that the peer sends a gossip message often enough to prevent the pings, and then gets disconnected as a result.

Roasbeef · 2025-05-13T22:25:54Z

Prior to this change, pings would be sent every 1 minute, thus preventing this disconnect. Now it is possible that the peer sends a gossip message often enough to prevent the pings, and then gets disconnected as a result.

One idea would be to do an explicit ping check at that point before D/C in the writeHandler. If it works (pong received), we continue onwards, otherwise, we disconnect.

Roasbeef · 2025-05-13T23:21:18Z

Here's a patch that implements that idea above: Roasbeef@120375e

Before we would: "disconnect if we haven't written a message in 5 minutes".

Now we'll: "force a ping to be sent if we haven't written a message in 5 minutes".

Returning to @morehouse's scenario above, instead of never disconnect, we'll send a ping, then if that doesn't receive a timely response, we'll disconnect.

ziggie1984 · 2025-05-14T09:41:14Z

Really nice idea with the combination of the patch +1 on this approach.

Similar to how we delay sending pings on successful reads, we now delay sending pings if we know the connection is up and alive for writes.

Remove this `idleTimer` as the `readNextMessage` will time out after 5 seconds anyway, which means it's unlikely the `idleTimer` will fire due to a congestion in the read pipe of the connection. When a timeout happens, this error is not caught and will break the loop and jump directly to the disconnection and this commit doesn't change that behaivor. This commit will change the behaivor that, if any of the message processing takes longer than 5 minute, it will not disconnect but wait for that process to be finished. Since all the processing are async, this should never happen.

This commit removes the `idleTimer` used in the `writeHandler`. When the connection is idle for one minute, a ping will be written to the connection, which means the `idleTimer` won't be fired. If retrying the write fails due to timeout, we will disconnect the peer.

yyforyongyu · 2025-05-14T13:02:00Z

peer/brontide.go

 		select {
 		case outMsg := <-p.sendQueue:
+			// Reset the timeout.
+			flushTimer.Reset(writeFlushTimeout)


Given we are using go1.23, there's no need to call Stop anymore, reference.

yyforyongyu · 2025-05-14T13:06:27Z

Currently writeHandler will force a disconnect if no message has been successfully sent to the peer in 5 minutes.

Nice catch. I think the same goes for the readHandler, if the writer is busy and the ping to be sent is queued at the very bottom, it could cause the reader to time out too.

Here's a patch that implements that idea above

Thanks for the patch! In addition I think we need to have a similar check in the reader too. Tried a bit locally and realized we could avoid interfering the ping/pong logic. Basically I think we could just remove the idleTimer completely since we always fire a ping if the connection is idle, then we only need to catch read/write timeout error and optionally close the conn there. So it goes like this,

If there's only read but no write, no ping is sent, and write won't time out.
If there's only write but no read, no ping is sent, and read won't time out.
When the connection is idle without any read or write, a ping/pong will be sent/received.
- if sending ping times out, we disconnect in writeHandler.
- if we haven't received a pong under 30 seconds, we disconnect, which could happen if,
  - the peer doesn't respond with a pong, we disconnect in pingHandler, or,
  - if reading pong times out, we disconnect in readHandler.

Thus we rely solely on the ping/pong mechanism to check for liveness, and can get rid of the idleTimer.

Another approach is to share the idle timer between the reader and writer, which can make sure one busy read won't make the write time out, see the commit here yyforyongyu@1ab9447

morehouse · 2025-05-14T16:29:51Z

Thus we rely solely on the ping/pong mechanism to check for liveness, and can get rid of the idleTimer.

I really like this approach, but I'm afraid there's a DoS vulnerability hidden in here due to the way LND queues outgoing messages.

Current situation

Outgoing messages travel through a series of chans and queues before being sent via TCP to the peer:

Brontide.outgoingQueue (unbuffered)
priorityMsgs or lazyMsgs in Brontide.queueHandler (unlimited list size)
Brontide.sendQueue (unbuffered)

Once messages have been removed from outgoingQueue, they are buffered in either priorityMsgs or lazyMsgs, both of which have unbounded size. One by one, messages are removed from priorityMsgs then lazyMsgs and sent through the sendQueue.

Processing of messages through the sendQueue is synchronous -- writeHandler won't process a new message until the current message is successfully sent via TCP and ACKed by the peer. If the peer doesn't ACK on the TCP layer, writeHandler will retry sending for up to 5 minutes (idleTimeout) before disconnecting the peer.

Accidental DoS protection

A peer can purposely refuse to ACK on the TCP layer, thereby causing messages to back up in the outgoing priorityMsgs and lazyMsgs buffers. If this was the full story, memory use would grow unbounded and it would be trivial for any peer to DoS LND this way.

The idleTimeout provides only a tiny defense against such an attack, requiring the attacker to ACK one message every 5 minutes to avoid being disconnected. As long as messages are getting queued at a faster rate than this, memory use will still continue to increase unbounded.

However the ping-pong mechanism provides a much stronger (accidental) defense against this attack. Pings are sent regularly every ~1 minute with a random NumPongBytes specified in the message. The attacker must actually receive these pings to know the proper pong length to send in response to prevent disconnection after 30 seconds. If the attacker is refusing to ACK at the TCP layer, the ping message will never actually be sent to the attacker, and the attacker will be disconnected.

Situation after this PR

After this PR, pings will only be sent after 1 minute of no messages being sent or received successfully. So if an attacker regularly sends an arbitrary message to the victim, the attacker will never actually have to respond to a ping. That means the attacker can refuse to ACK at the TCP layer (except once every 5 minutes), causing the victim's memory use to grow unbounded.

Path forward

Really we need to have a discussion about proper outgoing message handling, including things like queue limits, dropping messages, forcing disconnection, etc. If we can mitigate potential DoS that way, then this PR would be great.

IIUC, we don't urgently need this PR since the ping timeouts have been largely mitigated by #9804. For now I think we should leave the current ping-pong mechanism as is, to protect against the DoS attack described above, so we don't hold up the 0.19 release on a more robust solution.

yyforyongyu · 2025-05-14T19:53:52Z

@morehouse Thanks for the detailed analysis! Yes unfortunately the remote peer can hold the ACK since TCP is full-duplex so they can keep sending us msgs without ACKing ours. Will put this PR on hold and revisit for 0.20, also have some ideas about how to better design this area, will create a doc to kick off the discussions.

lightninglabs-deploy · 2025-05-28T21:29:52Z

@yyforyongyu, remember to re-request review from reviewers when ready

yyforyongyu · 2025-05-29T07:32:07Z

!lightninglabs-deploy mute

yyforyongyu self-assigned this May 13, 2025

yyforyongyu added p2p Code related to the peer-to-peer behaviour size/micro small bug fix or feature, less than 15 mins of review, less than 250 labels May 13, 2025

docs: update release notes

744feab

yyforyongyu force-pushed the optional-ping branch from 225f290 to 744feab Compare May 13, 2025 14:10

saubyk added this to the v0.19.0 milestone May 13, 2025

morehouse approved these changes May 13, 2025

View reviewed changes

morehouse requested changes May 13, 2025

View reviewed changes

Roasbeef mentioned this pull request May 13, 2025

peer+lnd: add new CLI option to control if we D/C on slow pongs #9801

Merged

yyforyongyu added 3 commits May 14, 2025 18:59

peer: delay pings on successful writes

63b2f30

Similar to how we delay sending pings on successful reads, we now delay sending pings if we know the connection is up and alive for writes.

yyforyongyu commented May 14, 2025

View reviewed changes

saubyk modified the milestones: v0.19.0, v0.20.0 May 14, 2025

yyforyongyu modified the milestones: v0.20.0, v0.21.0 Sep 16, 2025

peer: only send pings if the connection is idle for one minute #9805

Are you sure you want to change the base?

peer: only send pings if the connection is idle for one minute #9805

Uh oh!

Conversation

yyforyongyu commented May 13, 2025

Uh oh!

coderabbitai bot commented May 13, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

morehouse left a comment

Choose a reason for hiding this comment

Uh oh!

morehouse left a comment

Choose a reason for hiding this comment

Uh oh!

Roasbeef commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef commented May 13, 2025

Uh oh!

ziggie1984 commented May 14, 2025

Uh oh!

yyforyongyu May 14, 2025

Choose a reason for hiding this comment

Uh oh!

yyforyongyu commented May 14, 2025

Uh oh!

morehouse commented May 14, 2025

Current situation

Accidental DoS protection

Situation after this PR

Path forward

Uh oh!

yyforyongyu commented May 14, 2025

Uh oh!

lightninglabs-deploy commented May 28, 2025

Uh oh!

yyforyongyu commented May 29, 2025

Uh oh!

Uh oh!

CodeRabbit Configuration File (`.coderabbit.yaml`)

Roasbeef commented May 13, 2025 •

edited

Loading