Skip to content

Conversation

yyforyongyu
Copy link
Member

Based on the specs, the ping is used for keeping the liveness of the connection. We perform this check every minute, regardless of whether the connection is live or not.

This PR now changes the behavior such that we only perform this check when the connection is idle - if we are busy receiving gossip messages over the wire, we know for sure the connection is alive and can delay sending the pings.

This commit changes the ping/pong mechanism used for liveness check.
Previously we will always fire a ping every minute, even we know for
sure the connection is alive as we are busy processing other messages
like gossip updates. We now only fire a ping if we haven't received any
messages from the peer after one minute.
@yyforyongyu yyforyongyu self-assigned this May 13, 2025
@yyforyongyu yyforyongyu added p2p Code related to the peer-to-peer behaviour size/micro small bug fix or feature, less than 15 mins of review, less than 250 labels May 13, 2025
Copy link
Contributor

coderabbitai bot commented May 13, 2025

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@saubyk saubyk added this to the v0.19.0 milestone May 13, 2025
Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there might be a slight problem here...

Currently writeHandler will force a disconnect if no message has been successfully sent to the peer in 5 minutes. Prior to this change, pings would be sent every 1 minute, thus preventing this disconnect. Now it is possible that the peer sends a gossip message often enough to prevent the pings, and then gets disconnected as a result.

@Roasbeef
Copy link
Member

Roasbeef commented May 13, 2025

Prior to this change, pings would be sent every 1 minute, thus preventing this disconnect. Now it is possible that the peer sends a gossip message often enough to prevent the pings, and then gets disconnected as a result.

One idea would be to do an explicit ping check at that point before D/C in the writeHandler. If it works (pong received), we continue onwards, otherwise, we disconnect.

@Roasbeef
Copy link
Member

Here's a patch that implements that idea above: Roasbeef@120375e

Before we would: "disconnect if we haven't written a message in 5 minutes".

Now we'll: "force a ping to be sent if we haven't written a message in 5 minutes".

Returning to @morehouse's scenario above, instead of never disconnect, we'll send a ping, then if that doesn't receive a timely response, we'll disconnect.

@ziggie1984
Copy link
Collaborator

Really nice idea with the combination of the patch +1 on this approach.

Similar to how we delay sending pings on successful reads, we now delay
sending pings if we know the connection is up and alive for writes.
Remove this `idleTimer` as the `readNextMessage` will time out after 5
seconds anyway, which means it's unlikely the `idleTimer` will fire due
to a congestion in the read pipe of the connection. When a timeout
happens, this error is not caught and will break the loop and jump
directly to the disconnection and this commit doesn't change that
behaivor.

This commit will change the behaivor that, if any of the message
processing takes longer than 5 minute, it will not disconnect but wait
for that process to be finished. Since all the processing are async,
this should never happen.
This commit removes the `idleTimer` used in the `writeHandler`. When the
connection is idle for one minute, a ping will be written to the
connection, which means the `idleTimer` won't be fired. If retrying the
write fails due to timeout, we will disconnect the peer.
select {
case outMsg := <-p.sendQueue:
// Reset the timeout.
flushTimer.Reset(writeFlushTimeout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we are using go1.23, there's no need to call Stop anymore, reference.

@yyforyongyu
Copy link
Member Author

Currently writeHandler will force a disconnect if no message has been successfully sent to the peer in 5 minutes.

Nice catch. I think the same goes for the readHandler, if the writer is busy and the ping to be sent is queued at the very bottom, it could cause the reader to time out too.

Here's a patch that implements that idea above

Thanks for the patch! In addition I think we need to have a similar check in the reader too. Tried a bit locally and realized we could avoid interfering the ping/pong logic. Basically I think we could just remove the idleTimer completely since we always fire a ping if the connection is idle, then we only need to catch read/write timeout error and optionally close the conn there. So it goes like this,

  • If there's only read but no write, no ping is sent, and write won't time out.
  • If there's only write but no read, no ping is sent, and read won't time out.
  • When the connection is idle without any read or write, a ping/pong will be sent/received.
    • if sending ping times out, we disconnect in writeHandler.
    • if we haven't received a pong under 30 seconds, we disconnect, which could happen if,
      • the peer doesn't respond with a pong, we disconnect in pingHandler, or,
      • if reading pong times out, we disconnect in readHandler.

Thus we rely solely on the ping/pong mechanism to check for liveness, and can get rid of the idleTimer.

Another approach is to share the idle timer between the reader and writer, which can make sure one busy read won't make the write time out, see the commit here yyforyongyu@1ab9447

@morehouse
Copy link
Collaborator

Thus we rely solely on the ping/pong mechanism to check for liveness, and can get rid of the idleTimer.

I really like this approach, but I'm afraid there's a DoS vulnerability hidden in here due to the way LND queues outgoing messages.

Current situation

Outgoing messages travel through a series of chans and queues before being sent via TCP to the peer:

  1. Brontide.outgoingQueue (unbuffered)
  2. priorityMsgs or lazyMsgs in Brontide.queueHandler (unlimited list size)
  3. Brontide.sendQueue (unbuffered)

Once messages have been removed from outgoingQueue, they are buffered in either priorityMsgs or lazyMsgs, both of which have unbounded size. One by one, messages are removed from priorityMsgs then lazyMsgs and sent through the sendQueue.

Processing of messages through the sendQueue is synchronous -- writeHandler won't process a new message until the current message is successfully sent via TCP and ACKed by the peer. If the peer doesn't ACK on the TCP layer, writeHandler will retry sending for up to 5 minutes (idleTimeout) before disconnecting the peer.

Accidental DoS protection

A peer can purposely refuse to ACK on the TCP layer, thereby causing messages to back up in the outgoing priorityMsgs and lazyMsgs buffers. If this was the full story, memory use would grow unbounded and it would be trivial for any peer to DoS LND this way.

The idleTimeout provides only a tiny defense against such an attack, requiring the attacker to ACK one message every 5 minutes to avoid being disconnected. As long as messages are getting queued at a faster rate than this, memory use will still continue to increase unbounded.

However the ping-pong mechanism provides a much stronger (accidental) defense against this attack. Pings are sent regularly every ~1 minute with a random NumPongBytes specified in the message. The attacker must actually receive these pings to know the proper pong length to send in response to prevent disconnection after 30 seconds. If the attacker is refusing to ACK at the TCP layer, the ping message will never actually be sent to the attacker, and the attacker will be disconnected.

Situation after this PR

After this PR, pings will only be sent after 1 minute of no messages being sent or received successfully. So if an attacker regularly sends an arbitrary message to the victim, the attacker will never actually have to respond to a ping. That means the attacker can refuse to ACK at the TCP layer (except once every 5 minutes), causing the victim's memory use to grow unbounded.

Path forward

Really we need to have a discussion about proper outgoing message handling, including things like queue limits, dropping messages, forcing disconnection, etc. If we can mitigate potential DoS that way, then this PR would be great.

IIUC, we don't urgently need this PR since the ping timeouts have been largely mitigated by #9804. For now I think we should leave the current ping-pong mechanism as is, to protect against the DoS attack described above, so we don't hold up the 0.19 release on a more robust solution.

@yyforyongyu
Copy link
Member Author

@morehouse Thanks for the detailed analysis! Yes unfortunately the remote peer can hold the ACK since TCP is full-duplex so they can keep sending us msgs without ACKing ours. Will put this PR on hold and revisit for 0.20, also have some ideas about how to better design this area, will create a doc to kick off the discussions.

@saubyk saubyk modified the milestones: v0.19.0, v0.20.0 May 14, 2025
@lightninglabs-deploy
Copy link

@yyforyongyu, remember to re-request review from reviewers when ready

@yyforyongyu
Copy link
Member Author

!lightninglabs-deploy mute

@yyforyongyu yyforyongyu modified the milestones: v0.20.0, v0.21.0 Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2p Code related to the peer-to-peer behaviour size/micro small bug fix or feature, less than 15 mins of review, less than 250
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants