Skip to content

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented May 9, 2025

In this commit, we add a new CLI option to control if we D/C on slow pongs or not. Due to the existence of head-of-the-line blocking at various levels of abstraction (app buffer, slow processing, TCP kernel buffers, etc), if there's a flurry of gossip messages (eg: 1K channel updates), then even with a reasonable processing latency, a peer may still not read our ping in time.

To combat this, we change the default behavior to just logging for slow pongs, and add a new CLI option to re-enable the old behavior.

Along the way, we also add some more enhanced logging, so we can tell when the last successful ping was, and also the deadline reached.

Copy link
Contributor

coderabbitai bot commented May 9, 2025

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Roasbeef Roasbeef force-pushed the pong-relax branch 2 times, most recently from eeb7897 to 8c23404 Compare May 9, 2025 23:53
@djkazic
Copy link
Contributor

djkazic commented May 10, 2025

tACK, this PR improves connection reliability with my peers compared to just 0.19.0-beta-rc4.

Some slight weirdness that I figured would be good to document:

Pong size did not match expected size:

2025-05-10 12:15:51.125 [WRN] PEER: Peer(03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd): pong response failure for 03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd@96.230.252.205:9734: pong response does not match expected size. Expected: 1486, Got: 788. Time waited for this pong: 28.401370887s. Last successful RTT: 209.450171ms. -- not disconnecting due to config
2025-05-10 13:04:19.667 [WRN] PEER: Peer(0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031): pong response failure for 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:16079: pong response does not match expected size. Expected: 1402, Got: 1846. Time waited for this pong: 23.744974397s. Last successful RTT: 2.675905ms. -- not disconnecting due to config
2025-05-10 13:55:26.524 [WRN] PEER: Peer(021d2436cab847373a4212bf6d754ead5304f5d0791479643893a837b295f3441c): pong response failure for 021d2436cab847373a4212bf6d754ead5304f5d0791479643893a837b295f3441c@10.21.21.11:56890: pong response does not match expected size. Expected: 1164, Got: 3340. Time waited for this pong: 1.071593118s. Last successful RTT: 882.700042ms. -- not disconnecting due to config

Back-to-back repeated Pongs (remote is CLN):

2025-05-10 14:59:23.851 [DBG] PEER: Peer(03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd): Received Pong(len(pong_bytes)=2146) from 03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd@96.230.252.205:9734
2025-05-10 14:59:23.851 [DBG] PEER: Peer(03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd): Received Pong(len(pong_bytes)=1267) from 03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd@96.230.252.205:9734
2025-05-10 14:59:23.851 [DBG] PEER: Peer(03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd): Received Pong(len(pong_bytes)=1052) from 03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd@96.230.252.205:9734
2025-05-10 14:59:23.851 [WRN] PEER: Peer(03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd): pong response failure for 03f311f6bb92443e93ee75af7b8cabe0b8bd49f7eaa2757086124c1a75cbe2bfcd@96.230.252.205:9734: pong response does not match expected size. Expected: 3013, Got: 2146. Time waited for this pong: 1.127449045s. Last successful RTT: 209.450171ms. -- not disconnecting due to config

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flag could be useful!

@saubyk saubyk added this to the v0.19.0 milestone May 12, 2025
@ziggie1984 ziggie1984 self-requested a review May 12, 2025 09:08
@ziggie1984
Copy link
Collaborator

I propose adding a Connection-Failure-Treshold (N) which only disconnects after several failed ping/pongs in a row rather than just adding a connection flag wdyt ?

@Roasbeef
Copy link
Member Author

I propose adding a Connection-Failure-Treshold (N) which only disconnects after several failed ping/pongs in a row rather than just adding a connection flag wdyt ?

I had also considered making it a sort of peer level healthcheck, to inherit that threshold logic, but instead went in this direction as I started to second guess the design rationale in disconnecting in the first place. I think if we add a threshold flag, then we'd also want to add a flag to tune what the timeout value should be.

When I started to run this on my node (even w/ the super prio queue), I noticed some nodes that were just persistently slow in replying.

Ultimately slow nodes do affect payment latency e2e. There's a credible design direction here where we start to factor it in at the first hop level, but then also have the link sample the ping RTT of a peer, and decide if the link is even eligible to send based on that.

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last commit does not change existing behaviour afaict

@Roasbeef Roasbeef force-pushed the pong-relax branch 3 times, most recently from 3c75628 to f0fcffa Compare May 14, 2025 21:12
@Roasbeef
Copy link
Member Author

The latest version inverts the original PR: default stays, but users have an option to turn off the disconnect behavior.

@guggero guggero force-pushed the pong-relax branch 2 times, most recently from 1f498d3 to d6d25a9 Compare May 15, 2025 07:49
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed up a fix for the sample config and also made the behavior of return vs. continue in the ping manager more clear.

With that, LGTM 🎉

@guggero guggero requested a review from ellemouton May 15, 2025 07:51
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment - otherwise lgtm!

Roasbeef added 2 commits May 15, 2025 16:36
In this commit, we add a new CLI option to control if we D/C on slow
pongs or not. Due to the existence of head-of-the-line blocking at
various levels of abstraction (app buffer, slow processing, TCP kernel
buffers, etc), if there's a flurry of gossip messages (eg: 1K channel
updates), then even with a reasonable processing latency, a peer may
still not read our ping in time.

To give users another option, we add a flag that allows users to disable
this behavior. The default remains.
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +751 to +753
// If NoDisconnectOnPongFailure is true, we don't
// disconnect. Otherwise (if it's false, the default),
// we disconnect.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Comment can be removed, describes the code.

}

// getLastRTT safely retrieves the last known RTT, returning 0 if none exists.
func (m *PingManager) getLastRTT() time.Duration {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use fn.Option here as well similar as below pendingPingWait ?

close(pingSent)
})
},
OnPongFailure: func(err error,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should be protected by a Once struct maybe as well, because it is closing a channel ?

@Roasbeef Roasbeef merged commit 71dbc18 into lightningnetwork:master May 15, 2025
32 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants