Skip to content

kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

@teor2345

Description

@teor2345

Summary

When an outbound kad substream times out (10s), it is removed from the substream list, and a new outbound substream can be opened.

But on the inbound side, there appears to be no timeout, so the node only drops new inbound substreams when they are over its substream limit. (If all other substreams are waiting for the first message, or in another state, those substreams can't be re-used. So the new substream gets dropped.)

This causes thousands of "substream limit exceeded" warnings on the inbound side. It can also slow down syncing a lot, in some cases making it impossible.

This bug is self-triggering, because the dropped inbound substreams also time out on the outbound side.

Edit: this is not a duplicate of #3236, the cause is different, and it only happens under specific load conditions.

Expected behavior

Inbound substreams time out after approximately 10 seconds.

Ideally the inbound timeout is slightly shorter, because the timeout starts on the outbound side immediately, but only starts on the inbound side after the network transmission delay. If there is a long network delay for earlier substreams, but a short network delay for later substreams, this warning can still happen occasionally.

Actual behavior

Inbound substreams which have been timed out on the outbound side seem to hang around for much longer than 10s. Maybe they are only removed when a read fails on them? Or some other error happens?

Relevant log output

2025-04-08T06:24:27.293722Z WARN Consensus: libp2p_kad::handler: New inbound substream to peer exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream. peer=PeerId("12D3KooWN6kFp2Ev181UGq3BUDfk1jfjaNu6sDTqxCZUBpmp8kRQ")

Possible Solution

On the sending side, outbound substreams only count towards the limit until they timeout:

StreamUpgradeError::Timeout => io::ErrorKind::TimedOut.into(),

if self.outbound_substreams.len() < MAX_NUM_STREAMS {

And the outbound timeout is 10 seconds:

Duration::from_secs(10),

  1. But on the receiving side, inbound substreams count towards the limit until they've received a message:
    if let Poll::Ready(Some(event)) = self.inbound_substreams.poll_next_unpin(cx) {

    } => match substream.poll_next_unpin(cx) {

    Poll::Ready(Some(Err(e))) => {

and can't be re-used if the sender times out on the first message:

InboundSubstreamState::WaitingMessage { first: false, .. }

There is no inbound timeout:
https://github.com/libp2p/rust-libp2p/blob/b56b47aa6510ab4af0ae797a7f036364d414ae3e/protocols/kad/src/handler.rs#L75C5-L75C23

Here is how other protocols implement matching inbound and outbound timeouts:

inbound_workers: futures_bounded::FuturesSet::new(

Version

Latest main back to at least 0.54.2

Would you like to work on fixing this bug?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions