fix: propagate connection errors for blocked peer fallback (#2021) #2105

sanity · 2025-11-18T02:18:06Z

Problem

Issue #2021: Blocked peer mechanism not working - when connection to blocked peer fails, operations don't fall back to alternative peers.

Solution (Option A)

Propagate connection errors to operation handlers so they can retry with different peers:

Connection error propagation: When connection establishment fails in p2p_protoc.rs, call handle_aborted_op
GET retry logic: Inject empty ReturnGet to trigger existing fallback mechanism
Test infrastructure: Fix contract loading in run_app_blocked_peers test

Status

✅ Option A implemented and validated
✅ Contract loading fixed
✅ GET abort handling wired up
⚠️ Test remains #[ignore] - exposes pre-existing PUT/subscribe bug (filed separately)

Main branch can't even run this test (contract loading fails). This PR is an improvement.

Next Steps

Separate stacked PR will fix PUT/subscribe parent/child completion bug.

Closes #2021

- Add handle_aborted_op call when connection establishment fails - Inject empty ReturnGet to trigger existing retry logic for GET ops - Fix contract loading in run_app_blocked_peers test - Test remains #[ignore] until PUT/subscribe bug fixed separately Implements Option A from issue #2021

iduartgomez · 2025-11-18T13:59:50Z

crates/core/src/operations/get.rs

+            ..
+        }) = &self.state
+        {
+            let return_msg = GetMsg::ReturnGet {


what is the next state after doing? what do we achieve?

This injects an empty ReturnGet while staying in AwaitingResponse, so the existing fallback will pop the next candidate from the stored alternatives/skip list and retry with another peer instead of hanging on the failed connection.

iduartgomez · 2025-11-18T14:00:31Z

crates/core/src/operations/get.rs

+            let return_msg = GetMsg::ReturnGet {
+                id: self.id,
+                key: *key,
+                value: StoreResponse {
+                    state: None,
+                    contract: None,
+                },
+                sender: current_target.clone(),
+                target: op_manager.ring.connection_manager.own_location(),
+                skip_list: skip_list.clone(),
+            };
+
+            op_manager
+                .notify_op_change(NetMessage::from(return_msg), OpEnum::Get(self))
+                .await?;
+            return Err(OpError::StatePushed);


@claude what happens after doing this, what is achieved

Same intent as above: pushing an empty ReturnGet back into the GET state machine so it reuses the existing retry path (alternatives/skip list) and moves on to another peer instead of waiting on the failed connection.

crates/core/src/ring/connection_manager.rs

Copilot

Pull Request Overview

This PR fixes issue #2021 by implementing connection error propagation to enable blocked peer fallback, adds a transient connection management system for gateways, and fixes contract loading in tests.

Key changes:

Propagates connection failures through handle_aborted_op to allow GET operations to retry with alternative peers
Adds transient connection tracking with configurable budget and TTL for gateway resource management
Fixes contract loading in the blocked peers test by using the load_contract helper

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
crates/core/src/operations/get.rs	Adds `handle_abort` method to retry GET operations on connection failures; extends state to track current target and skip list
crates/core/src/node/network_bridge/p2p_protoc.rs	Implements connection error propagation, transient connection handling, and connection reuse logic
crates/core/src/node/mod.rs	Extends `handle_aborted_op` to process GET operation failures and adds transient configuration fields
crates/core/src/ring/connection_manager.rs	Adds transient connection tracking with registration, budget management, and TTL expiration
crates/core/src/config/mod.rs	Introduces `transient_budget` and `transient_ttl_secs` configuration options with defaults
apps/freenet-ping/app/tests/run_app_blocked_peers.rs	Fixes contract loading by properly computing code hash before creating container
crates/freenet-macros/src/codegen.rs	Updates generated test configs to include new transient fields
crates/core/tests/*.rs	Updates test configurations with new transient connection parameters
apps/freenet-ping/app/tests/common/mod.rs	Adds transient config fields to test helper functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-19T01:01:34Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+                .get_peer_key()
+                .expect("peer key should be set when connection exists");
+            callback
+                .send_result(Ok((resolved_peer_id, None)))


This code incorrectly returns the local node's own peer ID instead of the remote peer's ID. When reusing an existing connection, the callback should receive the peer ID of the connection being reused (peer), not the local node's peer key. This would cause the operation to incorrectly believe it has connected to itself rather than to the target peer.

The correct code should be:

callback .send_result(Ok((peer.clone(), None))) .await

Suggested change

.send_result(Ok((resolved_peer_id, None)))

.send_result(Ok((peer.clone(), None)))

Fixed: when reusing an existing connection we now return the remote peer ID (peer.clone()) instead of the local peer key so the requester gets the correct peer back.

yeah this was weird, was wondering the same, good catch copilot

Copilot · 2025-11-19T01:01:35Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+        if is_transient {
+            let connection_manager = &self.bridge.op_manager.ring.connection_manager;
+            let current = connection_manager.transient_count();
+            if current >= connection_manager.transient_budget() {
+                tracing::warn!(
+                    remote = %peer_id.addr,
+                    budget = connection_manager.transient_budget(),
+                    current,
+                    "Transient connection budget exhausted; dropping inbound connection"
+                );
+                return Ok(());


There's a potential race condition in the transient connection budget check. The check at line 1611 reads the current count, but the transient connection is only registered later at line 1711 (after the connection is inserted). Between these two points, multiple concurrent connections could pass the budget check before any of them are registered, allowing the budget to be exceeded.

Consider checking and atomically incrementing a counter, or moving the registration earlier in the flow before inserting the connection.

Added an early transient reservation that atomically bumps the budget before notifying callbacks or inserting the connection; if the budget is exhausted we drop the inbound connection immediately and release the reservation when we skip insertion. The location gets updated later once we know it.

Copilot · 2025-11-19T01:01:36Z

crates/core/src/operations/get.rs

                                    attempts_at_hop: attempts_at_hop + 1,
+                                    key,
+                                    current_target: next_target,
+                                    skip_list: updated_tried_peers,


The skip_list at line 865 is incorrectly set to updated_tried_peers, which discards the original skip_list parameter. The skip_list should accumulate peers to avoid across retries, not be replaced. When trying an alternative peer at the same hop, the skip_list should remain unchanged (it represents peers to skip when finding new candidates), while tried_peers tracks what we've tried at this specific hop.

The correct assignment should be:

skip_list: skip_list.clone(),

Or if you want to merge both:

skip_list: { let mut merged = skip_list.clone(); merged.extend(updated_tried_peers.iter().cloned()); merged },

Suggested change

skip_list: updated_tried_peers,

skip_list: skip_list.clone(),

Updated to keep the incoming skip_list intact (tried_peers handles per-hop attempts) and added a brief comment clarifying the separation between skip_list and tried_peers.

Copilot · 2025-11-19T01:01:37Z

crates/core/src/ring/connection_manager.rs

+    pub fn register_transient(&self, peer: PeerId, location: Option<Location>) {
+        self.transient_connections.write().insert(
+            peer,
+            TransientEntry {
+                opened_at: Instant::now(),
+                location,
+            },
+        );
+    }


The public method register_transient lacks documentation. Consider adding a doc comment explaining:

What a transient connection is

When and why this method should be called

The parameters and their purpose

Any side effects or invariants

Example:

/// Registers a new transient connection for the given peer. /// /// Transient connections are temporary connections accepted by gateways /// that are not yet added to the ring topology. They have a limited TTL /// and budget to prevent resource exhaustion. /// /// # Arguments /// * `peer` - The peer ID of the transient connection /// * `location` - Optional location of the peer in the ring pub fn register_transient(&self, peer: PeerId, location: Option<Location>) {

Added doc comments and renamed to try_register_transient to make the budgeted, fail-fast behavior explicit. Also tracks the count atomically so we respect the configured quota even under concurrency.

Copilot · 2025-11-19T01:01:38Z

crates/core/src/ring/connection_manager.rs

+            },
+        );
+    }
+


The public method drop_transient lacks documentation. Consider adding a doc comment explaining:

What this method does

When it should be called

What the return value represents

Any side effects

Example:

/// Removes a transient connection from tracking. /// /// This should be called when a transient connection is either promoted /// to a normal connection or when it expires and needs to be dropped. /// /// # Arguments /// * `peer` - The peer ID of the transient connection to drop /// /// # Returns /// The `TransientEntry` if it existed, `None` otherwise pub fn drop_transient(&self, peer: &PeerId) -> Option<TransientEntry> {

Suggested change

/// Removes a transient connection from tracking.

///

/// This should be called when a transient connection is either promoted

/// to a normal connection or when it expires and needs to be dropped.

///

/// # Arguments

/// * `peer` - The peer ID of the transient connection to drop

///

/// # Returns

/// The `TransientEntry` if it existed, `None` otherwise.

///

/// # Side Effects

/// Removes the transient connection from the internal tracking map.

Documented drop_transient and made it decrement the transient counter when removing an entry so the budget remains accurate.

I think we can handle transient connections at the type level and that would make all this stuff safer and elss error prone, but that will require some further changes/refactoring so will do that in a separate PR in the future

Copilot · 2025-11-19T01:01:38Z

crates/core/src/operations/get.rs

+    pub(crate) async fn handle_abort(self, op_manager: &OpManager) -> Result<(), OpError> {
+        if let Some(GetState::AwaitingResponse {
+            key,
+            current_target,
+            skip_list,
+            ..
+        }) = &self.state
+        {
+            // We synthesize an empty ReturnGet back to ourselves to reuse the existing
+            // fallback path that tries the next candidate. The state stays
+            // AwaitingResponse so the retry logic can pick up from the stored
+            // alternatives/skip list.
+            let return_msg = GetMsg::ReturnGet {
+                id: self.id,
+                key: *key,
+                value: StoreResponse {
+                    state: None,
+                    contract: None,
+                },
+                sender: current_target.clone(),
+                target: op_manager.ring.connection_manager.own_location(),
+                skip_list: skip_list.clone(),
+            };
+
+            op_manager
+                .notify_op_change(NetMessage::from(return_msg), OpEnum::Get(self))
+                .await?;
+            return Err(OpError::StatePushed);
+        }
+
+        // If we weren't awaiting a response, just put the op back.
+        // No retry needed; another handler may pick it up later.
+        op_manager.push(self.id, OpEnum::Get(self)).await?;
+        Ok(())
+    }


The public method handle_abort lacks documentation. Consider adding a doc comment explaining:

What scenarios trigger an abort

How the retry mechanism works

What the error return values mean (especially OpError::StatePushed)

The relationship with handle_aborted_op

Example:

/// Handles an aborted outbound connection attempt by triggering retry logic. /// /// When a connection to a peer fails, this method synthesizes an empty /// `ReturnGet` message to trigger the existing fallback path that tries /// alternative peers from the stored candidates list. /// /// # Arguments /// * `op_manager` - The operation manager for accessing ring state /// /// # Returns /// * `Err(OpError::StatePushed)` - Successfully initiated retry (not an error) /// * `Ok(())` - Operation was not awaiting a response, pushed back to queue /// * Other errors - Failed to process the abort pub(crate) async fn handle_abort(self, op_manager: &OpManager) -> Result<(), OpError> {

Added a doc comment explaining that this synthesizes an empty ReturnGet to reuse the retry path, when it returns OpError::StatePushed, and why the op is returned to the queue if it was not waiting on a response.

iduartgomez

GET retry logic: Inject empty ReturnGet to trigger existing fallback mechanism

Gotta say is a bit cheese we need to do this, but is a shortcoming of the current callback mechanisms. This also can be improved in the future when op state machines are refactored once we are stable.

sanity requested a review from iduartgomez November 18, 2025 13:56

iduartgomez reviewed Nov 18, 2025

View reviewed changes

crates/core/src/ring/connection_manager.rs Outdated Show resolved Hide resolved

sanity added 2 commits November 18, 2025 11:11

docs: clarify GET abort fallback state

2f55435

Merge branch 'main' into fix/i2021-blocked-peers-option-a

0edc051

This was referenced Nov 18, 2025

fix: WebSocket connection reset on test teardown in blocked-peers test #2108

Open

fix: complete parent PUT when child subscribe finishes #2109

Merged

sanity requested a review from Copilot November 19, 2025 00:57

Copilot started reviewing on behalf of sanity November 19, 2025 00:57 View session

Copilot finished reviewing on behalf of sanity November 19, 2025 01:00

Copilot AI reviewed Nov 19, 2025

View reviewed changes

fix: address review feedback for option A

89faadd

iduartgomez approved these changes Nov 19, 2025

View reviewed changes

sanity added this pull request to the merge queue Nov 19, 2025

Merged via the queue into main with commit 7ab46a0 Nov 19, 2025
16 checks passed

sanity deleted the fix/i2021-blocked-peers-option-a branch November 19, 2025 15:11

This was referenced Nov 19, 2025

refactor: handle transient connections at type level #2115

Open

Restore transient connection boundary during connect handshake #2084

Open

	.send_result(Ok((resolved_peer_id, None)))
	.send_result(Ok((peer.clone(), None)))

	skip_list: updated_tried_peers,
	skip_list: skip_list.clone(),

+    /// Removes a transient connection from tracking.
+    ///
+    /// This should be called when a transient connection is either promoted
+    /// to a normal connection or when it expires and needs to be dropped.
+    ///
+    /// # Arguments
+    /// * `peer` - The peer ID of the transient connection to drop
+    ///
+    /// # Returns
+    /// The `TransientEntry` if it existed, `None` otherwise.
+    ///
+    /// # Side Effects
+    /// Removes the transient connection from the internal tracking map.

Uh oh!

fix: propagate connection errors for blocked peer fallback (#2021) #2105

fix: propagate connection errors for blocked peer fallback (#2021) #2105

Uh oh!

Conversation

sanity commented Nov 18, 2025

Problem

Solution (Option A)

Status

Next Steps

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iduartgomez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants