TQ: Async Nodes and P2P connections #9258

andrewjstone · 2025-10-18T22:43:31Z

Builds on #9232

This is the first step in wrapping the trust_quorum::Node so that it can be used in an async context and integrated with sled-agent. Only the sprockets networking has been fully integrated so far such that each NodeTask has a ConnMgr that sets up a full mesh of sprockets connections. A test for this connectivity behavior has been written but the code is not wired into the production code yet.

Messages can be sent between NodeTasks over sprockets connections. Each connection exists in it's own task managed by an EstablishedConn. The main NodeTask task sends messages to and receives messages from this task to interact with the outside world via sprockets. Currently only Ping messages are sent over the wire as a means to keep the connections alive and detect disconnects.

A NodeHandle allows one to interact with the NodeTask. Currently only three operations are implemented with messages defined in NodeApiRequest. The user can instruct the node who it's peers are on the bootstrap network to establish connectivity, can poll for connectivity status, and can shutdown the node. All of this functionality is used in the accompanying test.

It's important to re-iterate that this code only implements connectivity between trust quorum nodes and no actual trust quorum messages are sent. They can't be as a handle can not yet initiate a reconfiguration or LRTQ upgrade. That behavior will come in a follow up. This PR is large enough.

A lot of this code is similar to the LRTQ connection management code, except that it operates over sprockets rather than TCP channels. This introduces some complexity, but it is mostly abstracted away into the SprocketsConfig.

andrewjstone · 2025-10-18T22:57:49Z

trust-quorum/src/established_conn.rs

+        }
+    }
+
+    pub async fn run(&mut self) {


This is nearly identical to LRTQ

andrewjstone · 2025-10-18T23:02:31Z

trust-quorum/src/established_conn.rs

+    async fn on_read(
+        &mut self,
+        res: Result<usize, std::io::Error>,
+    ) -> Result<(), ConnErr> {
+        match res {
+            Ok(n) => {
+                self.total_read += n;
+            }
+            Err(e) => {
+                return Err(ConnErr::FailedRead(e));
+            }
+        }
+
+        // We may have more than one message that has been read
+        loop {
+            if self.total_read < FRAME_HEADER_SIZE {
+                return Ok(());
+            }
+            // Read frame size
+            let size = read_frame_size(
+                self.read_buf[..FRAME_HEADER_SIZE].try_into().unwrap(),
+            );
+            let end = size + FRAME_HEADER_SIZE;
+
+            // If we haven't read the whole message yet, then return
+            if end > self.total_read {
+                return Ok(());
+            }
+            let msg: WireMsg =
+                ciborium::from_reader(&self.read_buf[FRAME_HEADER_SIZE..end])?;
+            // Move any remaining bytes to the beginning of the buffer.
+            self.read_buf.copy_within(end..self.total_read, 0);
+            self.total_read = self.total_read - end;
+
+            self.last_received_msg = Instant::now();
+            debug!(self.log, "Received {msg:?}");
+            match msg {
+                WireMsg::Tq(msg) => {
+                    if let Err(e) = self
+                        .main_tx
+                        .send(ConnToMainMsg {
+                            task_id: self.task_id,
+                            msg: ConnToMainMsgInner::Received {
+                                from: self.peer_id.clone(),
+                                msg,
+                            },
+                        })
+                        .await
+                    {
+                        warn!(
+                            self.log,
+                            "Failed to send received fsm msg to main task: {e:?}"
+                        );
+                    }
+                }
+                WireMsg::Ping => {
+                    // Nothing to do here, since Ping is just to keep us alive and
+                    // we updated self.last_received_msg above.
+                }
+                WireMsg::NetworkConfig(config) => {
+                    let generation = config.generation;
+                    if let Err(e) = self
+                        .main_tx
+                        .send(ConnToMainMsg {
+                            task_id: self.task_id,
+                            msg: ConnToMainMsgInner::ReceivedNetworkConfig {
+                                from: self.peer_id.clone(),
+                                config,
+                            },
+                        })
+                        .await
+                    {
+                        warn!(
+                            self.log,
+                            "Failed to send received NetworkConfig with
+                             generation {generation} to main task: {e:?}"
+                        );
+                    }
+                }
+            }
+        }
+    }
+
+    async fn check_write_result(
+        &mut self,
+        res: Result<usize, std::io::Error>,
+    ) -> Result<(), ConnErr> {
+        match res {
+            Ok(_) => {
+                if !self.current_write.has_remaining() {
+                    self.current_write = Cursor::new(Vec::new());
+                }
+                Ok(())
+            }
+            Err(e) => {
+                let _ = self.writer.shutdown().await;
+                Err(ConnErr::FailedWrite(e))
+            }
+        }
+    }
+
+    async fn on_msg_from_main(
+        &mut self,
+        msg: MainToConnMsg,
+    ) -> Result<(), ConnErr> {
+        match msg {
+            MainToConnMsg::Close => {
+                return Err(ConnErr::Close);
+            }
+            MainToConnMsg::Msg(msg) => self.write_framed_to_queue(msg).await,
+        }
+    }
+
+    async fn write_framed_to_queue(
+        &mut self,
+        msg: WireMsg,
+    ) -> Result<(), ConnErr> {
+        if self.write_queue.len() == MSG_WRITE_QUEUE_CAPACITY {
+            return Err(ConnErr::WriteQueueFull);
+        } else {
+            let msg = write_framed(&msg)?;
+            self.write_queue.push_back(msg);
+            Ok(())
+        }
+    }
+
+    async fn ping(&mut self) -> Result<(), ConnErr> {
+        if Instant::now() - self.last_received_msg > INACTIVITY_TIMEOUT {
+            return Err(ConnErr::InactivityTimeout);
+        }
+        self.write_framed_to_queue(WireMsg::Ping).await
+    }
+}
+
+// Decode the 4-byte big-endian frame size header
+fn read_frame_size(buf: [u8; FRAME_HEADER_SIZE]) -> usize {
+    u32::from_be_bytes(buf) as usize
+}
+
+/// Serialize and write `msg` into `buf`, prefixed by a 4-byte big-endian size
+/// header
+///
+/// Return the total amount of data written into `buf` including the 4-byte
+/// header.
+fn write_framed<T: Serialize + ?Sized>(
+    msg: &T,
+) -> Result<Vec<u8>, ciborium::ser::Error<std::io::Error>> {
+    let mut cursor = Cursor::new(vec![]);
+    // Write a size placeholder
+    std::io::Write::write(&mut cursor, &[0u8; FRAME_HEADER_SIZE])?;
+    cursor.set_position(FRAME_HEADER_SIZE as u64);
+    ciborium::into_writer(msg, &mut cursor)?;
+    let size: u32 =
+        (cursor.position() - FRAME_HEADER_SIZE as u64).try_into().unwrap();
+    let mut buf = cursor.into_inner();
+    buf[0..FRAME_HEADER_SIZE].copy_from_slice(&size.to_be_bytes());
+    Ok(buf)
+}


This is all nearly identical to LRTQ. The logic is the same, and LRTQ has been running without incident for over 2 years. What has change are some names and err types.

andrewjstone · 2025-10-18T23:23:40Z

Cargo.toml

 socket2 = { version = "0.5", features = ["all"] }
 sp-sim = { path = "sp-sim" }
-sprockets-tls = { git = "https://github.com/oxidecomputer/sprockets.git", rev = "7da1f0b5dcd3d631da18b43ba78a84b1a2b425ee" }
+sprockets-tls = { git = "https://github.com/oxidecomputer/sprockets.git", rev = "dea3bbfac7d9d3c45f088898fcd05ee5d2ec2210" }


This just pulls in a couple of helpers.

andrewjstone · 2025-10-18T23:28:02Z

I added @jgallagher and @hawkw as reviewers to look over the bulk of the code as it is async. @labbott and @flihp you can mostly ignore code that isn't related to sprockets and dice setup/test helpers.

trust-quorum/src/task.rs

trust-quorum/src/connection_manager.rs

Builds on #9232 This is the first step in wrapping the `trust_quorum::Node` so that it can be used in an async context and integrated with sled-agent. Only the sprockets networking has been fully integrated so far such that each `NodeTask` has a `ConnMgr` that sets up a full mesh of sprockets connections. A test for this connectivity behavior has been written but the code is not wired into the production code yet. Messages can be sent between `NodeTasks` over sprockets connections. Each connection exists in it's own task managed by an `EstablishedConn`. The main `NodeTask` task sends messages to and receives messages from this task to interact with the outside world via sprockets. Currently only `Ping` messages are sent over the wire as a means to keep the connections alive and detect disconnects. A `NodeHandle` allows one to interact with the `NodeTask`. Currently only three operations are implemented with messages defined in `NodeApiRequest`. The user can instruct the node who it's peers are on the bootstrap network to establish connectivity, can poll for connectivity status, and can shutdown the node. All of this functionality is used in the accompanying test. It's important to re-iterate that this code only implements connectivity between trust quorum nodes and no actual trust quorum messages are sent. They can't be as a handle can not yet initiate a reconfiguration or LRTQ upgrade. That behavior will come in a follow up. This PR is large enough. A lot of this code is similar to the LRTQ connection management code, except that it operates over sprockets rather than TCP channels. This introduces some complexity, but it is mostly abstracted away into the `SprocketsConfig`.

…::TaskId

andrewjstone · 2025-10-22T19:55:16Z

@hawkw Thanks for the great review! I believe I fixed up everything related to your comments, and then some ;)

trust-quorum/src/connection_manager.rs

trust-quorum/src/established_conn.rs

trust-quorum/src/task.rs

trust-quorum/src/connection_manager.rs

andrewjstone · 2025-10-23T01:08:51Z

I think I"m going to have to put all the code in this PR into a new crate; or rather, put the existing code into a trust-quorum-protocol crate. The problem is that tqdb depends on trust-quorum, but now that trust-quorum pulls in sprockets tqdb transitively depends on sprockets. While this is bad in that it makes little sense, it's made worse by the fact that sprockets pulls in libipcc which isn't available for all platforms and which causes the helios / build TUF repo job to fail in CI. I could probably make this work by having tqdb link with libipcc, but that makes so little sense considering tqdb is operating on the sans-io protocol that I should just bite the bullet and have a separate crate. I really should have done that from the beginning.

andrewjstone · 2025-10-23T03:59:18Z

Done in 79a6730

andrewjstone · 2025-10-25T18:00:26Z

I realized that there is a possibility of deadlock between the main task and the connection tasks due to the use of bounded channels and blocking sends. The main task can be sending a message to write to a connection task which can at the same time be sending a message read off the wire to the main task. The channels can be sized differently, but the risk is inherent in the structure. We can't use try_send and drop messages because most messages` cannot safely be dropped. They will not be retried for a given connection unless the connection gets torn down and re-established. There is a serialized write buffer for each connection that can absorb some burst from the main task without blocking on the sprockets network channel. If this gets exceeded we tear down the connections. It's possible if we size this buffer smaller than the channel capacity that this will resolve the deadlock issue by restarting slow connections.

A few other thoughts for resolution:

Use unbounded channels. The frequency of messages is very low and I'm not particularly concerned about memory usage here. I'm more concerned about picking the wrong too small size or unnecessarily using a bound that's too large.
Put the socket writer in its own task so that we have a DAG of tasks. The main task will only send to writers and the read task will only send to the main task. This strategy unfortunately does prevent doing things like sending messages to get statistics from tasks, but that probably isn't that big a deal. We can always use atomics or some other strategy, like yet another task, if necessary.
For every send from the main task to a writer, spawn a sending task to do the write. This seems quite silly and is basically another form of an unbounded channel.

andrewjstone requested a review from jgallagher October 18, 2025 22:43

andrewjstone changed the base branch from main to update-sprockets October 18, 2025 22:43

andrewjstone force-pushed the tq-sprockets branch 5 times, most recently from f24e5fb to 8257402 Compare October 18, 2025 23:22

andrewjstone commented Oct 18, 2025

View reviewed changes

andrewjstone requested review from flihp, hawkw and labbott October 18, 2025 23:26

andrewjstone commented Oct 18, 2025

View reviewed changes

trust-quorum/src/task.rs Outdated Show resolved Hide resolved

andrewjstone force-pushed the update-sprockets branch from 08cd293 to 655d671 Compare October 20, 2025 23:28

andrewjstone force-pushed the tq-sprockets branch from e42af9c to 035aba3 Compare October 20, 2025 23:31

hawkw reviewed Oct 20, 2025

View reviewed changes

Base automatically changed from update-sprockets to main October 21, 2025 15:52

andrewjstone added 2 commits October 21, 2025 19:47

hakari

7d20b2d

andrewjstone force-pushed the tq-sprockets branch 2 times, most recently from b44c01e to 7d20b2d Compare October 21, 2025 19:48

andrewjstone added 6 commits October 22, 2025 16:28

Fix up step method

78143bd

Use JoinSet and tokio::task::Id instead of FuturesUnordered and crate…

5a89a98

…::TaskId

logging cleanup

2af7f29

more review cleanup

56e2018

sock writer shutdown works again

19453d3

clippy

5ac98d0

hawkw reviewed Oct 22, 2025

View reviewed changes

Review comments

7f9060b

andrewjstone commented Oct 22, 2025

View reviewed changes

trust-quorum/src/connection_manager.rs Show resolved Hide resolved

andrewjstone added 3 commits October 22, 2025 23:36

Use BiHashMap and TriHashMap for connections

9c17716

No more graceful close from ConnMgr

dc1e778

no more test detritus

40906c6

andrewjstone added 2 commits October 23, 2025 03:56

Move sans-io code into trust-quorum-protocol crate

79a6730

hakari

aadc7ab

Uh oh!

TQ: Async Nodes and P2P connections #9258

Are you sure you want to change the base?

TQ: Async Nodes and P2P connections #9258

Uh oh!

Conversation

andrewjstone commented Oct 18, 2025

Uh oh!

andrewjstone Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

andrewjstone Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

andrewjstone Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

andrewjstone commented Oct 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrewjstone commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrewjstone commented Oct 23, 2025

Uh oh!

andrewjstone commented Oct 23, 2025

Uh oh!

andrewjstone commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants