Skip to content

ThreadX: Fix auto-reconnect and session cleanup#1126

Open
g4sp3r wants to merge 8 commits intoeclipse-zenoh:mainfrom
g4sp3r:main
Open

ThreadX: Fix auto-reconnect and session cleanup#1126
g4sp3r wants to merge 8 commits intoeclipse-zenoh:mainfrom
g4sp3r:main

Conversation

@g4sp3r
Copy link
Contributor

@g4sp3r g4sp3r commented Dec 23, 2025

In our system (stm32 - threadx) with serial link to rmw_zenohd we experience issues if router gets restarted - zenoh-pico client would just keep publishing with no indication anything is wrong.
After restarting the router the topics are of course not shown because no re-connection attempt is done by the client. This is even with enabling Z_FEATURE_AUTO_RECONNECT.

  1. First issue is found in src/transport/unicast/read.c where peer->common._received is always set to true even if no data is received. Probably this problem was created by STM32 ThreadX network.c implementation which uses non blocking read in _z_read_serial_internal. Non-blocking read was needed because otherwise client never re-sends the initial handshake to the router, meaning router would need to be started before the client.
  2. With 1 fixed, the client now detects loss of lease when the router is restarted. Next problem is closing and reopening the connection. Some ThreadX system support functions were missing to properly stop all running tasks and are added in this PR.

We tried using Z_FEATURE_AUTO_RECONNECT unsuccessfully, so in our system we currently check if the read_task is still running, if not we undeclare pub/sub and close the session. After small timeout we again open the session and declare pub/sub.
Like this we finally have a system that survives common communication disturbances without any reboots.


🏷️ Label-Based Checklist

No specific label requirements detected.

Current labels: No labels

Add one of these labels to this PR to see relevant checklist items: api-sync, breaking-change, bug, ci, dependencies, documentation, enhancement, new feature, internal

This section updates automatically when labels change.

@g4sp3r g4sp3r marked this pull request as draft December 25, 2025 21:49
@g4sp3r
Copy link
Contributor Author

g4sp3r commented Dec 25, 2025

Needs more work to support Z_FEATURE_AUTO_RECONNECT=1

Implemented cleanup queue for threads that need to self clean up. Memory
is released when creating new threads.

In current implementation new threads are created before _z_task_exit()
is called so there is always one z_task waiting to be released between
reconnects.
Allows closing session and undeclaring resources when lease expires.

When lease expires the link is cleared. In case of
Z_FEATURE_AUTO_RECONNECT=0, we want to call z_close, z_drop_ and
z_undeclare_ to release resources but that sends messages on the closed
link.
@g4sp3r
Copy link
Contributor Author

g4sp3r commented Dec 27, 2025

Current version working with both manual and automatic reconnect.

Added cleanup queue for tasks that need to self-terminate (auto-reconnect case). Tasks are freed from the queue when creating new tasks. Currently one task remains in queue between reconnects due to _z_reopen() creating new tasks before _z_task_exit() is called.

Another issue is found in manual reconnect case where trying to drop session and undeclare resources after lease expires leads to sending messages with cleared link. This locks the thread in our system in assert->exit() and leads to segmentation fault in tests with tcp Linux client. Adding the NULL checks in send_msg allows us to release all the memory before re-opening the connection again.

After quick test of memory usage we observe ~22KB used with Z_FEATURE_AUTO_RECONNECT=0 and ~40KB used with Z_FEATURE_AUTO_RECONNECT=1. Considering memory usage is x2 and we currently loose another 4KB with the "stuck" task we will be using the manual re-connection method handled by application logic for now.

@g4sp3r g4sp3r changed the title ThreadX: Fix unicast link detection. Fix closing connection. (Router restart handling) ThreadX: Fix auto-reconnect and session cleanup Dec 27, 2025
Should not use TX_WAIT_FOREVER from ISR.
Also properly check the return code.

Messages are now dropped if read thread can't keep up with the load.
@g4sp3r g4sp3r marked this pull request as ready for review January 9, 2026 10:11
@g4sp3r
Copy link
Contributor Author

g4sp3r commented Jan 30, 2026

Pinging again for review. Changes in src/transport/... are kind of unexpected at this stage to say the least.

The PR is already running in real deployment.

@g4sp3r g4sp3r mentioned this pull request Feb 4, 2026
@shoniecaplan
Copy link

Following this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants