ThreadX: Fix auto-reconnect and session cleanup#1126
ThreadX: Fix auto-reconnect and session cleanup#1126g4sp3r wants to merge 8 commits intoeclipse-zenoh:mainfrom
Conversation
|
Needs more work to support Z_FEATURE_AUTO_RECONNECT=1 |
Implemented cleanup queue for threads that need to self clean up. Memory is released when creating new threads. In current implementation new threads are created before _z_task_exit() is called so there is always one z_task waiting to be released between reconnects.
Allows closing session and undeclaring resources when lease expires. When lease expires the link is cleared. In case of Z_FEATURE_AUTO_RECONNECT=0, we want to call z_close, z_drop_ and z_undeclare_ to release resources but that sends messages on the closed link.
|
Current version working with both manual and automatic reconnect. Added cleanup queue for tasks that need to self-terminate (auto-reconnect case). Tasks are freed from the queue when creating new tasks. Currently one task remains in queue between reconnects due to _z_reopen() creating new tasks before _z_task_exit() is called. Another issue is found in manual reconnect case where trying to drop session and undeclare resources after lease expires leads to sending messages with cleared link. This locks the thread in our system in assert->exit() and leads to segmentation fault in tests with tcp Linux client. Adding the NULL checks in send_msg allows us to release all the memory before re-opening the connection again. After quick test of memory usage we observe ~22KB used with Z_FEATURE_AUTO_RECONNECT=0 and ~40KB used with Z_FEATURE_AUTO_RECONNECT=1. Considering memory usage is x2 and we currently loose another 4KB with the "stuck" task we will be using the manual re-connection method handled by application logic for now. |
Should not use TX_WAIT_FOREVER from ISR. Also properly check the return code. Messages are now dropped if read thread can't keep up with the load.
|
Pinging again for review. Changes in The PR is already running in real deployment. |
|
Following this |
In our system (stm32 - threadx) with serial link to rmw_zenohd we experience issues if router gets restarted - zenoh-pico client would just keep publishing with no indication anything is wrong.
After restarting the router the topics are of course not shown because no re-connection attempt is done by the client. This is even with enabling Z_FEATURE_AUTO_RECONNECT.
src/transport/unicast/read.cwherepeer->common._receivedis always set totrueeven if no data is received. Probably this problem was created by STM32 ThreadX network.c implementation which uses non blocking read in_z_read_serial_internal. Non-blocking read was needed because otherwise client never re-sends the initial handshake to the router, meaning router would need to be started before the client.We tried using Z_FEATURE_AUTO_RECONNECT unsuccessfully, so in our system we currently check if the read_task is still running, if not we undeclare pub/sub and close the session. After small timeout we again open the session and declare pub/sub.
Like this we finally have a system that survives common communication disturbances without any reboots.
🏷️ Label-Based Checklist
No specific label requirements detected.
Current labels: No labels
Add one of these labels to this PR to see relevant checklist items:
api-sync,breaking-change,bug,ci,dependencies,documentation,enhancement,new feature,internalThis section updates automatically when labels change.