zephyr: fix pthread stack pool overflow on reconnect (#1064)#1122
zephyr: fix pthread stack pool overflow on reconnect (#1064)#1122steils wants to merge 1 commit intoeclipse-zenoh:mainfrom
Conversation
) On Zephyr, _z_task_init() assigns preallocated pthread stacks from thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico recreates the read/lease threads. Previous implementation used a constantly increasing thread_index++, which eventually indexed past thread_stack_area[], corrupting memory and causing crashes. Replace thread_index++ with a stack-slot pool. When attr == NULL, pick a free slot in thread_stack_area[], set it with pthread_attr_setstack(), and start the thread. Slot release is now done with a thread-specific data key destructor. Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the stack pool size.
|
@steils I have tried this code change in our project; it works for the network disconnect and connect scenario. But this change is not working for the Zenoh router down-and-up scenario. The router ID remains the same all the time in our case and we are fetching it by using the -i option of Zenohd and the motherboard ID of the instrument. Please try this scenario once. |
|
@vinayramlingeg I'm afraid I cannot reproduce the crash in the router restart scenario. I'm doing the following:
No effect. Each time when I stop zenohd, I stop receiving messages from the publisher, and when I start zenohd again, the subscriber resumes receiving messages. |
|
@steils We have tried a few more times and understood that if we bring up (enable) the router within around 20 seconds, it's working fine, and after 20 seconds, it fails to get an alive message. |
|
@vinayramlingeg When you say "~20 seconds", do you mean |
|
@steils The scenario is we are killing the router process and waited more than 20 Sec, then the MCU doesnt receive any liveliness messages but if the time is less than 20 seconds, then it recovers. |
|
@steils Here are the main configurations we are using for the zephyr project Kernel optionsCONFIG_MAIN_STACK_SIZE=10240 Generic library optionsCONFIG_NEWLIB_LIBC=y Generic networking optionsCONFIG_NETWORKING=y CONFIG_NET_IPV4=y SocketsCONFIG_NET_SOCKETS=y Network buffersCONFIG_NET_PKT_RX_COUNT=16 Network address configCONFIG_NET_CONFIG_SETTINGS=y CONFIG_NET_IF_UNICAST_IPV6_ADDR_COUNT=3 LoggingCONFIG_NET_LOG=y CONFIG_JSON_LIBRARY=y |
|
@steils I verified that Z_FEATURE_AUTO_RECONNECT is enabled. Using breakpoints, I observed that when the router goes down, _z_reopen is executed. Inside this function, _z_scout_inner is called, but the system does not exit this function even after the router reconnects. I traced further with breakpoints through __z_scout_loop() → _z_link_recv_zbuf(), and execution stops there with no further progress. |
|
@Shreyas-CS15 did you try reproducing the bug with the fix or without? |
|
@steils Yes, I can be able to reproduce the issue. As I said before if I reconnect the router or ethernet cable before the below log comes auto reconnection is working, else its failing to reconnect. After below logs I'm not getting any zenoh-pico logs. |
On Zephyr, _z_task_init() assigns preallocated pthread stacks from thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico recreates the read/lease threads.
Previous implementation used a constantly increasing thread_index++, which eventually indexed past thread_stack_area[], corrupting memory and causing crashes.
Replace thread_index++ with a stack-slot pool. When attr == NULL, pick a free slot in thread_stack_area[], set it with pthread_attr_setstack(), and start the thread.
Slot release is now done with a thread-specific data key destructor.
Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the stack pool size.
Fixes: #1064
Reproducing
Verify messages arrive:
[sub] demo/example/h753/sub = ...Reproduce reconnection:
Expected result before the fix (fail)
After several reconnects, the firmware will crash due to corrupted stack like in issue #1064.
Expected result (pass)
alive tick=...[sub] ...messages.-DZENOH_LOG_DEBUG, you may also see zenoh-pico debug logs; there should be no "slot OOM" errors.🏷️ Label-Based Checklist
Based on the labels applied to this PR, please complete these additional requirements:
Labels:
bug🐛 Bug Fix Requirements
Since this PR is labeled as a bug fix, please ensure:
Why this matters: Bugs without tests often reoccur.
Instructions:
- [ ]to- [x])This checklist updates automatically when labels change, but preserves your checked boxes.