zephyr: fix pthread stack pool overflow on reconnect (#1064) by steils · Pull Request #1122 · eclipse-zenoh/zenoh-pico

steils · 2025-12-16T14:04:13Z

On Zephyr, _z_task_init() assigns preallocated pthread stacks from thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico recreates the read/lease threads.
Previous implementation used a constantly increasing thread_index++, which eventually indexed past thread_stack_area[], corrupting memory and causing crashes.

Replace thread_index++ with a stack-slot pool. When attr == NULL, pick a free slot in thread_stack_area[], set it with pthread_attr_setstack(), and start the thread.
Slot release is now done with a thread-specific data key destructor.

Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the stack pool size.

Fixes: #1064

Reproducing

Create a new Zephyr project according to Zenoh-Pico Readme, with the following main.c:

#include <stdio.h>
#include <string.h>
#include <zenoh-pico.h>
#include <unistd.h>

#define MODE "client"
#define LOCATOR "tcp/192.168.11.1:7447"

#define KEY_SUB "demo/example/h753/sub"

static void sub_handler(z_loaned_sample_t *s, void *arg) {
    (void)arg;
    z_view_string_t k;
    z_keyexpr_as_view_string(z_sample_keyexpr(s), &k);
    z_owned_string_t v;
    z_bytes_to_string(z_sample_payload(s), &v);
    printf("[sub] %.*s = %.*s\n",
           (int)z_string_len(z_loan(k)), z_string_data(z_loan(k)),
           (int)z_string_len(z_loan(v)), z_string_data(z_loan(v)));
    z_drop(z_move(v));
}

int main(void) {
    printf("zenoh-pico reconnection reproduction start\n");

    z_owned_config_t cfg;
    z_config_default(&cfg);
    zp_config_insert(z_loan_mut(cfg), Z_CONFIG_MODE_KEY, MODE);
    if (strlen(LOCATOR) > 0) {
        zp_config_insert(z_loan_mut(cfg), Z_CONFIG_CONNECT_KEY, LOCATOR);
    }

    z_owned_session_t sess;
    if (z_open(&sess, z_move(cfg), NULL) < 0) {
        printf("Unable to open session\n");
        return -1;
    }
    printf("Session opened\n");

    zp_start_read_task(z_loan_mut(sess), NULL);
    zp_start_lease_task(z_loan_mut(sess), NULL);

    z_view_keyexpr_t ke_sub;
    z_view_keyexpr_from_str_unchecked(&ke_sub, KEY_SUB);
    z_owned_closure_sample_t sub_cb;
    z_closure(&sub_cb, sub_handler, NULL, NULL);
    z_owned_subscriber_t sub;
    if (z_declare_subscriber(z_loan(sess), &sub, z_loan(ke_sub), z_move(sub_cb), NULL) < 0) {
        printf("Unable to declare subscriber\n");
        return -2;
    }
    printf("Subscriber declared on %s\n", KEY_SUB);

    for (int tick = 0;; ++tick) {
        printf("alive tick=%d\n", tick);
        sleep(1);
    }
    return 0;
}

Connect the board and run:

pio run
pio run -t upload

Verify messages arrive:
[sub] demo/example/h753/sub = ...
Reproduce reconnection:
- unplug Ethernet cable for ~3-5 seconds
- plug it back in
- repeat 4-5 times

Expected result before the fix (fail)

After several reconnects, the firmware will crash due to corrupted stack like in issue #1064.

Expected result (pass)

No crashes
The board keeps printing alive tick=...
After each reconnect, the subscriber resumes receiving [sub] ... messages.
With -DZENOH_LOG_DEBUG, you may also see zenoh-pico debug logs; there should be no "slot OOM" errors.

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: bug

🐛 Bug Fix Requirements

Since this PR is labeled as a bug fix, please ensure:

Root cause documented - Explain what caused the bug in the PR description
Reproduction test added - Test that fails on main branch without the fix
Test passes with fix - The reproduction test passes with your changes
Regression prevention - Test will catch if this bug reoccurs in the future
Fix is minimal - Changes are focused only on fixing the bug
Related bugs checked - Verified no similar bugs exist in related code

Why this matters: Bugs without tests often reoccur.

Instructions:

Check off items as you complete them (change - [ ] to - [x])
The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

) On Zephyr, _z_task_init() assigns preallocated pthread stacks from thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico recreates the read/lease threads. Previous implementation used a constantly increasing thread_index++, which eventually indexed past thread_stack_area[], corrupting memory and causing crashes. Replace thread_index++ with a stack-slot pool. When attr == NULL, pick a free slot in thread_stack_area[], set it with pthread_attr_setstack(), and start the thread. Slot release is now done with a thread-specific data key destructor. Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the stack pool size.

vinayramlingeg · 2025-12-17T08:55:20Z

@steils I have tried this code change in our project; it works for the network disconnect and connect scenario. But this change is not working for the Zenoh router down-and-up scenario. The router ID remains the same all the time in our case and we are fetching it by using the -i option of Zenohd and the motherboard ID of the instrument. Please try this scenario once.

steils · 2025-12-18T00:59:21Z

@vinayramlingeg I'm afraid I cannot reproduce the crash in the router restart scenario. I'm doing the following:

start zenohd (I tried using the same id via -i, or without it);
examples/z_pub -e "tcp/192.168.11.1:7447" -k "demo/example/h753/sub";
start the zephyr app with the fix;
then restart zenohd (^C in the terminal and start again after a few seconds)
repeat several times.

No effect. Each time when I stop zenohd, I stop receiving messages from the publisher, and when I start zenohd again, the subscriber resumes receiving messages.
If I revert the fix the bug reproduces again, and it always take me to stop zenohd two times - then I receive crash.
But with this fix reconnections work perfectly for me.

vinayramlingeg · 2025-12-18T07:29:08Z

@steils We have tried a few more times and understood that if we bring up (enable) the router within around 20 seconds, it's working fine, and after 20 seconds, it fails to get an alive message.
Any specific configuration needs to be done to avoid this 20-second delay

steils · 2025-12-18T12:53:23Z

@vinayramlingeg When you say "~20 seconds", do you mean
a) the router was down for >20s, then after restart you never receive messages from the publisher?
b) after starting the router the messages keep coming for 20s, and then the connection breaks?
I tried both cases several times but still get no errors, and the subscriber on the board keeps receiving messages from the publisher.
Could you please elaborate, how exactly you bring the router down, do you kill the process? Do you pass the same id with -i, and difference with or without it?
What build flags do you use to build zenoh-pico for zephyr?

vinayramlingeg · 2025-12-19T04:59:05Z

@steils The scenario is we are killing the router process and waited more than 20 Sec, then the MCU doesnt receive any liveliness messages but if the time is less than 20 seconds, then it recovers.
We are killing the router and running it again, and yes, we are using the same -i option every time.
And we are using the DHCP method for MCU network connection.

vinayramlingeg · 2025-12-19T06:55:18Z

@steils Here are the main configurations we are using for the zephyr project

Kernel options

CONFIG_MAIN_STACK_SIZE=10240
CONFIG_HEAP_MEM_POOL_SIZE=150000
CONFIG_ENTROPY_GENERATOR=y
CONFIG_TEST_RANDOM_GENERATOR=y
CONFIG_INIT_STACKS=y

Generic library options

CONFIG_NEWLIB_LIBC=y
CONFIG_NEWLIB_LIBC_NANO=n
CONFIG_POSIX_API=y

Generic networking options

CONFIG_NETWORKING=y
CONFIG_NET_L2_ETHERNET=y

CONFIG_NET_IPV4=y
CONFIG_NET_TCP=y
CONFIG_NET_ARP=y
CONFIG_NET_UDP=y
CONFIG_NET_DHCPV4=y
CONFIG_NET_SHELL=y
CONFIG_NET_MGMT=y
CONFIG_NET_MGMT_EVENT=y
CONFIG_DNS_RESOLVER=y

Sockets

CONFIG_NET_SOCKETS=y
CONFIG_NET_SOCKETS_POLL_MAX=4

Network buffers

CONFIG_NET_PKT_RX_COUNT=16
CONFIG_NET_PKT_TX_COUNT=16
CONFIG_NET_BUF_RX_COUNT=80
CONFIG_NET_BUF_TX_COUNT=80
CONFIG_NET_CONTEXT_NET_PKT_POOL=y

Network address config

CONFIG_NET_CONFIG_SETTINGS=y
CONFIG_NET_CONFIG_NEED_IPV4=y
CONFIG_NET_IPV4_IGMP=y
CONFIG_NET_CONFIG_NEED_IPV6=y
CONFIG_NET_IPV6_MLD=y
CONFIG_NET_IPV4_AUTO=y

CONFIG_NET_IF_UNICAST_IPV6_ADDR_COUNT=3
CONFIG_NET_IF_MCAST_IPV6_ADDR_COUNT=4
CONFIG_NET_MAX_CONTEXTS=10

Logging

CONFIG_NET_LOG=y
CONFIG_LOG=y
CONFIG_NET_STATISTICS=y

CONFIG_JSON_LIBRARY=y

Shreyas-CS15 · 2026-01-13T04:40:39Z

@steils I verified that Z_FEATURE_AUTO_RECONNECT is enabled. Using breakpoints, I observed that when the router goes down, _z_reopen is executed. Inside this function, _z_scout_inner is called, but the system does not exit this function even after the router reconnects. I traced further with breakpoints through __z_scout_loop() → _z_link_recv_zbuf(), and execution stops there with no further progress.
My zenoh-pico logs when I disconnect the router. After reconnecting there is no logs and it's not reconnecting(auto reconnection fails).
[1970-01-01T00:00:50Z DEBUG ::_zp_unicast_lease_task] Sending keep alive
[1970-01-01T00:00:50Z DEBUG ::_z_transport_tx_send_t_msg] Send session message
[1970-01-01T00:00:50Z DEBUG ::_z_keep_alive_encode] Encoding _Z_MID_T_KEEP_ALIVE
[1970-01-01T00:00:50Z INFO ::_zp_unicast_lease_task] Closing session because it has expired after 10000ms
[1970-01-01T00:00:57Z DEBUG ::_z_zephyr_task_release_stack] zephyr task slot 1 released
[1970-01-01T00:00:57Z DEBUG ::_z_transport_tx_send_t_msg] Send session message
[1970-01-01T00:00:57Z DEBUG ::_z_close_encode] Encoding _Z_MID_T_CLOSE
[1970-01-01T00:00:57Z DEBUG ::_z_scout_encode] Encoding _Z_MID_SCOUT

steils · 2026-01-24T11:54:01Z

@Shreyas-CS15 did you try reproducing the bug with the fix or without?

Shreyas-CS15 · 2026-01-27T05:28:19Z

@steils Yes, I can be able to reproduce the issue. As I said before if I reconnect the router or ethernet cable before the below log comes auto reconnection is working, else its failing to reconnect. After below logs I'm not getting any zenoh-pico logs.
[1970-01-01T00:00:50Z INFO ::_zp_unicast_lease_task] Closing session because it has expired after 10000ms
[1970-01-01T00:00:57Z DEBUG ::_z_zephyr_task_release_stack] zephyr task slot 1 released
[1970-01-01T00:00:57Z DEBUG ::_z_transport_tx_send_t_msg] Send session message
[1970-01-01T00:00:57Z DEBUG ::_z_close_encode] Encoding _Z_MID_T_CLOSE
[1970-01-01T00:00:57Z DEBUG ::_z_scout_encode] Encoding _Z_MID_SCOUT

steils added the bug Something isn't working label Dec 16, 2025

steils requested review from gmartin82 and sashacmc December 16, 2025 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zephyr: fix pthread stack pool overflow on reconnect (#1064)#1122

zephyr: fix pthread stack pool overflow on reconnect (#1064)#1122
steils wants to merge 1 commit intoeclipse-zenoh:mainfrom
ZettaScaleLabs:zephyr-stack-fix

steils commented Dec 16, 2025 •

edited

Loading

Uh oh!

vinayramlingeg commented Dec 17, 2025 •

edited

Loading

Uh oh!

steils commented Dec 18, 2025 •

edited

Loading

Uh oh!

vinayramlingeg commented Dec 18, 2025 •

edited

Loading

Uh oh!

steils commented Dec 18, 2025 •

edited

Loading

Uh oh!

vinayramlingeg commented Dec 19, 2025 •

edited

Loading

Uh oh!

vinayramlingeg commented Dec 19, 2025

Uh oh!

Shreyas-CS15 commented Jan 13, 2026

Uh oh!

steils commented Jan 24, 2026

Uh oh!

Shreyas-CS15 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

steils commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducing

Expected result before the fix (fail)

Expected result (pass)

🏷️ Label-Based Checklist

🐛 Bug Fix Requirements

Uh oh!

vinayramlingeg commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steils commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinayramlingeg commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steils commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinayramlingeg commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinayramlingeg commented Dec 19, 2025

Kernel options

Generic library options

Generic networking options

Sockets

Network buffers

Network address config

Logging

Uh oh!

Shreyas-CS15 commented Jan 13, 2026

Uh oh!

steils commented Jan 24, 2026

Uh oh!

Shreyas-CS15 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steils commented Dec 16, 2025 •

edited

Loading

vinayramlingeg commented Dec 17, 2025 •

edited

Loading

steils commented Dec 18, 2025 •

edited

Loading

vinayramlingeg commented Dec 18, 2025 •

edited

Loading

steils commented Dec 18, 2025 •

edited

Loading

vinayramlingeg commented Dec 19, 2025 •

edited

Loading