Skip to content

zephyr: fix pthread stack pool overflow on reconnect (#1064)#1122

Open
steils wants to merge 1 commit intoeclipse-zenoh:mainfrom
ZettaScaleLabs:zephyr-stack-fix
Open

zephyr: fix pthread stack pool overflow on reconnect (#1064)#1122
steils wants to merge 1 commit intoeclipse-zenoh:mainfrom
ZettaScaleLabs:zephyr-stack-fix

Conversation

@steils
Copy link
Member

@steils steils commented Dec 16, 2025

On Zephyr, _z_task_init() assigns preallocated pthread stacks from thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico recreates the read/lease threads.
Previous implementation used a constantly increasing thread_index++, which eventually indexed past thread_stack_area[], corrupting memory and causing crashes.

Replace thread_index++ with a stack-slot pool. When attr == NULL, pick a free slot in thread_stack_area[], set it with pthread_attr_setstack(), and start the thread.
Slot release is now done with a thread-specific data key destructor.

Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the stack pool size.

Fixes: #1064

Reproducing

  1. Create a new Zephyr project according to Zenoh-Pico Readme, with the following main.c:
#include <stdio.h>
#include <string.h>
#include <zenoh-pico.h>
#include <unistd.h>

#define MODE "client"
#define LOCATOR "tcp/192.168.11.1:7447"

#define KEY_SUB "demo/example/h753/sub"

static void sub_handler(z_loaned_sample_t *s, void *arg) {
    (void)arg;
    z_view_string_t k;
    z_keyexpr_as_view_string(z_sample_keyexpr(s), &k);
    z_owned_string_t v;
    z_bytes_to_string(z_sample_payload(s), &v);
    printf("[sub] %.*s = %.*s\n",
           (int)z_string_len(z_loan(k)), z_string_data(z_loan(k)),
           (int)z_string_len(z_loan(v)), z_string_data(z_loan(v)));
    z_drop(z_move(v));
}

int main(void) {
    printf("zenoh-pico reconnection reproduction start\n");

    z_owned_config_t cfg;
    z_config_default(&cfg);
    zp_config_insert(z_loan_mut(cfg), Z_CONFIG_MODE_KEY, MODE);
    if (strlen(LOCATOR) > 0) {
        zp_config_insert(z_loan_mut(cfg), Z_CONFIG_CONNECT_KEY, LOCATOR);
    }

    z_owned_session_t sess;
    if (z_open(&sess, z_move(cfg), NULL) < 0) {
        printf("Unable to open session\n");
        return -1;
    }
    printf("Session opened\n");

    zp_start_read_task(z_loan_mut(sess), NULL);
    zp_start_lease_task(z_loan_mut(sess), NULL);

    z_view_keyexpr_t ke_sub;
    z_view_keyexpr_from_str_unchecked(&ke_sub, KEY_SUB);
    z_owned_closure_sample_t sub_cb;
    z_closure(&sub_cb, sub_handler, NULL, NULL);
    z_owned_subscriber_t sub;
    if (z_declare_subscriber(z_loan(sess), &sub, z_loan(ke_sub), z_move(sub_cb), NULL) < 0) {
        printf("Unable to declare subscriber\n");
        return -2;
    }
    printf("Subscriber declared on %s\n", KEY_SUB);

    for (int tick = 0;; ++tick) {
        printf("alive tick=%d\n", tick);
        sleep(1);
    }
    return 0;
}
  1. Connect the board and run:
pio run
pio run -t upload
  1. Verify messages arrive:
    [sub] demo/example/h753/sub = ...

  2. Reproduce reconnection:

    • unplug Ethernet cable for ~3-5 seconds
    • plug it back in
    • repeat 4-5 times

Expected result before the fix (fail)

After several reconnects, the firmware will crash due to corrupted stack like in issue #1064.

Expected result (pass)

  • No crashes
  • The board keeps printing alive tick=...
  • After each reconnect, the subscriber resumes receiving [sub] ... messages.
  • With -DZENOH_LOG_DEBUG, you may also see zenoh-pico debug logs; there should be no "slot OOM" errors.

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: bug

🐛 Bug Fix Requirements

Since this PR is labeled as a bug fix, please ensure:

  • Root cause documented - Explain what caused the bug in the PR description
  • Reproduction test added - Test that fails on main branch without the fix
  • Test passes with fix - The reproduction test passes with your changes
  • Regression prevention - Test will catch if this bug reoccurs in the future
  • Fix is minimal - Changes are focused only on fixing the bug
  • Related bugs checked - Verified no similar bugs exist in related code

Why this matters: Bugs without tests often reoccur.

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

)

On Zephyr, _z_task_init() assigns preallocated pthread stacks from
thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico
recreates the read/lease threads.
Previous implementation used a constantly increasing thread_index++,
which eventually indexed past thread_stack_area[], corrupting memory and
causing crashes.

Replace thread_index++ with a stack-slot pool. When attr == NULL, pick
a free slot in thread_stack_area[], set it with pthread_attr_setstack(),
and start the thread.
Slot release is now done with a thread-specific data key destructor.

Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the
stack pool size.
@steils steils added the bug Something isn't working label Dec 16, 2025
@steils steils requested review from gmartin82 and sashacmc December 16, 2025 14:06
@vinayramlingeg
Copy link

vinayramlingeg commented Dec 17, 2025

@steils I have tried this code change in our project; it works for the network disconnect and connect scenario. But this change is not working for the Zenoh router down-and-up scenario. The router ID remains the same all the time in our case and we are fetching it by using the -i option of Zenohd and the motherboard ID of the instrument. Please try this scenario once.

@steils
Copy link
Member Author

steils commented Dec 18, 2025

@vinayramlingeg I'm afraid I cannot reproduce the crash in the router restart scenario. I'm doing the following:

  • start zenohd (I tried using the same id via -i, or without it);
  • examples/z_pub -e "tcp/192.168.11.1:7447" -k "demo/example/h753/sub";
  • start the zephyr app with the fix;
  • then restart zenohd (^C in the terminal and start again after a few seconds)
  • repeat several times.

No effect. Each time when I stop zenohd, I stop receiving messages from the publisher, and when I start zenohd again, the subscriber resumes receiving messages.
If I revert the fix the bug reproduces again, and it always take me to stop zenohd two times - then I receive crash.
But with this fix reconnections work perfectly for me.

@vinayramlingeg
Copy link

vinayramlingeg commented Dec 18, 2025

@steils We have tried a few more times and understood that if we bring up (enable) the router within around 20 seconds, it's working fine, and after 20 seconds, it fails to get an alive message.
Any specific configuration needs to be done to avoid this 20-second delay

@steils
Copy link
Member Author

steils commented Dec 18, 2025

@vinayramlingeg When you say "~20 seconds", do you mean
a) the router was down for >20s, then after restart you never receive messages from the publisher?
b) after starting the router the messages keep coming for 20s, and then the connection breaks?
I tried both cases several times but still get no errors, and the subscriber on the board keeps receiving messages from the publisher.
Could you please elaborate, how exactly you bring the router down, do you kill the process? Do you pass the same id with -i, and difference with or without it?
What build flags do you use to build zenoh-pico for zephyr?

@vinayramlingeg
Copy link

vinayramlingeg commented Dec 19, 2025

@steils The scenario is we are killing the router process and waited more than 20 Sec, then the MCU doesnt receive any liveliness messages but if the time is less than 20 seconds, then it recovers.
We are killing the router and running it again, and yes, we are using the same -i option every time.
And we are using the DHCP method for MCU network connection.

@vinayramlingeg
Copy link

@steils Here are the main configurations we are using for the zephyr project

Kernel options

CONFIG_MAIN_STACK_SIZE=10240
CONFIG_HEAP_MEM_POOL_SIZE=150000
CONFIG_ENTROPY_GENERATOR=y
CONFIG_TEST_RANDOM_GENERATOR=y
CONFIG_INIT_STACKS=y

Generic library options

CONFIG_NEWLIB_LIBC=y
CONFIG_NEWLIB_LIBC_NANO=n
CONFIG_POSIX_API=y

Generic networking options

CONFIG_NETWORKING=y
CONFIG_NET_L2_ETHERNET=y

CONFIG_NET_IPV4=y
CONFIG_NET_TCP=y
CONFIG_NET_ARP=y
CONFIG_NET_UDP=y
CONFIG_NET_DHCPV4=y
CONFIG_NET_SHELL=y
CONFIG_NET_MGMT=y
CONFIG_NET_MGMT_EVENT=y
CONFIG_DNS_RESOLVER=y

Sockets

CONFIG_NET_SOCKETS=y
CONFIG_NET_SOCKETS_POLL_MAX=4

Network buffers

CONFIG_NET_PKT_RX_COUNT=16
CONFIG_NET_PKT_TX_COUNT=16
CONFIG_NET_BUF_RX_COUNT=80
CONFIG_NET_BUF_TX_COUNT=80
CONFIG_NET_CONTEXT_NET_PKT_POOL=y

Network address config

CONFIG_NET_CONFIG_SETTINGS=y
CONFIG_NET_CONFIG_NEED_IPV4=y
CONFIG_NET_IPV4_IGMP=y
CONFIG_NET_CONFIG_NEED_IPV6=y
CONFIG_NET_IPV6_MLD=y
CONFIG_NET_IPV4_AUTO=y

CONFIG_NET_IF_UNICAST_IPV6_ADDR_COUNT=3
CONFIG_NET_IF_MCAST_IPV6_ADDR_COUNT=4
CONFIG_NET_MAX_CONTEXTS=10

Logging

CONFIG_NET_LOG=y
CONFIG_LOG=y
CONFIG_NET_STATISTICS=y

CONFIG_JSON_LIBRARY=y

@Shreyas-CS15
Copy link

@steils I verified that Z_FEATURE_AUTO_RECONNECT is enabled. Using breakpoints, I observed that when the router goes down, _z_reopen is executed. Inside this function, _z_scout_inner is called, but the system does not exit this function even after the router reconnects. I traced further with breakpoints through __z_scout_loop() → _z_link_recv_zbuf(), and execution stops there with no further progress.
My zenoh-pico logs when I disconnect the router. After reconnecting there is no logs and it's not reconnecting(auto reconnection fails).
[1970-01-01T00:00:50Z DEBUG ::_zp_unicast_lease_task] Sending keep alive
[1970-01-01T00:00:50Z DEBUG ::_z_transport_tx_send_t_msg] Send session message
[1970-01-01T00:00:50Z DEBUG ::_z_keep_alive_encode] Encoding _Z_MID_T_KEEP_ALIVE
[1970-01-01T00:00:50Z INFO ::_zp_unicast_lease_task] Closing session because it has expired after 10000ms
[1970-01-01T00:00:57Z DEBUG ::_z_zephyr_task_release_stack] zephyr task slot 1 released
[1970-01-01T00:00:57Z DEBUG ::_z_transport_tx_send_t_msg] Send session message
[1970-01-01T00:00:57Z DEBUG ::_z_close_encode] Encoding _Z_MID_T_CLOSE
[1970-01-01T00:00:57Z DEBUG ::_z_scout_encode] Encoding _Z_MID_SCOUT

@steils
Copy link
Member Author

steils commented Jan 24, 2026

@Shreyas-CS15 did you try reproducing the bug with the fix or without?

@Shreyas-CS15
Copy link

@steils Yes, I can be able to reproduce the issue. As I said before if I reconnect the router or ethernet cable before the below log comes auto reconnection is working, else its failing to reconnect. After below logs I'm not getting any zenoh-pico logs.
[1970-01-01T00:00:50Z INFO ::_zp_unicast_lease_task] Closing session because it has expired after 10000ms
[1970-01-01T00:00:57Z DEBUG ::_z_zephyr_task_release_stack] zephyr task slot 1 released
[1970-01-01T00:00:57Z DEBUG ::_z_transport_tx_send_t_msg] Send session message
[1970-01-01T00:00:57Z DEBUG ::_z_close_encode] Encoding _Z_MID_T_CLOSE
[1970-01-01T00:00:57Z DEBUG ::_z_scout_encode] Encoding _Z_MID_SCOUT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] STM32 nucleo board is getting USAGE FAULT Error when ethernet connection is restored

3 participants