Skip to content

Conversation

@earlephilhower
Copy link
Owner

Thanks to Yohine. He identified via email a leak of DHCP state that would cause LWIP to panic() after 256 disconnects.

Properly clean up DHCP state on link ::end (shutdown).

Thanks to Yohine.  He identified via email a leak of DHCP state that
would cause LWIP to panic() after 256 disconnects.

Properly clean up DHCP state on link ::end (shutdown).
@earlephilhower earlephilhower merged commit b832dee into master Oct 21, 2025
32 checks passed
@earlephilhower earlephilhower deleted the pop1 branch October 21, 2025 16:27
@yohine
Copy link

yohine commented Oct 22, 2025

Thank you for the quick correction and feedback.

@earlephilhower
Copy link
Owner Author

@yohine with this plus #3213 I got 3301 WiFi.begin()/Wifi.end() cycles overnight with no leaks. Unfortunately at the 3302nd loop the CYW43 chip started timing out and not responding to messages from the CYW43 driver running on the Pico. Debugging w/GDB I can see the driver try and send packets to the CYW43 and it doesn't respond w/in the timeout.

So AFAICT the binary blob running on the 2nd ARM chip has hung/died/something at this point. Nothing we can do about that here since it's completely opaque.

/*
    This sketch establishes a TCP connection to a "quote of the day" service.
    It sends a "hello" message, and then prints received data.
*/

#include <WiFi.h>

#ifndef STASSID
#define STASSID "your-ssid"
#define STAPSK "your-password"
#endif

const char* ssid = STASSID;
const char* password = STAPSK;

const char* host = "djxmmx.net";
const uint16_t port = 17;

//WiFiMulti multi;

void setup() {
  Serial.begin(115200);

  // We start by connecting to a WiFi network

  Serial.println();
  Serial.println();
  Serial.print("Connecting to ");
  Serial.println(ssid);

//  multi.addAP(ssid, password);

//  if (multi.run() != WL_CONNECTED) {
    //Serial.println("Unable to connect to network, rebooting in 10 seconds...");
    //delay(10000);
    //rp2040.reboot();
  //}
delay(5000);
  Serial.println("");
  Serial.println("WiFi connected");
  Serial.println("IP address: ");
  //Serial.println(WiFi.localIP());
}

static int l = 0;static int f= 0;
void loop() {
  static bool wait = false;
  Serial.printf("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l++, rp2040.getFreeHeap());
  //stats_display();
  WiFi.begin(ssid, password);
  if (!WiFi.connected()) {
      Serial.printf("--------------------------------------------------------------------------------- fail %d\n", f++);
      delay(1000);
    return;
  }

  Serial.print("connecting to ");
  Serial.print(host);
  Serial.print(':');
  Serial.println(port);

  // Use WiFiClient class to create TCP connections
  WiFiClient client;
  if (!client.connect(host, port)) {
    Serial.println("connection failed");
    delay(5000);
    return;
  }

  // This will send a string to the server
  Serial.println("sending data to server");
  if (client.connected()) {
    client.println("hello from RP2040");
  }

  // wait for data to be available
  unsigned long timeout = millis();
  while (client.available() == 0) {
    if (millis() - timeout > 5000) {
      Serial.println(">>> Client Timeout !");
      client.stop();
      WiFi.end();
      //delay(60000);
      return;
    }
  }

  // Read all the lines of the reply from server and print them to Serial
  Serial.println("receiving from remote server");
  // not testing 'client.connected()' since we do not need to send data here
  while (client.available()) {
    char ch = static_cast<char>(client.read());
    Serial.print(ch);
  }

  // Close the connection
  Serial.println();
  Serial.println("closing connection");
  client.stop();

  if (wait) {
//    delay(300000);  // execute once every 5 minutes, don't flood remote service
  }
  //Serial.printf("++++at end++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l, rp2040.getFreeHeap());
  //stats_display();
//delay(16000);
//  Serial.printf("++after 16s ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l, rp2040.getFreeHeap());
//  stats_display();


  //wait = true;
  WiFi.end();
//  Serial.printf("++++after wifiend++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l, rp2040.getFreeHeap());
//  stats_display();

//while(1);
}

@yohine
Copy link

yohine commented Oct 23, 2025

  • An important additional note: My current tests are only using UDP, and I'm not connecting and disconnecting each time (I do test only a few times). Therefore, please consider this information to be under different conditions. I'll be doing a similar reconnection test from now on.

I've been testing for a few days now. While I haven't reached a final conclusion yet, after making the same changes as in your #3213, the problem appears to have been resolved in my environment.

Deleting netif_remove() from netif_add() no longer causes the panic.
CYW43.begin() has also not stopped.
There are still no crashes in two days.

However, upon careful inspection, there appear to be slight differences in our fixes. These may or may not be related, but I'll introduce them for testing.

@LwipIntfDev::begin

    if (_isDHCP) {
        ip4_addr_set_u32(ip_2_ip4(&_netif.ip_addr), 0);

        netif_set_up(&_netif);
        if (netif_is_link_up(&_netif)) {
            switch (dhcp_start(&_netif)) {
            case ERR_OK:
                break;
            case ERR_IF:
                netif_remove(&_netif);
                return false;
            default:
                netif_remove(&_netif);
                return false;
            }
        }
    } else {
        netif_set_link_up(&_netif);
        netif_set_up(&_netif);
    }

void LwipIntfDev<RawDev>::end() {
    if (_started) {
        if (_intrPin < 0) {
            __removeEthernetPacketHandler(_phID);
        } else {
            detachInterrupt(_intrPin);
            __removeEthernetGPIO(_intrPin);
        }

        if (_removeNetifCB) {
            _removeNetifCB(&_netif);
        }

        RawDev::end();
	
        if (_isDHCP) {
            dhcp_stop(&_netif);
            dhcp_cleanup(&_netif);
        }

        netif_remove(&_netif);

        _started = false;
    }
}

@cyw43_spi_transfer() in cyw43_bus_pio_spi.c

        uint32_t fdebug_tx_stall = 1u << (PIO_FDEBUG_TXSTALL_LSB + bus_data->pio_sm);
        bus_data->pio->fdebug = fdebug_tx_stall;

	//! timeout !
        uint32_t start_time = time_us_32();
        uint32_t timeout_us = 1000;
        while (!(bus_data->pio->fdebug & fdebug_tx_stall)) {
            if (time_us_32() - start_time > timeout_us) {
                pio_error = 1;
                break; 
            }
            tight_loop_contents();
        }

        __compiler_memory_barrier();
        pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, false);
        pio_sm_set_consecutive_pindirs(bus_data->pio, bus_data->pio_sm, CYW43_PIN_WL_DATA_IN, 1, false);
    } else if (rx != NULL) { /* currently do one at a time */
        DUMP_SPI_TRANSACTIONS(
                printf("[%lu] bus TX %u bytes:", counter++, rx_length);
                dump_bytes(rx, rx_length);
        )
        panic_unsupported();
    }
    pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_mov(pio_pins, pio_null)); // for next time we turn output on

    stop_spi_comms();
    DUMP_SPI_TRANSACTIONS(
            printf("RXed:");
            dump_bytes(rx, rx_length);
            printf("\n");
    )

    //! pio error !
    if (pio_error > 0){
        return CYW43_EIO;
    }

    return 0;
}

@earlephilhower
Copy link
Owner Author

1 difference is trivial. I just cleaned up that ugly, ugly switch. I think originally it had more to it, but you can clearly see it's doing a {netif_remove/return false} on dhcp_start != ERR_OK. So, I cleaned that up as I was going because I don't want to end up on TheDailyWTF.

2 diff in your case I think has a race condition. Because networking is IRQ driven, it would be possible for you to get an IRQ right after RawDev::end (say, DHCP retry). At that point the global netif list still has the old, ended interface and will try and call the CYW43::sendPacket which may not go so well. (TBH looking at mine I see the same thing, only a little smaller window).

3 diff, if you found a bug in the CYW43 driver then please do post something on Pico-SDK to get the fix for everyone. I don't modify the upstream SDK for this core, at all, for sanity's sake. (Also, unless you reran make-libpico.sh your change would not be used by the core...I build the SDK and ship it as a blob we link to).

I think real the diff may just be in testing methods. I was banging WiFi.begin() and WiFi.end() as fast as the chip would do it w/o actually turning on/off the AP...

@yohine
Copy link

yohine commented Oct 26, 2025

The reason for the different results could be the test content, or it could be related to the PIO or hardware. At this point, I don't think there's a clear answer.

In my long-term testing, the complete stoppage has not recurred even once. Instead, I've found another serious problem, which I'm currently investigating. This is a phenomenon where the reconnection status gets stuck at 6 after exceeding a certain number of connection attempts or time. Status 3 indicates a successful connection, but it remains at 6. Disconnecting the AP returns it from 6 to 4, but the same problem persists afterward.

I haven't yet determined the specific time or number of attempts, and I'm still collecting data. It's unclear whether this is related to the current problem, but if my PIO timeout is triggered, that might be the cause. The behavior of the CYW43 after disconnection is undefined. However, at this point, it's only a possibility.

Regarding point 3: If it's ultimately confirmed to be a PIO problem, I will report it on the Pi Forum. However, the current information is probably not enough to convince them. Based on my experience, they are unlikely to trust my report.

In my environment, all necessary sources have been removed from the static library, and everything is compiled locally. I've confirmed that creating an infinite loop in the local function of cyw43_bus_pio_spi results in a correct stop. For example, I used a command like this:
arm-none-eabi-ar.exe d liblwip.a cyw43_bus_pio_spi.c.o

Unfortunately, I have other work to do, so I won't be able to test for about a week. Therefore, the resumption of the above retesting will be after that. However, I plan to continue investigating this problem until I can solve it or until I give up.

@yohine
Copy link

yohine commented Nov 26, 2025

I've identified the general cause of the problem. Assuming it's not a problem specific to my environment. It appears to be an error in the SDK's 64-bit timer value transfer. The following code will help visualize when the problem occurs.

uint64_t minutes = time_us_64() / 60000000ULL; printf("T:%llu", minutes);

The result is a jump from "T:93" to "T:22". After the jump, the WiFi status does not transition from 6 to 3. The jump matches the 32-bit value of 72 minutes. And after more than an hour, the connection is restored.

The solution to this issue is as follows.

`
void LwipIntfDev::end() {
if (_started) {
if (_isDHCP) {
dhcp_stop(&_netif);
dhcp_cleanup(&_netif);
}

    if (_intrPin < 0) {
        __removeEthernetPacketHandler(_phID);
    } else {
        detachInterrupt(_intrPin);
        __removeEthernetGPIO(_intrPin);
    }

    if (_removeNetifCB) {
        _removeNetifCB(&_netif);
    }

    RawDev::end();
    netif_remove(&_netif);

    ////Additional point here
    sys_timeouts_init();
    sys_restart_timeouts();

    _started = false;
}

}
`

The cause of the SDK timer value skipping is unknown. But now, PicoW able to reconnect after more than 95 minutes have passed.

@earlephilhower
Copy link
Owner Author

I think you should open something up on the SDK about this, if you can reproduce the timer issue.

I don't think the added sys_timeouts_init call is safe (it would at least need to be mutex protected, but I also think it kills MDNS and anything else that survives a single interface dropping).

@earlephilhower
Copy link
Owner Author

Running on a Pico:

void setup() {
}

void loop() {
  uint64_t t = time_us_64();
  uint64_t m = t / 60000000LL;
  Serial.printf("%llu %llu\n", t, m);
  delay(500);
}

I've run to the 32b overflow without issue (4294967296) which is where you'd expect something odd

...
12:31:17.315 -> 4290940323 71
12:31:17.796 -> 4291440373 71
12:31:18.309 -> 4291940428 71
12:31:18.791 -> 4292440480 71
12:31:19.303 -> 4292940532 71
12:31:19.817 -> 4293440585 71
12:31:20.299 -> 4293940637 71
12:31:20.814 -> 4294440689 71
12:31:21.295 -> 4294940739 71
12:31:21.810 -> 4295440792 71
12:31:22.295 -> 4295940862 71
12:31:22.812 -> 4296440922 71
12:31:23.296 -> 4296940979 71
....

And at 93->94 minutes I also see no problem

....
12:53:39.959 -> 5633595389 93
12:53:40.474 -> 5634095446 93
12:53:40.956 -> 5634595507 93
12:53:41.470 -> 5635095564 93
12:53:41.951 -> 5635595624 93
12:53:42.467 -> 5636095680 93
12:53:42.982 -> 5636595740 93
12:53:43.464 -> 5637095798 93
12:53:43.980 -> 5637595854 93
12:53:44.462 -> 5638095912 93
12:53:44.977 -> 5638595968 93
12:53:45.460 -> 5639096022 93
12:53:45.973 -> 5639596081 93
12:53:46.455 -> 5640096140 94
12:53:46.969 -> 5640596198 94
12:53:47.451 -> 5641096253 94
12:53:47.965 -> 5641596313 94
....

earlephilhower added a commit that referenced this pull request Nov 27, 2025
The divider is a shared HW resource, and when an IRQ comes in (i.e.
when a packet is processed by LWIP and the user's callbacks) its state can
be corrupted silently and randomly.

Change the Pico-SDK defaults to disable IRQs during division operations,
avoiding the issue by disallowing the LWIP callback to happen until after
division is completed.

Fixes #3212
@earlephilhower
Copy link
Owner Author

@yohine I had an inspiration looking at your code vs my own.

I think there's a possibility of division errors in the main app (i.e. WiFi.begin()) when an IRQ happens at a bad spot...and the CYW43 LWIP is all running at IRQ level. My own basic test had no IRQs going on whereas yours probably has lots of IRQs happening due to WiFi.

If you can try the change in #3250 (you'll need to pull the whole PR because it includes a rebuilt set of libraries) and rerun your failing case without your ////Additional point here code (i.e. no sys_timeouts_init() and no sys_restart_timeouts()). and report back it would be much appreciated.

@yohine
Copy link

yohine commented Nov 27, 2025

Thank you for your quick response and investigation.

I'm retesting under different conditions based on the results of your SDK verification. I expected that excluding the WiFi.begin() related code would allow the count to function properly, but contrary to my expectations, the counter is jumping.

It's becoming increasingly likely that this is an entirely different issue dependent on my environment. Therefore, it's no longer an issue that should be continued in this tree, as it's no longer a direct WiFi issue. I'll move on to #3250 for the rest of this post.

I'm sorry. First, I'll re-investigate the cause of the environment-dependent issue. If #3250 is likely to be involved, I'll test it. Please wait a moment.

@yohine
Copy link

yohine commented Nov 29, 2025

I had not added an important report related to the SDK.

The suspicion of a PIO hang-up has been resolved. I added a process to light up an LED when the PIO timeout is executed, but it has never lit up even once in several weeks of testing.

So it seems that the SDK suspicions have now been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants