-
Notifications
You must be signed in to change notification settings - Fork 517
Fix LWIP failure after 256 Ethernet disconnects #3212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks to Yohine. He identified via email a leak of DHCP state that would cause LWIP to panic() after 256 disconnects. Properly clean up DHCP state on link ::end (shutdown).
|
Thank you for the quick correction and feedback. |
|
@yohine with this plus #3213 I got 3301 WiFi.begin()/Wifi.end() cycles overnight with no leaks. Unfortunately at the 3302nd loop the CYW43 chip started timing out and not responding to messages from the CYW43 driver running on the Pico. Debugging w/GDB I can see the driver try and send packets to the CYW43 and it doesn't respond w/in the timeout. So AFAICT the binary blob running on the 2nd ARM chip has hung/died/something at this point. Nothing we can do about that here since it's completely opaque. |
I've been testing for a few days now. While I haven't reached a final conclusion yet, after making the same changes as in your #3213, the problem appears to have been resolved in my environment. Deleting netif_remove() from netif_add() no longer causes the panic. However, upon careful inspection, there appear to be slight differences in our fixes. These may or may not be related, but I'll introduce them for testing. @LwipIntfDev::begin @cyw43_spi_transfer() in cyw43_bus_pio_spi.c |
|
1 difference is trivial. I just cleaned up that ugly, ugly switch. I think originally it had more to it, but you can clearly see it's doing a {netif_remove/return false} on dhcp_start != ERR_OK. So, I cleaned that up as I was going because I don't want to end up on TheDailyWTF. 2 diff in your case I think has a race condition. Because networking is IRQ driven, it would be possible for you to get an IRQ right after 3 diff, if you found a bug in the CYW43 driver then please do post something on Pico-SDK to get the fix for everyone. I don't modify the upstream SDK for this core, at all, for sanity's sake. (Also, unless you reran I think real the diff may just be in testing methods. I was banging |
|
The reason for the different results could be the test content, or it could be related to the PIO or hardware. At this point, I don't think there's a clear answer. In my long-term testing, the complete stoppage has not recurred even once. Instead, I've found another serious problem, which I'm currently investigating. This is a phenomenon where the reconnection status gets stuck at 6 after exceeding a certain number of connection attempts or time. Status 3 indicates a successful connection, but it remains at 6. Disconnecting the AP returns it from 6 to 4, but the same problem persists afterward. I haven't yet determined the specific time or number of attempts, and I'm still collecting data. It's unclear whether this is related to the current problem, but if my PIO timeout is triggered, that might be the cause. The behavior of the CYW43 after disconnection is undefined. However, at this point, it's only a possibility. Regarding point 3: If it's ultimately confirmed to be a PIO problem, I will report it on the Pi Forum. However, the current information is probably not enough to convince them. Based on my experience, they are unlikely to trust my report. In my environment, all necessary sources have been removed from the static library, and everything is compiled locally. I've confirmed that creating an infinite loop in the local function of cyw43_bus_pio_spi results in a correct stop. For example, I used a command like this: Unfortunately, I have other work to do, so I won't be able to test for about a week. Therefore, the resumption of the above retesting will be after that. However, I plan to continue investigating this problem until I can solve it or until I give up. |
|
I've identified the general cause of the problem. Assuming it's not a problem specific to my environment. It appears to be an error in the SDK's 64-bit timer value transfer. The following code will help visualize when the problem occurs.
The result is a jump from "T:93" to "T:22". After the jump, the WiFi status does not transition from 6 to 3. The jump matches the 32-bit value of 72 minutes. And after more than an hour, the connection is restored. The solution to this issue is as follows. ` } The cause of the SDK timer value skipping is unknown. But now, PicoW able to reconnect after more than 95 minutes have passed. |
|
I think you should open something up on the SDK about this, if you can reproduce the timer issue. I don't think the added sys_timeouts_init call is safe (it would at least need to be mutex protected, but I also think it kills MDNS and anything else that survives a single interface dropping). |
|
Running on a Pico: I've run to the 32b overflow without issue (4294967296) which is where you'd expect something odd And at 93->94 minutes I also see no problem |
The divider is a shared HW resource, and when an IRQ comes in (i.e. when a packet is processed by LWIP and the user's callbacks) its state can be corrupted silently and randomly. Change the Pico-SDK defaults to disable IRQs during division operations, avoiding the issue by disallowing the LWIP callback to happen until after division is completed. Fixes #3212
|
@yohine I had an inspiration looking at your code vs my own. I think there's a possibility of division errors in the main app (i.e. WiFi.begin()) when an IRQ happens at a bad spot...and the CYW43 LWIP is all running at IRQ level. My own basic test had no IRQs going on whereas yours probably has lots of IRQs happening due to WiFi. If you can try the change in #3250 (you'll need to pull the whole PR because it includes a rebuilt set of libraries) and rerun your failing case without your |
|
Thank you for your quick response and investigation. I'm retesting under different conditions based on the results of your SDK verification. I expected that excluding the WiFi.begin() related code would allow the count to function properly, but contrary to my expectations, the counter is jumping. It's becoming increasingly likely that this is an entirely different issue dependent on my environment. Therefore, it's no longer an issue that should be continued in this tree, as it's no longer a direct WiFi issue. I'll move on to #3250 for the rest of this post. I'm sorry. First, I'll re-investigate the cause of the environment-dependent issue. If #3250 is likely to be involved, I'll test it. Please wait a moment. |
|
I had not added an important report related to the SDK. The suspicion of a PIO hang-up has been resolved. I added a process to light up an LED when the PIO timeout is executed, but it has never lit up even once in several weeks of testing. So it seems that the SDK suspicions have now been resolved. |
Thanks to Yohine. He identified via email a leak of DHCP state that would cause LWIP to panic() after 256 disconnects.
Properly clean up DHCP state on link ::end (shutdown).