Fix runtime on rp2350/2040 for multicore scheduler#5286
Fix runtime on rp2350/2040 for multicore scheduler#5286vejmarie wants to merge 5 commits intotinygo-org:devfrom
Conversation
- Add interrupt handler core detection to properly call the garbage collector - Switch to software spinlock as to avoid hardware bugs on rp2350 - reset second core before running it (in some cases it stay stuck) Signed-off-by: Jean-Marie Verdun <verdun@hpe.com>
Signed-off-by: Jean-Marie Verdun <verdun@hpe.com>
|
@vejmarie look like you need to run |
|
Yes I have seen it. I am working on it. I am running the full test suite currently on my Mac. It has been a little bit bumpy as it needs Nodes, and my machine is not under my full control, it is my corporate laptop |
Signed-off-by: Jean-Marie Verdun <verdun@hpe.com>
Signed-off-by: Jean-Marie Verdun <verdun@hpe.com>
dealing with RP2xxx as to enable EXTEXCALL on rp2350 Could be used for other specific initialization Signed-off-by: Jean-Marie Verdun <verdun@hpe.com>
|
Thanks for the contribution! Could you confirm that HPE's approval covers contributing this under the project's existing license terms? A link to HPE's open source contribution policy (if one exists) would be ideal. |
There was a problem hiding this comment.
Pull request overview
Fixes multicore-scheduler runtime lockups on RP2350/RP2040, particularly GC-related deadlocks reported in #5151, by adjusting RP-specific initialization, core bring-up behavior, FIFO interrupt handling, and spinlock implementation.
Changes:
- Add RP-family init hook (
initRP) and invoke it during runtime init (RP2350 sets ACTLR_EXTEXCLALL). - Modify secondary-core startup to force-reset core1 before running the FIFO start sequence.
- Update SIO FIFO IRQ handlers to route GC stop-the-world handling based on current core, and replace hardware spinlocks with a software (atomic) spinlock.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/runtime/runtime_rp2350.go | Adds RP2350-specific initRP() to configure core behavior via PPB ACTLR. |
| src/runtime/runtime_rp2040.go | Adds RP2040 initRP() stub to match shared init flow. |
| src/runtime/runtime_rp2.go | Updates multicore startup/reset logic, FIFO IRQ GC routing, and replaces RP hardware spinlocks with an atomic spinlock. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| type spinLock struct { | ||
| id uint8 | ||
| atomic.Uint32 | ||
| } | ||
|
|
||
| // Return the spinlock register: rp.SIO.SPINLOCKx | ||
| func (l *spinLock) spinlock() *volatile.Register32 { | ||
| return (*volatile.Register32)(unsafe.Add(unsafe.Pointer(&rp.SIO.SPINLOCK0), l.id*4)) | ||
| var ( | ||
| schedulerLock spinLock | ||
| futexLock spinLock | ||
| atomicsLock spinLock | ||
| printLock spinLock | ||
| ) | ||
|
|
||
| func resetSpinLocks() { | ||
| schedulerLock.Store(0) | ||
| futexLock.Store(0) | ||
| atomicsLock.Store(0) | ||
| printLock.Store(0) | ||
| } | ||
|
|
||
| func (l *spinLock) Lock() { | ||
| // Wait for the lock to be available. | ||
| spinlock := l.spinlock() | ||
| for spinlock.Get() == 0 { | ||
| arm.Asm("wfe") | ||
|
|
||
| for !l.Uint32.CompareAndSwap(0, 1) { | ||
| spinLoopWait() | ||
| } |
There was a problem hiding this comment.
The new spinLock implementation relies on atomic.Uint32.CompareAndSwap. On RP2040 (thumbv6m), atomic CAS operations are typically lowered to the __sync/__atomic libcalls implemented in src/runtime/atomics_critical.go, which enter lockAtomics(). With the multicore scheduler, lockAtomics() itself takes atomicsLock (scheduler_cores.go:281-289), creating a recursion/deadlock (spinLock -> CAS libcall -> lockAtomics -> atomicsLock.Lock -> spinLock...). Consider keeping the previous hardware-spinlock-based implementation for RP2040, or implement this lock using a mechanism that does not depend on sync/atomic CAS on targets without native atomics.
There was a problem hiding this comment.
This looks like BS at a glance. Thoughts @vejmarie ?
There was a problem hiding this comment.
I think it might be right. The M0 from the rp2040 doesn't seems to support CompareAndSwap in an atomic way. AI will keep surprising me. My challenge is that I don't have an rp2040 handy to test. I can say it works on rp2350 without any issue (I tested it dozen times). The hardware spin lock works on the rp2040, so I can create in some way wrappers and push them into the related architectural specific files which will be compiled based on target selection. The deadlock mentioned could appeared as we will face a recursive call to the lock subsystem in such cases. But if a core grab a lock it will go through I think. Let me know what you would prefer.
There was a problem hiding this comment.
Just checked the code, and I bet it shall work on rp2040, as CompareAndSwap exist for that platform even if it shall call lockAtomics recursively, it will either wait (which means other core is having the lock and will get out), or it will acquire it. It the second core crash with an Atomics lock acquired, it means there is a bug somewhere
| if schedulerAsserts && l.Uint32.Load() != 1 { | ||
| runtimePanic("unlock of unlocked spinlock") | ||
| } | ||
| l.Uint32.Store(0) |
There was a problem hiding this comment.
spinLock.Lock() uses WFE in the contention loop (spinLoopWait), but Unlock() no longer executes SEV after releasing the lock. If the waiter is sleeping in WFE with interrupts masked (common in runtime critical sections), it may not wake promptly (or at all) when the lock becomes available. Consider restoring an explicit SEV in Unlock (after the Store(0)) or switching the wait strategy away from WFE if you don't want to send events on unlock.
| l.Uint32.Store(0) | |
| l.Uint32.Store(0) | |
| arm.Asm("sev") |
There was a problem hiding this comment.
Non necessary the scheduler is taking care about waking up the core when needed. Tested with multiple g routines running at the same time and calls to GC.
| sleepTicks(100) | ||
| } | ||
| rp.PSM.SetFRCE_OFF_PROC1(0) | ||
| multicore_fifo_pop_blocking() |
There was a problem hiding this comment.
multicore_fifo_pop_blocking() here can deadlock boot if the FIFO is empty (which is a valid state after reset/power-on). If the intent is to clear stale FIFO data before restarting core1, use multicore_fifo_drain() (non-blocking) or gate the pop behind a FIFO_ST.VLD check to avoid blocking forever.
| multicore_fifo_pop_blocking() | |
| multicore_fifo_drain() |
There was a problem hiding this comment.
The FIFO will be empty but CORE1 will write to it as soon as it starts. The logic is to reset the CORE and wait for it to tell us it is ready, so we block while waiting for the message coming from CORE1 into the FIFO
Yes. It is approved without any need to change the license. We do not have a generic public open source contribution policy. The process we use is through a committee approval which is reviewing existing license from project before providing agreement, as well as code and IP. This has been the case, and I couldn't make the PR without it. |
|
Can this be tested on RP2040? That's the only W variant I have at hand to test some ugly bugs I was running into with multicore edit: title says rp2040 so I guess that's a yes. Was just confused since name of branch and other places only mentioned rp2350. will test now |
|
So I'm testing blinky example for my Raspberry Pi pico W RP2040 based microcontroller and something is breaking before initialization. I'm not even able to connect to serial though there is a time.Sleep(2*time.Second) before the entire program begins This PR:Current dev:Also kind of broken since multicore breaks serial write order, but at least it boots. |
Sorry for the mess. Let me think about it. We might have in the end to have two approaches, one for the 2040 and one for the 2350, as they do not support the same core and logic. I will try to get hands on a 2040. I will refactor the PR to get it business as usual for the 2040 and get new path for the 2350. blinky seems to work on my 2350 which is still a good news. |
|
I found the Pico 2 W Ron had gifted me while in Brussels for FOSDEM and tested this PR out and have some unhappy but crucial findings.
Seems like a change in this PR broke something that was previously working. It's odd because the slow webpage experience happens on every single request- and yet DHCP, NTP and the other protocols worked just fine. It may be that there is a race condition in cyw43439 package but I'd like to be sure before merging this. Are you able to test this with a Pico 2 W? Would be great if you were able to reproduce this behavior |
Yes I confirm blinky works without any issues on the rp2350. I got it running for 1h now and got no issues with the print locking. And the build |
I don't have this device unfortunately. What I know is that multicore scheduler and TCP/IP stack is tricky to make work as you will be facing a couple of wait period. I have identified a deadlock situation which needs to be addressed short term. When both core enter wfi instruction tinygo can wake up the core only if a timer is running or an interrupt appears. But the interrupt from GPIOs are masked and not configured to wake up the core by default which is leading to a situation where both core sleeps. I faced that issue with a w5500 setup and the SPI interrupt which is connected to a GPIO. Without having that specific GPIO set to be able to wake up a core, I ended up into situation where the rp2350 just totally ignore everything. Where it is even trickier is that the interrupt management is per core, and needs to be executed explicitly on each core which we can't do easily for the moment, but I will work on that as soon as we can move forward with this PR. If you are unlucky you can have core0 sleeping and core1 ignoring upcoming interrupt too. We need to meet during next FOSDEM. I will be attending next OCP Summit in Barcelona in april just in case. |
|
I'd be OK moving forward with this PR if it did not affect -scheduler=tasks. I know I have users using RP2350 with wifi so having these changes as-is in release is a no-go for me. It removes wifi support on the pico 2- and who knows what else. If you can limit the blast radius to -scheduler=cores I think we'd have a good-to-go, ready to merge change since this has a net positive effect to those who still use cores scheduler since now there is no crash. Also: please include comments explaining the why of the code- Have you read the RP2350 reference manual and taken a decision based on documentation? Is there a special decision you've taken based on a implementation note somewhere? A blog post? Are you basing yourself off the code in pico-sdk? This is valuable information so we can more easily audit the code and even potentially help you out. Please be precise in comments. Include documentation section number or pico-sdk file and line number
That'd be nice :) |
|
I found an RP2040 board at home ! (thanks to 9element and OSFC conference). This code make it more generic between both chip (I don't like it as it disable interrupts, but at least it is stable). My preferred option would be to use the hardware spin lock on the M0 and the software one on the M33. Can you try on your test code if that more generic approach is more stable ? func runCore1() { |
This would be very welcome. I'll try it tomorrow, going to sleep |
I don't own a crystal ball. So this is mainly coming from the ARM documentation about M0 and M33 cores, as well as rp2350 datasheet which includes Errata about the hardware spin lock and bugs (RP2350-E2). Interrupts handler modification comes from the rp2350 document section 3.2 and further about interrupts (https://pip-assets.raspberrypi.com/categories/1214-rp2350/documents/RP-008373-DS-2-rp2350-datasheet.pdf?disposition=inline). The core reset comes from the pico-sdk. |
|
claude-research.md Ok, gave 3 models a go at researching the topic so I can get a better feel of the context regarding multicore. Long day, still have not gotten around to looking at this. Would you say the models go over the brunt of the topics you applied to implement this? I'm going through all the rigamarole because this would be the first time I dive into the multicore side of things in TinyGo |
You are missing the Mistral one (just kidding but that is because I am french). I agree that topic is pretty complex. I like the synthetic output of Deepseek. By the way use the model feedback as a way to accelerate your learning curve and do not envision it will always work the way it is described (I speak from my experience with LLM and firmware coding. It helps me but never gave me 100% good answers) |
|
Thanks for the reassuring feedback <3 Yeah I've definetely gotten quite disastrous results with paid models when building firmware. And the results for helping out with developing https://github.com/soypat/lneto (networking stack) are not much better. Still have to put in a good amount of manual labor to get good results. |
|
I didn't realized you are lneto developer. I am using the library for this program https://github.com/opencomputeproject/ocp-fwupdater-PoC (I didn't have published yet the lneto side) which aims to automatically update firmware of HPE servers before starting them up. I have a nice PoC which works, and roughly I started to work on the multicore scheduler to accelerate lneto. I reach about 1.7MB/s with my TFTP client, and around 300kB/s with lneto TCP connection (the TCP ack doesn't really help, and HTTP partial download neither). When I enabled multicore I have been able to have more bandwidth by having data transfer between queues between cores. We wrote a driver for lneto which support the w5500. The TFTP code is running faster with the multicore scheduler (about 2.2MB/s and I think I am stuck by the w5500 and the performance of my spi bus which is implemented on a breadboard, that is why I design a custom small board) |
|
Holy moly- awesome! I'm very curious as to your findings! I'd love to know more so I can improve lneto. What do you mean the ack and http partial download does not help? Please file an issue!!! I'd love to work on improving performance!! I do recall users having issues serving http pages whose size exceeded the tcp buffer size, maybe that's still an issue? Were you using the same hardware with lneto? |
|
Also FYI, might be of help in future projects seeing your interests: https://github.com/soypat/fat |
The application that I develop is using an rp2350 to control the bring up of a server. It keeps a Proliant in reset mode up to the time, the spi-nor(s) which contains the ROM and the BMC code haven't been wiped out, and updated with latest and greatest firmware. The idea is to enforce a full reflash of a server every time it starts. My ultimate goal is to tend to a solution where the server is becoming flashless and can be fully managed through network connectivity. So I have attached a tmux 1574 from TI to an rp2350 and put behind it a spi-nor. So the spi-nor can be controlled from two different end points. Attached to the rp2350 on spi0, you can find a w5500. The rp2350 starts and issue a couple of basic networking inits (DHCP, arp, etc ...) and then it issue some HTTP command. One to retrieve the firmware size, and multiple commands to retrieve the firmware (each of them have a size which is around 32MB, so no ways to make it fit locally). What I do is I download around 1024 bytes of useful data and then I write it to the flash in parallel and download the next range of data up to reaching the spi-nor size. I hit two issues
I think we can improve a few things by using more go routine inside lneto to transfer data. What is the most important for me, was to have a PoC, as to prove inside HPE that it could work and to the OCP community that it is working. I have a nice demo setup for the next OCP summit in Barcelona and that will be on our small booth. The code is given to the OCP foundation and it is using tinygo. Hope it provides a good overview. I still have a lot of works and ideas at how to improve my firmware management stack to enhance manageability of servers inside datacenter. I am a big fan of high performance even when dealing with tiny embedded app. One of my next step is to have a fully functional server displayed at the Silicon Valley open compute summit probably using ethernet T1 instead of Tx, and demonstrating that a server can in fact become a client of a datacenter and easing the scaling of the management stack, If that work. I think my approach can be used to thermal sensors, and other devices which are used inside a server to manage it. |
Address issue reported into
GC-caused scheduler deadlock on RP2350 #5151
With these fixes I have been able to make run https://github.com/opencomputeproject/ocp-fwupdater-PoC/ with the core scheduler. (still need to be fully updated upstream)
ps: I had to add the SPDX contribution as this work is performed during my work time at Hewlett Packard Enterprise, and has been approved to be released publicly.