Skip to content

[Bug]: Critical fault #12 flash filesystem corruption and format on nrf52 platform #5839

@dahanc

Description

@dahanc

Category

Other

Hardware

Heltec Mesh Node T114

Firmware Version

2.5.13.1a06f88

Description

Back in November 2024, my T114 showed a Critical fault #⁠12. I rebooted it and it seemed to work OK, but a few days later, it got into a boot loop. The serial debug output was:

INFO  | ??:??:?? 6 Adding node to database with 1 nodes and 147308 bytes free!
DEBUG | ??:??:?? 6 Expanding short PSK #1
INFO  | ??:??:?? 6 Wanted region 1, using US
INFO  | ??:??:?? 6 Saving /prefs/db.proto
lfs warn:314: No more free space 224
ERROR | ??:??:?? 6 Error: can't encode protobuf io error
ERROR | ??:??:?? 6 LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | ??:??:?? 6 LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | ??:??:?? 6 LFS assert: block < lfs->cfg

Then about a week later, another of my T114s also got into a boot loop. This one doesn't have a screen, so I don't know if it also had a Critical fault #⁠12. I erased the flash, installed 2.5.13.1a06f88, connected the USB port to a PC, and logged all of the serial output, so in case it happened again, I could see what happened right before the first crash and reboot. After about a month, it rebooted:

INFO  | 20:29:31 2159141 [Router] Received nodeinfo from=0xa20a11c8, id=0x7d3b35e6, portnum=4, payloadlen=44
INFO  | 20:29:31 2159141 [Router] Node database full with 80 nodes and 109096 bytes free. Erasing oldest entry
INFO  | 20:29:31 2159141 [Router] Adding node to database with 80 nodes and 109096 bytes free!
DEBUG | 20:29:31 2159141 [Router] old user /, channel=0
DEBUG | 20:29:31 2159141 [Router] Update changed=1 user MKTX G2 Gateway/MKG3, channel=0
DEBUG | 20:29:31 2159141 [Router] State: ON
DEBUG | 20:29:31 2159141 [Router] Node status update: 22 online, 80 total
INFO  | 20:29:31 2159141 [Router] Save /prefs/db.proto
lfs debug:617: Bad block at 126
lfs debug:617: Bad block at 180
INFO  | 20:29:36 2159146 [Router] BLE Disconnected, reason = 0x8
DEBUG | 20:29:36 2159146 [Router] PhoneAPI::close()
lfs debug:640: Relocating 126 160 to 181 160
ERROR | 20:29:37 2159147 [Router] LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | 20:29:37 2159147 [Rou

And after the reboot, these is what it logged before rebooting again:

INFO  | ??:??:?? 3 Adding node to database with 1 nodes and 153800 bytes free!
DEBUG | ??:??:?? 4 Expand short PSK #1
INFO  | ??:??:?? 4 Wanted region 1, using US
DEBUG | ??:??:?? 4 Coerce telemetry to min of 30 minutes on defaults
INFO  | ??:??:?? 4 Save /prefs/db.proto
ERROR | ??:??:?? 4 LFS assert: block < lfs->cfg->block_cou

So it seems that what triggered it was the "Bad block" errors, and maybe the "relocation" code is buggy and corrupts the filesystem? In any case, the "Bad block" errors seem more relevant. From what I see in lfs.c, it looks like the "Bad block" message means a flash routine returned LFS_ERR_CORRUPT. However, I didn't see anything in InternalFileSystem.cpp that would return LFS_ERR_CORRUPT. So I think that means lfs_cache_cmp() returned false (e.g., line 194 of lfs.c).

I haven't looked into the details of how the caching works, but since I don't think the flash is going bad on either of my T114s (V2 hasn't been out that long), I wonder if something else in the firmware is corrupting the memory buffer used for the cache.

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions