-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
NRF52 - Remove file totally before opening write #5916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've reached the same conclusion. Thanks for this. @todd-herbert let's try to get this one merged too meshtastic/Adafruit_nRF52_Arduino#1. I think this PR and yours should resolve the issues folks are seeing. Note: I'm away from my test node till Monday, so I can't test. But peeking in on the logs shows it is still running after removing 85de193 & 1c0f43c. So I expect this PR to work just fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree! Just one line that might want attention first though.
I'll bring that one forward to match the upstream PR on the Adafruit repo and then yeah lets go ahead with it. |
Give me just a few minutes for a quick test run on real hardware 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to all be running well here! (Tested on T114)
- .proto files not doubling in size
- BLE disconnect still safe with meshtastic/Adafruit_nRF52_Arduino#1
- recovered from corruption induced by power removal during flash write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.
PROBLEM: -------- Users reported that BLE keyboards fail to reconnect after sleep, requiring manual re-pairing. Investigation revealed that bonding data files were becoming corrupted during sleep/wake cycles, when BLE connections are most unstable and prone to unexpected disconnections. ROOT CAUSE ANALYSIS: ------------------- The flash_cache_flush() function in flash_cache.c had a critical bug where it ignored return values from flash erase and program operations: ```c // BROKEN CODE (lines 89-90): fc->erase(fc->cache_addr); // Return value ignored fc->program(fc->cache_addr, fc->cache_buf, FLASH_CACHE_SIZE); // Return value ignored fc->cache_addr = FLASH_CACHE_INVALID_ADDR; // Always cleared cache ``` This meant that: 1. Flash operations could fail (especially during BLE disconnections) 2. The failure was silently ignored 3. The cache was cleared anyway, losing the data forever 4. LittleFS believed the write succeeded when it actually failed 5. Subsequent operations worked on corrupted filesystem state TECHNICAL DETAILS: ----------------- The underlying flash operations (fal_erase/fal_program) already had proper retry logic (20 attempts each, added in commit 3ea7855), but their failures were being ignored by flash_cache_flush(). This issue was particularly problematic during BLE sleep/wake cycles because: - BLE connections are unstable during wake-up - Flash operations are more likely to be interrupted - Multiple filesystem operations occur simultaneously - Unexpected disconnections corrupt in-progress writes INVESTIGATION RESOURCES: ----------------------- This fix was developed after extensive research including: 1. Adafruit nRF52 Arduino Issue adafruit#350: Documents this exact problem "The library completely ignores flash errors, both when reported synchronously and when reported via NRF_EVT_FLASH_OPERATION_ERROR" 2. Meshtastic firmware Issue #5839: BLE disconnection corruption meshtastic/firmware#5839 3. Meshtastic PR #5916: Application-layer workaround meshtastic/firmware#5916 4. Todd Herbert's flash operation improvements (our commit 3ea7855) Already implemented retry logic, but failures still ignored 5. Analysis of Nordic SoftDevice 6.1.1 flash operation behavior Confirmed that sd_flash_write/sd_flash_page_erase can fail during BLE instability SOLUTION: -------- Modified flash_cache_flush() to: 1. Check return values from both fc->erase() and fc->program() operations 2. Only clear cache (fc->cache_addr = FLASH_CACHE_INVALID_ADDR) if BOTH succeed 3. Preserve cache data for retry if either operation fails 4. Handle the verify-skip optimization path correctly This ensures that failed flash operations are properly detected and the cache remains valid for subsequent retry attempts, preventing data loss and filesystem corruption. TESTING IMPACT: -------------- This fix addresses the root cause of BLE reconnection failures after sleep. Users should experience: - Reliable BLE reconnection after keyboard wake from sleep - No more manual re-pairing requirements - Elimination of "corrupted dir pair" LittleFS errors - Overall improved filesystem integrity Related commits: - d8b2ea8: Disabled CCCD persistence (high-level workaround) - 3ea7855: Added flash operation retry logic (Todd Herbert) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
… pattern PROBLEM: -------- Kaleidoscope's NRF52Flash storage driver used a file overwrite pattern that is vulnerable to LittleFS corruption during BLE disconnections. The savePage() function would: 1. Open existing file with FILE_O_OVERWRITE 2. Call file_.seek(0) to reposition at beginning 3. Write new data over existing content This seek/overwrite pattern is problematic because if a BLE disconnection occurs during the write operation, LittleFS can be left in an inconsistent state with partially written data. ROOT CAUSE RESEARCH: ------------------- The seek(0)/truncate() pattern was identified as a corruption vector through analysis of the Meshtastic project's fixes: 1. Meshtastic firmware PR #5916: meshtastic/firmware#5916 Changed their SafeFile.cpp from: ```cpp // PROBLEMATIC PATTERN: File file = FSCom.open(filename, FILE_O_WRITE); file.seek(0); // ← Corruption risk ``` To: ```cpp // SAFER PATTERN: FSCom.remove(filename); // Delete first return FSCom.open(filename, FILE_O_WRITE); // Create fresh ``` 2. Meshtastic Issue #5839: Documents BLE disconnection corruption meshtastic/firmware#5839 Key findings: - Corruption occurs during unexpected BLE disconnections (timeout 0x08, 0x22) - Graceful disconnections (0x13) do not cause corruption - The issue is specifically related to abrupt connection losses during writes 3. Todd Herbert's analysis in Meshtastic discussions: - ".proto files not doubling in size" after fix - "BLE disconnect still safe" - "recovered from corruption induced by power removal during flash write" TECHNICAL EXPLANATION: --------------------- The vulnerability exists because: 1. LittleFS maintains metadata about file structure and allocation 2. During seek/overwrite operations, both old and new data coexist temporarily 3. If power loss or system interrupt occurs mid-operation, metadata can become inconsistent with actual data blocks SOLUTION IMPLEMENTATION: ----------------------- Modified savePage() in NRF52Flash.h (lines 314-327) to use the remove/create pattern instead of seek/overwrite: BEFORE (vulnerable): ```cpp bool file_exists = InternalFS.exists(path); if (\!file_.open(path, file_exists ? FILE_O_OVERWRITE : FILE_O_WRITE)) { return false; } if (file_exists && \!file_.seek(0)) { // ← CORRUPTION RISK return false; } ``` AFTER (safe): ```cpp // Remove existing file first to avoid seek/truncate corruption if (InternalFS.exists(path)) { if (\!InternalFS.remove(path)) { return false; } } // Create new file if (\!file_.open(path, FILE_O_WRITE)) { return false; } ``` BENEFITS OF THIS APPROACH: ------------------------- 1. Eliminates the vulnerable seek(0) operation entirely 2. Creates clean file state - no mixing of old/new data during write 3. Reduces LittleFS metadata complexity during write operations 4. Aligns with proven fix from Meshtastic community testing 5. Maintains atomic file operations - file either exists fully or not at all MESHTASTIC SUCCESS STORY: ------------------------ After implementing this exact pattern, Meshtastic users reported: - Elimination of filesystem corruption during BLE operations - Successful recovery from power-loss scenarios during writes - No more "lfs error:494: Corrupted dir pair" messages - Stable operation even with frequent BLE disconnections RELATIONSHIP TO OTHER FIXES: --------------------------- This is the second of two complementary fixes for flash corruption: 1. Previous commit (7fa304f7): Fixed flash_cache_flush() to handle operation failures - Addresses low-level flash operation error propagation - Ensures failed writes don't corrupt cache state 2. This commit: Fixed high-level file write patterns - Addresses application-level corruption during BLE instability - Eliminates seek/overwrite vulnerability Together, these fixes provide comprehensive protection against flash corruption during both normal operation and BLE connection instability. TESTING IMPACT: -------------- Users should experience: - Reliable Kaleidoscope settings persistence across sleep/wake cycles - No data loss during unexpected BLE disconnections - Elimination of settings corruption requiring factory reset - Stable operation during host sleep/wake scenarios References: - Meshtastic PR #5916: Application-layer seek/truncate fix - Meshtastic Issue #5839: BLE disconnection corruption documentation - Adafruit nRF52 Arduino Issue #350: Flash error handling problems - Nordic SoftDevice 6.1.1 documentation on flash operation behavior 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
… pattern PROBLEM: -------- Kaleidoscope's NRF52Flash storage driver used a file overwrite pattern that is vulnerable to LittleFS corruption during BLE disconnections. The savePage() function would: 1. Open existing file with FILE_O_OVERWRITE 2. Call file_.seek(0) to reposition at beginning 3. Write new data over existing content This seek/overwrite pattern is problematic because if a BLE disconnection occurs during the write operation, LittleFS can be left in an inconsistent state with partially written data. ROOT CAUSE RESEARCH: ------------------- The seek(0)/truncate() pattern was identified as a corruption vector through analysis of the Meshtastic project's fixes: 1. Meshtastic firmware PR #5916: meshtastic/firmware#5916 Changed their SafeFile.cpp from: ```cpp // PROBLEMATIC PATTERN: File file = FSCom.open(filename, FILE_O_WRITE); file.seek(0); // ← Corruption risk ``` To: ```cpp // SAFER PATTERN: FSCom.remove(filename); // Delete first return FSCom.open(filename, FILE_O_WRITE); // Create fresh ``` 2. Meshtastic Issue #5839: Documents BLE disconnection corruption meshtastic/firmware#5839 Key findings: - Corruption occurs during unexpected BLE disconnections (timeout 0x08, 0x22) - Graceful disconnections (0x13) do not cause corruption - The issue is specifically related to abrupt connection losses during writes 3. Todd Herbert's analysis in Meshtastic discussions: - ".proto files not doubling in size" after fix - "BLE disconnect still safe" - "recovered from corruption induced by power removal during flash write" TECHNICAL EXPLANATION: --------------------- The vulnerability exists because: 1. LittleFS maintains metadata about file structure and allocation 2. During seek/overwrite operations, both old and new data coexist temporarily 3. If power loss or system interrupt occurs mid-operation, metadata can become inconsistent with actual data blocks SOLUTION IMPLEMENTATION: ----------------------- Modified savePage() in NRF52Flash.h (lines 314-327) to use the remove/create pattern instead of seek/overwrite: BEFORE (vulnerable): ```cpp bool file_exists = InternalFS.exists(path); if (\!file_.open(path, file_exists ? FILE_O_OVERWRITE : FILE_O_WRITE)) { return false; } if (file_exists && \!file_.seek(0)) { // ← CORRUPTION RISK return false; } ``` AFTER (safe): ```cpp // Remove existing file first to avoid seek/truncate corruption if (InternalFS.exists(path)) { if (\!InternalFS.remove(path)) { return false; } } // Create new file if (\!file_.open(path, FILE_O_WRITE)) { return false; } ``` BENEFITS OF THIS APPROACH: ------------------------- 1. Eliminates the vulnerable seek(0) operation entirely 2. Creates clean file state - no mixing of old/new data during write 3. Reduces LittleFS metadata complexity during write operations 4. Aligns with proven fix from Meshtastic community testing 5. Maintains atomic file operations - file either exists fully or not at all MESHTASTIC SUCCESS STORY: ------------------------ After implementing this exact pattern, Meshtastic users reported: - Elimination of filesystem corruption during BLE operations - Successful recovery from power-loss scenarios during writes - No more "lfs error:494: Corrupted dir pair" messages - Stable operation even with frequent BLE disconnections RELATIONSHIP TO OTHER FIXES: --------------------------- This is the second of two complementary fixes for flash corruption: 1. Previous commit (7fa304f7): Fixed flash_cache_flush() to handle operation failures - Addresses low-level flash operation error propagation - Ensures failed writes don't corrupt cache state 2. This commit: Fixed high-level file write patterns - Addresses application-level corruption during BLE instability - Eliminates seek/overwrite vulnerability Together, these fixes provide comprehensive protection against flash corruption during both normal operation and BLE connection instability. TESTING IMPACT: -------------- Users should experience: - Reliable Kaleidoscope settings persistence across sleep/wake cycles - No data loss during unexpected BLE disconnections - Elimination of settings corruption requiring factory reset - Stable operation during host sleep/wake scenarios References: - Meshtastic PR #5916: Application-layer seek/truncate fix - Meshtastic Issue #5839: BLE disconnection corruption documentation - Adafruit nRF52 Arduino Issue #350: Flash error handling problems - Nordic SoftDevice 6.1.1 documentation on flash operation behavior 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Given the discoveries in #5839, I believe we should abandon the seek(0) / truncate approach to opening files to write, as this has issues in the current NRF52 framework version of LittleFS.
The more crude approach of removing the file from the filesystem prior to opening for write seems to be the only path forward, unless I am overlooking an alternative approach.