Skip to content

Conversation

@thebentern
Copy link
Contributor

Given the discoveries in #5839, I believe we should abandon the seek(0) / truncate approach to opening files to write, as this has issues in the current NRF52 framework version of LittleFS.
The more crude approach of removing the file from the filesystem prior to opening for write seems to be the only path forward, unless I am overlooking an alternative approach.

@thebentern thebentern changed the title Remove file totally before opening write NRF52 - Remove file totally before opening write Jan 23, 2025
@esev
Copy link
Contributor

esev commented Jan 23, 2025

I've reached the same conclusion. Thanks for this.

@todd-herbert let's try to get this one merged too meshtastic/Adafruit_nRF52_Arduino#1. I think this PR and yours should resolve the issues folks are seeing.

Note: I'm away from my test node till Monday, so I can't test. But peeking in on the logs shows it is still running after removing 85de193 & 1c0f43c. So I expect this PR to work just fine.

Copy link
Contributor

@todd-herbert todd-herbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree! Just one line that might want attention first though.

@todd-herbert
Copy link
Contributor

let's try to get this one merged too meshtastic/Adafruit_nRF52_Arduino#1. I think this PR and yours should resolve the issues folks are seeing.

I'll bring that one forward to match the upstream PR on the Adafruit repo and then yeah lets go ahead with it.

@todd-herbert
Copy link
Contributor

Give me just a few minutes for a quick test run on real hardware 🙏

Copy link
Contributor

@todd-herbert todd-herbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to all be running well here! (Tested on T114)

Copy link
Member

@fifieldt fifieldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

@thebentern thebentern merged commit 0d86088 into master Jan 24, 2025
57 checks passed
@thebentern thebentern deleted the remove-first branch January 24, 2025 01:40
obra added a commit to keyboardio/Keyboardio_nRF52_Arduino that referenced this pull request Aug 13, 2025
PROBLEM:
--------
Users reported that BLE keyboards fail to reconnect after sleep, requiring
manual re-pairing. Investigation revealed that bonding data files were
becoming corrupted during sleep/wake cycles, when BLE connections are most
unstable and prone to unexpected disconnections.

ROOT CAUSE ANALYSIS:
-------------------
The flash_cache_flush() function in flash_cache.c had a critical bug where
it ignored return values from flash erase and program operations:

```c
// BROKEN CODE (lines 89-90):
fc->erase(fc->cache_addr);     // Return value ignored
fc->program(fc->cache_addr, fc->cache_buf, FLASH_CACHE_SIZE); // Return value ignored
fc->cache_addr = FLASH_CACHE_INVALID_ADDR; // Always cleared cache
```

This meant that:
1. Flash operations could fail (especially during BLE disconnections)
2. The failure was silently ignored
3. The cache was cleared anyway, losing the data forever
4. LittleFS believed the write succeeded when it actually failed
5. Subsequent operations worked on corrupted filesystem state

TECHNICAL DETAILS:
-----------------
The underlying flash operations (fal_erase/fal_program) already had proper
retry logic (20 attempts each, added in commit 3ea7855), but their failures
were being ignored by flash_cache_flush().

This issue was particularly problematic during BLE sleep/wake cycles because:
- BLE connections are unstable during wake-up
- Flash operations are more likely to be interrupted
- Multiple filesystem operations occur simultaneously
- Unexpected disconnections corrupt in-progress writes

INVESTIGATION RESOURCES:
-----------------------
This fix was developed after extensive research including:

1. Adafruit nRF52 Arduino Issue adafruit#350: Documents this exact problem
   "The library completely ignores flash errors, both when reported
   synchronously and when reported via NRF_EVT_FLASH_OPERATION_ERROR"

2. Meshtastic firmware Issue #5839: BLE disconnection corruption
   meshtastic/firmware#5839

3. Meshtastic PR #5916: Application-layer workaround
   meshtastic/firmware#5916

4. Todd Herbert's flash operation improvements (our commit 3ea7855)
   Already implemented retry logic, but failures still ignored

5. Analysis of Nordic SoftDevice 6.1.1 flash operation behavior
   Confirmed that sd_flash_write/sd_flash_page_erase can fail during
   BLE instability

SOLUTION:
--------
Modified flash_cache_flush() to:
1. Check return values from both fc->erase() and fc->program() operations
2. Only clear cache (fc->cache_addr = FLASH_CACHE_INVALID_ADDR) if BOTH succeed
3. Preserve cache data for retry if either operation fails
4. Handle the verify-skip optimization path correctly

This ensures that failed flash operations are properly detected and the cache
remains valid for subsequent retry attempts, preventing data loss and
filesystem corruption.

TESTING IMPACT:
--------------
This fix addresses the root cause of BLE reconnection failures after sleep.
Users should experience:
- Reliable BLE reconnection after keyboard wake from sleep
- No more manual re-pairing requirements
- Elimination of "corrupted dir pair" LittleFS errors
- Overall improved filesystem integrity

Related commits:
- d8b2ea8: Disabled CCCD persistence (high-level workaround)
- 3ea7855: Added flash operation retry logic (Todd Herbert)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
obra added a commit to keyboardio/Kaleidoscope that referenced this pull request Aug 13, 2025
… pattern

PROBLEM:
--------
Kaleidoscope's NRF52Flash storage driver used a file overwrite pattern that
is vulnerable to LittleFS corruption during BLE disconnections. The savePage()
function would:

1. Open existing file with FILE_O_OVERWRITE
2. Call file_.seek(0) to reposition at beginning
3. Write new data over existing content

This seek/overwrite pattern is problematic because if a BLE disconnection
occurs during the write operation, LittleFS can be left in an inconsistent
state with partially written data.

ROOT CAUSE RESEARCH:
-------------------
The seek(0)/truncate() pattern was identified as a corruption vector through
analysis of the Meshtastic project's fixes:

1. Meshtastic firmware PR #5916:
   meshtastic/firmware#5916

   Changed their SafeFile.cpp from:
   ```cpp
   // PROBLEMATIC PATTERN:
   File file = FSCom.open(filename, FILE_O_WRITE);
   file.seek(0);  // ← Corruption risk
   ```

   To:
   ```cpp
   // SAFER PATTERN:
   FSCom.remove(filename);  // Delete first
   return FSCom.open(filename, FILE_O_WRITE); // Create fresh
   ```

2. Meshtastic Issue #5839: Documents BLE disconnection corruption
   meshtastic/firmware#5839

   Key findings:
   - Corruption occurs during unexpected BLE disconnections (timeout 0x08, 0x22)
   - Graceful disconnections (0x13) do not cause corruption
   - The issue is specifically related to abrupt connection losses during writes

3. Todd Herbert's analysis in Meshtastic discussions:
   - ".proto files not doubling in size" after fix
   - "BLE disconnect still safe"
   - "recovered from corruption induced by power removal during flash write"

TECHNICAL EXPLANATION:
---------------------
The vulnerability exists because:

1. LittleFS maintains metadata about file structure and allocation
2. During seek/overwrite operations, both old and new data coexist temporarily
3. If power loss or system interrupt occurs mid-operation, metadata can become
   inconsistent with actual data blocks

SOLUTION IMPLEMENTATION:
-----------------------
Modified savePage() in NRF52Flash.h (lines 314-327) to use the remove/create
pattern instead of seek/overwrite:

BEFORE (vulnerable):
```cpp
bool file_exists = InternalFS.exists(path);
if (\!file_.open(path, file_exists ? FILE_O_OVERWRITE : FILE_O_WRITE)) {
  return false;
}
if (file_exists && \!file_.seek(0)) {  // ← CORRUPTION RISK
  return false;
}
```

AFTER (safe):
```cpp
// Remove existing file first to avoid seek/truncate corruption
if (InternalFS.exists(path)) {
  if (\!InternalFS.remove(path)) {
    return false;
  }
}
// Create new file
if (\!file_.open(path, FILE_O_WRITE)) {
  return false;
}
```

BENEFITS OF THIS APPROACH:
-------------------------
1. Eliminates the vulnerable seek(0) operation entirely
2. Creates clean file state - no mixing of old/new data during write
3. Reduces LittleFS metadata complexity during write operations
4. Aligns with proven fix from Meshtastic community testing
5. Maintains atomic file operations - file either exists fully or not at all

MESHTASTIC SUCCESS STORY:
------------------------
After implementing this exact pattern, Meshtastic users reported:
- Elimination of filesystem corruption during BLE operations
- Successful recovery from power-loss scenarios during writes
- No more "lfs error:494: Corrupted dir pair" messages
- Stable operation even with frequent BLE disconnections

RELATIONSHIP TO OTHER FIXES:
---------------------------
This is the second of two complementary fixes for flash corruption:

1. Previous commit (7fa304f7): Fixed flash_cache_flush() to handle operation failures
   - Addresses low-level flash operation error propagation
   - Ensures failed writes don't corrupt cache state

2. This commit: Fixed high-level file write patterns
   - Addresses application-level corruption during BLE instability
   - Eliminates seek/overwrite vulnerability

Together, these fixes provide comprehensive protection against flash corruption
during both normal operation and BLE connection instability.

TESTING IMPACT:
--------------
Users should experience:
- Reliable Kaleidoscope settings persistence across sleep/wake cycles
- No data loss during unexpected BLE disconnections
- Elimination of settings corruption requiring factory reset
- Stable operation during host sleep/wake scenarios

References:
- Meshtastic PR #5916: Application-layer seek/truncate fix
- Meshtastic Issue #5839: BLE disconnection corruption documentation
- Adafruit nRF52 Arduino Issue #350: Flash error handling problems
- Nordic SoftDevice 6.1.1 documentation on flash operation behavior

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
obra added a commit to keyboardio/Kaleidoscope that referenced this pull request Aug 13, 2025
… pattern

PROBLEM:
--------
Kaleidoscope's NRF52Flash storage driver used a file overwrite pattern that
is vulnerable to LittleFS corruption during BLE disconnections. The savePage()
function would:

1. Open existing file with FILE_O_OVERWRITE
2. Call file_.seek(0) to reposition at beginning
3. Write new data over existing content

This seek/overwrite pattern is problematic because if a BLE disconnection
occurs during the write operation, LittleFS can be left in an inconsistent
state with partially written data.

ROOT CAUSE RESEARCH:
-------------------
The seek(0)/truncate() pattern was identified as a corruption vector through
analysis of the Meshtastic project's fixes:

1. Meshtastic firmware PR #5916:
   meshtastic/firmware#5916

   Changed their SafeFile.cpp from:
   ```cpp
   // PROBLEMATIC PATTERN:
   File file = FSCom.open(filename, FILE_O_WRITE);
   file.seek(0);  // ← Corruption risk
   ```

   To:
   ```cpp
   // SAFER PATTERN:
   FSCom.remove(filename);  // Delete first
   return FSCom.open(filename, FILE_O_WRITE); // Create fresh
   ```

2. Meshtastic Issue #5839: Documents BLE disconnection corruption
   meshtastic/firmware#5839

   Key findings:
   - Corruption occurs during unexpected BLE disconnections (timeout 0x08, 0x22)
   - Graceful disconnections (0x13) do not cause corruption
   - The issue is specifically related to abrupt connection losses during writes

3. Todd Herbert's analysis in Meshtastic discussions:
   - ".proto files not doubling in size" after fix
   - "BLE disconnect still safe"
   - "recovered from corruption induced by power removal during flash write"

TECHNICAL EXPLANATION:
---------------------
The vulnerability exists because:

1. LittleFS maintains metadata about file structure and allocation
2. During seek/overwrite operations, both old and new data coexist temporarily
3. If power loss or system interrupt occurs mid-operation, metadata can become
   inconsistent with actual data blocks

SOLUTION IMPLEMENTATION:
-----------------------
Modified savePage() in NRF52Flash.h (lines 314-327) to use the remove/create
pattern instead of seek/overwrite:

BEFORE (vulnerable):
```cpp
bool file_exists = InternalFS.exists(path);
if (\!file_.open(path, file_exists ? FILE_O_OVERWRITE : FILE_O_WRITE)) {
  return false;
}
if (file_exists && \!file_.seek(0)) {  // ← CORRUPTION RISK
  return false;
}
```

AFTER (safe):
```cpp
// Remove existing file first to avoid seek/truncate corruption
if (InternalFS.exists(path)) {
  if (\!InternalFS.remove(path)) {
    return false;
  }
}
// Create new file
if (\!file_.open(path, FILE_O_WRITE)) {
  return false;
}
```

BENEFITS OF THIS APPROACH:
-------------------------
1. Eliminates the vulnerable seek(0) operation entirely
2. Creates clean file state - no mixing of old/new data during write
3. Reduces LittleFS metadata complexity during write operations
4. Aligns with proven fix from Meshtastic community testing
5. Maintains atomic file operations - file either exists fully or not at all

MESHTASTIC SUCCESS STORY:
------------------------
After implementing this exact pattern, Meshtastic users reported:
- Elimination of filesystem corruption during BLE operations
- Successful recovery from power-loss scenarios during writes
- No more "lfs error:494: Corrupted dir pair" messages
- Stable operation even with frequent BLE disconnections

RELATIONSHIP TO OTHER FIXES:
---------------------------
This is the second of two complementary fixes for flash corruption:

1. Previous commit (7fa304f7): Fixed flash_cache_flush() to handle operation failures
   - Addresses low-level flash operation error propagation
   - Ensures failed writes don't corrupt cache state

2. This commit: Fixed high-level file write patterns
   - Addresses application-level corruption during BLE instability
   - Eliminates seek/overwrite vulnerability

Together, these fixes provide comprehensive protection against flash corruption
during both normal operation and BLE connection instability.

TESTING IMPACT:
--------------
Users should experience:
- Reliable Kaleidoscope settings persistence across sleep/wake cycles
- No data loss during unexpected BLE disconnections
- Elimination of settings corruption requiring factory reset
- Stable operation during host sleep/wake scenarios

References:
- Meshtastic PR #5916: Application-layer seek/truncate fix
- Meshtastic Issue #5839: BLE disconnection corruption documentation
- Adafruit nRF52 Arduino Issue #350: Flash error handling problems
- Nordic SoftDevice 6.1.1 documentation on flash operation behavior

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants