prevent range tree corruption race by updating dnode_sync()#18235
prevent range tree corruption race by updating dnode_sync()#18235alek-p wants to merge 1 commit intoopenzfs:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition in dnode_sync() that causes range tree corruption, leading to kernel panics and unimportable pools. The issue has been documented across multiple versions (2.0.x through 2.4.0) and manifests as "list_add corruption", "overlapping segment", or "duplicate-free" panics during zfs destroy operations.
Changes:
- Removed the
dnode_sync_free_range()callback function and its argument struct that droppeddn_mtxduring processing - Replaced
zfs_range_tree_walk()andzfs_range_tree_vacate()with an incremental loop that processes one segment at a time - Changed from
zfs_range_tree_remove()tozfs_range_tree_clear()to handle concurrent modifications gracefully
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pcd1193182
left a comment
There was a problem hiding this comment.
We discussed this on slack, so I'm familiar with the problem with the alternative solution. My take on the situation is that as long as this approach doesn't cause any performance issues (from repeatedly locking/unlocking or the inefficient method of clearing the tree), it's fine. If it does, we will probably want to switch to the other approach, even if that means doing some more investigation to ensure it can't deadlock or anything.
amotin
left a comment
There was a problem hiding this comment.
I agree that existing code is potentially unsafe if something else may modify the tree. And I generally agree with the proposed change, except a bit worrying about possible zfs_range_tree_clear() calls cost, since it has to search again for each entry. And I would reduce all the endless comment to one line, saying that we can't drop the lock inside normal iterator and expect it to protect.
What I don't understand is the specific scenario of simultaneous access. The code here works on a synced transaction group, while new additions should likely happen in open transaction group, which has its own range tree. Are there cases when we modify the range tree for the syncing txg outside of the sync thread? Because if not, I suspect we would only need the lock when we are vacating and destroy the range tree.
There was a problem hiding this comment.
I have a similar take, the more targeted proposed fix makes good sense to me as long as we don't discover use cases where there's a significant performance penalty. If we do, then we'll just have to look in to possible optimizations. Job 1 of course is resolve this race, nice work running this case down.
|
@alek-p do you have an example stack trace for Thread B in your top comment. It'd be helpful to see the full stacks for the exact scenario here. |
This really is the key question. We are not supposed to be able to do any modification to the syncing range-tree outside
I've tried to find the exact path, looking through the coredump we have, but I didn't find anything that looked like thread B. I assumed it was already finished by the time we panicked, since the panic happens when the walk resumes. However, I'll try to find such a stack by adding the asserts described above. |
|
Running with the ASSERTs added and reproduce.c, I'm able to get a look at what I think is a culprit thread. If we didn't panic on this ASSERT, we'd be modifying the range tree several lines later in dnode_free_range(). Both Thread A ( |
dea0fd2 to
9bed3ca
Compare
They are both executed in syncing context, but that's because they're both executed by the |
Switch to incremental range tree processing in dnode_sync() to avoid unsafe lock dropping during zfs_range_tree_walk(). This also ensures the free ranges remain visible to dnode_block_freed() throughout the sync process, preventing potential stale data reads. This patch: - Keeps the range tree attached during processing for visibility. - Processes segments one-by-one by restarting from the tree head. - Uses zfs_range_tree_clear() to safely handle ranges that may have been modified while the lock was dropped. Closes openzfs#18186 Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
9bed3ca to
eadd7a8
Compare
I was just going to say the same, but there is one more aspect to consider: to improve performance sync thread can sync multiple datasets and dnodes same time in separate taskq threads, periodically waiting for completion. One dnode should only be handled by one task at a time, but I wonder if we may miss the taskq waiting at some point, allowing a race with some other sync thread activity, that might care about the the same dnode somehow. |
This does seem plausible since this is what our stacktrace looks like for Thread A in the coredump we have: |
|
@alek-p any luck ginning up a reproducer to get a stack trace for thread B? Or perhaps additional debugging can be added to detect if other sync thread activity operating on the dnode at the same time. |
so far, my best guess for Thread B is #18235 (comment) I've also been running with some range tree sanity checking, but that hasn't yielded a panic yet. |
I've tried to look form this perspective, but unfortunately found no way how it could happen. |
Motivation and Context
We (at Connectwise) have been looking into a rare range-tree corruption that we've seen in our private cloud and has been reported here on github.
There is a (seemingly AI-generated) meta issue that captures some of the related context (PRs and reported issues) - #18186
TLDR is that people have been ending up with panics and unimportable pools with duplicate-free or overlapping segments
due to range tree corruption.
Sadly, we have not been able to reproduce this on demand, but we think we have found the root cause of it, and the patch in this PR is the proposed fix.
The root cause is a race that can be described as the following parallel execution scenario:
This code in
dnode_sync()last changed here when there was a similar panic, and it seems like that fix only moved it to happen earlier. This fix for #10708 was included in 2.0.x, so the timing lines up with the issue reports that say the range tree issue started in ~2.0.xI should mention that I also investigated making
dnode_sync_free_range()re-entrant. While this would be a "cleaner" architectural solution, eliminating the need to drop and reacquire the mutex, it requires more extensive locking changes. Specifically, the alternative approach involves a bunch of modifications arounddn_mtxand a couple of adjustments todn_struct_rwlock. Although my limited initial testing didn't uncover regressions, this broader shift in the locking strategy introduces a level of risk that may exceed the benefits of the cleanup. I’ve opted for the current, more localized approach to ensure stability and minimize unanticipated side effects. However, I am happy to reconsider and upstream the re-entrant implementation if the community prefers that.Description
To avoid the above referenced race, we cannot simply detach the range tree (set dn_free_ranges to NULL) before processing it because
dnode_block_freed()relies on it to correctly identify blocks that have been freed in the current TXG (fordbuf_read()calls on holes). If we detached it early, a concurrent reader might see the block as valid on disk and return stale data instead of zeros.We also can't use
zfs_range_tree_walk()norzfs_range_tree_vacate()with a callback that drops dn_mtx (dnode_sync_free_range()). This is unsafe because another thread (dnode_free_range()) or even the same thread via recursion (dnode_sync_free_range_impl() -> free_children() -> dbuf_dirty()) could acquire dn_mtx and modify the tree while the walk or vacate was in progress. This leads to tree corruption or panic when we resume.To fix the race while maintaining visibility, we process the tree incrementally. We pick a segment, drop the lock to sync it, and re-acquire the lock to remove it. By always restarting from the head of the tree, we ensure we are never using an invalid iterator.
We use
zfs_range_tree_clear()instead ofzfs_range_tree_remove()because the range might have already been removed while the lock was dropped (specifically in thedbuf_dirty()path mentioned above).zfs_range_tree_clear()handles this gracefully, whilezfs_range_tree_remove()would panic on a missing segment.How Has This Been Tested?
I've run local zfs-tests with positive results.
Types of changes
Checklist:
Signed-off-by.