Skip to content

prevent range tree corruption race by updating dnode_sync()#18235

Open
alek-p wants to merge 1 commit intoopenzfs:masterfrom
alek-p:range_tree_corruption
Open

prevent range tree corruption race by updating dnode_sync()#18235
alek-p wants to merge 1 commit intoopenzfs:masterfrom
alek-p:range_tree_corruption

Conversation

@alek-p
Copy link
Contributor

@alek-p alek-p commented Feb 18, 2026

Motivation and Context

We (at Connectwise) have been looking into a rare range-tree corruption that we've seen in our private cloud and has been reported here on github.
There is a (seemingly AI-generated) meta issue that captures some of the related context (PRs and reported issues) - #18186
TLDR is that people have been ending up with panics and unimportable pools with duplicate-free or overlapping segments
due to range tree corruption.
Sadly, we have not been able to reproduce this on demand, but we think we have found the root cause of it, and the patch in this PR is the proposed fix.

The root cause is a race that can be described as the following parallel execution scenario:

  Thread A (`txg_sync` thread):
   1. Enters dnode_sync for a dnode.
   2. Acquires dn->dn_mtx.
   3. Calls zfs_range_tree_walk on dn->dn_free_ranges.
   4. Inside walk, calls callback dnode_sync_free_range.
   5. Drops `dn->dn_mtx`.
   6. Begins processing a range (e.g., freeing blocks).

  Thread B (`zfs destroy` command):
   1. Running concurrently.
   2. Needs to add a range to the same dnode's free list (e.g., in dnode_free_range).
   3. Acquires dn->dn_mtx (SUCCESS, because Thread A dropped it!).
   4. Accesses dn->dn_free_ranges[txgoff]. It sees the pointer is valid (not NULL).
   5. Calls zfs_range_tree_add on that same tree.

  The Crash:
   * Thread B modifies the range tree while Thread A is in the middle of iterating over it.
   * zfs_range_tree_walk (Thread A) reads a pointer that Thread B just freed or moved.
   * Result: Use-after-free, infinite loop, or invalid memory access leading to panic.
   * Alternatively: Thread B adds a range that overlaps with one Thread A is currently freeing, leading to the "overlapping segment" panic later. 

This code in dnode_sync() last changed here when there was a similar panic, and it seems like that fix only moved it to happen earlier. This fix for #10708 was included in 2.0.x, so the timing lines up with the issue reports that say the range tree issue started in ~2.0.x

I should mention that I also investigated making dnode_sync_free_range() re-entrant. While this would be a "cleaner" architectural solution, eliminating the need to drop and reacquire the mutex, it requires more extensive locking changes. Specifically, the alternative approach involves a bunch of modifications around dn_mtx and a couple of adjustments to dn_struct_rwlock. Although my limited initial testing didn't uncover regressions, this broader shift in the locking strategy introduces a level of risk that may exceed the benefits of the cleanup. I’ve opted for the current, more localized approach to ensure stability and minimize unanticipated side effects. However, I am happy to reconsider and upstream the re-entrant implementation if the community prefers that.

Description

To avoid the above referenced race, we cannot simply detach the range tree (set dn_free_ranges to NULL) before processing it because dnode_block_freed() relies on it to correctly identify blocks that have been freed in the current TXG (for dbuf_read() calls on holes). If we detached it early, a concurrent reader might see the block as valid on disk and return stale data instead of zeros.

We also can't use zfs_range_tree_walk() nor zfs_range_tree_vacate() with a callback that drops dn_mtx (dnode_sync_free_range()). This is unsafe because another thread (dnode_free_range()) or even the same thread via recursion (dnode_sync_free_range_impl() -> free_children() -> dbuf_dirty()) could acquire dn_mtx and modify the tree while the walk or vacate was in progress. This leads to tree corruption or panic when we resume.

To fix the race while maintaining visibility, we process the tree incrementally. We pick a segment, drop the lock to sync it, and re-acquire the lock to remove it. By always restarting from the head of the tree, we ensure we are never using an invalid iterator.
We use zfs_range_tree_clear() instead of zfs_range_tree_remove() because the range might have already been removed while the lock was dropped (specifically in the dbuf_dirty() path mentioned above). zfs_range_tree_clear() handles this gracefully, while zfs_range_tree_remove() would panic on a missing segment.

How Has This Been Tested?

I've run local zfs-tests with positive results.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copilot AI review requested due to automatic review settings February 18, 2026 22:07
@alek-p alek-p added Status: Code Review Needed Ready for review and testing Type: Defect Incorrect behavior (e.g. crash, hang) labels Feb 18, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in dnode_sync() that causes range tree corruption, leading to kernel panics and unimportable pools. The issue has been documented across multiple versions (2.0.x through 2.4.0) and manifests as "list_add corruption", "overlapping segment", or "duplicate-free" panics during zfs destroy operations.

Changes:

  • Removed the dnode_sync_free_range() callback function and its argument struct that dropped dn_mtx during processing
  • Replaced zfs_range_tree_walk() and zfs_range_tree_vacate() with an incremental loop that processes one segment at a time
  • Changed from zfs_range_tree_remove() to zfs_range_tree_clear() to handle concurrent modifications gracefully

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@pcd1193182 pcd1193182 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this on slack, so I'm familiar with the problem with the alternative solution. My take on the situation is that as long as this approach doesn't cause any performance issues (from repeatedly locking/unlocking or the inefficient method of clearing the tree), it's fine. If it does, we will probably want to switch to the other approach, even if that means doing some more investigation to ensure it can't deadlock or anything.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that existing code is potentially unsafe if something else may modify the tree. And I generally agree with the proposed change, except a bit worrying about possible zfs_range_tree_clear() calls cost, since it has to search again for each entry. And I would reduce all the endless comment to one line, saying that we can't drop the lock inside normal iterator and expect it to protect.

What I don't understand is the specific scenario of simultaneous access. The code here works on a synced transaction group, while new additions should likely happen in open transaction group, which has its own range tree. Are there cases when we modify the range tree for the syncing txg outside of the sync thread? Because if not, I suspect we would only need the lock when we are vacating and destroy the range tree.

@behlendorf behlendorf self-requested a review February 19, 2026 18:44
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar take, the more targeted proposed fix makes good sense to me as long as we don't discover use cases where there's a significant performance penalty. If we do, then we'll just have to look in to possible optimizations. Job 1 of course is resolve this race, nice work running this case down.

@behlendorf
Copy link
Contributor

@alek-p do you have an example stack trace for Thread B in your top comment. It'd be helpful to see the full stacks for the exact scenario here.

@alek-p
Copy link
Contributor Author

alek-p commented Feb 20, 2026

What I don't understand is the specific scenario of simultaneous access. The code here works on a synced transaction group, while new additions should likely happen in open transaction group, which has its own range tree. Are there cases when we modify the range tree for the syncing txg outside of the sync thread? Because if not, I suspect we would only need the lock when we are vacating and destroy the range tree.

This really is the key question. We are not supposed to be able to do any modification to the syncing range-tree outside dnode_sync(), but I don't think this is true in practice. I will try to prove this by adding asserts that ASSERT(tx->tx_txg > spa_syncing_txg(dn->dn_objset->os_spa)) to the places that modify dn_free_ranges and running some tests.

@alek-p do you have an example stack trace for Thread B in your top comment? It'd be helpful to see the full stacks for the exact scenario here.

I've tried to find the exact path, looking through the coredump we have, but I didn't find anything that looked like thread B. I assumed it was already finished by the time we panicked, since the panic happens when the walk resumes. However, I'll try to find such a stack by adding the asserts described above.

@alek-p
Copy link
Contributor Author

alek-p commented Feb 20, 2026

Running with the ASSERTs added and reproduce.c, I'm able to get a look at what I think is a culprit thread. If we didn't panic on this ASSERT, we'd be modifying the range tree several lines later in dnode_free_range().

[  486.041407] VERIFY(tx->tx_txg > spa_syncing_txg(dn->dn_objset->os_spa)) failed
[  486.041441] PANIC at dnode.c:2418:dnode_free_range()
[  486.041456] Showing stack for process 47536
[  486.041457] CPU: 1 PID: 47536 Comm: txg_sync Tainted: P           OE      6.8.0-100-generic #100-Ubuntu
[  486.041459] Call trace:
[  486.041460]  dump_backtrace+0xa4/0x150
[  486.041488]  show_stack+0x24/0x50
[  486.041489]  dump_stack_lvl+0xc8/0x138                                                                                                                                                                          [  486.041510]  dump_stack+0x1c/0x38
[  486.041511]  spl_dumpstack+0x30/0x58 [spl]
[  486.041522]  spl_panic+0xfc/0x120 [spl]
[  486.041527]  spl_assert+0x2c/0x60 [zfs]
[  486.041647]  dnode_free_range+0x6dc/0x980 [zfs]
[  486.041742]  dmu_free_range+0x88/0x130 [zfs]
[  486.041832]  bpobj_iterate_blkptrs+0x4f4/0x588 [zfs]
[  486.041934]  bpobj_iterate_impl+0x31c/0x820 [zfs]
[  486.042022]  bpobj_iterate+0x24/0x58 [zfs]
[  486.042108]  spa_sync_deferred_frees+0x98/0x148 [zfs]
[  486.042200]  spa_sync_iterate_to_convergence+0x1b4/0x358 [zfs]
[  486.042290]  spa_sync+0x234/0x650 [zfs]
[  486.042378]  txg_sync_thread+0x24c/0x370 [zfs]
[  486.042467]  thread_generic_wrapper+0x80/0xb0 [spl]
[  486.042474]  kthread+0xf8/0x110
[  486.042479]  ret_from_fork+0x10/0x20

Both Thread A (dnode_sync()) and Thread B (spa_sync_deferred_frees()) execute within syncing context, allowing modifications to the same dn_free_ranges tree.

@alek-p alek-p added the Status: Understood The root cause of the issue is known label Feb 20, 2026
@alek-p alek-p force-pushed the range_tree_corruption branch from dea0fd2 to 9bed3ca Compare February 20, 2026 15:11
@pcd1193182
Copy link
Contributor

Running with the ASSERTs added and reproduce.c, I'm able to get a look at what I think is a culprit thread. If we didn't panic on this ASSERT, we'd be modifying the range tree several lines later in dnode_free_range()
....


Both Thread A (`dnode_sync()`) and Thread B (`spa_sync_deferred_frees()`) execute within syncing context, allowing modifications to the same dn_free_ranges tree.

They are both executed in syncing context, but that's because they're both executed by the spa_sync thread directly, right? So they can't race against each other. There would need to be some caller of one or the other that operates asynchronously to the spa_sync thread. dnode_sync definitely looks like it only gets called from the spa_sync thread, but dmu_free_range is called from so many places that that's probably the culprit. We'd need to modify the assert to allow for this being the spa_sync thread, to actually catch the culprit, I think.

Switch to incremental range tree processing in dnode_sync() to avoid
unsafe lock dropping during zfs_range_tree_walk(). This also ensures
the free ranges remain visible to dnode_block_freed() throughout the
sync process, preventing potential stale data reads.

This patch:
- Keeps the range tree attached during processing for visibility.
- Processes segments one-by-one by restarting from the tree head.
- Uses zfs_range_tree_clear() to safely handle ranges that may have
  been modified while the lock was dropped.

Closes openzfs#18186

Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
@alek-p alek-p force-pushed the range_tree_corruption branch from 9bed3ca to eadd7a8 Compare February 20, 2026 17:43
@amotin
Copy link
Member

amotin commented Feb 20, 2026

They are both executed in syncing context, but that's because they're both executed by the spa_sync thread directly, right? So they can't race against each other.

I was just going to say the same, but there is one more aspect to consider: to improve performance sync thread can sync multiple datasets and dnodes same time in separate taskq threads, periodically waiting for completion. One dnode should only be handled by one task at a time, but I wonder if we may miss the taskq waiting at some point, allowing a race with some other sync thread activity, that might care about the the same dnode somehow.

@alek-p
Copy link
Contributor Author

alek-p commented Feb 20, 2026

I was just going to say the same, but there is one more aspect to consider: to improve performance sync thread can sync multiple datasets and dnodes same time in separate taskq threads, periodically waiting for completion. One dnode should only be handled by one task at a time, but I wonder if we may miss the taskq waiting at some point, allowing a race with some other sync thread activity, that might care about the the same dnode somehow.

This does seem plausible since this is what our stacktrace looks like for Thread A in the coredump we have:

panic: VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0) failed

cpuid = 2
time = 1770620751
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0f7dd629f0
vpanic() at vpanic+0x161/frame 0xfffffe0f7dd62b20
spl_panic() at spl_panic+0x3a/frame 0xfffffe0f7dd62b80
free_children() at free_children+0x453/frame 0xfffffe0f7dd62bf0
free_children() at free_children+0x2f0/frame 0xfffffe0f7dd62c60
dnode_sync_free_range() at dnode_sync_free_range+0x1ff/frame 0xfffffe0f7dd62ce0
range_tree_walk() at range_tree_walk+0xa9/frame 0xfffffe0f7dd62d30
dnode_sync() at dnode_sync+0x368/frame 0xfffffe0f7dd62de0
sync_dnodes_task() at sync_dnodes_task+0x8c/frame 0xfffffe0f7dd62e20
taskq_run() at taskq_run+0x17/frame 0xfffffe0f7dd62e40
taskqueue_run_locked() at taskqueue_run_locked+0x182/frame 0xfffffe0f7dd62ec0
taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe0f7dd62ef0
fork_exit() at fork_exit+0x81/frame 0xfffffe0f7dd62f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0f7dd62f30

@behlendorf
Copy link
Contributor

@alek-p any luck ginning up a reproducer to get a stack trace for thread B? Or perhaps additional debugging can be added to detect if other sync thread activity operating on the dnode at the same time.

@alek-p
Copy link
Contributor Author

alek-p commented Feb 25, 2026

@alek-p any luck ginning up a reproducer to get a stack trace for thread B? Or perhaps additional debugging can be added to detect if other sync thread activity operating on the dnode at the same time.

so far, my best guess for Thread B is #18235 (comment)

I've also been running with some range tree sanity checking, but that hasn't yielded a panic yet.

@amotin
Copy link
Member

amotin commented Mar 4, 2026

I was just going to say the same, but there is one more aspect to consider: to improve performance sync thread can sync multiple datasets and dnodes same time in separate taskq threads, periodically waiting for completion. One dnode should only be handled by one task at a time, but I wonder if we may miss the taskq waiting at some point, allowing a race with some other sync thread activity, that might care about the the same dnode somehow.

I've tried to look form this perspective, but unfortunately found no way how it could happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing Status: Understood The root cause of the issue is known Type: Defect Incorrect behavior (e.g. crash, hang)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants