Inconsistencies in RW refs and bucket statuses #620

Serpentian · 2025-11-05T08:31:52Z

Serpentian
Nov 5, 2025
Maintainer

Reviewers

Main reviewer: @Gerold103
Second reviewer: @mrForza
Team Lead: @sergepetrenko
CTO: @sergos

Associated tickets

RW refs break replication when master changes #573

Changelog

1. First iteration fixes

Renamed PREPARED to READONLY.

1. Problem overview

According to the #573 the replication between master and replica breaks due to the on_replace trigger on _bucket space:

local function bucket_prepare_update(old_bucket, new_bucket)
    <...>
    if new_status == BSENDING or new_status == BSENT then
        if ref ~= nil and ref.rw > 0 then
            bucket_reject_update(bid, "'%s' bucket can't have rw refs",
                                 new_status)
        end
    elseif new_status == BGARBAGE or new_status == BRECEIVING then
        if ref ~= nil and (ref.rw > 0 or ref.ro > 0) then
            bucket_reject_update(bid, "'%s' bucket can't have any refs",
                                 new_status)
        end
    end
    <...>
end

It makes the situation much worse for #576, since the new master cannot receive update of the bucket and its state remain ACTIVE, which prolongs the time of doubled buckets in the cluster.

Neither this check can be relaxed, nor the RW refs can be dropped, since it may lead to the lost change of the data (e.g. the instance is in RW, it was a master but not anymore, the request is still in progress, the new master is already sending the bucket and will receive the update after sending, then the bucket is deleted with GC) or deleting the bucket and its data right under an active request.

2. Solution

2.0 Summary

New READONLY bucket state. bucket_send makes the bucket ACTIVE->READONLY, waits for 0 RW refs on replicas, for them to get that READONLY state and only after that makes the bucket SENDING.

2.1 Solution for incositent RW refs and bucket states (#573)

There's only one way to deal with this: making sure, that the new master doesn't start sending a bucket, while there're active RW requests on the old master on top of these buckets. Service and checking for RW refs before SENDING doesn't work (check out the alternatives), so we should move towards new bucket status here.

We have the ACTIVE state, where rw_lock is false and RW refs are allowed. We have the SENDING state, where rw_lock is true and RW refs are prohibited. We're missing intermediate state, where rw_lock is true and RW refs are still allowed. I propose to name it READONLY.

The bucket_send will take an ACTIVE bucket, makes the bucket READONLY. wait for 0 RW refs locally. After that the node goes to every instance in the current replicaset and wait for 0 RW refs there, waits for buckets to become READONLY and only after that the bucket can be done SENDING. In case of any error we make the bucket ACTIVE back, the state just replaces for us drop_rw_lock in code, it has the same behavior.

Why the new bucket state is better than manually setting and removing RW locks on the instances? Because we can guarantee, that sooner or later all rw_locks on properly working replicas will be dropped, as soon as they get the replication of the READONLY -> ACTIVE.

Reminder on how `map_callro` is going to work

The initial idea was to do the following:

During map_callro storage refs are created on replicas, master is supposed to check for that before starting the rebalancer.
During applying routes a master takes a bunch of buckes, does sched.move_start locally (it waits for storage ref to become 0). It already does that even without the patches, but this happens for every individual bucket, but we'll do that before sending a batch.
Then we set rw locks for that bunch of buckets and wait locally for RW refs to become 0 on them.
A bunch of buckets become SENDING. Buckets are not sent yet.
On every instance in the current replicaset master waits for the following:
5.1 On replica sched.move_start is called (waits for map_callro to end, prohibits new map_callro).
5.2 Wait for replica to get at least one SENDING bucket.
5.3 sched.move_end.

I propose to reuse the READONLY status for map_callro, in the fourth point we'll make the bucket READONLY instead of the SENDING.

3. Rejected alternatives

3.1 Checking for RW refs after becoming a master

Firstly, my main solution was the following service, which is started after node becomes a master:

The service will wait for the all nodes in the replicaset not to have any RW or storage refs and to become non-master (which guarantees, that new RW refs cannot happen there). Then the master waits for replication sync with all of the replicas in order to get the latest updates from the _bucket space, after all of the nodes become non-master new updates to _bucket space cannot happen (See above why). As soon as these conditions are satisified, the M.rebalance_allowed is set to true, rebalancing is allowed, the service dies, now recovery and rebalancer can do their stuff.

I had to reject it, because at any point of time any other node in the replicaset may become a master and start processing RW refs, the rebalancing will already be allowed at that point and try sending the bucket, which will again lead to replication error.

3.2 Checking for RW refs before making a bucket `SENDING`

This is another idea I checked, after rejecting 3.1: do not introduce the background service, but forcefully check the RW refs on other instances before making a buckets SENDING.

I didn't really like it and prefered the service, because with this solution user will always have to pay for that check, when rebalancing, even if master switch never happenned or user has failover disabled or in manual mode. And it doesn't apply with all the synchronizations we're going to use during implementation of the map_callro. There we synchronize the replicaset after making bucket SENDING, but applying that solution would require us to sync the replicaset for RW refs before SENDING and for storage refs after it, which seems like way too many synces.

But since the 3.1 didn't work, it seemed to me, that this may work. But, even if we check the node for being non-master and to have 0 RW refs before making a bucket SENDING it doesn't guarantee, that in the next moment it won't become master and have RW refs.

In order to guarantee that, we need to set rw_lock on the bucket. But how do we guarantee, that the bucket won't become locked forever? This is a standard 2PC problem. We need to set the locks everywhere and then remove it from everywhere in case of any error.

Here using rw_lock with timeout may help, as we do in the map_callrw with storage refs. But the problem here is that timeout may pass and the node still won't get SENDING status of the bucket, the node will become master and again break the replication.

At the end I don't see any way to guarantee, that a SENDING bucket doesn't have RW refs with this approach.

Gerold103 · 2025-11-11T21:55:25Z

Gerold103
Nov 11, 2025
Maintainer

Thanks, top work.

I suggest to consider another name instead of PREPARED. Because prepared for what? Why "prepared" means no new RW refs? A more specific state could be, for instance, READONLY. Hm?

2 replies

Serpentian Nov 12, 2025
Maintainer Author

Prepared for bucket send) But I don't mind, let it be READONLY

Gerold103 Nov 12, 2025
Maintainer

I know this because I read this paper, but in the code it wouldn't be so obvious. Thanks! LGTM!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistencies in RW refs and bucket statuses #620

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inconsistencies in RW refs and bucket statuses #620

Uh oh!

Uh oh!

Serpentian Nov 5, 2025 Maintainer

Reviewers

Associated tickets

Changelog

1. Problem overview

2. Solution

2.0 Summary

2.1 Solution for incositent RW refs and bucket states (#573)

3. Rejected alternatives

3.1 Checking for RW refs after becoming a master

3.2 Checking for RW refs before making a bucket SENDING

Replies: 1 comment · 2 replies

Uh oh!

Gerold103 Nov 11, 2025 Maintainer

Uh oh!

Serpentian Nov 12, 2025 Maintainer Author

Uh oh!

Gerold103 Nov 12, 2025 Maintainer

Serpentian
Nov 5, 2025
Maintainer

3.2 Checking for RW refs before making a bucket `SENDING`

Replies: 1 comment 2 replies

Gerold103
Nov 11, 2025
Maintainer

Serpentian Nov 12, 2025
Maintainer Author

Gerold103 Nov 12, 2025
Maintainer