Inconsistencies in RW refs and bucket statuses #620
Serpentian
started this conversation in
General
Replies: 1 comment 2 replies
-
|
Thanks, top work. I suggest to consider another name instead of |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Reviewers
Associated tickets
Changelog
1. First iteration fixes
PREPAREDtoREADONLY.1. Problem overview
According to the #573 the replication between master and replica breaks due to the
on_replacetrigger on_bucketspace:It makes the situation much worse for #576, since the new master cannot receive update of the bucket and its state remain
ACTIVE, which prolongs the time of doubled buckets in the cluster.Neither this check can be relaxed, nor the RW refs can be dropped, since it may lead to the lost change of the data (e.g. the instance is in RW, it was a master but not anymore, the request is still in progress, the new master is already sending the bucket and will receive the update after sending, then the bucket is deleted with GC) or deleting the bucket and its data right under an active request.
2. Solution
2.0 Summary
New
READONLYbucket state.bucket_sendmakes the bucketACTIVE->READONLY, waits for 0 RW refs on replicas, for them to get thatREADONLYstate and only after that makes the bucketSENDING.2.1 Solution for incositent RW refs and bucket states (#573)
There's only one way to deal with this: making sure, that the new master doesn't start sending a bucket, while there're active RW requests on the old master on top of these buckets. Service and checking for RW refs before
SENDINGdoesn't work (check out the alternatives), so we should move towards new bucket status here.We have the
ACTIVEstate, whererw_lockisfalseand RW refs are allowed. We have theSENDINGstate, whererw_lockistrueand RW refs are prohibited. We're missing intermediate state, whererw_lockistrueand RW refs are still allowed. I propose to name itREADONLY.The
bucket_sendwill take anACTIVEbucket, makes the bucketREADONLY. wait for 0 RW refs locally. After that the node goes to every instance in the current replicaset and wait for 0 RW refs there, waits for buckets to becomeREADONLYand only after that the bucket can be doneSENDING. In case of any error we make the bucketACTIVEback, the state just replaces for usdrop_rw_lockin code, it has the same behavior.Why the new bucket state is better than manually setting and removing RW locks on the instances? Because we can guarantee, that sooner or later all
rw_lockson properly working replicas will be dropped, as soon as they get the replication of theREADONLY -> ACTIVE.Reminder on how `map_callro` is going to work
The initial idea was to do the following:
map_callrostorage refs are created on replicas, master is supposed to check for that before starting the rebalancer.sched.move_startlocally (it waits for storage ref to become 0). It already does that even without the patches, but this happens for every individual bucket, but we'll do that before sending a batch.SENDING. Buckets are not sent yet.5.1 On replica
sched.move_startis called (waits formap_callroto end, prohibits newmap_callro).5.2 Wait for replica to get at least one
SENDINGbucket.5.3
sched.move_end.I propose to reuse the
READONLYstatus formap_callro, in the fourth point we'll make the bucketREADONLYinstead of theSENDING.3. Rejected alternatives
3.1 Checking for RW refs after becoming a master
Firstly, my main solution was the following service, which is started after node becomes a master:
The service will wait for the all nodes in the replicaset not to have any RW or storage refs and to become non-master (which guarantees, that new RW refs cannot happen there). Then the master waits for replication sync with all of the replicas in order to get the latest updates from the
_bucketspace, after all of the nodes become non-master new updates to_bucketspace cannot happen (See above why). As soon as these conditions are satisified, theM.rebalance_allowedis set totrue, rebalancing is allowed, the service dies, now recovery and rebalancer can do their stuff.I had to reject it, because at any point of time any other node in the replicaset may become a master and start processing RW refs, the rebalancing will already be allowed at that point and try sending the bucket, which will again lead to replication error.
3.2 Checking for RW refs before making a bucket
SENDINGThis is another idea I checked, after rejecting 3.1: do not introduce the background service, but forcefully check the RW refs on other instances before making a buckets
SENDING.I didn't really like it and prefered the service, because with this solution user will always have to pay for that check, when rebalancing, even if master switch never happenned or user has failover disabled or in manual mode. And it doesn't apply with all the synchronizations we're going to use during implementation of the
map_callro. There we synchronize the replicaset after making bucketSENDING, but applying that solution would require us to sync the replicaset for RW refs beforeSENDINGand for storage refs after it, which seems like way too many synces.But since the 3.1 didn't work, it seemed to me, that this may work. But, even if we check the node for being non-master and to have 0 RW refs before making a bucket
SENDINGit doesn't guarantee, that in the next moment it won't become master and have RW refs.In order to guarantee that, we need to set
rw_lockon the bucket. But how do we guarantee, that the bucket won't become locked forever? This is a standard 2PC problem. We need to set the locks everywhere and then remove it from everywhere in case of any error.Here using
rw_lockwith timeout may help, as we do in themap_callrwwith storage refs. But the problem here is that timeout may pass and the node still won't getSENDINGstatus of the bucket, the node will become master and again break the replication.At the end I don't see any way to guarantee, that a
SENDINGbucket doesn't have RW refs with this approach.Beta Was this translation helpful? Give feedback.
All reactions