-
Notifications
You must be signed in to change notification settings - Fork 928
Description
Problem Statement
At the very end of atomic slot migration, we guarantee all keys are synchronized to the target shard and then we:
- Assign the slot to ourselves
- Bump the epoch on the target shard
If no other epoch bumps are occurring at the same time, this works fine. If a shard that is not involved in the slot migration bumps the epoch, it will reconcile and the slot ownership transfer will still remain.
However, if either the target or source shard bumps their (for example, to perform a failover), we can end up failing the slot migration:
- Source shard bump: if the new primary on the source shard bumps the epoch before hearing about the target shard's slot ownership transfer, the new primary will also claim ownership of the slot. There is a 50/50 chance that this will result in slot ownership being transferred back to the source shard. Since the old primary may have already started cleaning up the dirty slots, this may result in data loss
- Target shard bump: if the new primary on the target shard bumps the epoch before hearing about the old primary's slot ownership transfer, the new primary will not claim ownership of the slot. If the old primary is still around, there is a 50/50 chance the old primary will take the shard back over, otherwise, the new primary will claim no ownership of the slot. Since the source shard may have already removed the slot from its view, there is a chance nobody claims ownership of the slot
Potential Solutions
- Synchronous replication to replicas: this is how we solved the problem for legacy slot migrations.
SETSLOT
will be persisted to replicas before being applied locally. - Election-based epoch bump: this will ensure an ordering between the slot migration and failover epoch bumps, and would guarantee nodes could only win an election after catching up to the latest epoch (and viewing the slot transfer).
1 has challenges if the replicas are partitioned away, in which case it will fail closed (not enough good replicas). After #2635 we already send the slot migration state to the replicas, but it would need to be guaranteed. As long as the ESTABLISH
command is guaranteed to be replicated, the newly promoted primary should (depending on if the most up-to-date replica wins) be aware of the migration and can complete it. The same would need to be guaranteed for the replicas of the source shard (so that the newly promoted replica can accept the target shard ownership even if the epoch is lower).
2 has a performance challenge - given that the election requires broadcasting messages to the whole cluster. However, we already support batching of slots into a single migration job, so this may be acceptable (it is not a per-slot vote, but a per-migration vote)