Improve atomic slot migration resilience to source and target shard failover

## Problem Statement

At the very end of atomic slot migration, we guarantee all keys are synchronized to the target shard and then we:

1. Assign the slot to ourselves
2. Bump the epoch on the target shard

If no other epoch bumps are occurring at the same time, this works fine. If a shard that is not involved in the slot migration bumps the epoch, it will reconcile and the slot ownership transfer will still remain.

However, if either the target or source shard bumps their (for example, to perform a failover), we can end up failing the slot migration:

* Source shard bump: if the new primary on the source shard bumps the epoch before hearing about the target shard's slot ownership transfer, the new primary will also claim ownership of the slot. There is a 50/50 chance that this will result in slot ownership being transferred back to the source shard. **Since the old primary may have already started cleaning up the dirty slots, this may result in data loss** 
* Target shard bump: if the new primary on the target shard bumps the epoch before hearing about the old primary's slot ownership transfer, the new primary will not claim ownership of the slot. If the old primary is still around, there is a 50/50 chance the old primary will take the shard back over, otherwise, the new primary will claim no ownership of the slot. **Since the source shard may have already removed the slot from its view, there is a chance nobody claims ownership of the slot**

## Potential Solutions

1. Synchronous replication to replicas: this is how we solved the problem for legacy slot migrations. `SETSLOT` will be persisted to replicas before being applied locally.
2. Election-based epoch bump: this will ensure an ordering between the slot migration and failover epoch bumps, and would guarantee nodes could only win an election after catching up to the latest epoch (and viewing the slot transfer).

1 has challenges if the replicas are partitioned away, in which case it will fail closed (not enough good replicas). After #2635 we already send the slot migration state to the replicas, but it would need to be guaranteed. As long as the `ESTABLISH` command is guaranteed to be replicated, the newly promoted primary should (depending on if the most up-to-date replica wins) be aware of the migration and can complete it. The same would need to be guaranteed for the replicas of the source shard (so that the newly promoted replica can accept the target shard ownership even if the epoch is lower). 

2 has a performance challenge - given that the election requires broadcasting messages to the whole cluster. However, we already support batching of slots into a single migration job, so this may be acceptable (it is not a per-slot vote, but a per-migration vote)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve atomic slot migration resilience to source and target shard failover #2724

Problem Statement

Potential Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve atomic slot migration resilience to source and target shard failover #2724

Description

Problem Statement

Potential Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions