Skip to content

Conversation

murphyjacob4
Copy link
Contributor

@murphyjacob4 murphyjacob4 commented Sep 21, 2025

Problem

In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE.

Design

The main part of the design is that we will now listen for and process the SYNCSLOTS ESTABLISH command on the replica. When a SYNCSLOTS ESTABLISH command is received from the primary, we begin tracking a new slot import in a special SLOT_IMPORT_OCCURRING_ON_PRIMARY state. Replicas use this state to track the import, and await for a future SYNCSLOTS FINISH message that tells them the import is successful/failed.

Success Case

     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |

Failure Case

     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |

Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new cluster-slot-imports aux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with the cluster-slot-imports aux field, replicas will add a new migration in the SLOT_IMPORT_OCCURRING_ON_PRIMARY state.

It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the cluster-slot-imports aux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get their SYNCSLOTS FINISH message in the replication stream.

AOF

Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the SYNCSLOTS ESTABLISH and FINISH commands to the AOF.

However, considering there is work to change this (#59 #1901) this design doesn't make any assumptions about this.

We will propagate the ESTABLISH and FINISH commands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those in verifyClusterConfigWithData.

Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients.

Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (verifyClusterConifgWithData) and sending a SYNCSLOTS FINISH.

Other Options Considered

I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events.

Copy link

codecov bot commented Sep 21, 2025

Codecov Report

❌ Patch coverage is 88.69863% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.26%. Comparing base (0646749) to head (d0a1e76).
⚠️ Report is 1 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_migrateslots.c 91.08% 23 Missing ⚠️
src/valkey-check-rdb.c 0.00% 10 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2635      +/-   ##
============================================
+ Coverage     72.18%   72.26%   +0.07%     
============================================
  Files           128      128              
  Lines         71037    71237     +200     
============================================
+ Hits          51277    51477     +200     
  Misses        19760    19760              
Files with missing lines Coverage Δ
src/cluster_legacy.c 87.36% <100.00%> (-0.03%) ⬇️
src/commands.def 100.00% <ø> (ø)
src/db.c 93.24% <100.00%> (-0.01%) ⬇️
src/rdb.c 77.60% <100.00%> (+1.16%) ⬆️
src/rdb.h 100.00% <ø> (ø)
src/replication.c 86.14% <100.00%> (+0.08%) ⬆️
src/server.c 88.43% <100.00%> (+0.06%) ⬆️
src/server.h 100.00% <ø> (ø)
src/valkey-check-rdb.c 71.18% <0.00%> (-1.37%) ⬇️
src/cluster_migrateslots.c 91.98% <91.08%> (+0.69%) ⬆️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not fully familiar with the ASM code in general. I didn't review the main ASM PR carefully enough.

if {$owning_prim == $idx_to_restart && ! $do_save} {
assert_match "0" [R $owning_prim CLUSTER COUNTKEYSINSLOT 16379]
assert_match "0" [R $owning_prim CLUSTER COUNTKEYSINSLOT 16381]
assert_match "0" [R $owning_prim CLUSTER COUNTKEYSINSLOT 16383]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer assert_match instead of assert_equal? Match is using a glob-style pattern. No need to use it if you don't use wildcards in the pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no reason, I can replace them now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - updating the whole file makes the line count go crazy. Can I fix it in a follow-up so that we can keep this change reviewable? It is a simple find/replace, but I don't want this change to have +900 LOC

@zuiderkwast
Copy link
Contributor

@valkey-io/valkey-committers Another pair of eyes on this wouldn't hurt. If anyone has time. It's needed for 9.0.0 to be released ASAP.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly aligned with the high level details. I'd like us to trim down the data in the RDB. The implementation seems pretty clean otherwise.

@zuiderkwast zuiderkwast moved this to In Progress in Valkey 9.0 Oct 1, 2025
@murphyjacob4 murphyjacob4 requested a review from madolson October 2, 2025 16:20
@murphyjacob4
Copy link
Contributor Author

Hey @madolson, thanks for the review, please take another look!

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything else looks fine outside some of the questions around the RDB format.

@murphyjacob4
Copy link
Contributor Author

@madolson we now don't track source node ID on the replica, and we use an RDB opcode to save the slot imports. PTAL!

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Signed-off-by: Jacob Murphy <[email protected]>
@murphyjacob4 murphyjacob4 merged commit 383ffe4 into valkey-io:unstable Oct 7, 2025
53 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to To be backported in Valkey 9.0 Oct 7, 2025
murphyjacob4 added a commit to murphyjacob4/valkey that referenced this pull request Oct 7, 2025
…ration (valkey-io#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (valkey-io#59 valkey-io#1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 added a commit to murphyjacob4/valkey that referenced this pull request Oct 7, 2025
…ration (valkey-io#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (valkey-io#59 valkey-io#1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 added a commit that referenced this pull request Oct 8, 2025
…ration (#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (#59 #1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <[email protected]>
@murphyjacob4 murphyjacob4 moved this from To be backported to Done in Valkey 9.0 Oct 8, 2025
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Oct 14, 2025
…ration (valkey-io#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (valkey-io#59 valkey-io#1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 added a commit that referenced this pull request Oct 18, 2025
…on (#2749)

When working on #2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.

The solution is:

1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op

This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.

Signed-off-by: Jacob Murphy <[email protected]>
cherukum-Amazon pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 19, 2025
…on (valkey-io#2749)

When working on valkey-io#2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.

The solution is:

1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op

This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit 2a914aa)
cherukum-Amazon pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 19, 2025
…on (valkey-io#2749)

When working on valkey-io#2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.

The solution is:

1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op

This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit 2a914aa)
Signed-off-by: cherukum-amazon <[email protected]>
cherukum-Amazon pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 21, 2025
…on (valkey-io#2749)

When working on valkey-io#2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.

The solution is:

1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op

This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit 2a914aa)
Signed-off-by: cherukum-amazon <[email protected]>
diego-ciciani01 pushed a commit to diego-ciciani01/valkey that referenced this pull request Oct 21, 2025
…on (valkey-io#2749)

When working on valkey-io#2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.

The solution is:

1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op

This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.

Signed-off-by: Jacob Murphy <[email protected]>
madolson pushed a commit that referenced this pull request Oct 21, 2025
…on (#2749)

When working on #2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.

The solution is:

1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op

This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit 2a914aa)
Signed-off-by: cherukum-amazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants