Skip to content

Conversation

aljoscha
Copy link
Contributor

… read-only mode

Before, this scenario could happen when starting two deploy generations for the same environment concurrently:

  1. Gen 1 bootstraps and sets up catalog state just enough, without minting global id -> shard id mappings.
  2. Gen 2 bootstraps, notices that the catalog state already exists and that it is gen 2 but catalog state is from gen 1. Because of this is starts in read-only mode. In , we mint new shard ID mappings, which will go into ephemeral catalog state because we're in read-only mode.
  3. Gen 1 gets around to mint global id -> shard id mappings and writes them down durably, then starts to proceed with bootstrap.
  4. Gen 2 is waiting for collections to hydrate that never do, because no-one ever advances their frontiers. It never reports as ready to promote, so tests that expect this will hang.

This isn't a correctness issue, the environment will hang forever or hit a timeout, halt, and restart, at which point it sees the catalog state with shard id mappings.

With this change, we notice the situation early, we halt when we would have to mint new ids in read-only mode. Which triggers a restart, at which point gen 2 probably sees the updated catalog state with shard id mappings.

… read-only mode

Before, this scenario could happen when starting two deploy generations
for the same environment concurrently:

1. Gen 1 bootstraps and sets up catalog state just enough, without
   minting global id -> shard id mappings.
2. Gen 2 bootstraps, notices that the catalog state already exists and
   that it is gen 2 but catalog state is from gen 1. Because of this is
   starts in read-only mode. In
   https://github.com/MaterializeInc/materialize/blob/e3c1b3b3620ea6d6c619f59efbfada11a5575ecb/src/storage-client/src/storage_collections.rs#L1384,
   we mint new shard ID mappings, which will go into ephemeral catalog
   state because we're in read-only mode.
3. Gen 1 gets around to mint global id -> shard id mappings and writes
   them down durably, then starts to proceed with bootstrap.
4. Gen 2 is waiting for collections to hydrate that never do, because
   no-one ever advances their frontiers. It never reports as ready to
   promote, so tests that expect this will hang.

This isn't a correctness issue, the environment will hang forever or hit
a timeout, halt, and restart, at which point it sees the catalog state
with shard id mappings.

With this change, we notice the situation early, we halt when we would
have to mint new ids in read-only mode. Which triggers a restart, at
which point gen 2 probably sees the updated catalog state with shard id
mappings.
@aljoscha aljoscha requested review from petrosagg and teskje August 15, 2025 16:50
@aljoscha aljoscha requested a review from a team as a code owner August 15, 2025 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant