Skip to content

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Jul 30, 2025

This is the same failure as observed in #129644 for which the original fix #129680 did not really work. It did not work because the the ordering of checks. The shutdown marker is removed after the cluster passes ready check so that new shards can be allocated. But the cluster cannot pass the ready check before the shards are allocated. Hence the circular dependency. In hindsight, there is no need to put shutdown record for all nodes. It is only needed on the node that upgrades the last to prevent snapshot from completion during the upgrade process. This PR does that which ensures there are always 2 nodes for hosting new shards.

Resolves: #132135
Resolves: #132136
Resolves: #132137

This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135
Resolves: elastic#132136
Resolves: elastic#132137
@ywangd ywangd requested a review from nicktindall July 30, 2025 08:18
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs auto-backport Automatically create backport pull requests when merged v9.2.0 v9.1.1 labels Jul 30, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd
Copy link
Member Author

ywangd commented Jul 31, 2025

@elasticmachine update branch

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 31, 2025
@elasticsearchmachine elasticsearchmachine merged commit f39ccb5 into elastic:main Jul 31, 2025
33 checks passed
@ywangd ywangd deleted the es-132135-fix branch July 31, 2025 05:13
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 132157

@ywangd
Copy link
Member Author

ywangd commented Jul 31, 2025

💚 All backports created successfully

Status Branch Result
9.1

Questions ?

Please refer to the Backport tool documentation

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jul 31, 2025
…2157)

This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137
(cherry picked from commit f39ccb5)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Jul 31, 2025
…132233)

This is the same failure as observed in #129644 for which the original
fix #129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: #132135 Resolves: #132136 Resolves: #132137
(cherry picked from commit f39ccb5)

# Conflicts:
#	muted-tests.yml
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Jul 31, 2025
…2157)

This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Jul 31, 2025
…2157)

This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.1 v9.2.0
Projects
None yet
4 participants