Skip to content

Conversation

@eggfoobar
Copy link
Contributor

During upgrades we need the arbiter mcp nodes to be counted in the same calculation for upgrading control plane nodes so an arbiter nodes does not update at the same time as a control plane node causing a quorum loss in etcd

- What I did

- How to verify it

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 21, 2025
@openshift-ci-robot
Copy link
Contributor

@eggfoobar: This pull request references Jira Issue OCPBUGS-64681, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

During upgrades we need the arbiter mcp nodes to be counted in the same calculation for upgrading control plane nodes so an arbiter nodes does not update at the same time as a control plane node causing a quorum loss in etcd

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 21, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eggfoobar
Once this PR has been reviewed and has the lgtm label, please assign umohnani8 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eggfoobar
Copy link
Contributor Author

/test e2e-metal-ovn-two-node-arbiter
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-arbiter-upgrade
/jira refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-arbiter-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6f953230-c699-11f0-960e-472a28fcb2d9-0

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 21, 2025
@openshift-ci-robot
Copy link
Contributor

@eggfoobar: This pull request references Jira Issue OCPBUGS-64681, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jogeo

In response to this:

/test e2e-metal-ovn-two-node-arbiter
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-arbiter-upgrade
/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jogeo November 21, 2025 05:18
during upgrades we need the arbiter mcp nodes to be counted in the same calculation for upgrading control plane nodes so an arbiter nodes does not update at the same time as a control plane node causing a quorum loss in etcd

Signed-off-by: ehila <[email protected]>
@eggfoobar eggfoobar force-pushed the ocpbugs-64681-arbiter-update branch from 300a354 to 301ec41 Compare November 21, 2025 18:36
@eggfoobar
Copy link
Contributor Author

/test e2e-metal-ovn-two-node-arbiter
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-arbiter-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-arbiter-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ffac67c0-c708-11f0-9349-9d312453556e-0

@eggfoobar
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 22, 2025

@eggfoobar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit 301ec41 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}
combinedNodes := append([]*corev1.Node{}, nodes...)
combinedNodes = append(combinedNodes, arbiterNodes...)
combinedMax, err := maxUnavailable(pool, combinedNodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this basically implies that the arbiter pool's maxUnavailable no longer has an effect. Maybe we should not allow setting that field at all.

(in practice nobody should be fiddling with maxUnavailable for either masters or arbiters, so it probably doesn't matter, just wanted to note that just in case)

arbiterUnavailable := len(getUnavailableMachines(arbiterNodes, arbiterPool))
// Adjust maxunavail to account for arbiter unavailable nodes
// This ensures we don't exceed the combined maxUnavailable across both pools
maxunavail = combinedMax - arbiterUnavailable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previously set maxunavail should just be a value set in the pool, and the candidate selection below filters for any in progress or not ready nodes. Given that we have some complex logic below, is it needed to subtract arbiter here?

}

// If coordinating with arbiter pool, also handle arbiter node updates
if arbiterPool != nil && len(arbiterNodes) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this whole section also be gated via if pool.Name == ctrlcommon.MachineConfigPoolMaster && controlPlaneTopology == configv1.HighlyAvailableArbiterMode similar to above? It seems possible that we'd be syncing the arbiter pool during a worker sync which can happen in parallel with masters, which might be unsafe

// If coordinating with arbiter pool, also handle arbiter node updates
if arbiterPool != nil && len(arbiterNodes) > 0 {
// Set cluster config annotation for arbiter nodes
if err := ctrl.setClusterConfigAnnotation(arbiterNodes, controlPlaneTopology); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could combine this with the same function call earlier somehow, or move it closer (although I guess functionally speaking it shouldn't matter)

There's quite a bit of duplicated code in general. I guess the main reason we can't just merge into the above functions is that arbiter has a different desiredConfig annotation it would need to set? Would it be easier if we modified the updateCandidateMachines function to account for that and treat the arbiter node as a master node in this function?

combinedNodes = append(combinedNodes, arbiterNodes...)
combinedMax, err := maxUnavailable(pool, combinedNodes)
if err == nil {
remainingCapacity := combinedMax - masterUnavailable - masterTargeted - arbiterUnavailable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble following this set of logic we used for calculation. Given that we calculated the capacity (including arbiter) earlier, and tracked the amount of masters being updated in this round, wouldn't it be just be capacity - masterTargeted? Would it be possible to simplify this logic somehow? It feels like a lot of potentially unnecessary calculations on top of all the duplication, and would be hard to maintain in the future. I think I still prefer somehow merging arbiter into master calculation and keep most of the function unduplicated, just set the desired annotation differently if arbiter gets selected. Ultimately this function is designed to update some nodes' annotations to the new desired config, and I'm hoping we can keep the core functionality the same without having to special case too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants