OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297

hongkailiu · 2025-09-23T19:33:36Z

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

openshift-ci-robot · 2025-09-23T20:27:12Z

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:
Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.
The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-09-24T13:47:21Z

This is what I expect to see (from this job):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1970572328379092992/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/build-log.txt | rg 'failed.*scaling different machineSets simultaneously|fail.*Progressing=False'
fail [github.com/openshift/origin/test/extended/machines/scale.go:253]: those cluster operators left Progressing=False while cluster was scaling: [network image-registry node-tuning storage dns]
failed: (6m0s) 2025-09-23T22:57:26 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

And the time matches perfectly.

Interestingly, it is the same list caught by another case. I feel they are caused from the same source of issue and the same bug can be shared by two cases.

petr-muller · 2025-09-25T23:32:30Z

/cc

test/extended/machines/scale.go

DavidHurta · 2025-09-29T12:22:30Z

/cc

petr-muller

LGTM

petr-muller

/lgtm

This is to cover the cluster scaling case from the rule [1] that is introduced recently: ``` Operators should not report Progressing only because DaemonSets owned by them are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade. ``` The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime. [1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

hongkailiu · 2025-09-30T13:52:34Z

/wip

Creating bugs for exceptions ...

hongkailiu · 2025-09-30T13:53:05Z

/wip

Creating bugs for exceptions ...

hongkailiu · 2025-09-30T13:54:22Z

/hold

openshift-trt · 2025-09-30T18:04:08Z

Job Failure Risk Analysis for sha: 787e9be

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2170): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

The bugs are created for the case of node rebooting. The condition goes to Progressing=True with the same reason that we found for the cluster scaling up/down. Thus, we re-use the bugs instead of recreating a new set of bugs that might be closed as duplciates.

openshift-ci-robot · 2025-10-02T13:12:54Z

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:
Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.
The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-10-02T13:12:55Z

The result from e2e-aws-ovn-serial-2of2 looks good:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1973475766385512448/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/artifacts/e2e.log | grep
grow
started: 0/5/36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m9s) 2025-10-01T22:58:36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

I wanted to show some logs for the exceptions but it does not seem an easy thing to do when the job succeeds. 🤷

hongkailiu · 2025-10-02T13:13:07Z

/hold cancel

hongkailiu · 2025-10-07T01:47:06Z

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

openshift-ci-robot · 2025-10-07T01:47:08Z

@hongkailiu: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

In response to this:

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-10-07T01:47:30Z

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

openshift-ci-robot · 2025-10-07T01:47:41Z

@hongkailiu: This PR has been marked as verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2.

In response to this:

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

neisw · 2025-10-07T18:06:42Z

test/extended/machines/scale.go

+				violations = append(violations, operator)
+			}
+		}
+		o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)


This will become one of those test failures (if any) that becomes hard to assign to individual components. Since it is a single test that can impact multiple operators.

Could Expect be called in the for loop for and include the operator in the name?

Ahh, nevermind I see it isn't a new test just new information regarding the test failure.

Thanks for the review.
I thought about it too: "hard to assign to individual components".
The modified test is an extended test (under /test/extended).
I do not see how to insert junitapi.JUnitTestCase like what a monitortest does in CollectData and EvaluateTestsFromConstructedIntervals.

And yes, if it fails in the future, we have to check the error msg from expect and manually triage the OCPBugs.

neisw · 2025-10-07T18:10:46Z

/approve

openshift-ci · 2025-10-07T18:11:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, neisw, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [neisw]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hongkailiu · 2025-10-07T18:50:34Z

/retest-required

openshift-ci-robot · 2025-10-08T04:39:36Z

/retest-required

Remaining retests: 0 against base HEAD 3581fe4 and 2 for PR HEAD c9c5fa5 in total

openshift-ci-robot · 2025-10-08T09:40:08Z

/retest-required

Remaining retests: 0 against base HEAD d3b6fa6 and 1 for PR HEAD c9c5fa5 in total

hongkailiu · 2025-10-08T13:11:44Z

/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi

hongkailiu · 2025-10-08T18:28:02Z

It was green once on the same commit.

/test e2e-gcp-csi

openshift-trt · 2025-10-09T00:35:09Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2293): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1770): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt · 2025-10-09T01:10:09Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2293): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1788): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-ci-robot · 2025-10-09T02:10:14Z

/retest-required

Remaining retests: 0 against base HEAD 816619b and 0 for PR HEAD c9c5fa5 in total

openshift-ci · 2025-10-09T09:09:28Z

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-serial	`c9c5fa5`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-openstack-ovn	`c9c5fa5`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-aws-ovn-single-node	`c9c5fa5`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-aws-ovn-edge-zones	`0d05b6a`	link	false	`/test e2e-aws-ovn-edge-zones`
ci/prow/okd-scos-e2e-aws-ovn	`c9c5fa5`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-10-09T09:11:04Z

/hold

Revision c9c5fa5 was retested 3 times: holding

openshift-trt · 2025-10-09T09:13:16Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2293): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1807): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt · 2025-10-09T10:13:08Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2326): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1807): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

hongkailiu · 2025-10-09T11:59:44Z

/retest-required
/hold cancel

openshift-trt · 2025-10-09T17:16:18Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2326): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1921): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-ci-robot · 2025-10-09T20:23:28Z

/retest-required

Remaining retests: 0 against base HEAD 7343864 and 2 for PR HEAD c9c5fa5 in total

openshift-ci-robot · 2025-10-09T21:23:38Z

/retest-required

Remaining retests: 0 against base HEAD bb9f65a and 1 for PR HEAD c9c5fa5 in total

openshift-trt · 2025-10-10T00:36:36Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2158): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1873): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt · 2025-10-10T01:11:06Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2158): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1873): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-ci bot requested review from deads2k and p0lyn0mial September 23, 2025 19:34

hongkailiu changed the title ~~ClusterOperators should not go Progressing only for cluster scaling~~ OTA-1637: ClusterOperators should not go Progressing only for cluster scaling Sep 23, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 23, 2025

openshift-ci bot requested a review from petr-muller September 25, 2025 23:32

wking reviewed Sep 26, 2025

View reviewed changes

test/extended/machines/scale.go Outdated Show resolved Hide resolved

openshift-ci bot requested a review from DavidHurta September 29, 2025 12:22

petr-muller reviewed Sep 30, 2025

View reviewed changes

petr-muller approved these changes Sep 30, 2025

View reviewed changes

openshift-ci bot assigned petr-muller Sep 30, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025

hongkailiu force-pushed the OTA-1637-scale branch from 5891e83 to 787e9be Compare September 30, 2025 13:49

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2025

hongkailiu force-pushed the OTA-1637-scale branch from 6e43bdc to 0d05b6a Compare October 1, 2025 15:39

hongkailiu force-pushed the OTA-1637-scale branch from 0d05b6a to c9c5fa5 Compare October 1, 2025 19:50

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 7, 2025

neisw reviewed Oct 7, 2025

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2025

openshift-merge-bot bot merged commit 440edc3 into openshift:main Oct 10, 2025
11 of 26 checks passed

OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297

OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297

Uh oh!

Conversation

hongkailiu commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 23, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongkailiu commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Sep 25, 2025

Uh oh!

Uh oh!

DavidHurta commented Sep 29, 2025

Uh oh!

petr-muller left a comment

Choose a reason for hiding this comment

Uh oh!

petr-muller left a comment

Choose a reason for hiding this comment

Uh oh!

hongkailiu commented Sep 30, 2025

Uh oh!

hongkailiu commented Sep 30, 2025

Uh oh!

hongkailiu commented Sep 30, 2025

Uh oh!

openshift-trt bot commented Sep 30, 2025

Uh oh!

openshift-ci-robot commented Oct 2, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongkailiu commented Oct 2, 2025

Uh oh!

hongkailiu commented Oct 2, 2025

Uh oh!

hongkailiu commented Oct 7, 2025

Uh oh!

openshift-ci-robot commented Oct 7, 2025

Uh oh!

hongkailiu commented Oct 7, 2025

Uh oh!

openshift-ci-robot commented Oct 7, 2025

Uh oh!

neisw Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

neisw Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

hongkailiu Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neisw commented Oct 7, 2025

Uh oh!

openshift-ci bot commented Oct 7, 2025

Uh oh!

hongkailiu commented Oct 7, 2025

Uh oh!

openshift-ci-robot commented Oct 8, 2025

Uh oh!

openshift-ci-robot commented Oct 8, 2025

Uh oh!

hongkailiu commented Oct 8, 2025

Uh oh!

hongkailiu commented Oct 8, 2025

Uh oh!

openshift-trt bot commented Oct 9, 2025

Uh oh!

openshift-trt bot commented Oct 9, 2025

Uh oh!

openshift-ci-robot commented Oct 9, 2025

Uh oh!

openshift-ci bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongkailiu commented Sep 23, 2025 •

edited

Loading

openshift-ci-robot commented Sep 23, 2025 •

edited by openshift-ci bot

Loading

hongkailiu commented Sep 24, 2025 •

edited

Loading

openshift-ci-robot commented Oct 2, 2025 •

edited by openshift-ci bot

Loading

hongkailiu Oct 7, 2025 •

edited

Loading

openshift-ci bot commented Oct 9, 2025 •

edited

Loading