-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297
Conversation
@hongkailiu: This pull request references OTA-1637 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This is what I expect to see (from this job):
And the time matches perfectly. ![]() ![]() Interestingly, it is the same list caught by another case. I feel they are caused from the same source of issue and the same bug can be shared by two cases. |
/cc |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This is to cover the cluster scaling case from the rule [1] that is introduced recently: ``` Operators should not report Progressing only because DaemonSets owned by them are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade. ``` The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime. [1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164
5891e83
to
787e9be
Compare
/wip Creating bugs for exceptions ... |
1 similar comment
/wip Creating bugs for exceptions ... |
/hold |
Job Failure Risk Analysis for sha: 787e9be
|
6e43bdc
to
0d05b6a
Compare
The bugs are created for the case of node rebooting. The condition goes to Progressing=True with the same reason that we found for the cluster scaling up/down. Thus, we re-use the bugs instead of recreating a new set of bugs that might be closed as duplciates.
0d05b6a
to
c9c5fa5
Compare
@hongkailiu: This pull request references OTA-1637 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
The result from e2e-aws-ovn-serial-2of2 looks good: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1973475766385512448/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/artifacts/e2e.log | grep
grow
started: 0/5/36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m9s) 2025-10-01T22:58:36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]" I wanted to show some logs for the exceptions but it does not seem an easy thing to do when the job succeeds. 🤷 |
/hold cancel |
/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2" |
@hongkailiu: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 |
@hongkailiu: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
violations = append(violations, operator) | ||
} | ||
} | ||
o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will become one of those test failures (if any) that becomes hard to assign to individual components. Since it is a single test that can impact multiple operators.
Could Expect be called in the for loop for and include the operator in the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, nevermind I see it isn't a new test just new information regarding the test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
I thought about it too: "hard to assign to individual components".
The modified test is an extended test (under /test/extended
).
I do not see how to insert junitapi.JUnitTestCase
like what a monitortest does in CollectData
and EvaluateTestsFromConstructedIntervals
.
And yes, if it fails in the future, we have to check the error msg from expect
and manually triage the OCPBugs.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, neisw, petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
/test e2e-aws-ovn-serial-2of2 |
It was green once on the same commit. /test e2e-gcp-csi |
Job Failure Risk Analysis for sha: c9c5fa5
|
Job Failure Risk Analysis for sha: c9c5fa5
|
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/hold Revision c9c5fa5 was retested 3 times: holding |
Job Failure Risk Analysis for sha: c9c5fa5
|
Job Failure Risk Analysis for sha: c9c5fa5
|
/retest-required |
Job Failure Risk Analysis for sha: c9c5fa5
|
Job Failure Risk Analysis for sha: c9c5fa5
|
1 similar comment
Job Failure Risk Analysis for sha: c9c5fa5
|
440edc3
into
openshift:main
This is to cover the cluster scaling case from the rule [1] that is introduced recently:
The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.
The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.
[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164