Skip to content

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Sep 23, 2025

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

@hongkailiu hongkailiu changed the title ClusterOperators should not go Progressing only for cluster scaling OTA-1637: ClusterOperators should not go Progressing only for cluster scaling Sep 23, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 23, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 23, 2025

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

hongkailiu commented Sep 24, 2025

This is what I expect to see (from this job):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1970572328379092992/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/build-log.txt | rg 'failed.*scaling different machineSets simultaneously|fail.*Progressing=False'
fail [github.com/openshift/origin/test/extended/machines/scale.go:253]: those cluster operators left Progressing=False while cluster was scaling: [network image-registry node-tuning storage dns]
failed: (6m0s) 2025-09-23T22:57:26 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

And the time matches perfectly.

Screenshot 2025-09-25 at 09 39 47 Screenshot 2025-09-25 at 09 40 56

Interestingly, it is the same list caught by another case. I feel they are caused from the same source of issue and the same bug can be shared by two cases.

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller September 25, 2025 23:32
@DavidHurta
Copy link

/cc

@openshift-ci openshift-ci bot requested a review from DavidHurta September 29, 2025 12:22
Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025
This is to cover the cluster scaling case from the rule [1] that is
introduced recently:

```
Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.
```

The test plugs into the existing scaling test. It checks each
CO's Progressing condition before and after the test, and identifies
every CO that either left Progressing=False or re-entered
Progressing=False with a different LastTransitionTime.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025
@hongkailiu
Copy link
Member Author

/wip

Creating bugs for exceptions ...

1 similar comment
@hongkailiu
Copy link
Member Author

/wip

Creating bugs for exceptions ...

@hongkailiu
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2025
Copy link

openshift-trt bot commented Sep 30, 2025

Job Failure Risk Analysis for sha: 787e9be

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2170): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplciates.
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 2, 2025

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

The result from e2e-aws-ovn-serial-2of2 looks good:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1973475766385512448/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/artifacts/e2e.log | grep
grow
started: 0/5/36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m9s) 2025-10-01T22:58:36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

I wanted to show some logs for the exceptions but it does not seem an easy thing to do when the job succeeds. 🤷

@hongkailiu
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025
@hongkailiu
Copy link
Member Author

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

@openshift-ci-robot
Copy link

@hongkailiu: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

In response to this:

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 7, 2025
@openshift-ci-robot
Copy link

@hongkailiu: This PR has been marked as verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2.

In response to this:

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

violations = append(violations, operator)
}
}
o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will become one of those test failures (if any) that becomes hard to assign to individual components. Since it is a single test that can impact multiple operators.

Could Expect be called in the for loop for and include the operator in the name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, nevermind I see it isn't a new test just new information regarding the test failure.

Copy link
Member Author

@hongkailiu hongkailiu Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.
I thought about it too: "hard to assign to individual components".
The modified test is an extended test (under /test/extended).
I do not see how to insert junitapi.JUnitTestCase like what a monitortest does in CollectData and EvaluateTestsFromConstructedIntervals.

And yes, if it fails in the future, we have to check the error msg from expect and manually triage the OCPBugs.

@neisw
Copy link
Contributor

neisw commented Oct 7, 2025

/approve

Copy link
Contributor

openshift-ci bot commented Oct 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, neisw, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2025
@hongkailiu
Copy link
Member Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 3581fe4 and 2 for PR HEAD c9c5fa5 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d3b6fa6 and 1 for PR HEAD c9c5fa5 in total

@hongkailiu
Copy link
Member Author

/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi

@hongkailiu
Copy link
Member Author

It was green once on the same commit.

/test e2e-gcp-csi

Copy link

openshift-trt bot commented Oct 9, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2293): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1770): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Copy link

openshift-trt bot commented Oct 9, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2293): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1788): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 816619b and 0 for PR HEAD c9c5fa5 in total

Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-serial c9c5fa5 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-openstack-ovn c9c5fa5 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-single-node c9c5fa5 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-edge-zones 0d05b6a link false /test e2e-aws-ovn-edge-zones
ci/prow/okd-scos-e2e-aws-ovn c9c5fa5 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/hold

Revision c9c5fa5 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2025
Copy link

openshift-trt bot commented Oct 9, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2293): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1807): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Copy link

openshift-trt bot commented Oct 9, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2326): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1807): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@hongkailiu
Copy link
Member Author

/retest-required
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2025
Copy link

openshift-trt bot commented Oct 9, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2326): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1921): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7343864 and 2 for PR HEAD c9c5fa5 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD bb9f65a and 1 for PR HEAD c9c5fa5 in total

Copy link

openshift-trt bot commented Oct 10, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2158): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1873): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

1 similar comment
Copy link

openshift-trt bot commented Oct 10, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2158): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1873): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-merge-bot openshift-merge-bot bot merged commit 440edc3 into openshift:main Oct 10, 2025
11 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants