Skip to content

[Bugfix] Implement upgrade-aware controller ordering for FE/BEs/CNs#704

Closed
jmjm15x wants to merge 1 commit intoStarRocks:mainfrom
jmjm15x:bugfix/upgrade-sequence
Closed

[Bugfix] Implement upgrade-aware controller ordering for FE/BEs/CNs#704
jmjm15x wants to merge 1 commit intoStarRocks:mainfrom
jmjm15x:bugfix/upgrade-sequence

Conversation

@jmjm15x
Copy link

@jmjm15x jmjm15x commented Oct 7, 2025

Description

Add upgrade detection and implement BE-first ordering for cluster upgrades in the StarRocks Kubernetes Operator. Previously, the operator always used FE-first ordering, which is correct for initial deployments but incorrect for upgrades. According to StarRocks guidelines:

  • Initial Deployment: FE → BEs/CNs
  • Cluster Upgrades: BEs/CNs → FE

From official documentation:

Upgrade procedure
By design, BEs and CNs are backward compatible with the FEs. Therefore, you need to upgrade BEs and CNs first and then FEs to allow your cluster to run properly while being upgraded. Upgrading them in an inverted order may lead to incompatibility between FEs and BEs/CNs, and thereby cause the service to crash.

Solution

Implemented an upgrade detection mechanism that dynamically switches controller execution order based on cluster state.
Key Changes

  • isUpgrade() function

    • Detects if the cluster is in the running phase.
    • Compares current deployed images vs. spec images for FE, BE, and CN components.
    • Returns true if the cluster is running and any image has changed.
  • getControllersInOrder() function

    • Returns BE-first ordering during upgrades: [be, cn, fe, feproxy].
    • Returns FE-first ordering for initial deployments: [fe, be, cn, feproxy].
    • Adds clear logging for ordering decisions.

Logic Flow:

Reconcile() called
    ↓
Is cluster phase == "running" AND images changed?
    ↓                           ↓
   YES                         NO
    ↓                           ↓
BE-first ordering          FE-first ordering
(BE → CN → FE → FeProxy)   (FE → BE → CN → FeProxy)

Test Case 1: Initial Deployment (v3.1.0)

Result: Correctly used FE-first ordering

Operator Log:
  INFO  initial deployment detected: using FE-first ordering
  INFO  sub controller sync spec  {"subController": "feController"}
  INFO  sub controller sync spec  {"subController": "beController"}

StatefulSet Creation:
  upgrade-test-fe: 22:14:39Z (created first)
  upgrade-test-be: 22:19:10Z (created 4m31s later, after FE was ready)

Test Case 2: Upgrade (v3.1.0 → v3.1.8)

Result: Correctly detected upgrade and used BE-first ordering

Operator Log:
  INFO  image changes detected in running cluster, this is an upgrade
  INFO  upgrade detected: using BE-first ordering (StarRocks official upgrade procedure)
  INFO  sub controller sync spec  {"subController": "beController"}
  INFO  sub controller sync spec  {"subController": "cnController"}
  INFO  sub controller sync spec  {"subController": "feController"}

StatefulSet Updates:
  upgrade-test-be: Updated at 06:24:26.881 (updated first)
  upgrade-test-fe: Updated at 06:24:26.905 (updated 24ms later)

Benefits

  1. Aligned with StarRocks upgrade guidelines – BE nodes upgraded before FE nodes for compatibility.
  2. Safe initial deployments – FE deployed first to ensure proper cluster initialization.
  3. Automatic detection – Behavior switches dynamically; no manual intervention needed.
  4. Clear diagnostics – Operator logs explicitly show which ordering is applied and why.
  5. Backward compatible – Existing deployments remain unaffected and work as expected.

Verification

To verify the fix works in your environment:

# Check initial deployment uses FE-first
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "initial deployment detected"

# Check upgrade uses BE-first
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "upgrade detected.*BE-first"

# Verify StatefulSet ordering
kubectl get statefulsets -n <namespace> -o custom-columns=NAME:.metadata.name,CREATED:.metadata.creationTimestamp

Related Issue(s)

#238

Checklist

For operator, please complete the following checklist:

  • run make generate to generate the code.
  • run golangci-lint run to check the code style.
  • run make test to run UT.
  • run make manifests to update the yaml files of CRD.

For helm chart, please complete the following checklist:

  • make sure you have updated the values.yaml
    file of starrocks chart.
  • In scripts directory, run bash create-parent-chart-values.sh to update the values.yaml file of the parent
    chart( kube-starrocks chart).

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


jmjm15x seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@jmjm15x jmjm15x force-pushed the bugfix/upgrade-sequence branch from a8d04a0 to a0a8087 Compare October 7, 2025 09:00
Replace static subcontroller array with dynamic ordering based on cluster state:
- Add isUpgrade() state machine detecting phase + image changes
- Implement getControllersInOrder() for runtime controller sequence building
- Switch FE-first (deploy) ↔ BE-first (upgrade) automatically
- Add unit tests for upgrade detection logic and controller ordering verification
- Include state transition logging for debugging

Follows StarRocks upgrade procedures (BE→FE) while preserving safe deployments (FE→BE).

Signed-off-by: jmjm15x <jmjm15x@gmail.com>
@jmjm15x jmjm15x force-pushed the bugfix/upgrade-sequence branch from a0a8087 to 5c79201 Compare October 7, 2025 17:05
@jmjm15x
Copy link
Author

jmjm15x commented Oct 9, 2025

Found a race condition issue and addressed it with a better approach in PR #707. Closing this PR to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants