Skip to content

[Bugfix] Implement upgrade-aware controller ordering for FE/BEs/CNs#707

Open
jmjm15x wants to merge 4 commits intoStarRocks:mainfrom
jmjm15x:bugfix/fe-be-upgrade-sequence
Open

[Bugfix] Implement upgrade-aware controller ordering for FE/BEs/CNs#707
jmjm15x wants to merge 4 commits intoStarRocks:mainfrom
jmjm15x:bugfix/fe-be-upgrade-sequence

Conversation

@jmjm15x
Copy link

@jmjm15x jmjm15x commented Oct 9, 2025

Description

Add upgrade sequencing control to the StarRocks Kubernetes Operator to ensure proper component ordering during both initial deployments and upgrades. Previously, the operator always used FE-first ordering, which is correct for initial deployments but incorrect for upgrades. According to StarRocks guidelines:

  • Initial Deployment: FE → BEs/CNs (FE must be leader before workers join)
  • Cluster Upgrades: BEs/CNs → FE (data nodes upgraded before metadata nodes)

From official documentation:

Upgrade procedure
By design, BEs and CNs are backward compatible with the FEs. Therefore, you need to upgrade BEs and CNs first and then FEs to allow your cluster to run properly while being upgraded. Upgrading them in an inverted order may lead to incompatibility between FEs and BEs/CNs, and thereby cause the service to crash.

Solution

Implemented a comprehensive upgrade detection and sequencing mechanism with robust component readiness validation to prevent premature progression between components.

Key Changes

1. Upgrade Detection (isUpgrade())

Detects upgrade scenarios by checking if StatefulSets exist with pending changes, compares Generation vs ObservedGeneration to detect spec changes.

Why this approach?

  • Simple and reliable, uses Kubernetes native generation tracking
  • Works for any spec change (images, resources, configs)
  • Handles transient states, correctly identifies upgrade during rollout progress

2. Controller Ordering (getControllersInOrder())

Dynamically switches controller execution order based on deployment scenario:

  • Upgrade scenario: [be, cn, fe, feproxy] (BE-first ordering)
  • Initial deployment: [fe, be, cn, feproxy] (FE-first ordering)

3. Component Readiness Validation (isComponentReady())

Multi-layer validation to prevent premature component progression by avoiding race conditions, ensuring rollout stability, and providing enhanced logging for debugging.

Logic Flow

Implements waiting logic directly in the reconciliation loop:

Reconcile() called
    ↓
Get controllers in order based on isUpgrade()
    ↓
For each controller in order:
    ↓
    ├─ If upgrade && feController
    │   └─ Check BE/CN ready? → If NO, wait and requeue
    │
    ├─ Sync controller (create/update resources)
    │
    ├─ If initial && feController  
    │   └─ Check FE ready? → If NO, wait and requeue
    │
    └─ If upgrade && (beController || cnController)
        └─ Check component ready? → If NO, wait and requeue

End-to-End Test Results

Test Case 1: Initial FE+BE Deployment (v3.1.0)

Expected: FE-first ordering (FE must be ready before BE starts)

Timeline:
  T0:       FE Pod Created - 2025-10-09 07:29:58 UTC
  T0+48s:   BE Pod Created - 2025-10-09 07:30:46 UTC

Operator Logs:
  - "initial deployment: waiting for FE to be ready before creating BE/CN"
  - Component progression: feController → wait for FE ready → beController

Verification:
  - FE StatefulSet created first
  - BE StatefulSet created 48 seconds later (after FE became ready)

Test Case 2: Version Upgrade (v3.1.0 → v3.1.8)

Expected: BE-first ordering (BE must complete before FE starts)

Timeline:
  T0:       BE Pod Created - 2025-10-09 07:44:45 UTC
  T0+6s:    FE Pod Created - 2025-10-09 07:44:51 UTC

Operator Logs:
  - "component not ready: StatefulSet spec change not yet observed"
  - "component not ready: StatefulSet rollout in progress"
  - "component not ready: no ready endpoints"
  - "upgrade: waiting for component rollout to complete before proceeding"

Verification:
  - BE StatefulSet updated first (detected generation change)
  - BE pod rolled out to v3.1.8
  - FE update waited for BE rollout completion
  - FE pod rolled out to v3.1.8 after BE was ready

Test Case 3: Configuration Change (Memory: 2Gi → 4Gi)

Expected: BE-first ordering (config changes treated as upgrades)

Timeline:
  T0:       BE Pod Created - 2025-10-09 07:47:05 UTC
  T0+6s:    FE Pod Created - 2025-10-09 07:47:11 UTC

Operator Logs:
  - "component not ready: no ready endpoints"
  - "upgrade: waiting for component rollout to complete before proceeding"

Verification:
  - Configuration change correctly detected as upgrade scenario
  - BE rolled out first with new memory limits (4Gi)
  - FE waited for BE readiness before rolling out
  - Both components running with 4Gi memory

Verification

# Check initial deployment uses FE-first (wait message)
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "initial deployment: waiting for FE"

# Check upgrade uses BE-first (wait message)
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "upgrade: waiting for component rollout"

# Verify component readiness details
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "component not ready"

# Check StatefulSet rollout sequence with timestamps
kubectl get statefulsets -n <namespace> -o json | jq -r '.items[] | "\(.metadata.name): \(.status.currentRevision) -> \(.status.updateRevision)"'

# Verify pod creation/recreation timestamps
kubectl get pods -n <namespace> -o custom-columns=NAME:.metadata.name,CREATED:.metadata.creationTimestamp

Checklist

For operator, please complete the following checklist:

  • run make generate to generate the code.
  • run golangci-lint run to check the code style (0 issues).
  • run make test to run UT (all controller tests passing).
  • run make manifests to update the yaml files of CRD.

For helm chart, please complete the following checklist:

  • make sure you have updated the values.yaml
    file of starrocks chart.
  • In scripts directory, run bash create-parent-chart-values.sh to update the values.yaml file of the parent
    chart( kube-starrocks chart).

@CLAassistant
Copy link

CLAassistant commented Oct 9, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ yandongxiao
❌ jmjm15x


jmjm15x seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@yandongxiao
Copy link
Collaborator

The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?

@jmjm15x
Copy link
Author

jmjm15x commented Oct 9, 2025

Not with this approach; I encountered a critical race condition during upgrades with pervious implementation I tried (#704). When the operator updated a StatefulSet's spec, it would immediately check component readiness using only endpoint availability. However, endpoints don't immediately reflect the new state - the old pods remain "ready" for a few seconds while Kubernetes starts the rollout. This caused FE to upgrade prematurely, before BE/CN completed their rollouts.

Fix: Implemented isComponentReady() with following validation:

  1. Service endpoints exist
  2. StatefulSet controller observed the spec change (ObservedGeneration check)
  3. Rollout is complete (currentRevision == updateRevision)
  4. All replicas are ready

Implementation approach: I kept the existing logic flow and controller structure intact, only enhancing the readiness checks for robustness. This ensures backward compatibility while fixing the race condition.

I include logs from few E2E tests with proper sequencing (BE/CN → FE) in the description.

@yandongxiao
Copy link
Collaborator

The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?

@jmjm15x What I want to express here is, when you use the current version of the operator and upgrade FE and BE simultaneously, did you encounter any issues? Currently, we have not received any other reports of issues caused by simultaneous FE/BE upgrades.

@jmjm15x
Copy link
Author

jmjm15x commented Oct 16, 2025

The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?

@jmjm15x What I want to express here is, when you use the current version of the operator and upgrade FE and BE simultaneously, did you encounter any issues? Currently, we have not received any other reports of issues caused by simultaneous FE/BE upgrades.

@yandongxiao, sorry for the misunderstanding. Yes, I’ve seen instability during upgrades in our clusters. Our current mitigation is using custom scripts to enforce BE-first ordering for stability, but this workaround risks disrupting the operator workflow.

@jmjm15x jmjm15x force-pushed the bugfix/fe-be-upgrade-sequence branch from 16c0162 to 7063994 Compare October 16, 2025 07:17
jmjm15x added 2 commits October 16, 2025 00:21
Fixes upgrade sequence issues and prevents premature component updates

Key changes:
- Add isUpgrade() detection based on StatefulSet existence
- Implement getControllersInOrder() for scenario-based sequencing
- Add isComponentReady() with endpoint, generation, rollout, and replica checks
- Detect and log corrupted state (BE without FE) with recovery attempt

Signed-off-by: jmjm15x <jmjm15x@gmail.com>
only

Previously, any StatefulSet existence triggered
BE-Frist ordering. Now only actual image changes
trigger upgrade ordering, preventing unnecessary
use of upgrade path for all changes.

Remove the redundant checks in the reconcile
method

Signed-off-by: jmjm15x <jmjm15x@gmail.com>
@jmjm15x jmjm15x force-pushed the bugfix/fe-be-upgrade-sequence branch from 7063994 to abc305e Compare October 16, 2025 07:21
@yandongxiao
Copy link
Collaborator

Please rebase your code to follow the latest main code. Some auto-tests was missing to be exectued, and I have fixed it.

* [Enhancement] Support arrow_flight_port

Signed-off-by: yandongxiao <yandongxiao@starrocks.com>

* [BugFix] fix failed test cases and add test cases for arrow flight

Signed-off-by: yandongxiao <yandongxiao@starrocks.com>

---------

Signed-off-by: yandongxiao <yandongxiao@starrocks.com>
@jmjm15x
Copy link
Author

jmjm15x commented Oct 20, 2025

Please rebase your code to follow the latest main code. Some auto-tests was missing to be exectued, and I have fixed it.

Rebased from the main branch.

@yandongxiao
Copy link
Collaborator

Please fix the failed test cases, I think you can run make test in your local computer.

@yandongxiao
Copy link
Collaborator

Another question: I think this PR should exclude the third PR.

@yandongxiao
Copy link
Collaborator

CLA assistant check Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.1 out of 2 committers have signed the CLA.✅ yandongxiao❌ jmjm15x

jmjm15x seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@jmjm15x Please sign the CLA.

Signed-off-by: jmjm15x <jmjm15x@gmail.com>
@jmjm15x
Copy link
Author

jmjm15x commented Oct 22, 2025

Please fix the failed test cases, I think you can run make test in your local computer.

Fixed the broken tests in last commit.

@jmjm15x
Copy link
Author

jmjm15x commented Oct 22, 2025

Another question: I think this PR should exclude the third PR.

@yandongxiao what do you mean by 3rd PR?

@jmjm15x
Copy link
Author

jmjm15x commented Oct 22, 2025

@yandongxiao I noticed a typo in my CLA signature, that's why it's pending. Could you revoke it so I can sign again?

@yandongxiao
Copy link
Collaborator

@yandongxiao I noticed a typo in my CLA signature, that's why it's pending. Could you revoke it so I can sign again?

@kevincai

@yandongxiao
Copy link
Collaborator

Another question: I think this PR should exclude the third PR.

@yandongxiao what do you mean by 3rd PR?

Sorry, there are errors in what I wrote. Now the PR contains four commits, but one commit is from yandongxiao.

@jmjm15x
Copy link
Author

jmjm15x commented Oct 22, 2025

Another question: I think this PR should exclude the third PR.

@yandongxiao what do you mean by 3rd PR?

Sorry, there are errors in what I wrote. Now the PR contains four commits, but one commit is from yandongxiao.

I rebased from main and I think that's the reason for the 3rd commit

Copy link
Collaborator

@yandongxiao yandongxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some names need to be updated

feSts := &appsv1.StatefulSet{}
feExists := kubeClient.Get(ctx, types.NamespacedName{
Namespace: cluster.Namespace,
Name: cluster.Name + "-fe",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use load.Name(cluster.Name, cluster.Spec.StarRocksFeSpec)

beSts := &appsv1.StatefulSet{}
beExists := kubeClient.Get(ctx, types.NamespacedName{
Namespace: cluster.Namespace,
Name: cluster.Name + "-be",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use load.Name(cluster.Name, cluster.Spec.StarRocksBeSpec)

return true // Component not configured, consider it ready
}
serviceName = rutils.ExternalServiceName(cluster.Name, cluster.Spec.StarRocksFeSpec)
statefulSetName = cluster.Name + "-fe"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use load.Name(cluster.Name, cluster.Spec.StarRocksFeSpec), and you can pass a nil pointer for the second parameter.

return true
}
serviceName = rutils.ExternalServiceName(cluster.Name, cluster.Spec.StarRocksBeSpec)
statefulSetName = cluster.Name + "-be"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same issue

return true
}
serviceName = rutils.ExternalServiceName(cluster.Name, cluster.Spec.StarRocksCnSpec)
statefulSetName = cluster.Name + "-cn"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same issuse

Namespace: cluster.Namespace,
Name: cluster.Name + "-be",
}, beSts) == nil

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is missing the CN check

// Corrupted state safeguard: BE exists but FE doesn't (invalid configuration).
// Treat as initial deployment so FE is reconciled first.
// Rationale: FE is a prerequisite for BE/CN; prioritizing FE allows recovery without misordering.
if beExists && !feExists {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is duplicated with the following !feExists condition

return false
}

return checkForImageChanges(ctx, kubeClient, cluster)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code is detecting whether sts is existing, checkForImageChanges compare their images. My suggestion is can we merge them together?

  1. if cluster spec fe exist, check whether sts exist, then check the image.


// After syncing, check if we need to wait for this component to be ready before proceeding
// Initial deployment: Wait for FE to be ready before creating BE/CN
if !isUpgradeScenario && controllerName == r.FeController.GetControllerName() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings up a potential issue, e.g. if FE fails to start due to certain reasons, such as excessive metadata or a long startup time, causing the probe to fail. In this case, users would surely want to modify the probe time, but the logic here prevents it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sub controller logic, there is a fe.CheckFEReady(ctx, be.Client, src.Namespace, src.Name) check, if FE is not ready, BE/CN will stop doing reconcile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is an upgrade scenario, then after the FE image is updated, the Operator will no longer consider it an upgrade scenario. So, your logic here is to wait for the FE here until it becomes Ready when the next sync occurs. I don't think this operation seems very necessary. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants