Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

ShubhamRwt · 2025-07-14T11:45:21Z

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <[email protected]>

106-self-healing-feature-in-operator.md

tomncooper

I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.

I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.

For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.

106-self-healing-feature-in-operator.md

scholzj

I think this is going in the right direction. But I think it needs to go a bit deeper:

We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called mode: full? Calling it full seems confusing - does it mean that full includes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybe mode: rebalance or mode: skew or something like that would make more sense?

106-self-healing-feature-in-operator.md

ppatierno · 2025-07-17T08:37:54Z

@scholzj good to know that you like the track we are now :-)

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

scholzj · 2025-07-17T16:22:17Z

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

I do not think this works here. KafkaRebalance is essentially an imperative API (although implemented through a declarative resource). You are sending a command to the CO to do a full rebalance.

The autoRebalance section in the Kafka CR is a declarative API. You are declaring how CO should automatically react to some situations. add-brokers and remove-brokers works well in both as it is a command as well as event description. full IMHO does not work that well in the declarative mode because as I said, it can be easily interpreted as full == all available options (i.e. including scale-up or scale-down). That is where the idea of skew comes from as from my understanding in this proposal we are reacting to skew -> the skew can be a CPU inbalance, Disk inbalance etc.

goal-violation sounds reasonable ... but I wonder if it is too generic. I assume that the future modes ... e.g. CCs suggestion to scale-up, scale-down, bad distribution across racks, broken disks or brokers ... those are also goal violations, or? But you cannot solve these by creating a KafkaRebalance. So they will need their own modes as well. That is kind of the context in whcih I'm trying to see the mode names.

Signed-off-by: ShubhamRwt <[email protected]>

106-auto-rebalance-on-imbalanced-clusters.md

ppatierno · 2025-08-01T09:17:08Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c.
+When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource.
+Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description).


It's a repetition of the above sentence. Maybe you can delete it but adding the link to the anomaly detector manager to the previous sentence.

ppatierno · 2025-08-01T09:18:00Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+## Motivation
+
+Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.


How a user is notified by anomalies currently? What are you referring to?

Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.

ppatierno · 2025-08-01T09:18:43Z

106-auto-rebalance-on-imbalanced-clusters.md

+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
+It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
+
+### Introduction to Self Healing


Suggested change

### Introduction to Self Healing

### Introduction to Self Healing in Cruise Control

You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."

ppatierno · 2025-08-01T09:21:01Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The above flow diagram depicts the self-healing process in Cruise Control.
+The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
+The configured notifiers provides alerts to the users about the detected anomaly and also returns the action that needs to be taken on the anomaly i.e. whether to fix it, ignore it or delay it.


The alert mechanism isn't out of the box. A notifier can have its own logic without generating any alerts. Even just triggering the fix without notifying anyone what's happening. So there is no assumption that a "configured" notifier provides alerts. I think this sentence should say that the notifier makes the decision about the action to take. Then CC provides some notifiers which are able to alert the user in several ways (MS Teams, Slack, etc etc).

ppatierno · 2025-08-01T09:45:48Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.
+
+#### What happens if some unfixable goal violation happens


Suggested change

#### What happens if some unfixable goal violation happens

#### What happens if some unfixable goal violation happens

ppatierno · 2025-08-01T09:46:20Z

106-auto-rebalance-on-imbalanced-clusters.md

+If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.
+
+#### What happens if some unfixable goal violation happens
+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.


Still need an example here for better understanding how this is prompted to the user.

Having prometheus metrics for such cases might be reasonable default way?

ppatierno · 2025-08-01T09:46:25Z

106-auto-rebalance-on-imbalanced-clusters.md

+#### What happens if some unfixable goal violation happens
+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.
+
+#### What happens if same anomaly is detected again while the auto-rebalance is happening


Suggested change

#### What happens if same anomaly is detected again while the auto-rebalance is happening

#### What happens if same anomaly is detected again while the auto-rebalance is happening

ppatierno · 2025-08-01T09:48:41Z

106-auto-rebalance-on-imbalanced-clusters.md

+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.
+
+#### What happens if same anomaly is detected again while the auto-rebalance is happening
+Since the cluster operator has the knowledge regarding the detected violation, we will ignore the anomalies while the rebalancing is happening. In case the anomaly still exists after the rebalance, Cruise Control will detect it again and a new rebalance would be triggered


so it seems to assume that if a first anomaly is created, the notifier creates the corresponding ConfigMap and the CO takes care of running a rebalancing. While the rebalancing is running, CC detects other anomalies, so the notifier is creating a bunch of other ConfigMaps that the CO is ignoring. Finally, the rebalancing ends ... the CO will find all these ConfigMaps ... what's going to do? This is where if it takes care of them we could:

lose the priority of them (ConfigMaps don't have priority)

the old anomalies could have been fixed by the previous rebalancing so it's useless handling them

I think the best option in this case would be to ignore the configmap and also delete it at the same time. I think I didn't mentioned it here which my mistake but later in the flowchart, I say that if anomalies are detected during a rebalance is happening, we will just gnore that configmap and delete it.

ppatierno · 2025-08-01T09:51:26Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **RebalanceOnScaleDown** to:
+  * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
+  * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
+  * **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed


are we really sure that if a rebalance is running for scale up or scale down, after that we should take care of the anomaly? Is it possible that the anomaly was somehow fixed because of the auto-rebalancing due to scale up or down? My gut feeling is that we could avoid to take care of an anomaly even because if the problem is still in place, it will be raised again by CC and then we'll deal with it. @tomncooper @scholzj wdyt?

I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?

Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.

I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.

I do not think I can really comment as I do not know how this really works in CC. I raised similar point before with regards to imbalance that cannot be fixed (e.g. because one partition causing the imbalance is too big etc.). Will it be raised again and again? Do we need to somehow detedt those and ignore them? Etc. So this is a bit similar. How do you know it was already resolved or not and will it be repeated or not. 🤷

@tomncooper @ppatierno you are correct, we should ignore and delete the configmap at the same time if rebalance is happening. I think I didn't mentioned it here which is my mistake but later in the flowchart, I show that if anomalies are detected during a rebalance is happening, we will just ignore that configmap and delete it. As for unfixable anomalies which can keep appearing, there is code present in the Cruise Control SelfHealingNotifier which I am going to utilize. That method checks if the rebalance can be performed on the Goal violation or not. If the goal vioaltion cannot be fixed, then we just ognore the anomaly and no configmap would be created in that case

tomncooper

Ok I did another pass. I have a few questions:

How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated KafkaRebalance CRs? But it is not clear.

tomncooper · 2025-08-01T10:14:04Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+## Motivation
+
+Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.


Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.

tomncooper · 2025-08-01T10:25:14Z

106-auto-rebalance-on-imbalanced-clusters.md

+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
+It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
+
+### Introduction to Self Healing


You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."

106-auto-rebalance-on-imbalanced-clusters.md

tomncooper · 2025-08-01T13:36:28Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **RebalanceOnScaleDown** to:
+  * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
+  * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
+  * **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed


I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?

Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.

I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.

106-auto-rebalance-on-imbalanced-clusters.md

im-konge · 2025-08-08T15:49:59Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c.
+When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource.
+Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description).


106-auto-rebalance-on-imbalanced-clusters.md

im-konge · 2025-08-08T16:16:09Z

106-auto-rebalance-on-imbalanced-clusters.md

+  finalizers:
+    - strimzi.io/auto-rebalancing
+spec:
+  mode: skew


So it will be always full and there will be no new mode like imbalance or skew, right?

im-konge · 2025-08-08T16:17:24Z

106-auto-rebalance-on-imbalanced-clusters.md

+#### AnomalyDetectorNotifier
+
+Cruise Control provides the `AnomalyNotifier` interface, which has multiple abstract methods on what to do if certain anomalies are detected.
+Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.


Suggested change

Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.

Some of those methods are: `onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()`.

I guess you don't have to use etc. here if you are just naming some of them, but I'm not a native speaker :)

im-konge · 2025-08-08T16:20:45Z

106-auto-rebalance-on-imbalanced-clusters.md

+# ...
+```
+
+The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance.


Yeah should it follow the names of other things like <cluster-name>-goal-violation-<anomalyID>? I guess that will be also easier to find in case that you would like to search all Namespaces for these kind of ConfigMap.

im-konge · 2025-08-08T16:23:08Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Users cannot configure the notifier if they are utilising the auto-rebalance on imbalanced cluster.
+This is because the operator is using our custom notifier for getting alerts about goal violations. 
+If the users try to override the notifier while the `skew` mode is enabled, the auto-rebalance `skew` configuration then the operator would throw errors in the auto-rebalance status field


Maybe could you re-phrase it a bit - I'm confused a bit by the the auto-rebalance skew configuration then the operator would throw errors. What do you mean by that?

106-auto-rebalance-on-imbalanced-clusters.md

tinaselenge · 2025-08-11T09:43:25Z

106-auto-rebalance-on-imbalanced-clusters.md

+  finalizers:
+    - strimzi.io/auto-rebalancing
+spec:
+  mode: skew


I think I would get confused to see different mode name, full here. We would have to explain how that map to skew or imbalance mode we introduced.

106-auto-rebalance-on-imbalanced-clusters.md

nickgarvey · 2025-08-18T18:42:54Z

Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows.

I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else.

An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete mode: skew on the CruiseControl CR to stop automatic rebalances, but it could be more clear.

Thanks for putting this together, excited to see this.

ppatierno · 2025-08-19T11:32:42Z

@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the stop annotation on the KafkaRebalance (of course the current batch has to finish first). With auto-rebalancing, the KafkaRebalance is owned by the operator and not by the user. That's anyway a good feedback because there is no clear way for the user to stop an auto-rebalancing in progress. I think you could apply the stop annotation on the KafkaRebalance resource but you can't delete it due to a finalizer. Then you should delete the corresponding mode within the spec.cruiseControl.autoRebalance.mode field to avoid the re-triggering. It's something to think about.

see-quick · 2025-08-19T13:26:43Z

106-auto-rebalance-on-imbalanced-clusters.md

+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
+It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.


Suggested change

With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.

It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.

In smaller clusters, anomalies can still be fixed manually. But as clusters grow, doing this becomes time-consuming or even impractical. For Strimzi users, it would be highly valuable if such anomalies could be detected and fixed automatically.

see-quick · 2025-08-19T13:58:24Z

106-auto-rebalance-on-imbalanced-clusters.md

+* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).
+
+The detected anomalies are inserted into a priority queue where comparator is based upon the priority value and the detection time.
+The smaller the priority value and detected time is, the higher priority the anomaly type has.


Suggested change

The smaller the priority value and detected time is, the higher priority the anomaly type has.

An anomaly is considered more important if it has a lower priority value and shorter detection time.

106-auto-rebalance-on-imbalanced-clusters.md

see-quick · 2025-08-20T04:37:32Z

106-auto-rebalance-on-imbalanced-clusters.md

+They can configure auto-rebalance to enable only for their specific case i.e. setting only `skew` mode or other scaling related modes.
+Once the auto-rebalance with `skew` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.
+To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly. 
+We will create our own custom notifier named `AnomalyDetectorNotifier` to do the same.


Yeah, that would make it more flexible for future changes, so +1 for naming it more generic way...

see-quick · 2025-08-20T04:40:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+The auto-rebalance configuration for the `spec.cruiseControl.autoRebalance.template` property in the `Kafka` custom resource is provided through a `KafkaRebalance` custom resource defined as a "template".
+That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set.
+When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing.
+This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.


Suggested change

This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.

This is not an actual rebalance request to get an optimization proposal; it is simply where the configuration for auto-rebalancing is defined.

see-quick · 2025-08-20T04:41:52Z

106-auto-rebalance-on-imbalanced-clusters.md

+That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set.
+When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing.
+This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.
+The user can specify rebalancing goals and other configuration for rebalancing, within the resource.


Suggested change

The user can specify rebalancing goals and other configuration for rebalancing, within the resource.

The user can specify rebalancing goals and configuration in the resource.

106-auto-rebalance-on-imbalanced-clusters.md

Frawless · 2025-08-20T09:56:12Z

106-auto-rebalance-on-imbalanced-clusters.md

+```
+
+The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance.
+Separate configmaps would be created for every goal violation such that on completion of the rebalance we can remove the particular configmap.


Guess the operator will remove to CM instead of users, right?

Frawless · 2025-08-20T10:02:18Z

106-auto-rebalance-on-imbalanced-clusters.md

+If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.
+
+#### What happens if some unfixable goal violation happens
+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.


Having prometheus metrics for such cases might be reasonable default way?

Frawless · 2025-08-20T10:24:11Z

106-auto-rebalance-on-imbalanced-clusters.md

+  A[KafkaClusterCreator] --creates--> B[KafkaCluster]
+  B -- calls --> D[KafkaAutoRebalancingReconciler.reconcile]
+  D -- check for configmap with goal-violation prefix --> E{if config map present?}
+  D -- if rebalance in progress --> F[ignore new configmaps and delete them]


which CMs will be deleted in that case? My understanding is that KafkaAutoRebalancingReconciler will not create new ones. What about these CMs that are used for current rebalancing? These should be deleted by rebalancing itself or?

Signed-off-by: ShubhamRwt <[email protected]>

kyguy · 2025-09-19T16:12:42Z

106-auto-rebalance-on-imbalanced-clusters.md

+This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.


Suggested change

We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.

We will be using the anomaly detection classes related to goal violations that can be addressed by a partition rebalances but not other anomaly detection classes related to goal violations that would require manual intervention like disk or broker failures.

kyguy · 2025-09-19T16:29:06Z

106-auto-rebalance-on-imbalanced-clusters.md

+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
+THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.


These sentences should be put on their own line. For the first sentence, it may be more direct to say something like:

Suggested change

THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.

The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

As for the second sentence, is this the real reason why we are leaving out these specific anomaly detection classes? It seems like we would want to leave them out because the detected issues (disk failure, broker failure, etc) would be non-trivial for the Strimzi Operator to fix (also out of scope of this feature). We want to narrow the scope to goal violations that the Operator can fix with a rebalance.

I think the reason I mentioned was because Cruise Control can only do rebalancing and that wouldn't help us fix these failures so therefore we are not supporting it but I think it would be good to frame it the way you mentioned

kyguy · 2025-09-19T17:57:13Z

106-auto-rebalance-on-imbalanced-clusters.md

+To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
+THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
+Doing this will provide us with the following advantages:


When we say "doing this", do we mean disabling the the anomaly detection classes that detect goal violations that would require manual intervention? Or disabling anomaly detection classes that attempt to resolve goal violations that would require manual intervention?

By Doing this, I was referring to the approach of the proposal on how we plan to detect the imbalances using CC and let operator fix them

kyguy · 2025-09-19T17:59:27Z

106-auto-rebalance-on-imbalanced-clusters.md

+We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
+THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
+Doing this will provide us with the following advantages:
+* we will ensure that the operator is in control of when rebalances will be triggered.


Suggested change

* we will ensure that the operator is in control of when rebalances will be triggered.

* we ensure that the operator controls all rebalance and cluster remediation operations.

kyguy · 2025-09-19T18:04:08Z

106-auto-rebalance-on-imbalanced-clusters.md

+THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
+Doing this will provide us with the following advantages:
+* we will ensure that the operator is in control of when rebalances will be triggered.
+* using the existing `KafkaRebalance` CR system make it easier for users to see what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.


Suggested change

* using the existing `KafkaRebalance` CR system make it easier for users to see what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.

* using the existing `KafkaRebalance` CR system gives more visibility into what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.

kyguy · 2025-09-19T18:11:46Z

106-auto-rebalance-on-imbalanced-clusters.md

+The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to the all the brokers.
+The mode is defined by setting the `spec.cruiseControl.autoRebalance.mode` field as `imbalance` and the corresponding rebalancing configuration is defined as a reference to a "template" `KafkaRebalance` custom resource, by using the `spec.cruiseControl.autoRebalance.template` field as a [LocalObjectReference](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/local-object-reference/).
+This field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations).
+To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.


Suggested change

To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.

To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to use whether it be `add-brokers`, `remove-brokers`, or `imbalance`.

kyguy · 2025-09-19T18:14:21Z

106-auto-rebalance-on-imbalanced-clusters.md

+This field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations).
+To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
+They don't require to set up all the modes and can enable the modes they require.
+They can configure auto-rebalance to enable only for their specific case i.e. setting only `imbalance` mode or other scaling related modes.


It seems like the first sentence of the above three summarizes this well enough, we could probably remove the bottom two sentences.

kyguy · 2025-09-19T18:21:48Z

106-auto-rebalance-on-imbalanced-clusters.md

+To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
+They don't require to set up all the modes and can enable the modes they require.
+They can configure auto-rebalance to enable only for their specific case i.e. setting only `imbalance` mode or other scaling related modes.
+Once the auto-rebalance with `imbalance` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.


Suggested change

Once the auto-rebalance with `imbalance` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.

When the auto-rebalance configuration is set with `imbalance` mode enabled, the operator will trigger a partition rebalance whenever a goal violation is detected by the anomaly detector.

kyguy · 2025-09-19T20:10:23Z

106-auto-rebalance-on-imbalanced-clusters.md

+To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly. 
+We will create our own custom notifier named `StrimziCruiseControlNotifier` to do the same.
+This notifier's job will be to update the operator regarding the goal violations so that the operator can trigger a rebalance (see section [AnomalyDetectorNotifier](./106-auto-rebalance-on-imbalanced-clusters.md#anomalydetectornotifier)).
+With this proposal, we are only going to support auto-rebalance on imbalanced cluster.


I am not sure I understand, does this mean that the operator will only trigger a partition rebalance for goal violations that don't require manual intervention?

kyguy · 2025-09-19T20:22:54Z

106-auto-rebalance-on-imbalanced-clusters.md

+We will create our own custom notifier named `StrimziCruiseControlNotifier` to do the same.
+This notifier's job will be to update the operator regarding the goal violations so that the operator can trigger a rebalance (see section [AnomalyDetectorNotifier](./106-auto-rebalance-on-imbalanced-clusters.md#anomalydetectornotifier)).
+With this proposal, we are only going to support auto-rebalance on imbalanced cluster.
+We also plan to implement the same for topic and metrics related issues, but it will be part of future work since their implementation require different approach.


It might be worth coming up with some terminology to distinguish the different types of goal violations that are flagged by the anomaly detector. Then it'd be easier to describe what will and will not be implemented as part of this feature versus features in the future. e.g something like

resource distribution violations : Violations that can be resolved by a partition rebalance.

component failure violations : Violations that can be resolved through manual intervention (disk or broker failure)

...

We would just need to look at the set of possible violations and divide them into groups based on how they are resolved.

The Notifier that we will create will ignore all the broker and disk related anomalies so I wonder why do we need to come up with terminologies for goal violation. I have also configure the notifier to alert the users in cases where goal like DiskDistributionGoal is violated and it cannot be fixed even by rebalancing so in that case this anomaly will be ignored

I think Shubham is just using the same terminology we have in Cruise Control. Any "goal violation" anomaly is related to the violation of ... goals as we know them from Cruise Control.
But disk or broker failures are two different things, they are not under the "goal violation" umbrella.
As well as the topic and metrics anomaly he mentioned.
All of them won't be covered here, but just the "goal violation". What's your doubt about that naming @kyguy ?

But disk or broker failures are two different things, they are not under the "goal violation" umbrella.
As well as the topic and metrics anomaly he mentioned.
All of them won't be covered here, but just the "goal violation". What's your doubt about that naming

We don't necessarily have to come up with new terminology here, I just wasn't sure what issues are and are not covered by the statement "we are only going to support auto-rebalance on imbalanced cluster.". The term "imbalanced cluster" seemed a little ambiguous here and other places in the proposal if it doesn't cover all the issues that could cause an imbalanced cluster. e.g broker failure

If we are covering all the "goal violations", maybe we can just say "we are only going to support auto-rebalance whenever goal violations are detected". However, it sounds like we aren't planning on a auto-rebalancing on all goal violations e.g when " DiskDistributionGoal is violated and it cannot be fixed even by rebalancing".

It looks like the current iteration of the proposal clears this distinction above anyway, so maybe we can safety remove the line "With this proposal, we are only going to support auto-rebalance on imbalanced cluster." or say
"With this proposal, we are only going to support auto-rebalance whenever goal violations are detected and the violation can be addressed by a rebalance".

I guess that by DiskDistributionGoal you are referring to DiskUsageDistributionGoal. Right?
Now you had a valid point regarding non fixable violation. I would assume that when anomaly detector detects a DiskUsageDistributionGoal it doesn't know if it's fixable or not yet. Our notifier just adds the goal violation anomaly to the ConfigMap and the operator will take care by creating a KafkaRebalance resource to fix it. I think that, if the violation is not fixable via a rebalancing, CC will respond by NOT providing a proposal, is that right? Or it will provide a proposal and then when the actual rebalancing is requested then it will fail?

Finally, I agree that we can replace "With this proposal, we are only going to support auto-rebalance on imbalanced cluster." with "With this proposal, we are only going to support auto-rebalance whenever goal violations are detected and the violation can be addressed by a rebalance"

I guess that by DiskDistributionGoal you are referring to DiskUsageDistributionGoal. Right?

Sorry, yes that is what I meant.

I think that, if the violation is not fixable via a rebalancing, CC will respond by NOT providing a proposal, is that right? Or it will provide a proposal and then when the actual rebalancing is requested then it will fail?

I believe the former.

Our notifier just adds the goal violation anomaly to the ConfigMap and the operator will take care by creating a KafkaRebalance resource to fix it.

That makes sense, I guess there will be times an imbalance due to a disk failure could be solved by a rebalance and some times not, we delegate the decision to Cruise Control.

Finally, I agree that we can replace "With this proposal, we are only going to support auto-rebalance on imbalanced cluster." with "With this proposal, we are only going to support auto-rebalance whenever goal violations are detected and the violation can be addressed by a rebalance"

Sounds good to me!

Signed-off-by: ShubhamRwt <[email protected]>

kyguy · 2025-09-22T17:25:57Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.
+It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically.


Suggested change

It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically.

It would be useful for users of Strimzi to be able to have these imbalanced clusters balanced automatically.

kyguy · 2025-09-22T17:49:38Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The above flow diagram depicts the self-healing process in Cruise Control.
+The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
+The notifier then decides what action to take on the anomaly whether to fix it, ignore it or delay. Cruise Control provides various notifiers to alert the users about the detected anomaly in several ways like Slack, Alerta, MS Teams etc.


Two sentences are on same line.

kyguy · 2025-09-22T18:00:22Z

106-auto-rebalance-on-imbalanced-clusters.md

+It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies.
+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
+The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
+Detector classes have different mechanisms to detect their corresponding anomalies.


Suggested change

Detector classes have different mechanisms to detect their corresponding anomalies.

Detector classes use different mechanisms to detect their corresponding anomalies.

kyguy · 2025-09-22T18:08:23Z

106-auto-rebalance-on-imbalanced-clusters.md

+Furthermore, `MetricAnomalyDetector` use metrics and `GoalViolationDetector` uses the load distribution to detect their anomalies.
+The detected anomalies can be of various types:
+* Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration.  However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR.
+* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).


Suggested change

* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).

* Topic Anomaly - When one or more topics in the cluster violate user-defined properties (e.g. some partitions are too large on disk).

kyguy · 2025-09-22T18:10:10Z

106-auto-rebalance-on-imbalanced-clusters.md

+* Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration.  However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR.
+* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
+* Broker Failure - This happens when a non-empty broker crashes or leaves a cluster for a long time.
+* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).


Suggested change

* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).

* Disk Failure - This failure happens when one of the non-empty disks fails (in a Kafka cluster with JBOD disks).

kyguy · 2025-09-22T19:14:08Z

106-auto-rebalance-on-imbalanced-clusters.md

+## Motivation
+
+Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.


It's also worth noting that configuring a Kafka cluster to detect and report partition imbalances in the first place also requires manual effort. Currently, users must set up and tune the anomaly detection settings themselves. One likely benefit of implementing this feature is that it would provide sensible default configurations which would help get users started with detecting partition imbalances.

kyguy · 2025-09-22T19:27:07Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The above flow diagram depicts the self-healing process in Cruise Control.
+The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
+The notifier then decides what action to take on the anomaly whether to fix it, ignore it or delay. Cruise Control provides various notifiers to alert the users about the detected anomaly in several ways like Slack, Alerta, MS Teams etc.


From what I understand, the notifier decides what action to take on the anomaly whether to fix it, ignore it or delay based on the user-provided Cruise Control server self.healing configurations e.g. self.healing.broker.failure.enabled, self.healing.goal.violation.enabled, etc.

Is that correct?

The above are properties that Cruise Control provides to the notifiers, but you can even create your own notifier which doesn't take them into account. Within CC you have a base class for notifier which takes into account the above, but you are not forced to write your own notifier by inheriting the base one.
Said that, in general yes, it's the notifier deciding if an anomaly should be fixed or not. Its response to the anomaly detector determines what to do next. Just bad but also "on the hype" example, the notifier could even use some AI :-P by providing it what's the anomaly and get an idea if the anomaly needs fix or not.

Ah yes, thanks that makes sense

kyguy · 2025-09-22T19:32:50Z

106-auto-rebalance-on-imbalanced-clusters.md

+Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource.
+
+Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
+All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.


Suggested change

All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.

All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.

kyguy · 2025-09-22T19:41:20Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This proposal allows the users to have their cluster balanced automatically whenever the cluster gets imbalanced due to overloaded broker, CPU usage etc.
+If we were to enable the self-healing ability of Cruise Control then, in response to detected anomalies, Cruise Control would issue partition reassignments without involving the Strimzi Cluster Operator.
+This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.


Since the self-healing feature of Cruise Control isn't being used as part of this proposal, would the two sentences above be better suited in a "Rejected Alternatives" section at the end of the proposal?

kyguy · 2025-09-22T19:42:46Z

106-auto-rebalance-on-imbalanced-clusters.md

+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We will be using the anomaly detection classes related to goal violations that can be addressed by a partition rebalances but not other anomaly detection classes related to goal violations that would require manual intervention like disk or broker failures.
+TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.


Suggested change

TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

scholzj

I left some comments. Some are nits, some are questions, etc. I feel like it would be great to have more clarifications on:

How are the anomalies removed from the CM
How exactly do we prevent repeated imbalances whcih cannot be fixed.

scholzj · 2025-10-06T09:28:37Z

106-auto-rebalance-on-imbalanced-clusters.md

+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
+The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
+Detector classes have different mechanisms to detect their corresponding anomalies.
+For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API.


What is Kafka Metadata API?

scholzj · 2025-10-06T09:31:32Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes.
+The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.
+The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.


Is it always only one notifier? Or can there be more of them?

scholzj · 2025-10-06T09:33:04Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The default `NoopNotifer` always sets the notifier action as `IGNORE`, which  means that the detected anomaly will be silently ignored and no notification is sent to the user.
+
+Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.


How does something like a Sack notifier work? Does it send a message to Slack and mark the anomaly as IGNORE?

scholzj · 2025-10-06T09:35:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+Even under normal operation, it's common for Kafka clusters to encounter problems such as partition key skew leading to an uneven partition distribution, or hardware issues like disk failures, which can degrade overall cluster's health and performance.
+Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource.
+
+Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).


I guess if they can set the option they can set it to anything including a custom notifier? Or how does Strimzi prevent the use of custom notifier today?

scholzj · 2025-10-06T09:36:44Z

106-auto-rebalance-on-imbalanced-clusters.md

+Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
+All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.


So, what is the actual consequence of this? Users can use the anomaly detection and use for example an notiofier which sends them a Slack message. But no self-healing is ever done?

scholzj · 2025-10-06T09:57:09Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+#### What happens if an unfixable goal violation happens
+
+In case, there is an unfixable goal violation like `DiskDistributionUsage` goal is violated but even after rebalance we cannot fix it since the all the disks are already completely populated, in that case the notifier would simply ignore that anomaly. This is because Cruise Control provides a check to first see if the violated goal can be fixed or not by trying a dry run internally. If the violated goal is unfixable then that goal is ignored and will not be added to the ConfigMap but the user will be prompted about the unfixable violation in the status section of the Kafka CR.


Ehh, you will need to have it in the ConfigMap in order to add it to the status. So this needs more detail.

scholzj · 2025-10-06T10:02:19Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+### Auto-rebalancing execution for `imbalance` mode
+
+### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode


Suggested change

### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

#### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

scholzj · 2025-10-06T10:03:05Z

106-auto-rebalance-on-imbalanced-clusters.md

+* **RebalanceOnScaleDown**: a rebalancing related to a scale down operation is running.
+* **RebalanceOnScaleUp**: a rebalancing related to a scale up operation is running.
+
+With the new `imbalance` mode, we will be introducing a new state to the FSM called `RebalanceOnAnomalyDetection`.


Should it be RebalanceOnImbalance instead if the type is imbalance?

scholzj · 2025-10-06T10:04:34Z

106-auto-rebalance-on-imbalanced-clusters.md

+If, during an ongoing auto-rebalancing, the `KafkaRebalance` custom resource is not there anymore on the next reconciliation, it could mean the user deleted it while the operator was stopped/crashed/not running.
+In this case, the FSM will assume it as `NotReady` so falling in the last case above.
+
+## Affected/not affected projects


Should have backwards compatibility section as well to clarify/summarize all the compatibilit issues (the custom notifier I guess being the only one).

scholzj · 2025-10-06T10:05:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+## Affected/not affected projects
+
+This change will affect the Strimzi cluster operator and a new repository named `strimzi-notifier` will be added under the Strimzi organisation.


+1 of separate repository for the notifier. But that should be likely already detailed in earlier in the proposal.

Added proposal for self-healing

abb4253

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt requested review from ppatierno, scholzj and tomncooper July 14, 2025 11:45

ppatierno reviewed Jul 15, 2025

View reviewed changes

tomncooper reviewed Jul 15, 2025

View reviewed changes

ppatierno requested review from Frawless, PaulRMellor, im-konge, katheris, kyguy, samuel-hawker, see-quick, sknot-rh and tombentley July 15, 2025 17:12

scholzj reviewed Jul 16, 2025

View reviewed changes

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

ShubhamRwt added 3 commits July 23, 2025 13:55

Added suggestions by Jakub, Tom and Paolo

9f538ff

Signed-off-by: ShubhamRwt <[email protected]>

Updated the heading

bf1e952

Signed-off-by: ShubhamRwt <[email protected]>

Added future scope section

a8ce365

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt changed the title ~~Added proposal for self-healing feature in operator~~ Added proposal for auto-rebalance on imbalanced cluster feature in operator Jul 24, 2025

ppatierno reviewed Aug 1, 2025

View reviewed changes

tomncooper reviewed Aug 1, 2025

View reviewed changes

im-konge reviewed Aug 8, 2025

View reviewed changes

tinaselenge reviewed Aug 11, 2025

View reviewed changes

see-quick reviewed Aug 20, 2025

View reviewed changes

Frawless reviewed Aug 20, 2025

View reviewed changes

Added suggestions made on the proposal

cc48117

Signed-off-by: ShubhamRwt <[email protected]>

kyguy reviewed Sep 19, 2025

View reviewed changes

ShubhamRwt added 3 commits September 22, 2025 15:50

Refine the proposal

66cb053

Signed-off-by: ShubhamRwt <[email protected]>

Pushed example yamls

f6c95c0

Signed-off-by: ShubhamRwt <[email protected]>

Minor Edits

39da9e1

Signed-off-by: ShubhamRwt <[email protected]>

kyguy reviewed Sep 22, 2025

View reviewed changes

scholzj reviewed Oct 6, 2025

View reviewed changes


		## Motivation

		Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.

	### Introduction to Self Healing
	### Introduction to Self Healing in Cruise Control


		If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.

		#### What happens if some unfixable goal violation happens

	#### What happens if same anomaly is detected again while the auto-rebalance is happening
	#### What happens if same anomaly is detected again while the auto-rebalance is happening

	Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.
	Some of those methods are: `onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()`.

		With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
		It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.

	With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
	It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
	In smaller clusters, anomalies can still be fixed manually. But as clusters grow, doing this becomes time-consuming or even impractical. For Strimzi users, it would be highly valuable if such anomalies could be detected and fixed automatically.

	The smaller the priority value and detected time is, the higher priority the anomaly type has.
	An anomaly is considered more important if it has a lower priority value and shorter detection time.

	This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.
	This is not an actual rebalance request to get an optimization proposal; it is simply where the configuration for auto-rebalancing is defined.

	The user can specify rebalancing goals and other configuration for rebalancing, within the resource.
	The user can specify rebalancing goals and configuration in the resource.

	We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
	We will be using the anomaly detection classes related to goal violations that can be addressed by a partition rebalances but not other anomaly detection classes related to goal violations that would require manual intervention like disk or broker failures.

	THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
	The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

	* we will ensure that the operator is in control of when rebalances will be triggered.
	* we ensure that the operator controls all rebalance and cluster remediation operations.

	* using the existing `KafkaRebalance` CR system make it easier for users to see what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.
	* using the existing `KafkaRebalance` CR system gives more visibility into what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.

	To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
	To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to use whether it be `add-brokers`, `remove-brokers`, or `imbalance`.

	Once the auto-rebalance with `imbalance` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.
	When the auto-rebalance configuration is set with `imbalance` mode enabled, the operator will trigger a partition rebalance whenever a goal violation is detected by the anomaly detector.

	Detector classes have different mechanisms to detect their corresponding anomalies.
	Detector classes use different mechanisms to detect their corresponding anomalies.

	* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
	* Topic Anomaly - When one or more topics in the cluster violate user-defined properties (e.g. some partitions are too large on disk).

	* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).
	* Disk Failure - This failure happens when one of the non-empty disks fails (in a Kafka cluster with JBOD disks).

	All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.
	All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Are you sure you want to change the base?

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Uh oh!

Conversation

ShubhamRwt commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ppatierno commented Jul 17, 2025

Uh oh!

scholzj commented Jul 17, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomncooper Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShubhamRwt Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

tomncooper left a comment •

edited

Loading

tomncooper Aug 1, 2025 •

edited

Loading

ShubhamRwt Aug 4, 2025 •

edited

Loading

tomncooper Aug 1, 2025 •

edited

Loading

im-konge Aug 8, 2025 •

edited

Loading

	TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.
	The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.


		The default `NoopNotifer` always sets the notifier action as `IGNORE`, which means that the detected anomaly will be silently ignored and no notification is sent to the user.

		Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.

		Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
		All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.


		#### What happens if an unfixable goal violation happens

		In case, there is an unfixable goal violation like `DiskDistributionUsage` goal is violated but even after rebalance we cannot fix it since the all the disks are already completely populated, in that case the notifier would simply ignore that anomaly. This is because Cruise Control provides a check to first see if the violated goal can be fixed or not by trying a dry run internally. If the violated goal is unfixable then that goal is ignored and will not be added to the ConfigMap but the user will be prompted about the unfixable violation in the status section of the Kafka CR.


		### Auto-rebalancing execution for `imbalance` mode

		### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

	### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode
	#### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode


		## Affected/not affected projects

		This change will affect the Strimzi cluster operator and a new repository named `strimzi-notifier` will be added under the Strimzi organisation.