Skip to content

Conversation

ShubhamRwt
Copy link
Contributor

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.

I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.

For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going in the right direction. But I think it needs to go a bit deeper:

  • We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
  • If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called mode: full? Calling it full seems confusing - does it mean that full includes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybe mode: rebalance or mode: skew or something like that would make more sense?

@ppatierno
Copy link
Member

@scholzj good to know that you like the track we are now :-)

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

@scholzj
Copy link
Member

scholzj commented Jul 17, 2025

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

I do not think this works here. KafkaRebalance is essentially an imperative API (although implemented through a declarative resource). You are sending a command to the CO to do a full rebalance.

The autoRebalance section in the Kafka CR is a declarative API. You are declaring how CO should automatically react to some situations. add-brokers and remove-brokers works well in both as it is a command as well as event description. full IMHO does not work that well in the declarative mode because as I said, it can be easily interpreted as full == all available options (i.e. including scale-up or scale-down). That is where the idea of skew comes from as from my understanding in this proposal we are reacting to skew -> the skew can be a CPU inbalance, Disk inbalance etc.

goal-violation sounds reasonable ... but I wonder if it is too generic. I assume that the future modes ... e.g. CCs suggestion to scale-up, scale-down, bad distribution across racks, broken disks or brokers ... those are also goal violations, or? But you cannot solve these by creating a KafkaRebalance. So they will need their own modes as well. That is kind of the context in whcih I'm trying to see the mode names.

@ShubhamRwt ShubhamRwt changed the title Added proposal for self-healing feature in operator Added proposal for auto-rebalance on imbalanced cluster feature in operator Jul 24, 2025

This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c.
When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource.
Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a repetition of the above sentence. Maybe you can delete it but adding the link to the anomaly detector manager to the previous sentence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


## Motivation

Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How a user is notified by anomalies currently? What are you referring to?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.

With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.

### Introduction to Self Healing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Introduction to Self Healing
### Introduction to Self Healing in Cruise Control

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."


The above flow diagram depicts the self-healing process in Cruise Control.
The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
The configured notifiers provides alerts to the users about the detected anomaly and also returns the action that needs to be taken on the anomaly i.e. whether to fix it, ignore it or delay it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alert mechanism isn't out of the box. A notifier can have its own logic without generating any alerts. Even just triggering the fix without notifying anyone what's happening. So there is no assumption that a "configured" notifier provides alerts. I think this sentence should say that the notifier makes the decision about the action to take. Then CC provides some notifiers which are able to alert the user in several ways (MS Teams, Slack, etc etc).


If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.

#### What happens if some unfixable goal violation happens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### What happens if some unfixable goal violation happens
#### What happens if some unfixable goal violation happens

If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.

#### What happens if some unfixable goal violation happens
In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need an example here for better understanding how this is prompted to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having prometheus metrics for such cases might be reasonable default way?

#### What happens if some unfixable goal violation happens
In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.

#### What happens if same anomaly is detected again while the auto-rebalance is happening
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### What happens if same anomaly is detected again while the auto-rebalance is happening
#### What happens if same anomaly is detected again while the auto-rebalance is happening

In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.

#### What happens if same anomaly is detected again while the auto-rebalance is happening
Since the cluster operator has the knowledge regarding the detected violation, we will ignore the anomalies while the rebalancing is happening. In case the anomaly still exists after the rebalance, Cruise Control will detect it again and a new rebalance would be triggered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it seems to assume that if a first anomaly is created, the notifier creates the corresponding ConfigMap and the CO takes care of running a rebalancing. While the rebalancing is running, CC detects other anomalies, so the notifier is creating a bunch of other ConfigMaps that the CO is ignoring. Finally, the rebalancing ends ... the CO will find all these ConfigMaps ... what's going to do? This is where if it takes care of them we could:

  1. lose the priority of them (ConfigMaps don't have priority)
  2. the old anomalies could have been fixed by the previous rebalancing so it's useless handling them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best option in this case would be to ignore the configmap and also delete it at the same time. I think I didn't mentioned it here which my mistake but later in the flowchart, I say that if anomalies are detected during a rebalance is happening, we will just gnore that configmap and delete it.

* from **RebalanceOnScaleDown** to:
* **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
* **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
* **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we really sure that if a rebalance is running for scale up or scale down, after that we should take care of the anomaly? Is it possible that the anomaly was somehow fixed because of the auto-rebalancing due to scale up or down? My gut feeling is that we could avoid to take care of an anomaly even because if the problem is still in place, it will be raised again by CC and then we'll deal with it. @tomncooper @scholzj wdyt?

Copy link

@tomncooper tomncooper Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?

Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.

I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think I can really comment as I do not know how this really works in CC. I raised similar point before with regards to imbalance that cannot be fixed (e.g. because one partition causing the imbalance is too big etc.). Will it be raised again and again? Do we need to somehow detedt those and ignore them? Etc. So this is a bit similar. How do you know it was already resolved or not and will it be repeated or not. 🤷

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomncooper @ppatierno you are correct, we should ignore and delete the configmap at the same time if rebalance is happening. I think I didn't mentioned it here which is my mistake but later in the flowchart, I show that if anomalies are detected during a rebalance is happening, we will just ignore that configmap and delete it. As for unfixable anomalies which can keep appearing, there is code present in the Cruise Control SelfHealingNotifier which I am going to utilize. That method checks if the rebalance can be performed on the Goal violation or not. If the goal vioaltion cannot be fixed, then we just ognore the anomaly and no configmap would be created in that case

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did another pass. I have a few questions:

  • How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
  • You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
  • It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated KafkaRebalance CRs? But it is not clear.


## Motivation

Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.

With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.

### Introduction to Self Healing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."

* from **RebalanceOnScaleDown** to:
* **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
* **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
* **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed
Copy link

@tomncooper tomncooper Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?

Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.

I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.


This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c.
When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource.
Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

finalizers:
- strimzi.io/auto-rebalancing
spec:
mode: skew
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it will be always full and there will be no new mode like imbalance or skew, right?

#### AnomalyDetectorNotifier

Cruise Control provides the `AnomalyNotifier` interface, which has multiple abstract methods on what to do if certain anomalies are detected.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.
Some of those methods are: `onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()`.

I guess you don't have to use etc. here if you are just naming some of them, but I'm not a native speaker :)

# ...
```

The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah should it follow the names of other things like <cluster-name>-goal-violation-<anomalyID>? I guess that will be also easier to find in case that you would like to search all Namespaces for these kind of ConfigMap.


Users cannot configure the notifier if they are utilising the auto-rebalance on imbalanced cluster.
This is because the operator is using our custom notifier for getting alerts about goal violations.
If the users try to override the notifier while the `skew` mode is enabled, the auto-rebalance `skew` configuration then the operator would throw errors in the auto-rebalance status field
Copy link
Member

@im-konge im-konge Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could you re-phrase it a bit - I'm confused a bit by the the auto-rebalance skew configuration then the operator would throw errors. What do you mean by that?

finalizers:
- strimzi.io/auto-rebalancing
spec:
mode: skew
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would get confused to see different mode name, full here. We would have to explain how that map to skew or imbalance mode we introduced.

@nickgarvey
Copy link

Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows.

I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else.

An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete mode: skew on the CruiseControl CR to stop automatic rebalances, but it could be more clear.

Thanks for putting this together, excited to see this.

@ppatierno
Copy link
Member

ppatierno commented Aug 19, 2025

@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the stop annotation on the KafkaRebalance (of course the current batch has to finish first). With auto-rebalancing, the KafkaRebalance is owned by the operator and not by the user. That's anyway a good feedback because there is no clear way for the user to stop an auto-rebalancing in progress. I think you could apply the stop annotation on the KafkaRebalance resource but you can't delete it due to a finalizer. Then you should delete the corresponding mode within the spec.cruiseControl.autoRebalance.mode field to avoid the re-triggering. It's something to think about.

Comment on lines 10 to 11
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
In smaller clusters, anomalies can still be fixed manually. But as clusters grow, doing this becomes time-consuming or even impractical. For Strimzi users, it would be highly valuable if such anomalies could be detected and fixed automatically.

* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).

The detected anomalies are inserted into a priority queue where comparator is based upon the priority value and the detection time.
The smaller the priority value and detected time is, the higher priority the anomaly type has.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The smaller the priority value and detected time is, the higher priority the anomaly type has.
An anomaly is considered more important if it has a lower priority value and shorter detection time.

They can configure auto-rebalance to enable only for their specific case i.e. setting only `skew` mode or other scaling related modes.
Once the auto-rebalance with `skew` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.
To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly.
We will create our own custom notifier named `AnomalyDetectorNotifier` to do the same.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would make it more flexible for future changes, so +1 for naming it more generic way...

The auto-rebalance configuration for the `spec.cruiseControl.autoRebalance.template` property in the `Kafka` custom resource is provided through a `KafkaRebalance` custom resource defined as a "template".
That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set.
When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing.
This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.
This is not an actual rebalance request to get an optimization proposal; it is simply where the configuration for auto-rebalancing is defined.

That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set.
When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing.
This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.
The user can specify rebalancing goals and other configuration for rebalancing, within the resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The user can specify rebalancing goals and other configuration for rebalancing, within the resource.
The user can specify rebalancing goals and configuration in the resource.

```

The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance.
Separate configmaps would be created for every goal violation such that on completion of the rebalance we can remove the particular configmap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess the operator will remove to CM instead of users, right?

If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.

#### What happens if some unfixable goal violation happens
In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having prometheus metrics for such cases might be reasonable default way?

A[KafkaClusterCreator] --creates--> B[KafkaCluster]
B -- calls --> D[KafkaAutoRebalancingReconciler.reconcile]
D -- check for configmap with goal-violation prefix --> E{if config map present?}
D -- if rebalance in progress --> F[ignore new configmaps and delete them]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which CMs will be deleted in that case? My understanding is that KafkaAutoRebalancingReconciler will not create new ones. What about these CMs that are used for current rebalancing? These should be deleted by rebalancing itself or?

This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator.
To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
We will be using the anomaly detection classes related to goal violations that can be addressed by a partition rebalances but not other anomaly detection classes related to goal violations that would require manual intervention like disk or broker failures.

To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator.
To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sentences should be put on their own line. For the first sentence, it may be more direct to say something like:

Suggested change
THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

As for the second sentence, is this the real reason why we are leaving out these specific anomaly detection classes? It seems like we would want to leave them out because the detected issues (disk failure, broker failure, etc) would be non-trivial for the Strimzi Operator to fix (also out of scope of this feature). We want to narrow the scope to goal violations that the Operator can fix with a rebalance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason I mentioned was because Cruise Control can only do rebalancing and that wouldn't help us fix these failures so therefore we are not supporting it but I think it would be good to frame it the way you mentioned

To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
Doing this will provide us with the following advantages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say "doing this", do we mean disabling the the anomaly detection classes that detect goal violations that would require manual intervention? Or disabling anomaly detection classes that attempt to resolve goal violations that would require manual intervention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By Doing this, I was referring to the approach of the proposal on how we plan to detect the imbalances using CC and let operator fix them

We will be using the goal violation anomaly detection related classes in Cruise Control to detect imbalanced cluster and not other detection related class like Disk failures or broker failure.
THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
Doing this will provide us with the following advantages:
* we will ensure that the operator is in control of when rebalances will be triggered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* we will ensure that the operator is in control of when rebalances will be triggered.
* we ensure that the operator controls all rebalance and cluster remediation operations.

THe reason behind it is that disk failures and broker failures can be fixed in a much better way than rebalancing the cluster. It is much easier to spin up a new disk in case of disk failures and in the same way it is better to fix the issue with the broker directly instead just moving the partitions replicas away from it.
Doing this will provide us with the following advantages:
* we will ensure that the operator is in control of when rebalances will be triggered.
* using the existing `KafkaRebalance` CR system make it easier for users to see what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* using the existing `KafkaRebalance` CR system make it easier for users to see what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.
* using the existing `KafkaRebalance` CR system gives more visibility into what is happening and when, which (as we don't support the Cruise Control UI) enhances observability and will also aids in debugging.

The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to the all the brokers.
The mode is defined by setting the `spec.cruiseControl.autoRebalance.mode` field as `imbalance` and the corresponding rebalancing configuration is defined as a reference to a "template" `KafkaRebalance` custom resource, by using the `spec.cruiseControl.autoRebalance.template` field as a [LocalObjectReference](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/local-object-reference/).
This field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations).
To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to use whether it be `add-brokers`, `remove-brokers`, or `imbalance`.

This field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations).
To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
They don't require to set up all the modes and can enable the modes they require.
They can configure auto-rebalance to enable only for their specific case i.e. setting only `imbalance` mode or other scaling related modes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the first sentence of the above three summarizes this well enough, we could probably remove the bottom two sentences.

To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
They don't require to set up all the modes and can enable the modes they require.
They can configure auto-rebalance to enable only for their specific case i.e. setting only `imbalance` mode or other scaling related modes.
Once the auto-rebalance with `imbalance` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once the auto-rebalance with `imbalance` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.
When the auto-rebalance configuration is set with `imbalance` mode enabled, the operator will trigger a partition rebalance whenever a goal violation is detected by the anomaly detector.

To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly.
We will create our own custom notifier named `StrimziCruiseControlNotifier` to do the same.
This notifier's job will be to update the operator regarding the goal violations so that the operator can trigger a rebalance (see section [AnomalyDetectorNotifier](./106-auto-rebalance-on-imbalanced-clusters.md#anomalydetectornotifier)).
With this proposal, we are only going to support auto-rebalance on imbalanced cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand, does this mean that the operator will only trigger a partition rebalance for goal violations that don't require manual intervention?

We will create our own custom notifier named `StrimziCruiseControlNotifier` to do the same.
This notifier's job will be to update the operator regarding the goal violations so that the operator can trigger a rebalance (see section [AnomalyDetectorNotifier](./106-auto-rebalance-on-imbalanced-clusters.md#anomalydetectornotifier)).
With this proposal, we are only going to support auto-rebalance on imbalanced cluster.
We also plan to implement the same for topic and metrics related issues, but it will be part of future work since their implementation require different approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth coming up with some terminology to distinguish the different types of goal violations that are flagged by the anomaly detector. Then it'd be easier to describe what will and will not be implemented as part of this feature versus features in the future. e.g something like

  • resource distribution violations : Violations that can be resolved by a partition rebalance.
  • component failure violations : Violations that can be resolved through manual intervention (disk or broker failure)
  • ...

We would just need to look at the set of possible violations and divide them into groups based on how they are resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Notifier that we will create will ignore all the broker and disk related anomalies so I wonder why do we need to come up with terminologies for goal violation. I have also configure the notifier to alert the users in cases where goal like DiskDistributionGoal is violated and it cannot be fixed even by rebalancing so in that case this anomaly will be ignored

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Shubham is just using the same terminology we have in Cruise Control. Any "goal violation" anomaly is related to the violation of ... goals as we know them from Cruise Control.
But disk or broker failures are two different things, they are not under the "goal violation" umbrella.
As well as the topic and metrics anomaly he mentioned.
All of them won't be covered here, but just the "goal violation". What's your doubt about that naming @kyguy ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But disk or broker failures are two different things, they are not under the "goal violation" umbrella.
As well as the topic and metrics anomaly he mentioned.
All of them won't be covered here, but just the "goal violation". What's your doubt about that naming

We don't necessarily have to come up with new terminology here, I just wasn't sure what issues are and are not covered by the statement "we are only going to support auto-rebalance on imbalanced cluster.". The term "imbalanced cluster" seemed a little ambiguous here and other places in the proposal if it doesn't cover all the issues that could cause an imbalanced cluster. e.g broker failure

If we are covering all the "goal violations", maybe we can just say "we are only going to support auto-rebalance whenever goal violations are detected". However, it sounds like we aren't planning on a auto-rebalancing on all goal violations e.g when " DiskDistributionGoal is violated and it cannot be fixed even by rebalancing".

It looks like the current iteration of the proposal clears this distinction above anyway, so maybe we can safety remove the line "With this proposal, we are only going to support auto-rebalance on imbalanced cluster." or say
"With this proposal, we are only going to support auto-rebalance whenever goal violations are detected and the violation can be addressed by a rebalance".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that by DiskDistributionGoal you are referring to DiskUsageDistributionGoal. Right?
Now you had a valid point regarding non fixable violation. I would assume that when anomaly detector detects a DiskUsageDistributionGoal it doesn't know if it's fixable or not yet. Our notifier just adds the goal violation anomaly to the ConfigMap and the operator will take care by creating a KafkaRebalance resource to fix it. I think that, if the violation is not fixable via a rebalancing, CC will respond by NOT providing a proposal, is that right? Or it will provide a proposal and then when the actual rebalancing is requested then it will fail?

Finally, I agree that we can replace "With this proposal, we are only going to support auto-rebalance on imbalanced cluster." with "With this proposal, we are only going to support auto-rebalance whenever goal violations are detected and the violation can be addressed by a rebalance"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that by DiskDistributionGoal you are referring to DiskUsageDistributionGoal. Right?

Sorry, yes that is what I meant.

I think that, if the violation is not fixable via a rebalancing, CC will respond by NOT providing a proposal, is that right? Or it will provide a proposal and then when the actual rebalancing is requested then it will fail?

I believe the former.

Our notifier just adds the goal violation anomaly to the ConfigMap and the operator will take care by creating a KafkaRebalance resource to fix it.

That makes sense, I guess there will be times an imbalance due to a disk failure could be solved by a rebalance and some times not, we delegate the decision to Cruise Control.

Finally, I agree that we can replace "With this proposal, we are only going to support auto-rebalance on imbalanced cluster." with "With this proposal, we are only going to support auto-rebalance whenever goal violations are detected and the violation can be addressed by a rebalance"

Sounds good to me!

Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>

Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.
It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically.
It would be useful for users of Strimzi to be able to have these imbalanced clusters balanced automatically.


The above flow diagram depicts the self-healing process in Cruise Control.
The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
The notifier then decides what action to take on the anomaly whether to fix it, ignore it or delay. Cruise Control provides various notifiers to alert the users about the detected anomaly in several ways like Slack, Alerta, MS Teams etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two sentences are on same line.

It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
Detector classes have different mechanisms to detect their corresponding anomalies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Detector classes have different mechanisms to detect their corresponding anomalies.
Detector classes use different mechanisms to detect their corresponding anomalies.

Furthermore, `MetricAnomalyDetector` use metrics and `GoalViolationDetector` uses the load distribution to detect their anomalies.
The detected anomalies can be of various types:
* Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration. However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR.
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
* Topic Anomaly - When one or more topics in the cluster violate user-defined properties (e.g. some partitions are too large on disk).

* Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration. However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR.
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
* Broker Failure - This happens when a non-empty broker crashes or leaves a cluster for a long time.
* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).
* Disk Failure - This failure happens when one of the non-empty disks fails (in a Kafka cluster with JBOD disks).

## Motivation

Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth noting that configuring a Kafka cluster to detect and report partition imbalances in the first place also requires manual effort. Currently, users must set up and tune the anomaly detection settings themselves. One likely benefit of implementing this feature is that it would provide sensible default configurations which would help get users started with detecting partition imbalances.


The above flow diagram depicts the self-healing process in Cruise Control.
The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
The notifier then decides what action to take on the anomaly whether to fix it, ignore it or delay. Cruise Control provides various notifiers to alert the users about the detected anomaly in several ways like Slack, Alerta, MS Teams etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the notifier decides what action to take on the anomaly whether to fix it, ignore it or delay based on the user-provided Cruise Control server self.healing configurations e.g. self.healing.broker.failure.enabled, self.healing.goal.violation.enabled, etc.

Is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above are properties that Cruise Control provides to the notifiers, but you can even create your own notifier which doesn't take them into account. Within CC you have a base class for notifier which takes into account the above, but you are not forced to write your own notifier by inheriting the base one.
Said that, in general yes, it's the notifier deciding if an anomaly should be fixed or not. Its response to the anomaly detector determines what to do next. Just bad but also "on the hype" example, the notifier could even use some AI :-P by providing it what's the anomaly and get an idea if the anomaly needs fix or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thanks that makes sense

Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource.

Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.
All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.


This proposal allows the users to have their cluster balanced automatically whenever the cluster gets imbalanced due to overloaded broker, CPU usage etc.
If we were to enable the self-healing ability of Cruise Control then, in response to detected anomalies, Cruise Control would issue partition reassignments without involving the Strimzi Cluster Operator.
This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the self-healing feature of Cruise Control isn't being used as part of this proposal, would the two sentences above be better suited in a "Rejected Alternatives" section at the end of the proposal?

To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator.
To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We will be using the anomaly detection classes related to goal violations that can be addressed by a partition rebalances but not other anomaly detection classes related to goal violations that would require manual intervention like disk or broker failures.
TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.
The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Some are nits, some are questions, etc. I feel like it would be great to have more clarifications on:

  • How are the anomalies removed from the CM
  • How exactly do we prevent repeated imbalances whcih cannot be fixed.

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
Detector classes have different mechanisms to detect their corresponding anomalies.
For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Kafka Metadata API?


Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes.
The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.
The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always only one notifier? Or can there be more of them?


The default `NoopNotifer` always sets the notifier action as `IGNORE`, which means that the detected anomaly will be silently ignored and no notification is sent to the user.

Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does something like a Sack notifier work? Does it send a message to Slack and mark the anomaly as IGNORE?

Even under normal operation, it's common for Kafka clusters to encounter problems such as partition key skew leading to an uneven partition distribution, or hardware issues like disk failures, which can degrade overall cluster's health and performance.
Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource.

Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if they can set the option they can set it to anything including a custom notifier? Or how does Strimzi prevent the use of custom notifier today?

Comment on lines +75 to +76
Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what is the actual consequence of this? Users can use the anomaly detection and use for example an notiofier which sends them a Slack message. But no self-healing is ever done?


#### What happens if an unfixable goal violation happens

In case, there is an unfixable goal violation like `DiskDistributionUsage` goal is violated but even after rebalance we cannot fix it since the all the disks are already completely populated, in that case the notifier would simply ignore that anomaly. This is because Cruise Control provides a check to first see if the violated goal can be fixed or not by trying a dry run internally. If the violated goal is unfixable then that goal is ignored and will not be added to the ConfigMap but the user will be prompted about the unfixable violation in the status section of the Kafka CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh, you will need to have it in the ConfigMap in order to add it to the status. So this needs more detail.


### Auto-rebalancing execution for `imbalance` mode

### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode
#### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

* **RebalanceOnScaleDown**: a rebalancing related to a scale down operation is running.
* **RebalanceOnScaleUp**: a rebalancing related to a scale up operation is running.

With the new `imbalance` mode, we will be introducing a new state to the FSM called `RebalanceOnAnomalyDetection`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be RebalanceOnImbalance instead if the type is imbalance?

If, during an ongoing auto-rebalancing, the `KafkaRebalance` custom resource is not there anymore on the next reconciliation, it could mean the user deleted it while the operator was stopped/crashed/not running.
In this case, the FSM will assume it as `NotReady` so falling in the last case above.

## Affected/not affected projects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have backwards compatibility section as well to clarify/summarize all the compatibilit issues (the custom notifier I guess being the only one).


## Affected/not affected projects

This change will affect the Strimzi cluster operator and a new repository named `strimzi-notifier` will be added under the Strimzi organisation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 of separate repository for the notifier. But that should be likely already detailed in earlier in the proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants