Skip to content

Commit 3a18a8b

Browse files
Proposal for KRaft Support in Kafka Roller (#91)
* Kafka Roller in KRaft mode Co-authored-by: Gantigmaa Selenge <[email protected]> Signed-off-by: Katherine Stanley <[email protected]> * Clarify wording around including leader in quorum health check Signed-off-by: Katherine Stanley <[email protected]> * Address Paul's comments Co-authored-by: Katherine Stanley <[email protected]> Signed-off-by: Gantigmaa Selenge <[email protected]> * Correct the statement about stuck pod Signed-off-by: Gantigmaa Selenge <[email protected]> --------- Signed-off-by: Katherine Stanley <[email protected]> Signed-off-by: Gantigmaa Selenge <[email protected]> Co-authored-by: Katherine Stanley <[email protected]>
1 parent 79c5135 commit 3a18a8b

File tree

2 files changed

+162
-0
lines changed

2 files changed

+162
-0
lines changed

060-kafka-roller-kraft.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# KafkaRoller KRaft Support
2+
3+
This proposal describes the actions that the KafkaRoller should take when operating
4+
against a Strimzi cluster in KRaft mode.
5+
The proposal describes the checks the KafkaRoller should take, how it should perform
6+
those checks, and in what order, but does not discuss exactly how the KafkaRoller works.
7+
This proposal is expected to apply to both the current KafkaRoller and a future iteration of the KafkaRoller.
8+
9+
This proposal assumes that liveness/readiness of nodes is as described in [proposal #46](https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md)
10+
and [PR #8892](https://github.com/strimzi/strimzi-kafka-operator/pull/8892).
11+
12+
13+
## Current situation
14+
15+
When operating on a ZooKeeper based cluster the KafkaRoller has the behaviour
16+
described below.
17+
18+
### Order to restart nodes
19+
Restart in order:
20+
1. Unready brokers
21+
2. Ready brokers
22+
3. Current controller
23+
24+
### Triggers
25+
The following are some of the triggers that restart a broker:
26+
- Pod or its StrimziPodSet is annotated for manual rolling update
27+
- Pod is unready
28+
- Broker's configuration has changed
29+
30+
### Restart conditions
31+
The KafkaRoller considers the following in order to determine if it is ok to restart a pod:
32+
1. Does not restart a broker performing log recovery
33+
2. Attempts to connect an admin client to a broker and if it can't connect, restarts the broker
34+
3. Restarts a broker if the pod is in stuck state
35+
4. Restarts a broker if the configuration has changed and cannot be updated dynamically
36+
5. Does not restart a broker if doing so would take the in-sync replicas count below `min.insync.replicas` for any topic hosted on that broker
37+
38+
#### Unready pod
39+
If pod is unready but not stuck, KafkaRoller awaits its readiness until the operational timeout is reached before doing anything. A pod is considered stuck if it is in one of following states:
40+
- `CrashLoopBackOff`
41+
- `ImagePullBackOff`
42+
- `ContainerCreating`
43+
- `Pending` and `Unschedulable`
44+
45+
If a pod is stuck, KafkaRoller restarts it only if the pod is out of date. Otherwise, KafkaRoller fails the reconciliation process and does not restart other pods. If a pod is stuck, restarting the other pods might lead them to the same stuck state.
46+
If pod is unready but not stuck, KafkaRoller considers the restart conditions that are based on configuration change and availability check (#4 and #5 of `Restart conditions` above) to restart the pod.
47+
48+
49+
#### Configuration changes
50+
The KafkaRoller currently handles configuration updates to the brokers in the following way:
51+
- Retrieves the current Kafka configurations of the broker via admin client and compares it with the desired configurations specified in the Kafka CR.
52+
- Performs dynamic configuration updates if possible
53+
54+
In KRaft mode the KafkaRoller currently skips controller only pods, but performs the above steps on any combined or broker only pods.
55+
This is causing a problem in combined mode because if the quorum has not formed due to some of the pods not being ready
56+
the KafkaRoller will still try to contact the broker via the admin client.
57+
This call fails because the quorum is not formed, so in some cases this results in the cluster being stuck with some pods
58+
in a pending state.
59+
60+
## Motivation
61+
62+
When running in the KRaft mode the controller pods need to be rolled if manually annotated, configuration changes occur or pod is unready. At the moment the
63+
existing logic is blocking the ZooKeeper to KRaft migration that is being proposed in [PR #90](https://github.com/strimzi/proposals/pull/90) as well as the full implementation of KRaft liveness and readiness checks as described in [proposal #46](https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md).
64+
65+
## Proposal
66+
67+
The KafkaRoller behaviour should be unchanged when operating against a ZooKeeper based cluster.
68+
69+
The proposed behaviour when operating against a KRaft cluster is described below.
70+
71+
### Order to restart nodes
72+
Restart in order:
73+
1. Unready controller/combined nodes
74+
2. Ready controller/combined nodes in follower state
75+
3. Active controller (applies to both pure controller and combined node)
76+
4. Unready broker-only nodes
77+
5. Ready broker-only nodes
78+
79+
80+
### Triggers
81+
The following are some of the triggers that would restart a KRaft controller or combined node:
82+
- Pod or its StrimziPodSet is annotated for manual rolling update
83+
- Pod is unready
84+
- Controller's configuration has changed
85+
86+
The triggers for broker remain the same as ZooKeeper mode.
87+
88+
### Restart conditions
89+
The restart conditions that the KafkaRoller considers for different modes are described below.
90+
91+
#### The new quorum check
92+
The restart conditions include a new check for controllers to verify that restarting the Kafka node does not affect the quorum health.
93+
The proposed check ensures that a sufficient majority of controller nodes are caught up before allowing a restart:
94+
- Create admin client connection to the brokers and call `describeMetadataQuorum` API.
95+
- If failed to connect to the brokers, return `UnforceableProblem` which currently results in delay and retry for the pod until the maximum attempt is reached.
96+
- From the quorum info returned from the admin API, read `lastCaughtUpTimestamp` of each controller. `lastCaughtUpTimestamp` is the last millisecond timestamp at which a replica controller was known to be caught up with the quorum leader.
97+
- Check the quorum leader id using the quorum info and identify the `lastCaughtUpTimestamp` of the quorum leader.
98+
- Retrieve value of the Kafka property `controller.quorum.fetch.timeout.ms` from the desired configurations specified in the Kafka CR. If this property does not exist in the desired configurations, then use the hard-coded default value for it which is `2000`. The reason for this is explained further in the **NOTE** below.
99+
- Mark a KRaft controller node as caught up if `leaderLastCaughtUpTimestamp - replicaLastCaughtUpTimestamp < controllerQuorumFetchTimeoutMs`, or if it is the current quorum leader. This will exclude the controller KafkaRoller is currently considering to restart, because we are trying to check whether the quorum would stay healthy when this controller is restarted.
100+
- Count each controller node that is caught up (`numOfCaughtUpControllers`).
101+
- Can restart if: `numOfCaughtUpControllers >= ceil((double) (totalNumOfControllers + 1) / 2)`.
102+
103+
> NOTE: Until [KIP-919](https://cwiki.apache.org/confluence/display/KAFKA/KIP-919%3A+Allow+AdminClient+to+Talk+Directly+with+the+KRaft+Controller+Quorum) is implemented, KafkaRoller cannot create an admin connection to the controller directly to describe its configuration or the quorum state. Therefore KafkaRoller checks the desired configurations to get the value of `controller.quorum.fetch.timeout.ms` and creates admin client connection to the brokers for the quorum check. If KafkaRoller cannot connect to any of the brokers after the maximum attempts to retry, the controllers will be marked as failed to reconcile because of not being able to determine the quorum health. KafkaRoller would then try to reconcile the brokers, which may help restoring admin client connections to them. In this scenario, the reconciliation will be completed with failure and reported to the operator. The controllers quorum check will be retried in the next round of reconciliation.
104+
105+
106+
#### Separate brokers and controllers
107+
For controller-only:
108+
1. Does not restart a controller performing log recovery
109+
2. Restarts a controller if the pod is in stuck state
110+
3. Attempts to connect an admin client to any of the brokers, if it cannot connect, returns `UnforceableProblem` which results in delay and retry for the pod until the maximum attempt is reached.
111+
When the maximum attempt is reached, the KafkaRoller will move on to restart broker pods in case this resolves connection issues but will still ultimately mark the reconciliation as failed.
112+
4. Restarts a controller if the controller configuration has changed
113+
5. Does not restart a controller if doing so would take the number of caught-up controllers (inc leader) to less than majority of the quorum
114+
115+
For broker-only:
116+
1. Does not restart a broker performing log recovery
117+
2. Restarts a broker if the pod is in stuck state
118+
3. Attempts to connect an admin client to a broker and if it can't connect restarts the broker
119+
> **NEW** Currently when pod is stuck and out of date, it gets restarted because admin client connection fails. With this change, we will not attempt admin client connection to the broker, if the pod is stuck. Instead the pod will be restarted without making the admin client connection. Therefore the resulting action that restarts the pod that is stuck stays the same for ZooKeeper mode. This resolves the issue mentioned at the beginning of the proposal that leaves some pods stuck in pending state.
120+
4. Restarts a broker if the broker configuration has changed and cannot be updated dynamically
121+
5. Does not restart a broker if doing so would take the in-sync replicas count below `min.insync.replicas` for any topic hosted on that broker
122+
123+
124+
#### Combined mode
125+
1. Does not restart a combined node performing log recovery
126+
2. Restarts a combined node if the pod is in stuck state
127+
3. Attempts to connect an admin client to a broker and if it can't connect restarts the broker
128+
4. Restarts a combined node if the controller configuration has changed OR if the broker configuration has changed and cannot be updated dynamically
129+
5. Does not restart a ready combined node if doing so would take the number of caught-up controllers (inc leader) to less than majority of the quorum
130+
6. Does not restart a ready combined node if doing so would take the in-sync replicas count below `min.insync.replicas` for any topic hosted on that broker
131+
132+
133+
#### Unready pod
134+
Remains the same as ZooKeeper mode.
135+
136+
137+
#### Configuration changes
138+
As implemented in [PR #9125](https://github.com/strimzi/strimzi-kafka-operator/pull/9125) until [KIP 919](https://cwiki.apache.org/confluence/display/KAFKA/KIP-919%3A+Allow+AdminClient+to+Talk+Directly+with+the+KRaft+Controller+Quorum+and+add+Controller+Registration) is implemented:
139+
- For broker-only nodes, configuration changes are handled in the same way as brokers in the ZooKeeper case. If controller configurations changed, they will be ignored and brokers will not restart.
140+
- For controller-only nodes, a hash of the controller configs is calculated. A change to this hash causes a restart of the node, regardless of whether this configuration can be updated dynamically or not.
141+
- For combined nodes, both controller and broker configurations are checked and handled in the same way as brokers in the ZooKeeper case.
142+
143+
Once KIP 919 is implemented, the configurations of controller-only nodes will be diffed to see what values changed.
144+
If the configurations that were updated are dynamic configurations, the KafkaRoller will call the Admin API to dynamically update
145+
these values. This will be similar to how dynamic configuration updates are handled in ZooKeeper mode.
146+
147+
## Affected/not affected projects
148+
149+
The only affected project is the Strimzi cluster operator.
150+
151+
## Compatibility
152+
153+
This proposal does not affect the ZooKeeper broker KafkaRoller behaviour.
154+
This proposal does change the way that KRaft nodes are rolled, however since KRaft mode is not supported for production use
155+
and the existing logic is incomplete this is acceptable.
156+
157+
## Rejected alternatives
158+
159+
### Node rolling order
160+
We considered rolling all unready pods, then all ready pods, regardless of whether they were controllers or brokers.
161+
However, the problem with this approach is that for broker nodes to become ready the controller quorum must be formed.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ This repository list of proposals for the Strimzi project. A template for new pr
99

1010
| # | Title |
1111
| :-: |:----------------------------------------------------------------------|
12+
| 60 | [Kafka Roller KRaft Support](./060-kafka-roller-kraft.md) |
1213
| 59 | [ZooKeeper to KRaft migration](./059-zk-kraft-migration.md) |
1314
| 58 | [Deprecate and remove EnvVarConfigProvider](./058-deprecate-and-remove-envvar-config-provider.md) |
1415
| 57 | [Allow running ZooKeeper and KRaft based clusters in parallel](./057-run-zk-kraft-clusters-parallel.md) |

0 commit comments

Comments
 (0)