Skip to content

Commit 7cccffb

Browse files
committed
Address review comments from Federico and Paolo
Signed-off-by: Gantigmaa Selenge <[email protected]>
1 parent ec38009 commit 7cccffb

File tree

2 files changed

+26
-25
lines changed

2 files changed

+26
-25
lines changed

06x-new-kafka-roller.md

Lines changed: 26 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ The objective of this proposal is to introduce a new KafkaRoller with more struc
4646

4747
KafkaRoller decisions would be informed by observations coming from different sources (e.g. Kubernetes API, KafkaAgent, Kafka Admin API). These sources will be abstracted so that KafkaRoller is not dependent on their specifics as long as it's getting the information it needs. The abstractions also enable much better unit testing.
4848

49-
Nodes would categorised based on the observed states, the roller will perform specific actions on nodes in each category. Those actions should cause a subsequent observation to cause a state transition. This iterative process continues until each node's state aligns with the desired state.
49+
Nodes would be categorized based on the observed states, the roller will perform specific actions on nodes in each category. Those actions should cause a subsequent observation to cause a state transition. This iterative process continues until each node's state aligns with the desired state.
5050

5151
In addition, the new KafkaRoller will introduce an algorithm to restart brokers in parallel when safety conditions are met. These conditions ensure Kafka producer availability and minimize the impact on controllers and overall cluster stability. It will also wait for partitions to be reassigned to their preferred leaders to avoid triggering unnecessary partition leader elections.
5252

@@ -57,7 +57,7 @@ When a new reconciliation starts up, a context object is created for each node t
5757
- <i>nodeRef</i>: NodeRef object that contains Node ID.
5858
- <i>currentNodeRole</i>: Currently assigned process roles for this node (e.g. controller, broker).
5959
- <i>lastKnownState</i>: It contains the last known state of the node based on information collected from the abstracted sources (Kubernetes API, KafkaAgent and Kafka Admin API). The table below describes the possible states.
60-
- <i>restartReason</i>: It is updated based on the current predicate logic from the `Reconciler`. For example, an update in the Kafka CR is detected.
60+
- <i>restartReason</i>: It is updated based on the current predicate logic passed from the `KafkaReconciler` class. For example, an update in the Kafka CR is detected.
6161
- <i>numRestartAttempts</i>: The value is incremented each time the node has been restarted or attempted to be restarted.
6262
- <i>numReconfigAttempts</i>: The value is incremented each time the node has been reconfigured or attempted to be reconfigured.
6363
- <i>numRetries</i>: The value is incremented each time the node is evaluated/processed but was not restarted/reconfigured due to not meeting safety conditions for example, availability check failed, log recovery or timed out waiting for pod to become ready.
@@ -68,9 +68,9 @@ When a new reconciliation starts up, a context object is created for each node t
6868
| :--------------- | :--------------- | :----------- |
6969
| UNKNOWN | The initial state when creating `Context` for a node or state just after the node gets restarted/reconfigured. We expect to transition from this state fairly quickly. | `NOT_RUNNING` `NOT_READY` `RECOVERING` `READY` |
7070
| NOT_RUNNING | Node is not running (Kafka process is not running). This is determined via Kubernetes API, more details for it below. | `READY` `UNKNOWN` `NOT_READY` `RECOVERING` |
71-
| NOT_READY | Node is running but not ready to serve requests which is determined by Kubernetes readiness probe (broker state < 2 OR == 127 OR controller is not listening on port). | `READY` `UNKNOWN` `NOT_RUNNING` `RECOVERING` |
71+
| NOT_READY | Node is running but not ready to serve requests which is determined by Kubernetes readiness probe (broker state is not RUNNING OR controller is not listening on port). | `READY` `UNKNOWN` `NOT_RUNNING` `RECOVERING` |
7272
| RECOVERING | Node has started but is in log recovery (broker state == 2). This is determined via the KafkaAgent. | `READY` `NOT_RUNNING` `NOT_READY` |
73-
| READY | Node is in running state and ready to serve requests which is determined by Kubernetes readiness probe (broker state >= 3 AND != 127 OR controller is listening on port). | `LEADING_ALL_PREFERRED` `UNKNOWN` |
73+
| READY | Node is in running state and ready to serve requests which is determined by Kubernetes readiness probe (broker state is RUNNING OR controller is listening on port). | `LEADING_ALL_PREFERRED` `UNKNOWN` |
7474
| LEADING_ALL_PREFERRED | Node is leading all the partitions that it is the preferred leader for. Node's state can transition into this only from `READY` state. | This is the final state we expect
7575

7676
Context about broker states and restart reasons:
@@ -87,20 +87,21 @@ If one of the following is true, then node's state is `NOT_RUNNING`:
8787
- the pod has container status `ContainerStateWaiting` with `CrashLoopBackOff` or `ImagePullBackOff` reason
8888
If none of the above is true but the node is not ready, then its state would be `NOT_READY`.
8989

90-
#### Flow diagram describing the overall flow of the states
90+
#### High level flow diagram describing the flow of the states
9191
![The new roller flow](./images/06x-new-roller-flow.png)
9292

9393

94+
9495
### Configurability
9596
The following are the configuration options for the new KafkaRoller. If exposed to user, the user can configure it via `STRIMZI_` environment variables. Otherwise, the operator will set them to the default values (which are similar to what the current roller has):
9697

9798
| Configuration | Default value | Exposed to user | Description |
9899
|:-----------------------|:--------------|:----------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
99-
| maxRestartAttempts | 3 | No | The maximum number of times a node can be restarted before failing the reconciliation. This is checked against the node's `numRestartAttempts`. |
100-
| maxReconfigAttempts | 3 | No | The maximum number of times a node can be dynamically reconfigured before restarting it. This is checked against the node's `numReconfigAttempts`. |
100+
| maxRestartAttempts | 3 | No | The maximum number of restart attempts per node before failing the reconciliation. This is checked against node's `numRestartAttempts`. |
101+
| maxReconfigAttempts | 3 | No | The maximum number of dynamic reconfiguration attempts per node before restarting the node. This is checked against node's `numReconfigAttempts`. |
101102
| maxRetries | 10 | No | The maximum number of times a node can be retried after not meeting the safety conditions e.g. availability check failed. This is checked against the node's `numRetries`. |
102103
| operationTimeoutMs | 60 seconds | Yes | The maximum amount of time we will wait for nodes to transition to `READY` state after an operation in each retry. This is already exposed to the user via environment variable `STRIMZI_OPERATION_TIMEOUT_MS`. |
103-
| maxRestartParallelism | 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. This will be exposed to the user via the new environment variable `STRIMZI_MAX_RESTART_BATCH_SIZE`.
104+
| maxRestartParallelism | 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. This will be exposed to the user via the new environment variable `STRIMZI_MAX_RESTART_BATCH_SIZE`. However, if there are multiple brokers in `NOT_RUNNING` state, they may get restarted in parallel despite this configuration for a faster recovery.
104105
| postRestartDelay | 0 | Yes | Delay between restarts of nodes or batches. It's set to 0 by default, but can be adjusted by users to slow down the restarts. This will also help JIT to reach a steady state and to reduce impact on clients.
105106
| restartAndPreferredLeaderElectionDelay | 10 seconds | No | Delay between restart and triggering partition leader election so that just-rolled broker is leading all the partitions it is the preferred leader for. This is to avoid situations where leaders moving to a newly started node that does not yet have established networking to some outside networks, e.g. through load balancers.
106107

@@ -114,7 +115,7 @@ The following are the configuration options for the new KafkaRoller. If exposed
114115
nodeRoles: <Set using pod labels `strimzi.io/controller-role` and `strimzi.io/broker-role`>,
115116
state: UNKNOWN,
116117
lastTransition: <SYSTEM_TIME>,
117-
reason: <Result of predicate function from KafkaReconciler>,
118+
restartReason: <Result of predicate function from KafkaReconciler>,
118119
numRestartAttempts: 0,
119120
numReconfigAttempts: 0,
120121
numRetries: 0
@@ -123,10 +124,10 @@ The following are the configuration options for the new KafkaRoller. If exposed
123124
Contexts are recreated in each reconciliation with the above initial data.
124125

125126
2. **Transition Node States:**
126-
Update each node's state based on information from abstracted sources. If failed to retrieve information, the reconciliation fails and restarts from step 1.
127+
Update each node's state based on information from abstracted sources. If failed to retrieve information, the current reconciliation immediately fails. When the next reconciliation is triggered, it will restart from step 1.
127128

128129
3. **Handle `NOT_READY` Nodes:**
129-
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`. If the timeout is reached, check if nodes need to be restarted.
130+
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`.
130131

131132
4. **Categorize Nodes:**
132133
Group nodes based on their state and connectivity:
@@ -153,24 +154,24 @@ The following are the configuration options for the new KafkaRoller. If exposed
153154

154155
8. **Refine `MAYBE_RECONFIGURE_OR_RESTART` Nodes:**
155156
Describe Kafka configurations via Admin API:
156-
- Nodes with dynamic config changes go to `RECONFIGURE`.
157-
- Nodes with non dynamic config changes, go to `RESTART`.
158-
- Nodes with no config changes go to `NOP`.
157+
- Nodes with dynamic config changes are added to `RECONFIGURE` group.
158+
- Nodes with non dynamic config changes are added `RESTART` group.
159+
- Nodes with no config changes are added to `NOP` group.
159160

160161
9. **Reconfigure Nodes:**
161162
Reconfigure nodes in the `RECONFIGURE` group:
162-
- If `numReconfigAttempts` exceeds `maxReconfigAttempts`, add a restart reason and repeat from step 2.
163+
- Check if `numReconfigAttempts` exceeds `maxReconfigAttempts`. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue.
163164
- Send `incrementalAlterConfig` request, transition state to `UNKNOWN`, and increment `numReconfigAttempts`.
164-
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2.
165+
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2, otherwise continue.
165166

166167
10. **Check for `NOT_READY` Nodes:**
167-
If `RESTART` group is empty and no nodes are `NOT_READY`, reconciliation is successful. Otherwise, wait for `NOT_READY` nodes' state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, increment `numRetries` and repeat from step 2.
168+
If `RESTART` group is empty and no nodes are `NOT_READY`, reconciliation is successful. Otherwise, wait for `NOT_READY` nodes' state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, increment `numRetries` and repeat from step 2. Otherwise, continue.
168169

169-
11. **Batch and Restart Nodes:**
170+
11. **Categorize and Batch Nodes:**
170171
Categorize and batch nodes for restart:
171172
- Ensure controllers are restarted sequentially in an order of pure controllers, mixed nodes and the active controller to maintain quorum.
172173
- Group broker nodes without common partitions for parallel restart to maintain availability.
173-
- If no safe nodes to restart, check `numRetries`. If exceeded, throw `UnrestartableNodesException`.Otherwise, increment `numRetries` and repeat from step 2. More on safety conditions below.
174+
- If no safe nodes to restart, check `numRetries`. If exceeded, throw `UnrestartableNodesException`. Otherwise, increment `numRetries` and repeat from step 2. More on safety conditions below.
174175

175176
12. **Restart Nodes in Parallel:**
176177
Restart broker nodes in the batch:
@@ -221,7 +222,7 @@ All the nodes except `mixed-3` have the following Context with `nodeRef` being t
221222
nodeRoles: controller
222223
state: READY
223224
lastTransition: 0123456
224-
reason: MANUAL_ROLLING_UPDATE
225+
restartReason: MANUAL_ROLLING_UPDATE
225226
numRestartAttempts: 0
226227
numReconfigAttempts: 0
227228
numRetries: 0
@@ -232,18 +233,18 @@ The `mixed-3` node has the following context because the operator could not esta
232233
nodeRoles: controller,broker
233234
state: NOT_RUNNING
234235
lastTransition: 0123456
235-
reason: POD_UNRESPONSIVE
236+
restartReason: POD_UNRESPONSIVE
236237
numRestartAttempts: 0
237238
numReconfigAttempts: 0
238239
numRetries: 0
239240
```
240-
2. The roller checks if all of the controller nodes are mixed and in `NOT_RUNNING` state. Since they are not and it has `POD_UNRESPONSIVE` reason, it restarts `mixed-3` node and waits for it to have `READY` state. The `mixed-3`'s context becomes:
241+
2. The roller checks if all of the controller nodes are in `NOT_RUNNING` state. Since they are not and `mixed-3` node has `POD_UNRESPONSIVE` reason, it is restarted and waited to have `READY` state. The `mixed-3`'s context becomes:
241242
```
242243
nodeRef: mixed-3/3
243244
nodeRoles: controller,broker
244245
state: RESTARTED
245246
lastTransition: 654987
246-
reason: POD_UNRESPONSIVE
247+
restartReason: POD_UNRESPONSIVE
247248
numRestartAttempts: 1
248249
numReconfigAttempts: 0
249250
numRetries: 0
@@ -277,7 +278,7 @@ topic("topic-E"), Replicas(6, 10, 11), ISR(6, 10, 11), MinISR(2)
277278
nodeRoles: broker
278279
state: RECOVERING
279280
lastTransition: 987456
280-
reason:
281+
restartReason:
281282
numRestartAttempts: 1
282283
numReconfigAttempts: 0
283284
numRetries: 10
@@ -294,7 +295,7 @@ topic("topic-E"), Replicas(6, 10, 11), ISR(6, 10, 11), MinISR(2)
294295

295296
### Switching from the old KafkaRoller to the new KafkaRoller
296297

297-
The new KafkaRoller will only work with KRaft clusters therefore when running in Zookeeper mode, the current KafkaRoller will be used. Kafka CR's `KafkaMetadataState` represents where the metadata is stored for the cluster. It is set to `KRaft` when a cluster is fully migrated to KRaft or was created in KRaft mode. `KafkaReconciler` will be updated to switch to the new roller based on this state. This means the old KafkaRoller will be used during migration of existing clusters from Zookeeper to KRaft mode and the new roller is used only after the migration is completed and for new clusters created in KRaft mode.
298+
The new KafkaRoller will only work with KRaft clusters therefore when running in Zookeeper mode, the current KafkaRoller will be used. Kafka CR's `KafkaMetadataState` represents where the metadata is stored for the cluster. It is set to `KRaft` when a cluster is fully migrated to KRaft or was created in KRaft mode. `KafkaReconciler` class will be updated to switch to the new roller based on this state. This means the old KafkaRoller will be used during migration of existing clusters from Zookeeper to KRaft mode and the new roller is used only after the migration is completed and for new clusters created in KRaft mode.
298299

299300
### Future improvement
300301

images/06x-new-roller-flow.png

154 KB
Loading

0 commit comments

Comments
 (0)