You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 06x-new-kafka-roller.md
+26-25Lines changed: 26 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ The objective of this proposal is to introduce a new KafkaRoller with more struc
46
46
47
47
KafkaRoller decisions would be informed by observations coming from different sources (e.g. Kubernetes API, KafkaAgent, Kafka Admin API). These sources will be abstracted so that KafkaRoller is not dependent on their specifics as long as it's getting the information it needs. The abstractions also enable much better unit testing.
48
48
49
-
Nodes would categorised based on the observed states, the roller will perform specific actions on nodes in each category. Those actions should cause a subsequent observation to cause a state transition. This iterative process continues until each node's state aligns with the desired state.
49
+
Nodes would be categorized based on the observed states, the roller will perform specific actions on nodes in each category. Those actions should cause a subsequent observation to cause a state transition. This iterative process continues until each node's state aligns with the desired state.
50
50
51
51
In addition, the new KafkaRoller will introduce an algorithm to restart brokers in parallel when safety conditions are met. These conditions ensure Kafka producer availability and minimize the impact on controllers and overall cluster stability. It will also wait for partitions to be reassigned to their preferred leaders to avoid triggering unnecessary partition leader elections.
52
52
@@ -57,7 +57,7 @@ When a new reconciliation starts up, a context object is created for each node t
57
57
- <i>nodeRef</i>: NodeRef object that contains Node ID.
58
58
- <i>currentNodeRole</i>: Currently assigned process roles for this node (e.g. controller, broker).
59
59
- <i>lastKnownState</i>: It contains the last known state of the node based on information collected from the abstracted sources (Kubernetes API, KafkaAgent and Kafka Admin API). The table below describes the possible states.
60
-
- <i>restartReason</i>: It is updated based on the current predicate logic from the `Reconciler`. For example, an update in the Kafka CR is detected.
60
+
- <i>restartReason</i>: It is updated based on the current predicate logic passed from the `KafkaReconciler` class. For example, an update in the Kafka CR is detected.
61
61
- <i>numRestartAttempts</i>: The value is incremented each time the node has been restarted or attempted to be restarted.
62
62
- <i>numReconfigAttempts</i>: The value is incremented each time the node has been reconfigured or attempted to be reconfigured.
63
63
- <i>numRetries</i>: The value is incremented each time the node is evaluated/processed but was not restarted/reconfigured due to not meeting safety conditions for example, availability check failed, log recovery or timed out waiting for pod to become ready.
@@ -68,9 +68,9 @@ When a new reconciliation starts up, a context object is created for each node t
| UNKNOWN | The initial state when creating `Context` for a node or state just after the node gets restarted/reconfigured. We expect to transition from this state fairly quickly. |`NOT_RUNNING``NOT_READY``RECOVERING``READY`|
70
70
| NOT_RUNNING | Node is not running (Kafka process is not running). This is determined via Kubernetes API, more details for it below. |`READY``UNKNOWN``NOT_READY``RECOVERING`|
71
-
| NOT_READY | Node is running but not ready to serve requests which is determined by Kubernetes readiness probe (broker state < 2 OR == 127 OR controller is not listening on port). |`READY``UNKNOWN``NOT_RUNNING``RECOVERING`|
71
+
| NOT_READY | Node is running but not ready to serve requests which is determined by Kubernetes readiness probe (broker state is not RUNNING OR controller is not listening on port). |`READY``UNKNOWN``NOT_RUNNING``RECOVERING`|
72
72
| RECOVERING | Node has started but is in log recovery (broker state == 2). This is determined via the KafkaAgent. |`READY``NOT_RUNNING``NOT_READY`|
73
-
| READY | Node is in running state and ready to serve requests which is determined by Kubernetes readiness probe (broker state >= 3 AND != 127 OR controller is listening on port). |`LEADING_ALL_PREFERRED``UNKNOWN`|
73
+
| READY | Node is in running state and ready to serve requests which is determined by Kubernetes readiness probe (broker state is RUNNING OR controller is listening on port). |`LEADING_ALL_PREFERRED``UNKNOWN`|
74
74
| LEADING_ALL_PREFERRED | Node is leading all the partitions that it is the preferred leader for. Node's state can transition into this only from `READY` state. | This is the final state we expect
75
75
76
76
Context about broker states and restart reasons:
@@ -87,20 +87,21 @@ If one of the following is true, then node's state is `NOT_RUNNING`:
87
87
- the pod has container status `ContainerStateWaiting` with `CrashLoopBackOff` or `ImagePullBackOff` reason
88
88
If none of the above is true but the node is not ready, then its state would be `NOT_READY`.
89
89
90
-
#### Flow diagram describing the overall flow of the states
90
+
#### High level flow diagram describing the flow of the states
91
91

92
92
93
93
94
+
94
95
### Configurability
95
96
The following are the configuration options for the new KafkaRoller. If exposed to user, the user can configure it via `STRIMZI_` environment variables. Otherwise, the operator will set them to the default values (which are similar to what the current roller has):
96
97
97
98
| Configuration | Default value | Exposed to user | Description |
| maxRestartAttempts | 3 | No | The maximum number of times a node can be restarted before failing the reconciliation. This is checked against the node's `numRestartAttempts`. |
100
-
| maxReconfigAttempts | 3 | No | The maximum number of times a node can be dynamically reconfigured before restarting it. This is checked against the node's `numReconfigAttempts`. |
100
+
| maxRestartAttempts | 3 | No | The maximum number of restart attempts per node before failing the reconciliation. This is checked against node's `numRestartAttempts`. |
101
+
| maxReconfigAttempts | 3 | No | The maximum number of dynamic reconfiguration attempts per node before restarting the node. This is checked against node's `numReconfigAttempts`. |
101
102
| maxRetries | 10 | No | The maximum number of times a node can be retried after not meeting the safety conditions e.g. availability check failed. This is checked against the node's `numRetries`. |
102
103
| operationTimeoutMs | 60 seconds | Yes | The maximum amount of time we will wait for nodes to transition to `READY` state after an operation in each retry. This is already exposed to the user via environment variable `STRIMZI_OPERATION_TIMEOUT_MS`. |
103
-
| maxRestartParallelism | 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. This will be exposed to the user via the new environment variable `STRIMZI_MAX_RESTART_BATCH_SIZE`.
104
+
| maxRestartParallelism | 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. This will be exposed to the user via the new environment variable `STRIMZI_MAX_RESTART_BATCH_SIZE`. However, if there are multiple brokers in `NOT_RUNNING` state, they may get restarted in parallel despite this configuration for a faster recovery.
104
105
| postRestartDelay | 0 | Yes | Delay between restarts of nodes or batches. It's set to 0 by default, but can be adjusted by users to slow down the restarts. This will also help JIT to reach a steady state and to reduce impact on clients.
105
106
| restartAndPreferredLeaderElectionDelay | 10 seconds | No | Delay between restart and triggering partition leader election so that just-rolled broker is leading all the partitions it is the preferred leader for. This is to avoid situations where leaders moving to a newly started node that does not yet have established networking to some outside networks, e.g. through load balancers.
106
107
@@ -114,7 +115,7 @@ The following are the configuration options for the new KafkaRoller. If exposed
114
115
nodeRoles: <Set using pod labels `strimzi.io/controller-role` and `strimzi.io/broker-role`>,
115
116
state: UNKNOWN,
116
117
lastTransition: <SYSTEM_TIME>,
117
-
reason: <Result of predicate function from KafkaReconciler>,
118
+
restartReason: <Result of predicate function from KafkaReconciler>,
118
119
numRestartAttempts: 0,
119
120
numReconfigAttempts: 0,
120
121
numRetries: 0
@@ -123,10 +124,10 @@ The following are the configuration options for the new KafkaRoller. If exposed
123
124
Contexts are recreated in each reconciliation with the above initial data.
124
125
125
126
2.**Transition Node States:**
126
-
Update each node's state based on information from abstracted sources. If failed to retrieve information, the reconciliation fails and restarts from step 1.
127
+
Update each node's state based on information from abstracted sources. If failed to retrieve information, the current reconciliation immediately fails. When the next reconciliation is triggered, it will restart from step 1.
127
128
128
129
3.**Handle `NOT_READY` Nodes:**
129
-
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`. If the timeout is reached, check if nodes need to be restarted.
130
+
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`.
130
131
131
132
4.**Categorize Nodes:**
132
133
Group nodes based on their state and connectivity:
@@ -153,24 +154,24 @@ The following are the configuration options for the new KafkaRoller. If exposed
- Nodes with dynamic config changes go to `RECONFIGURE`.
157
-
- Nodes with non dynamic config changes, go to`RESTART`.
158
-
- Nodes with no config changes go to `NOP`.
157
+
- Nodes with dynamic config changes are added to `RECONFIGURE` group.
158
+
- Nodes with non dynamic config changes are added`RESTART` group.
159
+
- Nodes with no config changes are added to `NOP` group.
159
160
160
161
9.**Reconfigure Nodes:**
161
162
Reconfigure nodes in the `RECONFIGURE` group:
162
-
-If `numReconfigAttempts` exceeds `maxReconfigAttempts`, add a restart reason and repeat from step 2.
163
+
-Check if `numReconfigAttempts` exceeds `maxReconfigAttempts`. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue.
163
164
- Send `incrementalAlterConfig` request, transition state to `UNKNOWN`, and increment `numReconfigAttempts`.
164
-
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2.
165
+
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2, otherwise continue.
165
166
166
167
10.**Check for `NOT_READY` Nodes:**
167
-
If `RESTART` group is empty and no nodes are `NOT_READY`, reconciliation is successful. Otherwise, wait for `NOT_READY` nodes' state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, increment `numRetries` and repeat from step 2.
168
+
If `RESTART` group is empty and no nodes are `NOT_READY`, reconciliation is successful. Otherwise, wait for `NOT_READY` nodes' state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, increment `numRetries` and repeat from step 2. Otherwise, continue.
168
169
169
-
11.**Batch and Restart Nodes:**
170
+
11.**Categorize and Batch Nodes:**
170
171
Categorize and batch nodes for restart:
171
172
- Ensure controllers are restarted sequentially in an order of pure controllers, mixed nodes and the active controller to maintain quorum.
172
173
- Group broker nodes without common partitions for parallel restart to maintain availability.
173
-
- If no safe nodes to restart, check `numRetries`. If exceeded, throw `UnrestartableNodesException`.Otherwise, increment `numRetries` and repeat from step 2. More on safety conditions below.
174
+
- If no safe nodes to restart, check `numRetries`. If exceeded, throw `UnrestartableNodesException`.Otherwise, increment `numRetries` and repeat from step 2. More on safety conditions below.
174
175
175
176
12.**Restart Nodes in Parallel:**
176
177
Restart broker nodes in the batch:
@@ -221,7 +222,7 @@ All the nodes except `mixed-3` have the following Context with `nodeRef` being t
221
222
nodeRoles: controller
222
223
state: READY
223
224
lastTransition: 0123456
224
-
reason: MANUAL_ROLLING_UPDATE
225
+
restartReason: MANUAL_ROLLING_UPDATE
225
226
numRestartAttempts: 0
226
227
numReconfigAttempts: 0
227
228
numRetries: 0
@@ -232,18 +233,18 @@ The `mixed-3` node has the following context because the operator could not esta
232
233
nodeRoles: controller,broker
233
234
state: NOT_RUNNING
234
235
lastTransition: 0123456
235
-
reason: POD_UNRESPONSIVE
236
+
restartReason: POD_UNRESPONSIVE
236
237
numRestartAttempts: 0
237
238
numReconfigAttempts: 0
238
239
numRetries: 0
239
240
```
240
-
2. The roller checks if all of the controller nodes are mixed and in `NOT_RUNNING` state. Since they are not and it has `POD_UNRESPONSIVE` reason, it restarts `mixed-3` node and waits for it to have `READY` state. The `mixed-3`'s context becomes:
241
+
2. The roller checks if all of the controller nodes are in `NOT_RUNNING` state. Since they are not and `mixed-3` node has `POD_UNRESPONSIVE` reason, it is restarted and waited to have `READY` state. The `mixed-3`'s context becomes:
### Switching from the old KafkaRoller to the new KafkaRoller
296
297
297
-
The new KafkaRoller will only work with KRaft clusters therefore when running in Zookeeper mode, the current KafkaRoller will be used. Kafka CR's `KafkaMetadataState` represents where the metadata is stored for the cluster. It is set to `KRaft` when a cluster is fully migrated to KRaft or was created in KRaft mode. `KafkaReconciler` will be updated to switch to the new roller based on this state. This means the old KafkaRoller will be used during migration of existing clusters from Zookeeper to KRaft mode and the new roller is used only after the migration is completed and for new clusters created in KRaft mode.
298
+
The new KafkaRoller will only work with KRaft clusters therefore when running in Zookeeper mode, the current KafkaRoller will be used. Kafka CR's `KafkaMetadataState` represents where the metadata is stored for the cluster. It is set to `KRaft` when a cluster is fully migrated to KRaft or was created in KRaft mode. `KafkaReconciler`class will be updated to switch to the new roller based on this state. This means the old KafkaRoller will be used during migration of existing clusters from Zookeeper to KRaft mode and the new roller is used only after the migration is completed and for new clusters created in KRaft mode.
0 commit comments