You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 06x-new-kafka-roller.md
+27-18Lines changed: 27 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,54 +103,63 @@ Context: {
103
103
numRetries: 0
104
104
}
105
105
```
106
-
2.Transition each node's state to the corresponding state based on the information collected from the abstracted sources.
106
+
2.Observe and transition each node's state to the corresponding state based on the information collected from the abstracted sources.
107
107
108
-
3. Group the nodes into the following categories based on their state and connectivity:
108
+
3. If there are nodes in `NOT_READY` state, wait for them to have `SERVING` within the `postOperationalTimeoutMs`.
109
+
We want to give nodes chance to get ready before we try to connect to the or consider them for rolling. This is important especially for nodes which were just started.
110
+
This is consistent with how the current roller handles unready nodes.
111
+
- If the timeout is reached, proceed to the next step and check if any of the nodes need to be restarted.
112
+
113
+
4. Group the nodes into the following categories based on their state and connectivity:
109
114
-`RESTART_FIRST` - Nodes that have `NOT_READY` or `NOT_RUNNING` state in their contexts. The group will also include nodes that we cannot connect to via Admin API.
110
115
-`WAIT_FOR_LOG_RECOVERY` - Nodes that have `RECOVERING` state.
111
116
-`RESTART` - Nodes that have non-empty list of reasons from the predicate function and have not been restarted yet (Context.numRestartAttempts == 0).
112
117
-`MAYBE_RECONFIGURE` - Broker nodes (including combined nodes) that have an empty list of reasons and not been reconfigured yet (Context.numReconfigAttempts == 0).
113
118
-`NOP` - Nodes that have at least one restart or reconfiguration attempt (Context.numRestartAttempts > 0 || Context.numReconfigAttempts > 0 ) and have either
114
119
`LEADING_ALL_PREFERRED` or `SERVING` state.
115
120
116
-
4. Wait for nodes in `WAIT_FOR_LOG_RECOVERY` group to finish performing log recovery.
117
-
- Wait for each node to have `SERVING` within the `postOperationalTimeoutMs`.
118
-
- If the timeout is reached for a node and its `numRetries` is greater than or equal to `maxRetries`, throw `UnrestartableNodesException` with the log recovery progress (number of remaining logs and segments). Otherwise increment node's `numRetries` and repeat from step 3.
121
+
5. Wait for nodes in `WAIT_FOR_LOG_RECOVERY` group to finish performing log recovery.
122
+
- Wait for nodes to have `SERVING` within the `postOperationalTimeoutMs`.
123
+
- If the timeout is reached for a node and its `numRetries` is greater than or equal to `maxRetries`, throw `UnrestartableNodesException` with the log recovery progress (number of remaining logs and segments). Otherwise increment node's `numRetries` and repeat from step 2.
119
124
120
-
5. Restart nodes in `RESTART_FIRST` category:
125
+
6. Restart nodes in `RESTART_FIRST` category:
121
126
- if one or more nodes have `NOT_RUNNING` state, we first need to check 2 special conditions:
122
127
- If all of the nodes are combined and are in `NOT_RUNNING` state, restart them in parallel to give the best chance of forming the quorum.
123
128
> This is to address the issue described in https://github.com/strimzi/strimzi-kafka-operator/issues/9426.
124
129
125
130
- If a node is in `NOT_RUNNING` state, the restart it only if it has `POD_HAS_OLD_REVISION` reason. This is because, if the node is not running at all, then restarting it likely won't make any difference unless the node is out of date.
126
131
> For example, if a pod is in pending state due to misconfigured affinity rule, there is no point restarting this pod again or restarting other pods, because that would leave them in pending state as well. If the user then fixed the misconfigured affinity rule, then we should detect that the pod has an old revision, therefore should restart it so that pod is scheduled correctly and runs.
127
132
128
-
- At this point either we started nodes or decided not to because nodes did not have `POD_HAS_OLD_REVISION` reason. Regardless, wait for nodes to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 3.
133
+
- At this point either we started nodes or decided not to because nodes did not have `POD_HAS_OLD_REVISION` reason. Regardless, wait for nodes to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 2.
129
134
130
135
131
136
- Otherwise the nodes will be attempted to restart one by one in the following order:
132
137
- Pure controller nodes
133
138
- Combined nodes
134
139
- Broker only nodes
135
140
136
-
- Wait for the restarted node to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 3.
141
+
- Wait for the restarted node to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 2.
137
142
138
-
6. Further refine the broker nodes in `MAYBE_RECONFIGURE` group:
143
+
7. Further refine the broker nodes in `MAYBE_RECONFIGURE` group:
139
144
- Describe Kafka configurations for each node via Admin API and compare them against the desired configurations. This is essentially the same mechanism we use today for the current KafkaRoller.
140
145
- If a node has configuration changes and they can be dynamically updated, add the node into another group called `RECONFIGURE`.
141
146
- If a node has configuration changes but they cannot be dynamically updated, add the node into the `RESTART` group.
142
147
- If a node has no configuration changes, put the node into the `NOP` group.
143
148
144
-
7. Reconfigure each node in `RECONFIGURE` group:
145
-
- If `numReconfigAttempts` of a node is greater than the configured `maxReconfigAttempts`, add a restart reason to its context and repeat from step 3. Otherwise continue.
149
+
8. Reconfigure each node in `RECONFIGURE` group:
150
+
- If `numReconfigAttempts` of a node is greater than the configured `maxReconfigAttempts`, add a restart reason to its context and repeat from step 2. Otherwise continue.
146
151
- Send `incrementalAlterConfig` request with its config updates.
147
152
- Transitions the node's state to `RECONFIGURED` and increment its `numReconfigAttempts`.
148
153
- Wait for each node that got configurations updated until they have `LEADING_ALL_PREFERRED` within the `postOperationalTimeoutMs`.
149
-
- If the `postOperationalTimeoutMs` is reached, repeat from step 3.
154
+
- If the `postOperationalTimeoutMs` is reached, repeat from step 2.
150
155
151
-
8. If at this point, the `RESTART` group is empty, the reconciliation will be completed successfully.
156
+
9. If at this point, the `RESTART` group is empty and if there is no nodes that is in `NOT_READY` state, the reconciliation will be completed successfully.
157
+
- If there are nodes in `NOT_READY` state, wait for them to have `SERVING` within the `postOperationalTimeoutMs`.
158
+
- If the timeout is reached for a node and its `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`.
159
+
- Otherwise increment node's `numRetries` and repeat from step 2.
160
+
This is consistent with how the current roller handles unready nodes.
152
161
153
-
9. Otherwise, batch nodes in `RESTART` group and get the next batch to restart:
162
+
10. Otherwise, batch nodes in `RESTART` group and get the next batch to restart:
154
163
- Further categorize nodes based on their roles so that the following restart order can be enforced:
155
164
1. `NON_ACTIVE_CONTROLLER` - Pure controller that is not the active controller
156
165
2. `ACTIVE_CONTROLLER` - Pure controller that is the active controller (the quorum leader)
@@ -169,17 +178,17 @@ Context: {
169
178
- batch the nodes that do not have any partitions in common therefore can be restarted together
170
179
- remove nodes that have an impact on the availability from the batches (more on this later)
171
180
- return the largest batch
172
-
- If an empty batch is returned, that means none of the nodes met the safety conditions such as availability and qourum health impact. In this case, check their `numRetries` and if any of them is equal to or greater than `maxRetries`, throw `UnrestartableNodesException`. Otherwise increment their `numRetries` and repeat from step 3.
181
+
- If an empty batch is returned, that means none of the nodes met the safety conditions such as availability and qourum health impact. In this case, check their `numRetries` and if any of them is equal to or greater than `maxRetries`, throw `UnrestartableNodesException`. Otherwise increment their `numRetries` and repeat from step 2.
173
182
174
-
8. Restart the nodes from the returned batch in parallel:
183
+
11. Restart the nodes from the returned batch in parallel:
175
184
- If `numRestartAttempts` of a node is larger than `maxRestartAttempts`, throw `MaxRestartsExceededException`.
176
185
- Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestartAttempts`.
177
186
- After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationalTimeoutMs` is reached.
178
-
- If the timeout is reached, throw `TimeoutException` if a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 3.
187
+
- If the timeout is reached, throw `TimeoutException`. If a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 2.
179
188
- After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message.
180
189
181
190
182
-
9. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails.
191
+
12. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails.
0 commit comments