|
| 1 | +# KRaft upgrades and downgrades |
| 2 | + |
| 3 | +This proposal introduces the Kafka upgrade mechanism for KRaft clusters. |
| 4 | +It describes how Strimzi will upgrade or downgrade KRaft clusters from one Apache Kafka version to another. |
| 5 | + |
| 6 | +## Current situation |
| 7 | + |
| 8 | +Strimzi currently does not support Apache Kafka upgrades / downgrades on KRaft clusters. |
| 9 | +The expectation is that the KRaft-based Apache Kafka clusters are deleted and freshly created both when upgrading / downgrading the Strimzi Cluster Operator as well as when upgrading / downgrading Apache Kafka from one version to another. |
| 10 | + |
| 11 | +### ZooKeeper-based clusters |
| 12 | + |
| 13 | +Strimzi currently supports upgrades of ZooKeeper-based Kafka clusters. |
| 14 | +Upgrading ZooKeeper-based clusters involves a multi-stage process: |
| 15 | +1. First, ZooKeeper nodes are rolled to use the new container image with the new Kafka version and possibly also the new ZooKeeper version (if the new Kafka version uses a new ZooKeeper version). |
| 16 | +2. Next, Kafka brokers are rolled to use the new container image with the new Apache Kafka version. |
| 17 | +3. Finally, the Kafka brokers are rolled one more time to update the `inter.broker.protocol.version` (in case the `inter.broker.protocol.version` changed). |
| 18 | + This step is done either when the user changes the `inter.broker.protocol.version` in `Kafka.spec.kafka.config`. |
| 19 | + Or automatically in the next reconciliation if `inter.broker.protocol.version` is not set at all in `Kafka.spec.kafka.config`. |
| 20 | + |
| 21 | +Strimzi also supports downgrades of ZooKeeper-based Kafka clusters. |
| 22 | +Downgrading ZooKeeper-based clusters involves a multi-stage process: |
| 23 | +1. If the `inter.broker.protocol.version` is set to a version higher than what is supported by the Kafka version being downgraded to, it must be set to a version that is supported by the older Kafka version. |
| 24 | + This step has to be done manually by the user by setting the version in `Kafka.spec.kafka.config`. |
| 25 | + The Cluster Operator will roll the Kafka brokers to set the new value but will still use the container images of the newer Kafka version. |
| 26 | + Strimzi validates the `inter.broker.protocol.version` and does not proceed with the next steps until a version compatible with the older Kafka version is used. |
| 27 | + Kafka allows downgrading the `inter.broker.protocol.version`, but does not guarantee full functionality for all versions. |
| 28 | +2. Next, ZooKeeper nodes are rolled to use the new container image with the older Kafka version and possibly also the older ZooKeeper version (if the older Kafka version uses the older ZooKeeper version). |
| 29 | +3. Finally, the Kafka brokers are rolled one more time to use the container image with the older Kafka version. |
| 30 | + |
| 31 | +The upgrade/downgrade happens in two situations: |
| 32 | +* When the Kafka version is changed in `Kafka.spec.kafka.version` |
| 33 | +* When the Strimzi Cluster operator is upgraded / downgraded and uses a different default version of Apache Kafka |
| 34 | + |
| 35 | +## Motivation |
| 36 | + |
| 37 | +KRaft-to-KRaft upgrades and downgrades are one of the missing parts of Strimzi's KRaft support. |
| 38 | +Supporting upgrades and downgrades is essential to consider KRaft as production-ready. |
| 39 | +This work can be done in parallel with the work on other KRaft proposals, such as migration of ZooKeeper-based clusters to KRaft-based clusters. |
| 40 | + |
| 41 | +## Proposal |
| 42 | + |
| 43 | +Strimzi should follow the Apache Kafka [procedure for upgrading / downgrading](https://kafka.apache.org/documentation/#upgrade_350_kraft) Kafka. |
| 44 | + |
| 45 | +### KRaft upgrades according to Apache Kafka |
| 46 | + |
| 47 | +The Apache Kafka [KRaft upgrade procedure](https://kafka.apache.org/documentation/#upgrade_350_kraft) consists of 2 different steps: |
| 48 | +1. Roll out the new Kafka version and verify everything is working fine. |
| 49 | +2. Update the `metadata.version` using the `kafka-features.sh` command line tool (or using the Kafka Admin API). |
| 50 | + Unlike updating the `inter.broker.protocol.version`, updating the `metadata.version` does not require a rolling update of all Kafka nodes. |
| 51 | + |
| 52 | +Downgrade is possible as well: |
| 53 | +1. Downgrading the `metadata.version` |
| 54 | +2. Rolling out the older Kafka version |
| 55 | + |
| 56 | +Changes to the metadata formats - that are defined by the `metadata.version` - might not be backwards compatible. |
| 57 | +As a result, the first step of the downgrade procedure - downgrading the `metadata.version` - can be done safely without any metadata loss only for selected metadata versions. |
| 58 | +For other versions, the downgrade will be possible only as an _unsafe_ downgrade that might result in losing some metadata information, which might have a negative impact on the cluster. |
| 59 | +This is similar to how the `inter.broker.protocol.version` works today, as it also doesn't offer any guarantees of full compatibility when downgrading it. |
| 60 | + |
| 61 | +#### Existing downgrade limitations |
| 62 | + |
| 63 | +As of today, the Kraft downgrade situation is as follows: |
| 64 | +1. The unsafe downgrade is currently not supported by Apache Kafka. |
| 65 | +2. None of the `metadata.versions` supported by current Kafka versions support the safe downgrade. |
| 66 | + _(You should be able to downgrade from 3.5-IV1 to 3.5-IV0. But you cannot downgrade between 3.5 and 3.4 metadata versions.)_ |
| 67 | + |
| 68 | +The first point should be addressed in the future when the support for the unsafe downgrade is implemented. |
| 69 | +The second point is expected to remain an issue. |
| 70 | +While in the future the metadata formats might become more stable and it might be possible to safely downgrade the metadata between different Kafka versions, there still might be some versions that do not support the safe downgrade. |
| 71 | + |
| 72 | +Despite Kafka currently not really supporting downgrades, this proposal designs the Strimzi implementation based on how the downgrade is expected to work. |
| 73 | + |
| 74 | +### Strimzi implementation |
| 75 | + |
| 76 | +Strimzi will follow the same approach to upgrading / downgrading a KRaft cluster as it does today for ZooKeeper-based clusters. |
| 77 | + |
| 78 | +#### Configuring the metadata version |
| 79 | + |
| 80 | +In ZooKeeper-based clusters, the `inter.broker.protocol.version` is stored in the Kafka configuration (`Kafka.spec.kafka.config`). |
| 81 | +The `metadata.version` is not part of the Kafka configuration. |
| 82 | +To allow users to configure it in our Kafka custom resource, a new String-type field named `metadataVersion` will be added to `Kafka.spec.kafka`. |
| 83 | +This field will be validated by the operator to contain a valid Kafka metadata version. |
| 84 | + |
| 85 | +Users can use this new field to configure the metadata version. |
| 86 | +The default value of this field when not set will be the metadata version corresponding to the current Kafka version. |
| 87 | +The default version corresponding to a given Kafka version will be stored in `kafka-versions.yaml` in the same way we today store the default `inter.broker.protocol.version` and `log.message.format.version`. |
| 88 | + |
| 89 | +The operator will use this field to set the metadata version in the Kafka cluster. |
| 90 | +In the initial deployment of the Kafka cluster, it will be set using the `kafka-storage.sh` tool (check the _Configuring the initial metadata version_ section for more details about why `kafka-storage.sh` needs to be used for new clusters). |
| 91 | +In existing Kafka clusters, it will be queried and changed using the Kafka Admin API. |
| 92 | + |
| 93 | +The current metadata version will be tracked in the `.status` section of the Kafka CR in `.status.kafkaMetadataVersion`. |
| 94 | +This field will be used for validations and to track if upgrade / downgrade can be executed. |
| 95 | + |
| 96 | +##### Upgrading metadata version |
| 97 | + |
| 98 | +When upgrading the metadata version, the operator will validate the desired version and check if it is expected to work with the used Kafka version (e.g. is not higher than `metadata.version` supported by the Kafka version specified in `Kafka.spec.kafka.version` or its default value). |
| 99 | +If it passes the validation, the Kafka Admin API will be used to upgrade the metadata. |
| 100 | +If the validation fails, the existing metadata version will be used and an error will be raised in the log and in the `.status` section of the Kafka CR. |
| 101 | + |
| 102 | +##### Downgrading metadata version |
| 103 | + |
| 104 | +When downgrading the `metadata.version` in Kafka, the operator will always try to do a _safe_ downgrade only. |
| 105 | +And if the safe downgrade fails, it will report an error (in the log and in the `.status` section of the Kafka CR) and continue reconciliation without changing the metadata version. |
| 106 | + |
| 107 | +In case the safe downgrade is not supported, an _unsafe_ downgrade can be attempted (as explained in one of the earlier sections, the _unsafe_ downgrade is not implemented in current Kafka versions, but it is planned to be supported in the future). |
| 108 | +Unsafe downgrade might result in metadata loss that might cause various problems to the Kafka cluster. |
| 109 | +Users who decide they want to do the unsafe downgrade can do so manually: |
| 110 | +* Pause the reconciliation of the Kafka CR |
| 111 | +* Update the Kafka CR to the updated metadata version (to avoid the operator upgrading it back immediately after the reconciliation is unpaused) |
| 112 | +* Use the Kafka Admin API or CLI tools to do the unsafe downgrade |
| 113 | +* Unpause the reconciliation of the Kafka CR |
| 114 | + |
| 115 | +##### Configuring the initial metadata version |
| 116 | + |
| 117 | +Normally, when a new Kafka cluster is deployed, it will use its default metadata version. |
| 118 | +If a user wants to start a new Kafka cluster with an older metadata version, it has to be specified using the `--release-version` option of the `kafka-storage.sh` script when preparing the storage of the new cluster. |
| 119 | +Without using this tool, the Kafka cluster will start with its default metadata version and downgrade to the older one might not be possible anymore. |
| 120 | +To avoid this, Strimzi will use this mechanism to allow users deploy new Kafka clusters with older metadata versions as well. |
| 121 | + |
| 122 | +Users will specify the desired metadata version as usually in `Kafka.spec.kafka.metadataVersion`. |
| 123 | +Internally, Strimzi will store the desired `metadata.version` in the per-broker configuration Config Map. |
| 124 | +The value will be mounted into the Kafka pods and used when formatting the storage of a new Kafka cluster using the `kafka-storage.sh` utility. |
| 125 | +That way - unlike when using environment variables - we will not need to roll the Pod every time the `metadata.version` changes (since in such cases we can update it dynamically without rolling the pods). |
| 126 | + |
| 127 | +#### Upgrade procedure |
| 128 | + |
| 129 | +Upgrade can happen in two situations: |
| 130 | +* After the upgrade of the Strimzi Cluster Operator: |
| 131 | + * When the new version of the Cluster Operator supports a new default Kafka version and `Kafka.spec.kafka.version` is not set. |
| 132 | + * When the user changes `Kafka.spec.kafka.version` in parallel with the Strimzi Cluster Operator upgrade. |
| 133 | +* When the user requests Kafka upgrade by changing `Kafka.spec.kafka.version`. |
| 134 | + |
| 135 | +In both cases, the Strimzi Cluster Operator will: |
| 136 | +1. Validate that the metadata version currently used is compatible with the new Kafka version (it will check that it is lower than the new Kafka version). |
| 137 | + This validation will be done based on the `kafka-versions.yaml` file and by comparing the versions. |
| 138 | + This check is expected to pass for upgrades since a Kafka cluster that would not pass it would be invalid already before the upgrade. |
| 139 | +2. Roll all Kafka pods to use the containers with the new Kafka version. |
| 140 | + |
| 141 | +Once all the Kafka pods are rolled, the next step will depend on whether `Kafka.spec.kafka.metadataVersion` is set or not. |
| 142 | +When it is not set, Strimzi will automatically update the metadata version in the Kafka cluster using Kafka Admin API to the default version corresponding to the new Kafka version. |
| 143 | +The operator will not change the `Kafka.spec.kafka.metadataVersion` field - it will remain unset. |
| 144 | +And the Kafka upgrade will be complete with this. |
| 145 | + |
| 146 | +In case the `Kafka.spec.kafka.metadataVersion` field is set, it will just check the Kafka cluster has the desired metadata version as in any other reconciliation. |
| 147 | +The user will be expected to verify the Kafka cluster and change the `Kafka.spec.kafka.metadataVersion` when everything seems to work fine. |
| 148 | +Only after changing the metadata version, the upgrade will be considered complete. |
| 149 | +Similarly to today's ZooKeeper-based implementation, a warning will be issued (in the log and in the Kafka CR `.status` section) when the metadata version does not correspond to the Kafka version and tell the user to complete the upgrade by updating the metadata version. |
| 150 | + |
| 151 | +The upgrades will be supported to skip multiple Strimzi and Kafka versions in single step. |
| 152 | +For example, to upgrade from Strimzi 1.1.0 / Kafka 4.1.0 to Strimzi 1.5.0 / Kafka 4.4.0 (the exact versions are for demonstration purposes only). |
| 153 | +The exact number of versions a user can skip during the upgrade might be limited by other changes in Strimzi and by Apache Kafka itself. |
| 154 | + |
| 155 | +#### Downgrade procedure |
| 156 | + |
| 157 | +Downgrade can happen only in one situation: when the user requests Kafka downgrade by changing `Kafka.spec.kafka.version`. |
| 158 | + |
| 159 | +Unlike with upgrades, downgrading the Strimzi Cluster Operator is expected to be done only step-by-step and does not support skipping Strimzi versions: |
| 160 | +1. First downgrade the Kafka version to the oldest version supported by the newer Strimzi version |
| 161 | +2. Only then downgrade the Cluster Operator while keeping the same Kafka version |
| 162 | +3. Repeat if you need to downgrade to an older Strimzi/Kafka versions |
| 163 | + |
| 164 | +In both cases, the user is expected to ensure before triggering the downgrade that the metadata version is compatible with the Kafka version we should downgrade to. |
| 165 | +If the `metadata.version` was already bumped before the user decided to downgrade, the user can downgrade it by changing `Kafka.spec.kafka.metadataVersion`. |
| 166 | +Setting the exact version is required even if the metadata version was not set in the `Kafka` CR and was bumped automatically. |
| 167 | +(See the section about configuring metadata versions for the limitations of downgrading metadata.) |
| 168 | +Unlike the upgrades that might happen automatically, downgrades are expected to be always driven by the users. |
| 169 | +So this level of upfront preparation is acceptable. |
| 170 | + |
| 171 | +In both cases, the operator will: |
| 172 | +1. Validate if the metadata version is suitable for the Kafka version we are downgrading to (i.e. the same or lower). |
| 173 | + If the validation fails, it will fail the reconciliation and expect the user to address it. |
| 174 | + Since the user triggers the downgrade by manual action (editing the Kafka CR), it is expected that this error will be caught early and the user will fix the problem or revert the downgrade. |
| 175 | +2. Next, the operator will roll the Kafka pods to use the new container with the desired version. |
| 176 | + |
| 177 | +_Note:_ |
| 178 | +_This corresponds to what Strimzi supports today for ZooKeeper-based clusters._ |
| 179 | +_Today, the users are responsible for downgrading the `inter.broker.protocol.version` before the Kafka downgrade by updating their `Kafka` CR._ |
| 180 | +_Similarly, even today they have to downgrade Strimzi versions step-by-step in the same way as described in this section._ |
| 181 | + |
| 182 | +#### (Stretch goal) Downgrade of Kafka after Strimzi Cluster Operator downgrade |
| 183 | + |
| 184 | +As a _stretch_ goal, we should also support Kafka downgrades right after the Strimzi downgrade (i.e. jumping multiple Strimzi and Kafka versions during downgrade). |
| 185 | +Implementing this has additional technical complications because it needs to understand Kafka versions that did not exist at the time the software was released. |
| 186 | +While this feature might be useful in some situations, it does not need to be implemented to consider our KRaft support _production-ready_. |
| 187 | +So it might be implemented only later after the other parts. |
| 188 | + |
| 189 | +Even in this downgrade scenario, the user has to make sure the metadata version used by Strimzi is valid even for the Kafka version that we should downgrade to. |
| 190 | +After the Cluster Operator is downgraded, the operator will validate the Kafka version and the metadata version. |
| 191 | +And if the validation passes, it will automatically roll all Kafka pods to use the Kafka version we are downgrading to. |
| 192 | +If the validation fails, it will be up to the user to address the issue. |
| 193 | +This might require moving back to the original Strimzi and Kafka version. |
| 194 | + |
| 195 | +#### Example YAML files |
| 196 | + |
| 197 | +Given the current limitations when downgrading the metadata versions, it seems reasonable for users to prefer managing the metadata manually rather then to have them changed automatically. |
| 198 | +So we should encourage users to set the `Kafka.spec.kafka.metadataVersion` by having it set in our KRaft example YAML files. |
| 199 | +It will be set there to the metadata version corresponding to the Kafka version the example files use. |
| 200 | +This would also correspond to the current ZooKeeper-based examples that have the `inter.broker.protocol.version` set. |
| 201 | + |
| 202 | +#### Risks |
| 203 | + |
| 204 | +Kafka provides APIs to manage the `metadata.version`. |
| 205 | +Strimzi has no way to block the users from using these APIs. |
| 206 | +If the users decide to manipulate the `matadata.version` themselves, this can lead to unexpected issues. |
| 207 | +This risk is deemed acceptable, because we expect that: |
| 208 | +* Users would prefer to have the `metadata.version` managed through Strimzi. |
| 209 | +* Users would have no reason to manipulate the `metadata.version` on their own. |
| 210 | + |
| 211 | +The risk can be also at least partially mitigated through documentation. |
| 212 | + |
| 213 | +## Affected projects |
| 214 | + |
| 215 | +This proposal affects only the Strimzi Cluster Operator. |
| 216 | +No other components are affected. |
| 217 | + |
| 218 | +## Backwards compatibility |
| 219 | + |
| 220 | +This proposal has no impact on any existing features or on the upgrades / downgrades of ZooKeeper-based clusters. |
| 221 | + |
| 222 | +## Rejected alternatives |
| 223 | + |
| 224 | +### Leaving the `metadata.version` updates to users |
| 225 | + |
| 226 | +One of the alternatives considered was to have Strimzi only update the container images (software). |
| 227 | +And leave the update of the `metadata.version` to the user. |
| 228 | +However, this would not allow users to execute the whole upgrade declaratively. |
| 229 | +There would be also risk that users won't follow with the `metadata.version` and keep using the old version and that might have negative consequences in the future when the Kafka cluster runs on too old `metadata.version`. |
0 commit comments