You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 083-stretch-cluster.md
+56-24Lines changed: 56 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,44 +1,57 @@
1
1
# Stretch Kafka cluster
2
2
3
-
The Strimzi Kafka operator currently manages Kafka clusters within a single Kubernetes cluster. This proposal aims to extend support to stretch Kafka clusters, where brokers and controllers of a single Kafka cluster are distributed across multiple Kubernetes clusters.
3
+
The Strimzi Kafka operator currently manages Kafka clusters within a single Kubernetes cluster.
4
+
This proposal aims to extend support to stretch Kafka clusters, where brokers and controllers of a single Kafka cluster are distributed across multiple Kubernetes clusters.
4
5
5
6
## Current situation
6
7
7
-
At present, the availability of Strimzi-managed Kafka clusters is directly tied to the availability of the underlying Kubernetes cluster. If a Kubernetes cluster experiences an outage, the entire Kafka cluster becomes unavailable, disrupting all connected Kafka clients.
8
+
At present, the availability of Strimzi-managed Kafka clusters is directly tied to the availability of the underlying Kubernetes cluster.
9
+
If a Kubernetes cluster experiences an outage, the entire Kafka cluster becomes unavailable, disrupting all connected Kafka clients.
8
10
9
11
## Motivation
10
12
11
-
A stretch Kafka cluster allows Kafka nodes to be distributed across multiple Kubernetes clusters, significantly enhancing resilience by enabling the system to tolerate the outage of an entire Kubernetes cluster without disrupting service to clients. This configuration ensures high availability and seamless client operations, even in the event of cluster-specific failures.
13
+
A stretch Kafka cluster allows Kafka nodes to be distributed across multiple Kubernetes clusters, significantly enhancing resilience by enabling the system to tolerate the outage of an entire Kubernetes cluster without disrupting service to clients.
14
+
This configuration ensures high availability and seamless client operations, even in the event of cluster-specific failures.
12
15
13
-
In addition to improving fault tolerance, this approach also facilitates other valuable use cases, such as
16
+
In addition to improving fault tolerance, this approach also facilitates other valuable use cases, such as:
14
17
15
18
-**Migration Flexibility**: The ability to move Kafka clusters between Kubernetes environments without downtime, supporting maintenance or migrations.
16
19
-**Resource Optimization**: Efficiently utilizing resources across multiple clusters, which can be advantageous in environments with varying cluster capacities or during scaling operations.
17
20
18
21
### Limitations and Considerations
19
22
While a stretch Kafka cluster offers several advantages, it also introduces some challenges and considerations:
20
23
21
-
-**Increased Network Complexity and Costs**: The communication between brokers and controllers across clusters relies on network connectivity, which can be less reliable and more costly than intra-cluster communication. This necessitates careful consideration of network architecture and associated costs.
24
+
-**Increased Network Complexity and Costs**: The communication between brokers and controllers across clusters relies on network connectivity, which can be less reliable and more costly than intra-cluster communication.
25
+
This necessitates careful consideration of network architecture and associated costs.
22
26
23
-
-**Latency Requirements**: The stretch Kafka cluster is best suited for environments with low-latency network connections between the Kubernetes clusters. High latency can adversely affect the performance and synchronization of Kafka nodes, potentially leading to delays or errors in replication and client communication. Defining the minimal acceptable latency between clusters is crucial to ensure optimal performance.
27
+
-**Latency Requirements**: The stretch Kafka cluster is best suited for environments with low-latency network connections between the Kubernetes clusters.High latency can adversely affect the performance and synchronization of Kafka nodes, potentially leading to delays or errors in replication and client communication.
28
+
Defining the minimal acceptable latency between clusters is crucial to ensure optimal performance.
24
29
25
30
## Proposal
26
31
27
-
This proposal seeks to enhance the Strimzi Kafka operator to support stretch Kafka clusters, distributing brokers and controllers across multiple Kubernetes clusters. The intent is to focus on high-availability of the data plane. The proposal outlines high-level topology and design concepts for such deployments, with a plan to incrementally include finer design and implementation details for various aspects.
32
+
This proposal seeks to enhance the Strimzi Kafka operator to support stretch Kafka clusters, distributing brokers and controllers across multiple Kubernetes clusters.
33
+
The intent is to focus on high-availability of the data plane.The proposal outlines high-level topology and design concepts for such deployments, with a plan to incrementally include finer design and implementation details for various aspects.
28
34
29
35
### Prerequisites
30
36
31
-
-**Multiple Kubernetes Clusters**: Stretch Kafka clusters will require multiple Kubernetes clusters. Ideally, an odd number of clusters (at least three) is needed to maintain quorum in the event of a cluster outage.
Ideally, an odd number of clusters (at least three) is needed to maintain quorum in the event of a cluster outage.
32
39
33
-
-**Low Latency**: Kafka clusters should be deployed in environments that allow low-latency communication between Kafka brokers and controllers. Stretch Kafka clusters should be deployed in environments such as data centers or availability zones within a single region, and not across distant regions where high latency could impair performance.
40
+
-**Low Latency**: Kafka clusters should be deployed in environments that allow low-latency communication between Kafka brokers and controllers.
41
+
Stretch Kafka clusters should be deployed in environments such as data centers or availability zones within a single region, and not across distant regions where high latency could impair performance.
34
42
35
-
-**KRaft**: As Kafka and Strimzi transition towards KRaft-based clusters, this proposal focuses exclusively on enabling stretch deployments for KRaft-based Kafka clusters. While Zookeeper-based deployments are still supported, they are outside the scope of this proposal.
43
+
-**KRaft**: As Kafka and Strimzi transition towards KRaft-based clusters, this proposal focuses exclusively on enabling stretch deployments for KRaft-based Kafka clusters.
44
+
While Zookeeper-based deployments are still supported, they are outside the scope of this proposal.
36
45
37
46
### Design
38
47
39
-
The cluster operator will be deployed in all Kubernetes clusters and will manage Kafka brokers/controllers running on that cluster. One Kubernetes cluster will act as the control point for defining custom resources (Kafka, KafkaNodePool) required for stretch Kafka cluster. The KafkaNodePool custom resource will be extended to include information about a Kubernetes cluster where the pool should be deployed. The cluster operator will create necessary resources (StrimziPodSets, services etc.) on the target clusters specified within the KafkaNodePool resource.
48
+
The cluster operator will be deployed in all Kubernetes clusters and will manage Kafka brokers/controllers running on that cluster.
49
+
One Kubernetes cluster will act as the control point for defining custom resources (Kafka, KafkaNodePool) required for stretch Kafka cluster.
50
+
The KafkaNodePool custom resource will be extended to include information about a Kubernetes cluster where the pool should be deployed.
51
+
The cluster operator will create necessary resources (StrimziPodSets, services etc.) on the target clusters specified within the KafkaNodePool resource.
40
52
41
-
This approach will allow users to specify/manage the definition of stretch Kafka cluster in a single location. The operators will then create necessary resources in target Kubernetes clusters, which can then be reconciled/managed by operators on those clusters.
53
+
This approach will allow users to specify/manage the definition of stretch Kafka cluster in a single location.
54
+
The operators will then create necessary resources in target Kubernetes clusters, which can then be reconciled/managed by operators on those clusters.
42
55
43
56
### Reconciling Kafka and KafkaNodePool resources
44
57

@@ -47,7 +60,8 @@ This approach will allow users to specify/manage the definition of stretch Kafka
A new optional field (`target`) will be introduced in the KafkaNodePool resource specification, to allow users to specify the details of the Kubernetes cluster where the node pool should be deployed. This section will include the target cluster's URL (Kubernetes cluster where resources for this node pool will be created) and the secret containing the kubeconfig data for that cluster.
63
+
A new optional field (`target`) will be introduced in the KafkaNodePool resource specification, to allow users to specify the details of the Kubernetes cluster where the node pool should be deployed.
64
+
This section will include the target cluster's URL (Kubernetes cluster where resources for this node pool will be created) and the secret containing the kubeconfig data for that cluster.
51
65
52
66
An example of the KafkaNodePool resource with the new fields might look like:
53
67
@@ -106,28 +120,41 @@ spec:
106
120
type: ingress
107
121
```
108
122
109
-
A new annotation (`stretch-mode: enabled`) will be introduced in Kafka custom resource to indicate when it is representing a stretch Kafka cluster. This approach is similar to how Strimzi currently enables features like KafkaNodePool (KNP) and KRaft mode.
123
+
A new annotation (`stretch-mode: enabled`) will be introduced in Kafka custom resource to indicate when it is representing a stretch Kafka cluster
124
+
This approach is similar to how Strimzi currently enables features like KafkaNodePool (KNP) and KRaft mode.
110
125
111
-
In a stretch Kafka cluster, we'll need bootstrap and broker services to be present on each Kubernetes cluster and be accessible from other clusters. The Kafka reconciler will identify all target clusters from KafkaNodePool resources and create these services in target Kubernetes clusters. This will ensure that even if the central cluster experiences an outage, external clients can still connect to the stretch cluster and continue their operations without interruption.
126
+
In a stretch Kafka cluster, we'll need bootstrap and broker services to be present on each Kubernetes cluster and be accessible from other clusters.
127
+
The Kafka reconciler will identify all target clusters from KafkaNodePool resources and create these services in target Kubernetes clusters.
128
+
This will ensure that even if the central cluster experiences an outage, external clients can still connect to the stretch cluster and continue their operations without interruption.
112
129
113
130
#### Cross-cluster communication
114
-
Kafka controllers/brokers are distributed across multiple Kubernetes environments and will need to communicate with each other. Currently, the Strimzi Kafka operator defines Kafka listeners for internal communication (controlplane and replication) between brokers/controllers (Kubernetes services using ports 9090 and 9091). The user is not able to influence how these services are set up and exposed outside the cluster. We would remove this limitation and allow users to define how these internal listeners are configured in the Kafka resource, just like they do for Kafka client listeners.
131
+
Kafka controllers/brokers are distributed across multiple Kubernetes environments and will need to communicate with each other.
132
+
Currently, the Strimzi Kafka operator defines Kafka listeners for internal communication (controlplane and replication) between brokers/controllers (Kubernetes services using ports 9090 and 9091).
133
+
The user is not able to influence how these services are set up and exposed outside the cluster.
134
+
We would remove this limitation and allow users to define how these internal listeners are configured in the Kafka resource, just like they do for Kafka client listeners.
115
135
116
-
Users will also be able to override listener configurations in each KafkaNodePool resource, if the listeners need to be exposed in different ways (ingress host names, Ingress annotations etc.) for each Kubernetes cluster. This will be similar to how KafkaNodePools are used to override other configuration like storage etc. To override a listener, KafkaNodePool will define configuration with same listener name as in Kafka resource.
136
+
Users will also be able to override listener configurations in each KafkaNodePool resource, if the listeners need to be exposed in different ways (ingress host names, Ingress annotations etc.) for each Kubernetes cluster.
137
+
This will be similar to how KafkaNodePools are used to override other configuration like storage etc.
138
+
To override a listener, KafkaNodePool will define configuration with same listener name as in Kafka resource.
117
139
118
140
#### Resource cleanup on remote Kubernetes clusters
119
-
As some of the Kubernetes resources will be created on a remote cluster, we will not be able to use standard Kubernetes approaches for deleting resources based on owner references. The operator will need to delete remote resources explicitly when the owning resource is deleted.
141
+
As some of the Kubernetes resources will be created on a remote cluster, we will not be able to use standard Kubernetes approaches for deleting resources based on owner references.
142
+
The operator will need to delete remote resources explicitly when the owning resource is deleted.
120
143
121
144
- The exact mechanism that will be used for such cleanup in various scenarios is not detailed out yet and will be added here before the proposal is complete.
122
145
123
146
#### Network policies
124
-
In a stretch Kafka cluster, some Network policies will be relaxed to allow communication from other Kubernetes clusters that are specified as targets in various KafkaNodePool resources. This will allow brokers/controllers on separate Kubernetes clusters to communicate effectively.
147
+
In a stretch Kafka cluster, some Network policies will be relaxed to allow communication from other Kubernetes clusters that are specified as targets in various KafkaNodePool resources.
148
+
This will allow brokers/controllers on separate Kubernetes clusters to communicate effectively.
125
149
126
150
#### Secrets
127
-
We need to create Kubernetes Secrets in the central cluster that will store the credentials required for creating resources on the target clusters. These secrets will be referenced in the KafkaNodePool custom resource.
151
+
We need to create Kubernetes Secrets in the central cluster that will store the credentials required for creating resources on the target clusters.
152
+
These secrets will be referenced in the KafkaNodePool custom resource.
128
153
129
154
#### Entity operator
130
-
We would recommend that all KafkaTopic and KafkaUser resources are managed from the cluster that holds Kafka and KafkaNodePool resources, and that should be the cluster where the entity operator should be enabled. This will allow all resource management/configuration form a central place. The entity operator should not be impacted by changes in this proposal.
155
+
We would recommend that all KafkaTopic and KafkaUser resources are managed from the cluster that holds Kafka and KafkaNodePool resources, and that should be the cluster where the entity operator should be enabled.
156
+
This will allow all resource management/configuration form a central place.
157
+
The entity operator should not be impacted by changes in this proposal.
131
158
132
159
## Additional considerations
133
160
@@ -140,7 +167,8 @@ Once the general approach is agreed, this proposal will be updated to include an
140
167
141
168
## Affected/not affected projects
142
169
143
-
This proposal only impacts strimzi-kafka-operator project.
170
+
This proposal only impacts strimzi-kafka-operator project.
171
+
144
172
145
173
## Rejected alternatives
146
174
@@ -150,9 +178,13 @@ This proposal only impacts strimzi-kafka-operator project.
An alternative approach considered was setting up a stretch Kafka cluster with synchronized `KafkaStretchCluster` and `Kafka` custom resources (CRs). The idea was to introduce a new CR called `KafkaStretchCluster`, which would contain details of all the clusters involved in the stretch Kafka deployment. The spec would include information such as cluster names, secrets for connecting to each Kubernetes cluster, and a list of node pools across the entire stretch cluster.
181
+
An alternative approach considered was setting up a stretch Kafka cluster with synchronized `KafkaStretchCluster` and `Kafka` custom resources (CRs).
182
+
The idea was to introduce a new CR called `KafkaStretchCluster`, which would contain details of all the clusters involved in the stretch Kafka deployment.
183
+
The spec would include information such as cluster names, secrets for connecting to each Kubernetes cluster, and a list of node pools across the entire stretch cluster.
154
184
155
-
The Kafka CR could be created in any of the Kubernetes clusters, and it would be propagated to the remaining clusters through coordinated actions by the Cluster Operator. Similarly, changes to the Kafka CR could be made in any Kubernetes cluster, and once detected by the Cluster Operator, the changes would be propagated to the CRs in the other clusters. The `KafkaNodePool` resources would be deployed to individual Kubernetes clusters, requiring users to apply the KafkaNodePool CR in each cluster separately.
185
+
The Kafka CR could be created in any of the Kubernetes clusters, and it would be propagated to the remaining clusters through coordinated actions by the Cluster Operator.
186
+
Similarly, changes to the Kafka CR could be made in any Kubernetes cluster, and once detected by the Cluster Operator, the changes would be propagated to the CRs in the other clusters.
187
+
The `KafkaNodePool` resources would be deployed to individual Kubernetes clusters, requiring users to apply the KafkaNodePool CR in each cluster separately.
0 commit comments