-
Notifications
You must be signed in to change notification settings - Fork 299
Description
I came across a disturbing bug while building a quorum of controllers. The challenge was the use of SASL_SSL and SCRAM-SHA-512 as the primary and only mechanism in intra-cluster communication (controller-broker, controller-controller, broker-broker). If there is only one controller in CurrentVoters, the cluster is immune to all server failures and restarts, even all at the same time, etc. If I promote more controllers from CurrentObservers to CurrentVoters, the loss of one controller does not cause any problems but if I lose all controllers, even for a while, the quorum no longer has a chance to gather.
I add scram credentials during storage format:
1st controller$ /opt/kafka/bin/kafka-storage.sh format -t $CLUSTER_ID --feature kraft.version=1 --initial-controllers 1@kafka1.domain.local:19095,2@kafka2.domain.local:19095,3@kafka3.domain.local:19095 -c /etc/kafka/controller.properties --add-scram 'SCRAM-SHA-512=[name="admin",password="qaz123456"]'
other controllers$ /opt/kafka/bin/kafka-storage.sh format -t $CLUSTER_ID --feature kraft.version=1 --no-initial-controllers -c /etc/kafka/controller.properties --add-scram 'SCRAM-SHA-512=[name="admin",password="qaz123456"]'
controller config:
process.roles=controller
node.id=1
controller.quorum.bootstrap.servers=kafka1.domain.local:19095,kafka2.domain.local:19095,kafka3.domain.local:19095
listeners=CONTROLLER_SASL_SSL://kafka1.pietka.local:19095
controller.listener.names=CONTROLLER_SASL_SSL
listener.security.protocol.map=CONTROLLER_SASL_SSL:SASL_PLAINTEXT
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
super.users=User:pietka.local;User:admin;User:ANONYMOUS
security.protocol=SASL_PLAINTEXT
#ssl
ssl.principal.mapping.rules=RULE:^.*CN=\\*\\.(.*?),OU=Corporation.*$/$1/L
ssl.client.auth=required
ssl.keystore.location=/etc/kafka/certs/client.keystore.jks
ssl.keystore.password=qaz123456
ssl.truststore.location=/etc/kafka/certs/server.truststore.jks
ssl.truststore.password=qaz123456
#sasl
sasl.mechanism=SCRAM-SHA-512
listener.name.controller_sasl_ssl.scram-sha-512.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="admin" password="qaz123456" user_admin="qaz123456";
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="admin" password="qaz123456" user_admin="qaz123456";
sasl.enabled.mechanisms=SCRAM-SHA-512
sasl.mechanism.controller.protocol=SCRAM-SHA-512
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
log.dirs=/data/controller
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=2
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=1
log.retention.hours=2
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
logs:
[2025-05-18 11:48:54,895] INFO [RaftManager id=1] Failed authentication with kafka2.domain.local/10.10.21.2 (channelId=2) (Authentication failed during authentication due to invalid credentials with SASL mechanism SCRAM-SHA-512) (org.apache.kafka.common.network.Selector) [2025-05-18 11:48:54,895] INFO [RaftManager id=1] Node 2 disconnected. (org.apache.kafka.clients.NetworkClient) [2025-05-18 11:48:54,896] ERROR [RaftManager id=1] Connection to node 2 (kafka2.domain.local/10.10.21.2:19095) failed authentication due to: Authentication failed during authentication due to invalid credentials with SASL mechanism SCRAM-SHA-512 (org.apache.kafka.clients.NetworkClient) [2025-05-18 11:48:54,896] ERROR [kafka-1-raft-outbound-request-thread]: Failed to send the following request due to authentication error: ClientRequest(expectResponse=true, callback=org.apache.kafka.raft.KafkaNetworkChannel$$Lambda/0x00007f4d3f3cd778@55782294, destination=2, correlationId=576, clientId=raft-client-1, createdTimeMs=1747561734586, requestBuilder=VoteRequestData(clusterId='R8gKVujrQ5qNagUU3rwdHw', voterId=2, topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, replicaEpoch=5, replicaId=1, replicaDirectoryId=vE9YPPpdfucShPkUZgFA8g, voterDirectoryId=HzZ9HhGDsjlbWPgBXFd67g, lastOffsetEpoch=3, lastOffset=580, preVote=true)])])) (org.apache.kafka.raft.KafkaNetworkChannel$SendThread) [2025-05-18 11:48:54,896] ERROR Request OutboundRequest(correlationId=574, data=VoteRequestData(clusterId='R8gKVujrQ5qNagUU3rwdHw', voterId=2, topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, replicaEpoch=5, replicaId=1, replicaDirectoryId=vE9YPPpdfucShPkUZgFA8g, voterDirectoryId=HzZ9HhGDsjlbWPgBXFd67g, lastOffsetEpoch=3, lastOffset=580, preVote=true)])]), createdTimeMs=1747561734586, destination=kafka2.domain.local:19095 (id: 2 rack: null isFenced: false)) failed due to authentication error (org.apache.kafka.raft.KafkaNetworkChannel) org.apache.kafka.common.errors.SaslAuthenticationException: Authentication failed during authentication due to invalid credentials with SASL mechanism SCRAM-SHA-512 [2025-05-18 11:48:54,898] ERROR [RaftManager id=1] Unexpected error NETWORK_EXCEPTION in VOTE response: InboundResponse(correlationId=574, data=VoteResponseData(errorCode=13, topics=[], nodeEndpoints=[]), source=kafka2.domain.local:19095 (id: 2 rack: null isFenced: false)) (org.apache.kafka.raft.KafkaRaftClient) [2025-05-18 11:48:54,939] INFO [SocketServer listenerType=CONTROLLER, nodeId=1] Failed authentication with /10.10.21.2 (channelId=10.10.21.1:19095-10.10.21.2:51690-2-194) (Authentication failed during authentication due to invalid credentials with SASL mechanism SCRAM-SHA-512) (org.apache.kafka.common.network.Selector) [2025-05-18 11:48:54,942] INFO [MetadataLoader id=1] initializeNewPublishers: the loader is still catching up because we still don't know the high water mark yet. (org.apache.kafka.image.loader.MetadataLoader) [2025-05-18 11:48:55,043] INFO [MetadataLoader id=1] initializeNewPublishers: the loader is still catching up because we still don't know the high water mark yet. (org.apache.kafka.image.loader.MetadataLoader) [2025-05-18 11:48:55,144] INFO [MetadataLoader id=1] initializeNewPublishers: the loader is still catching up because we still don't know the high water mark yet. (org.apache.kafka.image.loader.MetadataLoader) [2025-05-18 11:48:55,247] INFO [MetadataLoader id=1] initializeNewPublishers: the loader is still catching up because we still don't know the high water mark yet. (org.apache.kafka.image.loader.MetadataLoader)