Skip to content

KAFKA-20312: Handle null leader during OffsetFetcher regroup safely#21760

Open
nileshkumar3 wants to merge 4 commits intoapache:trunkfrom
nileshkumar3:KAFKA-20312-fix-offsetfetcher-null-leader-regroup
Open

KAFKA-20312: Handle null leader during OffsetFetcher regroup safely#21760
nileshkumar3 wants to merge 4 commits intoapache:trunkfrom
nileshkumar3:KAFKA-20312-fix-offsetfetcher-null-leader-regroup

Conversation

@nileshkumar3
Copy link
Copy Markdown
Contributor

@nileshkumar3 nileshkumar3 commented Mar 15, 2026

Description:

This PR fixes a potential NullPointerException in
OffsetFetcherUtils.regroupPartitionMapByNode when regrouping partitions
by leader during offset reset / list-offsets.

Background

Partitions are grouped by leader via metadata.fetch().leaderFor(tp). If
metadata changes between the initial leader lookup and the regroup step
(e.g. leadership change or stale metadata), leaderFor(tp) can return
null. The previous implementation used Collectors.groupingBy(...,
leaderFor(...)), which throws an NPE when the classifier returns null.

Fix

OffsetFetcherUtils.regroupPartitionMapByNode Replaced the stream-based
grouping with a loop that skips partitions whose leader is null, adds
them to a caller-provided partitionsToRetry set, and does not trigger
metadata refresh (callers are responsible for retry and metadata).

Callers

OffsetFetcher (classic consumer): passes partitionsToRetry into the
helper; in resetPositionsAsync, when the set is non-empty, calls
setNextAllowedRetry(partitionsToRetry, now + retryBackoffMs) and
metadata.requestUpdate(false). OffsetsRequestManager (new consumer):
passes a local retry set into the helper, then adds skipped partitions
to state.remainingToSearch (with timestamp) and calls
metadata.requestUpdate(false) when the set is non-empty. This keeps
existing retry semantics and avoids the NPE.

Tests

OffsetFetcherTest.testResetPositionsMetadataRefreshWhenLeaderBecomesUnknownDuringRegroup
Simulates leaderFor(tp) returning null during regroup (first
metadata.fetch() stubbed to a cluster with no partition, then real
method). Asserts no exception, partition stays pending reset, and after
backoff and a second attempt with valid metadata the offset reset
succeeds.

OffsetsRequestManagerTest.testFetchOffsetsRegroupSkipsNullLeaderPartition_NoNPE
Simulates the same scenario in the fetch-offsets path: currentLeader has
a leader but metadata.fetch() returns a cluster where one partition has
no leader. Asserts no NPE, one request sent (for the partition with a
leader), and that the skipped partition is retried after metadata update
and completes successfully.

@github-actions github-actions Bot added triage PRs from the community consumer clients labels Mar 15, 2026
@github-actions
Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Copy link
Copy Markdown
Contributor

@frankvicky frankvicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nileshkumar3 Thanks for the PR.
It seems the failed test is due to OOM. PTAL

@github-actions github-actions Bot added streams core Kafka Broker producer tools connect performance kraft mirror-maker-2 dependencies Pull requests that update a dependency file storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature KIP-932 Queues for Kafka build Gradle build or GitHub Actions docker Official Docker image generator RPC and Record code generator transactions Transactions and EOS group-coordinator and removed needs-attention triage PRs from the community labels Apr 15, 2026
@nileshkumar3 nileshkumar3 force-pushed the KAFKA-20312-fix-offsetfetcher-null-leader-regroup branch from 1f47ccc to 440f4cf Compare April 15, 2026 05:00
@nileshkumar3
Copy link
Copy Markdown
Contributor Author

@nileshkumar3 Thanks for the PR. It seems the failed test is due to OOM. PTAL

Thanks @frankvicky for reviewing.I traced the failure to a tight poll loop in that test scenario and pushed a small follow-up fix (no extra backoff on that path, metadata refresh only). Should be in better shape for another look whenever you can.

Copy link
Copy Markdown
Contributor

@frankvicky frankvicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nileshkumar3
Thanks for the update. Overall LGTM

nileshkumar3 and others added 3 commits April 15, 2026 17:49
…als/OffsetFetcherTest.java


review comment

Co-authored-by: TengYao Chi <kitingiao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Gradle build or GitHub Actions ci-approved clients connect consumer core Kafka Broker dependencies Pull requests that update a dependency file docker Official Docker image generator RPC and Record code generator group-coordinator KIP-932 Queues for Kafka kraft mirror-maker-2 performance producer storage Pull requests that target the storage module streams tiered-storage Related to the Tiered Storage feature tools transactions Transactions and EOS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants