KAFKA-20312: Handle null leader during OffsetFetcher regroup safely#21760
KAFKA-20312: Handle null leader during OffsetFetcher regroup safely#21760nileshkumar3 wants to merge 4 commits intoapache:trunkfrom
Conversation
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
frankvicky
left a comment
There was a problem hiding this comment.
@nileshkumar3 Thanks for the PR.
It seems the failed test is due to OOM. PTAL
1f47ccc to
440f4cf
Compare
Thanks @frankvicky for reviewing.I traced the failure to a tight poll loop in that test scenario and pushed a small follow-up fix (no extra backoff on that path, metadata refresh only). Should be in better shape for another look whenever you can. |
frankvicky
left a comment
There was a problem hiding this comment.
@nileshkumar3
Thanks for the update. Overall LGTM
…als/OffsetFetcherTest.java review comment Co-authored-by: TengYao Chi <kitingiao@gmail.com>
Description:
This PR fixes a potential NullPointerException in
OffsetFetcherUtils.regroupPartitionMapByNode when regrouping partitions
by leader during offset reset / list-offsets.
Background
Partitions are grouped by leader via metadata.fetch().leaderFor(tp). If
metadata changes between the initial leader lookup and the regroup step
(e.g. leadership change or stale metadata), leaderFor(tp) can return
null. The previous implementation used Collectors.groupingBy(...,
leaderFor(...)), which throws an NPE when the classifier returns null.
Fix
OffsetFetcherUtils.regroupPartitionMapByNode Replaced the stream-based
grouping with a loop that skips partitions whose leader is null, adds
them to a caller-provided partitionsToRetry set, and does not trigger
metadata refresh (callers are responsible for retry and metadata).
Callers
OffsetFetcher (classic consumer): passes partitionsToRetry into the
helper; in resetPositionsAsync, when the set is non-empty, calls
setNextAllowedRetry(partitionsToRetry, now + retryBackoffMs) and
metadata.requestUpdate(false). OffsetsRequestManager (new consumer):
passes a local retry set into the helper, then adds skipped partitions
to state.remainingToSearch (with timestamp) and calls
metadata.requestUpdate(false) when the set is non-empty. This keeps
existing retry semantics and avoids the NPE.
Tests
OffsetFetcherTest.testResetPositionsMetadataRefreshWhenLeaderBecomesUnknownDuringRegroup
Simulates leaderFor(tp) returning null during regroup (first
metadata.fetch() stubbed to a cluster with no partition, then real
method). Asserts no exception, partition stays pending reset, and after
backoff and a second attempt with valid metadata the offset reset
succeeds.
OffsetsRequestManagerTest.testFetchOffsetsRegroupSkipsNullLeaderPartition_NoNPE
Simulates the same scenario in the fetch-offsets path: currentLeader has
a leader but metadata.fetch() returns a cluster where one partition has
no leader. Asserts no NPE, one request sent (for the partition with a
leader), and that the skipped partition is retried after metadata update
and completes successfully.