Skip to content

Conversation

yyj8
Copy link
Contributor

@yyj8 yyj8 commented Nov 24, 2024

Fixes #23635

Main Issue: #xyz

PIP: #xyz

Motivation

In some special scenarios, when the broker service has a deadlock, it needs to be able to automatically recover instead of requiring manual intervention. For example, when the service is deployed in a customer environment, we cannot directly manage it. If the service has a deadlock, the k8s probe should return a failure because the service may be unavailable. The probe failure triggers a broker pod restart to resolve the deadlock.

Modifications

Add deadlock detection in the probe. If a deadlock exists, print the thread stack and return a service unavailable exception.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:
yyj8#10

…return a failure because the service may be unavailable
Copy link

@yyj8 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Nov 24, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a deadlock check in the health check:

@GET
@Path("/health")
@ApiOperation(value = "Run a healthCheck against the broker")
@ApiResponses(value = {
@ApiResponse(code = 200, message = "Everything is OK"),
@ApiResponse(code = 403, message = "Don't have admin permission"),
@ApiResponse(code = 404, message = "Cluster doesn't exist"),
@ApiResponse(code = 500, message = "Internal server error")})
public void healthCheck(@Suspended AsyncResponse asyncResponse,
@ApiParam(value = "Topic Version")
@QueryParam("topicVersion") TopicVersion topicVersion) {
validateSuperUserAccessAsync()
.thenAccept(__ -> checkDeadlockedThreads())
.thenCompose(__ -> internalRunHealthCheck(topicVersion))
.thenAccept(__ -> {
LOG.info("[{}] Successfully run health check.", clientAppId());
asyncResponse.resume(Response.ok("ok").build());
}).exceptionally(ex -> {
LOG.error("[{}] Fail to run health check.", clientAppId(), ex);
resumeAsyncResponseExceptionally(asyncResponse, ex);
return null;
});
}
private void checkDeadlockedThreads() {
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
long[] threadIds = threadBean.findDeadlockedThreads();
if (threadIds != null && threadIds.length > 0) {
ThreadInfo[] threadInfos = threadBean.getThreadInfo(threadIds, false, false);
String threadNames = Arrays.stream(threadInfos)
.map(threadInfo -> threadInfo.getThreadName() + "(tid=" + threadInfo.getThreadId() + ")").collect(
Collectors.joining(", "));
if (System.currentTimeMillis() - threadDumpLoggedTimestamp
> LOG_THREADDUMP_INTERVAL_WHEN_DEADLOCK_DETECTED) {
threadDumpLoggedTimestamp = System.currentTimeMillis();
LOG.error("Deadlocked threads detected. {}\n{}", threadNames,
ThreadDumpUtil.buildThreadDiagnosticString());
} else {
LOG.error("Deadlocked threads detected. {}", threadNames);
}
throw new IllegalStateException("Deadlocked threads detected. " + threadNames);
}
}

It also contains an example of how to check deadlocks.

…return a failure because the service may be unavailable
@yyj8 yyj8 requested a review from lhotari November 25, 2024 14:38
@lhotari lhotari changed the title [fix][broker]If there is a deadlock in the service, the probe should return a failure because the service may be unavailable [improvement][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable Nov 26, 2024
yyj8 added 2 commits November 26, 2024 23:45
…e should return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @yyj8. Some suggestions for field naming and simplifying the code comment.

…e should return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable.
@yyj8 yyj8 requested a review from lhotari November 27, 2024 07:41
…e should return a failure because the service may be unavailable.
@lhotari
Copy link
Member

lhotari commented Nov 27, 2024

@yyj8 btw. when you add commits to the PR, it's useful to make the commit title about the change and not copy the PR title into the follow up commits. When the PR is merged, all commits are squashed so they won't end up in the final merged commit. The benefit of the commit messages in the PR commits is that the reviewer will be able to follow the changes.

@lhotari lhotari added this to the 4.1.0 milestone Nov 29, 2024
yyj8 added 2 commits December 4, 2024 21:42
…e should return a failure because the service may be unavailable. Add lastPrintThreadDumpTimestamp field to control the interval time for printing complete thread stack information.
…e should return a failure because the service may be unavailable. Add unit testing code.
@coderzc coderzc modified the milestones: 4.1.0, 4.2.0 Sep 1, 2025
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 72.34043% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.27%. Comparing base (ccbd245) to head (49af659).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
.../apache/pulsar/common/configuration/VipStatus.java 72.34% 8 Missing and 5 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #23634       +/-   ##
=============================================
+ Coverage     38.30%   74.27%   +35.97%     
- Complexity      100    33281    +33181     
=============================================
  Files          1844     1901       +57     
  Lines        144273   148403     +4130     
  Branches      16726    17204      +478     
=============================================
+ Hits          55262   110227    +54965     
+ Misses        81479    29400    -52079     
- Partials       7532     8776     +1244     
Flag Coverage Δ
inttests 26.47% <0.00%> (-0.01%) ⬇️
systests 22.74% <0.00%> (-0.10%) ⬇️
unittests 73.79% <72.34%> (+39.47%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/pulsar/common/configuration/VipStatus.java 72.91% <72.34%> (+72.91%) ⬆️

... and 1407 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari lhotari merged commit d833b8b into apache:master Sep 19, 2025
96 of 99 checks passed
lhotari added a commit that referenced this pull request Sep 22, 2025
…ould return a failure because the service may be unavailable (#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
…ould return a failure because the service may be unavailable (apache#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
lhotari added a commit that referenced this pull request Sep 23, 2025
…ould return a failure because the service may be unavailable (#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
lhotari added a commit that referenced this pull request Sep 23, 2025
…ould return a failure because the service may be unavailable (#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
lhotari added a commit that referenced this pull request Sep 23, 2025
…ould return a failure because the service may be unavailable (#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 26, 2025
…ould return a failure because the service may be unavailable (apache#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
(cherry picked from commit cb223f7)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 26, 2025
…ould return a failure because the service may be unavailable (apache#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
(cherry picked from commit cb223f7)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 29, 2025
…ould return a failure because the service may be unavailable (apache#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
(cherry picked from commit e199b24)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 29, 2025
…ould return a failure because the service may be unavailable (apache#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
(cherry picked from commit d833b8b)
(cherry picked from commit e199b24)
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
…ould return a failure because the service may be unavailable (apache#23634)

Co-authored-by: Lari Hotari <[email protected]>
Co-authored-by: Lari Hotari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment