Skip to content

8359820: Improve handshake/safepoint timeout diagnostic messages #26309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

toxaart
Copy link
Contributor

@toxaart toxaart commented Jul 15, 2025

Hi, please consider the following changes:

The problem in the issue description is not a problem by itself, the behavior is not unexpected, but it is somewhat difficult to find out what caused SIGILL to be fired.

We propagate this information from handshake::handle_timeout() to VMError::report() with a help of a global variable. The same mechanism is used to address a similar issue in the safepoint timeout handler.

Tested in tiers 1-3.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8359820: Improve handshake/safepoint timeout diagnostic messages (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26309/head:pull/26309
$ git checkout pull/26309

Update a local copy of the PR:
$ git checkout pull/26309
$ git pull https://git.openjdk.org/jdk.git pull/26309/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26309

View PR using the GUI difftool:
$ git pr show -t 26309

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26309.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 15, 2025

👋 Welcome back toxaart! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 15, 2025

@toxaart This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8359820: Improve handshake/safepoint timeout diagnostic messages

Reviewed-by: dholmes, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 24 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dholmes-ora, @tstuefe) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk
Copy link

openjdk bot commented Jul 15, 2025

@toxaart The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@toxaart toxaart marked this pull request as ready for review July 15, 2025 10:51
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 15, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 15, 2025

@dholmes-ora
Copy link
Member

@toxaart I'm really looking for something in the fatal error handler so that instead of seeing just:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGILL (0x4) at pc=0x00007d9dc5a98d71 (sent by kill), pid=329828, tid=329852
#
# JRE version: Java(TM) SE Runtime Environment (26.0+3) (fastdebug build 26-ea+3-153)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-ea+3-153, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0x98d71] 

There is something there that indicates it was a handshake timeout. E.g.

# SIGILL (0x4) at pc=0x00007d9dc5a98d71 (sent by handshake timeout handlerl), pid=329828, tid=329852

We may need the handshake code to set a flag on the target Thread that the error code can query if it sees a SIGILL.

@toxaart toxaart marked this pull request as draft July 16, 2025 09:49
@openjdk openjdk bot removed the rfr Pull request is ready for review label Jul 16, 2025
@toxaart toxaart closed this Jul 16, 2025
@toxaart toxaart force-pushed the JDK-8359820-SIGILL-with-low-handshake-timeout-on-intel-sde branch from a8179fc to 310ef85 Compare July 16, 2025 12:19
@toxaart toxaart reopened this Jul 16, 2025
@openjdk
Copy link

openjdk bot commented Jul 17, 2025

⚠️ @toxaart This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@toxaart toxaart changed the title 8359820: SIGILL with low -XX:HandshakeTimeout 8359820: Improve handshake/safepoint timeout diagnostic messages Jul 17, 2025
@toxaart toxaart marked this pull request as ready for review July 17, 2025 14:04
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 17, 2025
@tstuefe
Copy link
Member

tstuefe commented Jul 18, 2025

BTW, for artificially generated signals we already have a clear indication in hs_err files. We print the sigaction structure associated with the signal.

e.g.

 siginfo: si_signo: 4 (SIGILL), si_code: 0 (SI_USER), si_pid: 13281, si_uid: 1027

SI_USER => sent via kill command or pthread_kill
si_pid = sending process or thread id
si_uid = sending user (in case of outside process)

See also: https://pubs.opengroup.org/onlinepubs/007904875/functions/sigaction.html

I have nothing against making this clearer, just saying that the info is already kind of there.

@tstuefe
Copy link
Member

tstuefe commented Jul 18, 2025

Not sure what is so hard to understand here @tstuefe . A thread is hit with a SIGILL and we report that now, but we don't report why it was hit with the SIGILL. If there were only one reason (like it executed an illegal instruction) then it would be obvious, but we have hijacked SIGILL as a generic "something happened" signal. So the proposal here is to record the identity of the thread being sent a SIGILL due to a handshake or safepoint timeout, so that when that thread responds to the SIGILL it can see that is why it got it and report that fact. If a different thread also got a SIGILL for a different reason we don't want it reporting it was due to the timeout mechanism.

Thank you, @dholmes-ora . I already answered Anton, but I get that now.

@toxaart
Copy link
Contributor Author

toxaart commented Jul 18, 2025

Was (A) the winner, and it started error handling? And you maybe saw a "Thread XXX also had an error" line from the sending thread?

Yes, the slow thread started reporting, and I think I also observed the latter message as well. Note that the fatal error is still processed in the end of the timeout handler, but not reported by VMError, as it can report only one such error.

So yes, we want to improve the reporting for case A: when a slow thread receives a SIGILL and dies being able to handle the error, we want to know if SIGILL came from handshake/safepoint timeout and print extra info if that is the case.

To help with this case, I suggest a simple addition in handshake.cpp:

Thanks, added to the latest change.

Yes, it could happen. The mechanism could be improved by storing the fact that a SIGILL has been sent to thread X not in a global variable but in the Thread structure of X. Then, in VMError, one checks if the current thread had been the target of a recent pthread_kill, and only write "sent by xxx" in that case. I ignore here the possible case of multiple senders one receiver, because I think that is extremely unlikely.

I think this would be a more invasive change, we can do it when there is a real need.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 18, 2025
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of this looks good, but I have a few remaining nits. Thanks.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jul 28, 2025
@toxaart toxaart requested a review from dholmes-ora July 28, 2025 10:06
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 5, 2025
@toxaart
Copy link
Contributor Author

toxaart commented Aug 5, 2025

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Aug 5, 2025
@openjdk
Copy link

openjdk bot commented Aug 5, 2025

@toxaart
Your change (at version d85769a) is now ready to be sponsored by a Committer.

@dholmes-ora
Copy link
Member

/sponsor

@openjdk
Copy link

openjdk bot commented Aug 6, 2025

Going to push as commit 6656e76.
Since your change was applied there have been 32 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 6, 2025
@openjdk openjdk bot closed this Aug 6, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Aug 6, 2025
@openjdk
Copy link

openjdk bot commented Aug 6, 2025

@dholmes-ora @toxaart Pushed as commit 6656e76.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few post-integration notes, maybe do a little follow-up cleanup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime [email protected] integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants