Skip to content

[GR-67169] Support JFR emergency dumps on out of memory #11530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

roberttoyonaga
Copy link
Collaborator

@roberttoyonaga roberttoyonaga commented Jul 2, 2025

Related issue: #10600

Summary

One of JFR's primary goals is to provide insight in the event of a crash like OOME. JFR data may be useful for investigating OOME. For example, JFR's CPU and allocation profiling can help locate where problem areas might be occurring. JFR's garbage collection events and thread data could also be helpful with diagnosing problems.

Currently, it's possible to receive heap dumps on out of memory (OOM) but this is not yet possible for JFR. OpenJDK has this feature and we should try to implement it in Native Image too.

Goals

  • Add support for emergency dumps on OOME
  • Add support for the jdk.DumpReason event
  • Re-work the existing JFR code to make flushing completely allocation free.

Non-Goals

  • Add support for jdk.Shutdown Events (will be done in a follow up PR).
  • Perform JFR emergency dumps on VM crashes. Focus on OOME for now.
  • Change the existing JFR infrastructure other than to make more parts allocation free.

Details

This PR can be broken into two main parts: (1) making JFR flushing allocation free and (2) creating the emergency dump file.

(1) Making JFR flushing allocation free

Many small changes had to be made to make the JFR flushing procedure allocation free:

  • JfrChunkFileWriter#writeString adapted to use native memory
  • JfrSerializer classes pre-initialize a small amount of data while in hosted mode.
  • for loops needed to be changed from for (Object name : names) format to for (int i=0; i<names.length; i++) format
  • Some visitor patterns and lambdas were replaced.
  • The SecondsNanos class was made into a RawStructure so it could be allocated on the stack.

Larger changes to the JfrTypeRepository and JfrSymbolRepository were also required. The general procedure used by the JfrTypeRepository remains the same, but we cannot use Package, Module, and Classloader classes directly because their methods may allocate and they are not pinned objects referenceable from AbstractUninterruptibleHashtable. To work around this, I've made JfrTypeInfo RawStructures corresponding to each of these Java classes (PackageInfoRaw, ModuleInfoRaw etc.). Some type data such as package names must be manually computed to avoid allocation (see setPackageNameAndLength). In some cases, serialization of symbols to native memory buffers must happen earlier (in JfrTypeRepository instead of JfrSymbolRepository) in order to avoid allocating new Java Strings. The JfrSymbolRepository has been modified accordingly to cache pointers to serialized data rather than String objects. The regular Java hash maps have been replaced with new implementations of AbstractUninterruptibleHashtable as well.

One large obstacle was that JfrTypeRepository#collectTypeInfo originally needed to walk the image heap and allocate a list of loaded classes. That process is not easy to make allocation free. To work around this, I experimented with pre-allocating the loaded class list at start-up but found that this negatively affected startup times. My solution was to make the JfrTypeRepository function more similarly to the other JFR repositories in SubstrateVM by maintaining previous/current epochData. Specifically, during event emission, JfrTypeRepository#getClassId now caches the class constant data used by events. Types used by JFR are stored in previous/current epoch data hash tables. This uses some more memory than the old approach, but at least it avoids allocation and is consistent with other the JFR repositories in SubstrateVM. This is a lazy approach so it avoids the start up penalty of pre-allocation.

A small bug in JfrTypeRepository was fixed. The bootstrap classloader was originally not being serialized to chunks. Hotspot gives this classloader the reserved ID of 0 and serializes it if it was tagged during the epoch.

(2) Creating the emergency dump file

New classes implementing this support: JfrEmergencyDumpSupport, JfrEmergencyDumpFeature, PosixJfrEmergencyDumpSupport. I have tried to keep the components and logic as similar as possible to Hotspot class JfrEmergencyDump found in jfrEmergencyDump.cpp.

After the emergency dump flush has completed, the JFR disk repository directory is scanned. The names of chunk files are gathered and sorted (which also implicitly sorts them chronologically). Each chunkfile in the sorted list is then copied to the emergency dump snapshot.

A lot of the work in PosixJfrEmergencyDumpSupport involves handling/creating filenames as C strings. Similar to Hotspot JFr, a pre-allocated native memory path buffer is used as a temporary place to construct filenames and paths.

Hotspot JFR uses quicksort to handle sorting chunk filenames. In SubstrateVM, a Java quick sort implementation has been added to GrowableWordArrayAccess to sort chunk files while avoiding using the Java heap.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 2, 2025
@roberttoyonaga roberttoyonaga marked this pull request as ready for review July 3, 2025 17:50
@christianhaeubl christianhaeubl requested a review from zapster July 4, 2025 08:36
@zapster zapster changed the title Support JFR emergency dumps on out of memory [GR-67169] Support JFR emergency dumps on out of memory Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
native-image native-image-jfr OCA Verified All contributors have signed the Oracle Contributor Agreement. redhat-interest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants